PDT (Photometric DeTrending Algorithm Using Machine Learning) aims to remove systematic trends in the light curves. For details about the algorithm, see Kim et al. 2009. In brief, PDT finds clusters of light curves that are highly correlated using machine learning, construct one master trend per cluster and detrend an individual light curve using the constructed master trends by minimizing residuals while constraining coefficients to be positive.
The latest PDT uses Birch to find highly-correlated light curves rather than Hierarchical clustering that Kim et al. 2009 originally used. This is mainly because 1) Birch does not need to set the number of clusters, and 2) Birch is scalable (i.e. applicable to large dataset).
Note that PDT is designed for the light curves having the same number of data points that are synced in time (see How to Use PDT). Nevertheless, PDT provides a module to deal with missing data points (i.e. not-synced data). For details, see the section: Missing Values. Also note that the light curves must be cleaned beforehand (e.g. highly-fluctuated data points, etc).
Although PDT is designed for astronomical research, it can be applied to any kind of time series data such as stock market, weather data, etc.
These libraries will be automatically installed if your machine does not have them installed. If you encounter errors during the installation of these dependencies, try to install them individually. Your machine may not have other required libraries by these dependencies.
The easiest way to install the PDT package is:
pip install pdtrend
pip install git+https://github.com/dwkim78/pdtrend
If you do not want to install/upgrade the dependencies, execute the above commend with the
--no-deps option. PDT possibly works with older version of Python and other libraries.
Alternatively, you can download the PDT package from the Git repository as:
git clone https://github.com/dwkim78/pdtrend cd pdtrend python setup.py install
You can edit
setup.py, if you do not want to update your own Python libraries (i.e. edit the
To check if PDT is correctly installed, type following commands in the Python console.
from pdtrend import test test()
The command will print messages like:
yyyy-mm-dd hh:mm:ss,sss INFO - Loading the light curve set. yyyy-mm-dd hh:mm:ss,sss INFO - The number of light curves is 57. yyyy-mm-dd hh:mm:ss,sss INFO - Initializing pdtrend. yyyy-mm-dd hh:mm:ss,sss INFO - Calculating the distance matrix. yyyy-mm-dd hh:mm:ss,sss INFO - Searching for clusters using Birch. yyyy-mm-dd hh:mm:ss,sss INFO - Filtering the clusters. yyyy-mm-dd hh:mm:ss,sss INFO - Building master trends. yyyy-mm-dd hh:mm:ss,sss INFO - Detrending one light curves using the master trends. yyyy-mm-dd hh:mm:ss,sss INFO - Ploting results. yyyy-mm-dd hh:mm:ss,sss INFO - Done.
This command reads the sample dataset consisting of 57 light curves (Python pickled and bzipped), run the clustering algorithm (i.e. Birch) to find clusters, construct master trends of those clusters, and detrend a sample light curve. It also generates three output images under the "./output" folder.
The above image shows the master trend constructed by the clustering algorithm. In this example data set, PDT found one master trend. For details about what is a master trend, see Kim et al. 2009. In brief, it is a representative trend of a cluster.
The following image is an example light curve before (top) and after (bottom) the detrending. Note that when PDT detrends a light curves, it minimized RMS of residuals while constraining weights for each master trend to be positive. The positive constraint is mandatory to avoid undesirable RMS minimization. For instance, if the weights are negative while the master trends are monotonically increasing, RMS minimization can reduce monotonically decreasing signals in light curves, which is unwanted.
In addition, PDT can plot spatial distribution of the constructed master trends if x and y coordinates of stars of the light curves are given (see How to Use PDT for details). In this test dataset, the x and y coordinates are randomly generated between 0 and 1000.
How to Use PDT
Using PDT is relatively simple because PDT assumes that light curves are synced. Nevertheless, note that PDT requires enough number of light curves to find clusters and master trends. We recommend to use PDT with at least 50 light curves, but not too many such as several tens of thousands because then it might take long to run. In the latter case, we recommend to run PDT multiple times for individual subsets of the light curves.
The following pseudo code shows how to use PDT.
# Import PDT. from pdtrend import PDTrend # Read light curves. lcs = ... # Create PDT instance. pdt = PDTrend(lcs) # Find clusters and then construct master trends. pdt.run() # Detrend each light curve. for lc in lcs: detrended_lc = pdt.detrend(lc)
In order to use PDT, light curves must be read beforehand (i.e. the line
lcs = ...). The
lcs must consist of N rows where each row contains M columns. N is the number of light curves and M is the number of data points.
lcs could be either Python list or numpy.ndarry. For example:
lcs = [ [1, 2, 3, 4, 5], [5, 4, 3, 2, 1], [3, 3, 3, 3, 3], ]
is a data set consisting of three light curves, each of which contains 5 data points.
When creating the PDT instance, you can set additional two options as:
|n_min_member||The minimum number of members in each cluster. If a cluster has fewer members, PDT discards the cluster. Default is 10. If you have a lot of light curves (e.g. several hundreds or thousands), you may want to increase this number to 20, 30, 50, 100 or so.|
|dist_cut||The distance matrix that PDT uses is (1 - correlation matrix) / 2. If a cluster found by Birch consists of light curves of random Gaussian noise (i.e. no clear variability), it is likely that the median distance between the light curves is close to 0.5. Thus we can remove clusters whose median distance is larger than 0.5. Nevertheless, the default value is set to 0.45 in order to discard less-correlated clusters as well. If you increase this value (e.g. to 0.6 or so), PDT will construct master trends consisting of non-varying light curves.|
|weights||A list of weights for the light curves. Default is None, so the identical weights for all light curves. The number of weights must be same with the number of input light curves. PDT uses the weights only when constructing master trends. See Kim et al. 2009 for details.|
|xy_coords||A list of x and y spatial coordinates of a star of each light curve. Default is None. It must contains Nx2 elements, where N is the number of input light curves. The first column is the x coordinate and the second column is the y coordinate. If this list is given, you can use
|branching_factor||For details, see scikit-learn Birch.|
|threshold||For details, see scikit-learn Birch.|
After creating an PDT instance (e.g.
pdt), you can execute the command
pdt.run(), which will find clusters and construct master trends. To remove trends in each light curve, you can then use
pdt.detrend(lc) command which will return a detrended light curve.
lc is an individual light curve of either 1d list or 1d numpy.ndarray. For example,
lc = [1, 2, 3, 4, 5]
Note that you can apply the constructed master trends to any light curves if their data points are synced. Thus, if you preprocess your data and remove low signal-to-noise-ratio (SNR) light curves from
lcs before running
pdt.run, you can 1) construct high SNR master trends, and 2) reduce time for calculating correlation matrix. The constructed master trends, of course, can be used to detrend the low SNR light curves.
Using the created PDT instance, you can access the following information:
|master_trends||An array of light curves of the constructed master trends. For instance, if there are two master trends,
|master_trends_indices||These indices correspond to the indices of
|corr_matrix||A correlation matrix. The correlation coefficients are calculated using the Pearson's correlation algorithm.|
|dist_matrix||A distance matrix = (1. - the correlation matrix) / 2. Thus, between 0 and 1. 0 is the closest distance (i.e. correlation = 1) whereas 1 is the farthest distance (i.e. correlation = -1).|
|birch||A scikit-learn instance of the trained Birch cluster.|
You can access these information using the PDT instance. For instance, to access
master_trends, you can do as follows:
If You Get "No clusters were found" Message
It means that PDT failed to find clusters of light curves that are highly correlated. This could imply that your dataset does not have strong trends. Nevertheless, if you still want to detect clusters of (less-highly-correlated) light curves, you can either decrease
n_min_member or increase
dist_cut, and rerun. For example,
# The first execution. pdt.run() # If this returns the message, "No clusters were found". # Then, adjust parameters. For example: pdt.n_min_member = 8 pdt.dist_cut = 0.6 # And then, do the second execution. pdt.run()
The second execution of
pdt.run() will be faster than the first execution because the Birch cluster is already trained (i.e.
pdt.birch) during the first execution. The Birch cluster will be retrained only if you create a new PDT instance (same goes for the correlation matrix and distance matrix).
In addition, you might want to increase
threshold, which is the maximum distance between sub-clusters to merge them into one cluster. Increasing the value tends to give a larger cluster (i.e. more members in the cluster), but those members might not be highly-correlated.
PDT is designed to work for synced light curves in time. Nevertheless, PDT provides a module that fills missing values using interpolation. Remember that any kinds of these "filling missing values" methods could introduce another biases and yield undesired results. Please use this module at your own risk. Note that you must have a set of light curves that satisfies: 1) the light curves are from the same survey, 2) the sampling rate of the light curves is similar, and 3) their observation periods are generally same. If any of these condition is not satisfied, detrended results could be very inaccurate.
PDT uses interpolation of order of one (i.e. linear interpolation). PDT does not use higher order interpolation (e.g. quadratic or cubic) to minimize over-fitting risk. You can use the module as follows:
from pdtrend import FMdata # Filling missing data points. fmt = FMdata(lcs_missing, times, n_min_data=3) results = fmt.run()
lcs_missing is an array of light curves with missing values and
times is an array of observation times for the corresponding light curves. The number of data points between an individual light curve and a corresponding time list must match. The following example shows the three light curves that are not synced:
lcs_missing = [ [3, 3, 5, 4, 2], [5, 6, 2], [3, 3, 3, 3] ] times = [ [1, 2, 3, 4, 5], [2, 3, 4], [1, 3, 4, 5] ]
Note that each list in
times must be in ascending order before using
The most important thing you have to remember is to set one parameter when creating a
FMdata instance, which is:
|n_min_data||The number of minimum data points in a light curves. If a light curve has fewer data points than this value,
Setting this parameter to a proper value is very important. For example, let's assume that observation periods of almost all light curves are about one year. If there exists one light curve whose observation period is only one month, then every light curves in the returned
lcs will be one month long. Therefore, you should either increase or decrease the value of
n_min_data according to the temporal characteristics of your light curves.
results after executing
fmt.run() is a Python dictionary containing three elements as:
|lcs||An array of light curves with the missing values filled.|
|epoch||An one-dimensional array contains synced observation epochs. Note that, in order to prevent extrapolation, the start epoch and the end epoch of
|indices||A list of the indices for each
In case of the above example, the returned
indices will be (Note: of course, we cannot apply
FMdata to the above example data since there are too few data points. This is just a conceptual example):
results['lcs'] = [ [3, 5, 4], [5, 6, 2], [3, 3, 3] ] results['epoch'] = [2, 3, 4] results['indices'] = [0, 1, 2]
lcs can be ingested into PDT as:
pdt = PDTrend(lcs); pdt.run() (see How to Use PDT for details).
If you want to write log messages either to console or to disk, you can use the PDT Logger class as:
from pdtrend import Logger logger = Logger().getLogger() logger.debug('debug message') logger.info('info message') logger.warn('warn message') logger.error('error message') logger.critical('critical message')
Keep in mind that you need to generate only one logger instance through the whole processes, but not many. If you want to save log messages to a file, generate a logger instance as follows:
logger = Logger('/PATH/TO/FILE.log').getLogger()
This will send log messages to both console and a log file. Note that the path must be the absolute path.
To several survey dataset.
- release of beta version.
- test with HATNet, SuperWASP, and KMTNet dataset.
- release of alpha version.
- add another module for dealing with missing values.
- minor bug fixed for loading test light curves using pickle.load()
- Python 3 compatible
- type of the output from FMdata is changed to Python dictionary.
- PEP8 style Docstring.
- many minor bugs fixed.
- modules for dealing with missing values (i.e. not-synced observations).
- consider weights for light curves while building master trends.
- if X and Y coordinates of light curves are given, pdtrend can plot spatial distribution of constructed master trends.
- if no master trend is found, warning and advice messages will be printed.
- release of pre-alpha version.
- calculate correlation matrix and distance matrix.
- train a Birch cluster.
- construct master trends.
- add a detrending module.
- create the GitHub repository.
If you use PDT in publication, we would appreciate citations to the paper, Kim et al. 2009 and this GitHub repository as well.
Dae-Won Kim, email: dwkim78 at gmail.com
astronomy - light curves - time series - trend removal - detrend - machine learning - Birch - clustering