## History 

This notebook previously sought to explore the utility of dimensionality reduction with PCA prior to clustering considering 3 levels of variance explained: 90%, 95%, and 99%. 
Aside from the reduction, the data was processed in the same way as done previously. Prior to dimensionality reduction, the was scaled using sklearn's standardScaler. After reduction, the data was clustered using the method clusterOutliers.db_out.
These reductions were compared to the clustering on the full set of features from the paper as well as a rerun of the clustering on the full set of features (the rerun is hereafter referred to as the baseline) and a summary of these results is stored in the Google Sheets file 'PCA Reduction'.
In short:
Treating the paper data as the ground truth:
Precision: 100%, Recall: 68%
The paper data had a stricter definition of a cluster and contained essentially every non-core data point identified in the reductions and the rerun baseline. Almost every point identified as an outlier or an edge member in the reductions was identified as such in the paper's results. If we treat the paper as the ground truth (problematic) then the precision of the reductions is perfect.
The reductions (and rerun baseline) clustered more of the data and identified between 63% and 70% of the non-core identified in the paper data, a recall (assuming the paper is the ground truth) of about 68%.

Treating the new baseline (reclustering on the full set of features) as the ground truth:
Precision: 90%-99%, recall: 85%-90%

The reduced recall as compared to the paper is mitigated when examining against the re-clustered data. There is not a simple way to say exactly how good the results of the reduced data are since there is no simple way of saying what the ground truth should really be as there is not a simple way of defining what an outlier should be.

In [1]:
# Import the custom code developed for this work
import sys
sys.path.append('python')
from clusterOutliers import clusterOutliers as coo
from clusterOutliers import import_gen
import quarterTools as qt

# Cluster-Outlier Object creation from calculated features

Calling import_gen with defaults, creates an importer for clusterOutlier objects w/ data 
stored in a common directory with a common naming convention.

Defaults:
* path to data files directory (filedir): /home/dgiles/Documents/KeplerLCs/output/
* output file suffix (suffix): \_output.p
* path to fits files directory (fitsdir): /home/dgiles/Documents/KeplerLCs/fitsFiles
* output file extension (out_file_ext): .coo
    
This has been executed on 5/19/2019 to initialize the cluster outlier objects, this cell does not need to be run again
unless the definition of the object is significantly altered as it will overwrite the existing object based only on the 
feature data. Doing this would overwrite any work or additions made to the cluster outlier object including 
dimensionality reductions, outlier scoring, and any other analysis stored in these objects.

In [None]:
import_quarter = import_gen()
# Q_dict contains a clusterOutlier object for each quarter.
Q_dict = {'Q{}'.format(i):import_quarter('Q{}'.format(i)) for i in range(1,18)}
for k in Q_dict:
    # This is commented out to prevent accidental execution, it can overwrite previous work with the same name
    #Q_dict[k].save() 

## Cluster-Outlier Object Import

In [2]:
import pickle
Q_dict = dict()
for i in range(1,18):
    with open('/home/dgiles/Documents/KeplerLCs/output/Q{}.coo'.format(i),'rb') as file:
        Q_dict['Q{}'.format(i)]=pickle.load(file)

# Reductions

### Example

In [4]:
import pickle
with open('/home/dgiles/Documents/KeplerLCs/output/Q3.coo','rb') as file:
    Q3=pickle.load(file)
    print("Q3.coo imported")
# by default the reduction method assumes that data has not been scaled. The default scaler is StandardScaler.
pca90 = Q3.pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=True)
# Data can be pre-scaled and fed in. Techincally the method could be run on data outside the object too,
# but in that case it would be preferable to use the actual sklearn package directly.
scaled = qt.data_scaler(Q3.data)
pca95 = Q3.pca_red(df=scaled,red_name='PCA95',var_rat=0.95,scaled=True,verbose=True)

Q3.coo imported
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 18,
        Variance explained: 91.1%
        


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 23,
        Variance explained: 95.6%
        


"""
Creating PCA reductions for each quarter that explain 90%, 95%, and 99% of variance

Reductions have been produced and saved in their respective cluster outlier files, output has been suppressed, cell converted to markdown to avoid accidental execution.
"""
```
for k in Q_dict.keys():
    print("Starting "+k)
    pca90 = Q_dict[k].pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=False)
    pca95 = Q_dict[k].pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=False)
    pca99 = Q_dict[k].pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=False)
    Q_dict[k].save()
```

Saving reductions to a seperate file (probably deprecated? I can't remember why I did this)

In [None]:
for k in list(Q_dict.keys())[1:-1]:
    print(k,Q_dict[k].reductions.keys())
    with open('/home/dgiles/Documents/KeplerLCs/output/'+k+".rdct",'wb') as file:
        pickle.dump(Q_dict[k].reductions,file)

In [4]:
Q_dict['Q4'].reductions.keys()

dict_keys(['PCA90', 'PCA95', 'PCA99'])

In [8]:
featCSV = "/home/dgiles/Documents/KeplerLCs/output/Archive/Q4_FullSample.csv" # Path to csv containing feature data (should be a pandas dataframe saved as a csv)
fitsDir = "/home/dgiles/Documents/KeplerLCs/fitsFiles/Q4fitsfiles" # path to fits files
Q4_sample = coo(featCSV,fitsDir)

In [11]:
common_sampler = qt.make_sampler(Q4_sample.files)

In [12]:
Q4_common_PCA90 = common_sampler(Q_dict['Q4'].reductions['PCA90'])

In [14]:
labels = Q_dict['Q4'].db_out(Q4_common_PCA90)

Estimating Parameters...
Sampling data for parameter estimation...
Calculating nearest neighbor distances...
Finding elbow...

        Epsilon is in the neighborhood of 03.22.
        
Scaling density...
Clustering data with DBSCAN, eps=03.22,min_samples=59...
There are 4186 total outliers and 1882 edge members.


## Timing performance 

In [36]:
import pickle
from datetime import datetime
import pandas as pd
import numpy as np
from datetime import timedelta

In [17]:
def foo(Q):
    with open('/home/dgiles/Documents/KeplerLCs/output/{}.coo'.format(Q),'rb') as file:
        Qcoo=pickle.load(file)
        print("{}.coo imported".format(Q))
        
    timestart = datetime.now()
    pca90 = Qcoo.pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=False)
    time90 = datetime.now()
    pca95 = Qcoo.pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=False)
    time95 = datetime.now()
    pca99 = Qcoo.pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=False)
    time99 = datetime.now()
    return [time90-timestart,time95-time90,time99-time95]

In [45]:
rdct_times = pd.DataFrame(index=Q_dict.keys(),data=[],columns=['PCA90_times','PCA95_times','PCA99_times'])
for Q in Q_dict.keys():
    rdct_times.loc[Q]=foo(Q)

Q1.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q2.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q3.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q4.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q5.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q6.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q7.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q8.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q9.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q10.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q11.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q12.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q13.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q14.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q15.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q16.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q17.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


In [34]:
(sum(Q_times,timedelta(0))/len(Q_times)).total_seconds()

20.287922

In [49]:
timedelta.total_seconds(Q_times[0])

12.344685

In [63]:
rdct_times.to_csv('reduction_times.csv')

In [61]:
rdct_times['PCA99_times']=rdct_times.PCA99_times.apply(timedelta.total_seconds)

In [62]:
rdct_times

Unnamed: 0,PCA90_times,PCA95_times,PCA99_times
Q1,10.800959,15.733025,31.469359
Q2,15.636804,23.47498,40.848296
Q3,17.268312,23.718734,40.539367
Q4,16.412181,26.08297,43.468744
Q5,15.084672,23.907383,43.325367
Q6,15.155965,19.125999,30.751354
Q7,14.579648,20.126913,33.302158
Q8,14.638839,23.367365,39.939548
Q9,13.007518,18.856272,33.763662
Q10,13.070692,20.757123,35.445522


In [65]:
Q_dict['Q1'].reductions.keys()

dict_keys(['PCA90', 'PCA95', 'PCA99'])

In [67]:
rdct_times.columns

Index(['PCA90_times', 'PCA95_times', 'PCA99_times'], dtype='object')

In [66]:
def foo(Q):
    with open('/home/dgiles/Documents/KeplerLCs/output/{}.coo'.format(Q),'rb') as file:
        Qcoo=pickle.load(file)
        print("{}.coo imported".format(Q))
    timestart = datetime.now()
    labels = Q_dict[Q].db_out(df='self',verbose=False)
    timebase = datetime.now()
    #pca90 = Q_dict[Q].db_out(df=Q_dict[Q].reductions['PCA90'],verbose=False)
    #time90 = datetime.now()
    #pca95 = Q_dict[Q].db_out(df=Q_dict[Q].reductions['PCA95'],verbose=False)
    #time95 = datetime.now()
    #pca99 = Q_dict[Q].db_out(df=Q_dict[Q].reductions['PCA99'],verbose=False)
    #time99 = datetime.now()
    return [timebase-timestart]#,time90-timebase,time95-time90,time99-time95]

In [68]:
cluster_times = pd.DataFrame(index=Q_dict.keys(),
                             data=[],
                             columns=['base_times','PCA90_times','PCA95_times','PCA99_times'])

for Q in Q_dict.keys():
    cluster_times.loc[Q]=foo(Q)

Q1.coo imported
Q2.coo imported
Q3.coo imported
Q4.coo imported
Q5.coo imported
Q6.coo imported
Q7.coo imported
Q8.coo imported
Q9.coo imported
Q10.coo imported
Q11.coo imported
Q12.coo imported
Q13.coo imported
Q14.coo imported
Q15.coo imported
Q16.coo imported
Q17.coo imported


In [69]:
cluster_times

Unnamed: 0,base_times,PCA90_times,PCA95_times,PCA99_times
Q1,0:00:00.664112,0:05:36.920364,0:06:31.406803,0:07:49.503053
Q2,0:00:00.597309,0:06:07.008624,0:07:13.794225,0:10:11.842755
Q3,0:00:00.585696,0:05:33.111268,0:06:19.336950,0:07:58.892724
Q4,0:00:00.599547,0:05:28.040484,0:06:40.874488,0:08:10.595437
Q5,0:00:00.683147,0:07:17.708614,0:08:28.320800,0:10:20.854077
Q6,0:00:00.667909,0:04:46.475251,0:07:41.115918,0:07:28.367959
Q7,0:00:00.594571,0:05:15.496100,0:06:06.243251,0:08:08.195946
Q8,0:00:00.601752,0:07:17.999669,0:08:24.057016,0:10:29.516116
Q9,0:00:00.578084,0:05:26.499160,0:06:26.872182,0:08:36.890855
Q10,0:00:00.582821,0:06:35.383877,0:07:38.692501,0:08:57.583528


In [70]:
for col in cluster_times.columns:
    cluster_times[col]=cluster_times[col].apply(timedelta.total_seconds)
cluster_times.to_csv('cluster_times.csv')

In [2]:
import pickle
Q_dict = dict()
for i in range(1,18):
    with open('/home/dgiles/Documents/KeplerLCs/output/Q{}.coo'.format(i),'rb') as file:
        Q_dict['Q{}'.format(i)]=pickle.load(file)
        

In [4]:
Q_dict['Q2'].data.describe()

Unnamed: 0,longtermtrend,meanmedrat,skews,varss,coeffvar,stds,kurt,mad,maxslope,minslope,...,percentamp,magratio,sautocorrcoef,autocorrcoef,flatmean,tflatmean,roundmean,troundmean,roundrat,flatrat
count,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,...,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0,166279.0
mean,9.371762e-07,0.999961,3.340188,0.0003351354,0.001764,0.001849,123.111624,0.001132,0.086729,-0.083243,...,0.012911,3.734413,-0.377254,0.450403,0.021728,0.025312,0.084013,-0.092002,-0.890005,1.195254
std,0.0003665585,0.008698,7.260049,0.0534972,0.017906,0.018213,346.105208,0.00645,0.543824,0.419445,...,0.082309,4.334264,0.283425,0.368717,0.242568,0.265486,4.864805,4.656313,508.366026,1.519619
min,-0.08239169,-0.325208,-42.724255,3.043061e-10,-5.672508,1.7e-05,-1.88019,9e-06,0.001443,-56.360523,...,9.2e-05,0.000993,-0.97589,-0.936395,-0.100668,-0.098674,-149.433333,-285.934677,-123456.0,-91.427759
25%,-6.16437e-07,0.999998,0.081774,1.012456e-07,0.000318,0.000318,0.343933,0.00019,0.026899,-0.076667,...,0.003026,1.044859,-0.498457,0.102779,0.004783,0.003993,-0.035727,-0.017802,-1.014421,1.058415
50%,-5.262431e-09,1.000007,0.564201,2.819561e-07,0.000531,0.000531,4.72497,0.000314,0.046925,-0.046883,...,0.006044,2.251401,-0.487424,0.347322,0.008159,0.006748,-0.006172,0.007027,-0.156886,1.167688
75%,6.244583e-07,1.000019,2.780292,7.525685e-07,0.000867,0.000868,51.072044,0.000523,0.076757,-0.026836,...,0.01238,4.598552,-0.453832,0.85799,0.012324,0.010234,0.015515,0.041276,0.728192,1.288302
max,0.07015344,3.434339,62.274471,18.93018,1.337901,4.350882,3938.013111,0.680921,70.261696,-0.001465,...,20.883038,105.570994,0.995079,0.999999,19.227104,15.378776,266.550377,132.343519,92467.13989,588.931409
