## History 

This notebook previously sought to explore the utility of dimensionality reduction with PCA prior to clustering considering 3 levels of variance explained: 90%, 95%, and 99%. 
Aside from the reduction, the data was processed in the same way as done previously. Prior to dimensionality reduction, the was scaled using sklearn's standardScaler. After reduction, the data was clustered using the method clusterOutliers.db_out.
These reductions were compared to the clustering on the full set of features from the paper as well as a rerun of the clustering on the full set of features (the rerun is hereafter referred to as the baseline) and a summary of these results is stored in the Google Sheets file 'PCA Reduction'.
In short:
Treating the paper data as the ground truth:
Precision: 100%, Recall: 68%
The paper data had a stricter definition of a cluster and contained essentially every non-core data point identified in the reductions and the rerun baseline. Almost every point identified as an outlier or an edge member in the reductions was identified as such in the paper's results. If we treat the paper as the ground truth (problematic) then the precision of the reductions is perfect.
The reductions (and rerun baseline) clustered more of the data and identified between 63% and 70% of the non-core identified in the paper data, a recall (assuming the paper is the ground truth) of about 68%.

Treating the new baseline (reclustering on the full set of features) as the ground truth:
Precision: 90%-99%, recall: 85%-90%

The reduced recall as compared to the paper is mitigated when examining against the re-clustered data. There is not a simple way to say exactly how good the results of the reduced data are since there is no simple way of saying what the ground truth should really be as there is not a simple way of defining what an outlier should be.

In [1]:
# Import the custom code developed for this work
import sys
sys.path.append('python')
from clusterOutliers import clusterOutliers as coo
from clusterOutliers import import_gen
import quarterTools as qt

# Cluster-Outlier Object creation from calculated features

Calling import_gen with defaults, creates an importer for clusterOutlier objects w/ data 
stored in a common directory with a common naming convention.

Defaults:
* path to data files directory (filedir): /home/dgiles/Documents/KeplerLCs/output/
* output file suffix (suffix): \_output.p
* path to fits files directory (fitsdir): /home/dgiles/Documents/KeplerLCs/fitsFiles
* output file extension (out_file_ext): .coo
    
This has been executed on 5/19/2019 to initialize the cluster outlier objects, this cell does not need to be run again
unless the definition of the object is significantly altered as it will overwrite the existing object based only on the 
feature data. Doing this would overwrite any work or additions made to the cluster outlier object including 
dimensionality reductions, outlier scoring, and any other analysis stored in these objects.

In [None]:
import_quarter = import_gen()
# Q_dict contains a clusterOutlier object for each quarter.
Q_dict = {'Q{}'.format(i):import_quarter('Q{}'.format(i)) for i in range(1,18)}
for k in Q_dict:
    # This is commented out to prevent accidental execution, it can overwrite previous work with the same name
    #Q_dict[k].save() 

## Cluster-Outlier Object Import

In [2]:
import pickle
Q_dict = dict()
for i in range(1,18):
    with open('/home/dgiles/Documents/KeplerLCs/output/Q{}.coo'.format(i),'rb') as file:
        Q_dict['Q{}'.format(i)]=pickle.load(file)

# Reductions

### Example

In [4]:
import pickle
with open('/home/dgiles/Documents/KeplerLCs/output/Q3.coo','rb') as file:
    Q3=pickle.load(file)
    print("Q3.coo imported")
# by default the reduction method assumes that data has not been scaled. The default scaler is StandardScaler.
pca90 = Q3.pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=True)
# Data can be pre-scaled and fed in. Techincally the method could be run on data outside the object too,
# but in that case it would be preferable to use the actual sklearn package directly.
scaled = qt.data_scaler(Q3.data)
pca95 = Q3.pca_red(df=scaled,red_name='PCA95',var_rat=0.95,scaled=True,verbose=True)

Q3.coo imported
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 18,
        Variance explained: 91.1%
        


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 23,
        Variance explained: 95.6%
        


"""
Creating PCA reductions for each quarter that explain 90%, 95%, and 99% of variance

Reductions have been produced and saved in their respective cluster outlier files, output has been suppressed, cell converted to markdown to avoid accidental execution.
"""
```
for k in Q_dict.keys():
    print("Starting "+k)
    pca90 = Q_dict[k].pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=False)
    pca95 = Q_dict[k].pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=False)
    pca99 = Q_dict[k].pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=False)
    Q_dict[k].save()
```

Saving reductions to a seperate file (probably deprecated? I can't remember why I did this)

In [None]:
for k in list(Q_dict.keys())[1:-1]:
    print(k,Q_dict[k].reductions.keys())
    with open('/home/dgiles/Documents/KeplerLCs/output/'+k+".rdct",'wb') as file:
        pickle.dump(Q_dict[k].reductions,file)

In [4]:
Q_dict['Q4'].reductions.keys()

dict_keys(['PCA90', 'PCA95', 'PCA99'])

In [8]:
featCSV = "/home/dgiles/Documents/KeplerLCs/output/Archive/Q4_FullSample.csv" # Path to csv containing feature data (should be a pandas dataframe saved as a csv)
fitsDir = "/home/dgiles/Documents/KeplerLCs/fitsFiles/Q4fitsfiles" # path to fits files
Q4_sample = coo(featCSV,fitsDir)

In [11]:
common_sampler = qt.make_sampler(Q4_sample.files)

In [12]:
Q4_common_PCA90 = common_sampler(Q_dict['Q4'].reductions['PCA90'])

In [14]:
labels = Q_dict['Q4'].db_out(Q4_common_PCA90)

Estimating Parameters...
Sampling data for parameter estimation...
Calculating nearest neighbor distances...
Finding elbow...

        Epsilon is in the neighborhood of 03.22.
        
Scaling density...
Clustering data with DBSCAN, eps=03.22,min_samples=59...
There are 4186 total outliers and 1882 edge members.


## Timing performance 

In [36]:
import pickle
from datetime import datetime
import pandas as pd
import numpy as np
from datetime import timedelta

In [17]:
def foo(Q):
    with open('/home/dgiles/Documents/KeplerLCs/output/{}.coo'.format(Q),'rb') as file:
        Qcoo=pickle.load(file)
        print("{}.coo imported".format(Q))
        
    timestart = datetime.now()
    pca90 = Qcoo.pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=False)
    time90 = datetime.now()
    pca95 = Qcoo.pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=False)
    time95 = datetime.now()
    pca99 = Qcoo.pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=False)
    time99 = datetime.now()
    return [time90-timestart,time95-time90,time99-time95]

In [45]:
rdct_times = pd.DataFrame(index=Q_dict.keys(),data=[],columns=['PCA90_times','PCA95_times','PCA99_times'])
for Q in Q_dict.keys():
    rdct_times.loc[Q]=foo(Q)

Q1.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q2.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q3.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q4.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q5.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q6.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q7.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q8.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q9.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q10.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q11.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q12.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q13.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q14.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q15.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q16.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Q17.coo imported


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)
  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


In [34]:
(sum(Q_times,timedelta(0))/len(Q_times)).total_seconds()

20.287922

In [49]:
timedelta.total_seconds(Q_times[0])

12.344685

In [47]:
rdct_times.to_csv('reduction_times.csv')

In [58]:
rdct_times.PCA90_times.apply(timedelta.total_seconds)

Q1     10.800959
Q2     15.636804
Q3     17.268312
Q4     16.412181
Q5     15.084672
Q6     15.155965
Q7     14.579648
Q8     14.638839
Q9     13.007518
Q10    13.070692
Q11    11.410296
Q12    13.619598
Q13    12.983753
Q14    12.869900
Q15    12.776386
Q16    12.131277
Q17    12.627555
Name: PCA90_times, dtype: float64