This notebook previously sought to explore the utility of dimensionality reduction with PCA prior to clustering considering 3 levels of variance explained: 90%, 95%, and 99%. 
Aside from the reduction, the data was processed in the same way as done previously. Prior to dimensionality reduction, the was scaled using sklearn's standardScaler. After reduction, the data was clustered using the method clusterOutliers.db_out.
These reductions were compared to the clustering on the full set of features from the paper as well as a rerun of the clustering on the full set of features (the rerun is hereafter referred to as the baseline) and a summary of these results is stored in the Google Sheets file 'PCA Reduction'.
In short:
Treating the paper data as the ground truth:
Precision: 100%, Recall: 68%
The paper data had a stricter definition of a cluster and contained essentially every non-core data point identified in the reductions and the rerun baseline. Almost every point identified as an outlier or an edge member in the reductions was identified as such in the paper's results. If we treat the paper as the ground truth (problematic) then the precision of the reductions is perfect.
The reductions (and rerun baseline) clustered more of the data and identified between 63% and 70% of the non-core identified in the paper data, a recall (assuming the paper is the ground truth) of about 68%.

Treating the new baseline (reclustering on the full set of features) as the ground truth:
Precision: 90%-99%, recall: 85%-90%

The reduced recall as compared to the paper is mitigated when examining against the re-clustered data. There is not a simple way to say exactly how good the results of the reduced data are since there is no simple way of saying what the ground truth should really be as there is not a simple way of defining what an outlier should be.

In [2]:
# Import the custom code developed for this work
import sys
sys.path.append('python')
from clusterOutliers import clusterOutliers as coo
from clusterOutliers import import_gen
import quarterTools as qt

"""
Calling import_gen with defaults, creates an importer for clusterOutlier objects w/ data 
stored in a common directory with a common naming convention.
    path to data files directory (filedir): /home/dgiles/Documents/KeplerLCs/output/
    output file suffix (suffix): _output.p
    path to fits files directory (fitsdir): /home/dgiles/Documents/KeplerLCs/fitsFiles
    output file extension (out_file_ext): .coo
    
This has been executed on 5/19/2019 to initialize the cluster outlier objects, this cell does not need to be run again
unless the definition of the object is significantly altered as it will overwrite the existing object based only on the 
feature data. Doing this would overwrite any work or additions made to the cluster outlier object including 
dimensionality reductions, outlier scoring, and any other analysis stored in these objects. This cell has been
converted to markdown to avoid accidentally running the code.
"""
```
import_quarter = import_gen()
# Q_dict contains a clusterOutlier object for each quarter.
Q_dict = {'Q{}'.format(i):import_quarter('Q{}'.format(i)) for i in range(1,18)}
for k in Q_dict:
    Q_dict[k].save()
```

The code below is used to import existing cluster outlier objects.

In [3]:
import pickle
Q_dict = dict()
for i in range(1,18):
    with open('/home/dgiles/Documents/KeplerLCs/output/Q{}.coo'.format(i),'rb') as file:
        Q_dict['Q{}'.format(i)]=pickle.load(file)


# Reductions

### Example

In [7]:
pca90 = Q_dict['Q2'].pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=True)
pca95 = Q_dict['Q2'].pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=True)
pca99 = Q_dict['Q2'].pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=True)
Q_dict['Q2'].save()

Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 18,
        Variance explained: 90.2%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 24,
        Variance explained: 95.3%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 36,
        Variance explained: 99.0%
        


"""
Creating PCA reductions for each quarter that explain 90%, 95%, and 99% of variance

Reductions have been produced and saved in their respective cluster outlier files, output has been suppressed, cell converted to markdown to avoid accidental execution.
"""
```
for k in Q_dict.keys():
    print("Starting "+k)
    pca90 = Q_dict[k].pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=False)
    pca95 = Q_dict[k].pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=False)
    pca99 = Q_dict[k].pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=False)
    Q_dict[k].save()
```