This notebook seeks to explore the utility of dimensionality reduction with PCA prior to clustering. At the time of this writing, this notebook considers 3 levels of variance explained: 90%, 95%, and 99%. 
Aside from the reduction we will process the data in the same way as previously done. We will scale the data using sklearn's standardScaler and cluster the data using the method clusterOutliers.db_out.
We will compare these reductions to the clustering done on the full set of features. 

In [1]:
# Some standard imports for math and data handling
import numpy as np
import pandas as pd
from scipy import stats

# Imports for processing specific to this workbook
from sklearn.decomposition import PCA
from sklearn import preprocessing
from datetime import datetime

# Import the custom code developed for this work
import sys
np.set_printoptions(threshold=sys.maxsize)
sys.path.append('python')
from clusterOutliers import clusterOutliers as coo
from clusterOutliers import import_gen
import quarterTools as qt

"""
Calling import_gen with defaults, creates an importer for clusterOutlier objects w/ data 
stored in a common directory with a common naming convention.
    path to data files directory (filedir): /home/dgiles/Documents/KeplerLCs/output/
    output file suffix (suffix): _output.p
    path to fits files directory (fitsdir): /home/dgiles/Documents/KeplerLCs/fitsFiles
    output file extension (out_file_ext): .coo
    
This has been executed on 5/19/2019 to initialize the cluster outlier objects, this cell does not need to be run again
unless the definition of the object is significantly altered as it will overwrite the existing object based only on the 
feature data. Doing this would overwrite any work or additions made to the cluster outlier object including 
dimensionality reductions, outlier scoring, and any other analysis stored in these objects. This cell has been
converted to markdown to avoid accidentally running the code.
"""
```
import_quarter = import_gen()
# Q_dict contains a clusterOutlier object for each quarter.
Q_dict = {'Q{}'.format(i):import_quarter('Q{}'.format(i)) for i in range(1,18)}
for k in Q_dict:
    Q_dict[k].save()
```

The code below is used to import existing cluster outlier objects.

In [5]:
import pickle
Q_dict = dict()
for i in range(1,18):
    with open('/home/dgiles/Documents/KeplerLCs/output/Q{}.coo'.format(i),'rb') as file:
        Q_dict['Q{}'.format(i)]=pickle.load(file)


# Reductions and Clustering

In [11]:
# Creating PCA reductions for each quarter that explain 90%, 95%, and 99% of variance
for k in Q_dict:
    pca90 = Q_dict['Q{}'.format(i)].pca_red(df='self',red_name='PCA90',var_rat=0.9,scaled=False,verbose=True)
    pca95 = Q_dict['Q{}'.format(i)].pca_red(df='self',red_name='PCA95',var_rat=0.95,scaled=False,verbose=True)
    pca99 = Q_dict['Q{}'.format(i)].pca_red(df='self',red_name='PCA99',var_rat=0.99,scaled=False,verbose=True)

for k in Q_dict:
    Q_dict[k].save()

Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 90.0% of the variance...

        Dimensions: 16,
        Variance explained: 91.0%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 95.0% of the variance...

        Dimensions: 21,
        Variance explained: 95.4%
        
Scaling data using StandardScaler...


  return self.partial_fit(X, y)
  scaled_data = scaler.transform(data)


Finding minimum number of dimensions to explain 99.0% of the variance...

        Dimensions: 30,
        Variance explained: 99.0%
        
