## Import necessary libraries and modules

In [None]:
import pandas as pd
import numpy as np
import os 
import seaborn as sns
import statistics

In [None]:
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler

I want to have interactive plots and visualizations for my data, so I'm going to use HoloViews.

### Load the cleaned dataset

In [None]:
# load in dataset
mutants = pd.read_csv('../data/interim/k8_clean_data.csv', header=None, low_memory=False)

In [None]:
mutants.head()

In [None]:
# pull out the column of mutants with the nametags

nametags = mutants[5408].astype(str)
print(nametags)

## Dimensionality Reduction

Due to the highly dimensional nature of this dataset (5408 features!!) and the resulting complexity, knowing where to start is difficult. More common visualization techniques for lower dimensional data are too computationally expensive and time-consuming to be useful at this point. 

Some first steps to reduce dimensionality would be:

1) Search for features with no variance and drop these features, since they do not provide any information for our classification problem.

2) Look for multicollinearity in the features! Identify features that are highly linearly correlated with each other or features that have a high PMI (point-wise mutual information) score. These features are redundant, as having many features with a high PMI score or high linear correlation do not provide additional information. 

Since I am not sure yet whether the features have linear or nonlinear (geometric?) relationships, I will be using both the linear correlation and the PMI scores (mutual information captures non-linear relationships) for the dimensionality reduction and comparing the results of both methods.

### Find Features with No Variance or Low Variance

Using the pandas describe() method, I will be looking for numerical features that have a standard deviation = 0 and min and max values that are equal to each other. These features will be dropped. 

I will also make a note of features that have very low variance, especially compared to the rest of the features, and investigate the relevance of those features further.

In [None]:
mutants.describe()

### Using Linear Correlation to Reduce Dimensionality

My strategy for dimensionality reduction using linear correlation, excluding non-numerical features, is as follows:

1) Start by looking at only the 2D, or electrostatic and surface-based features. This will be the first 4827 features, or columns, in the dataset.

2) Calculate the linear correlation of the first two features with each other. This is a measure of the **redundancy** of the two features.
    
    (a) If this value is higher than the chosen threshold value, then that tells me that the two features are collinear and are redundant. Now, decide which of the two features to drop by proceeding to step #3.
    
    (b) If this value is lower than the chosen threshold value, then that tells me that the two features are not collinear and both features will be retained in the dataset. Proceed to step #4.

3) Take the two features from step #2 and calculate their correlation with the rest of the features. This will give a [2 x 4827] correlation matrix. Calculate the median of the values of each row, giving two numbers which summarize how highly correlated the two features are with the rest of the 2D features, aka **the relevancy of that feature to the rest of the dataset**. *The average is susceptible to bias from outliers, so the median is a better summary statistic* 
    
    (a) If one of the two values is much larger than the other, then drop the feature with the larger value since that feature has more repeated information over the whole dataset than the other feature. If the two values are the same, then retain both features in the dataset.

4) Move on to the next two features and repeat the process for all of the 2D features. 

5) After subsetting the 2D features in this way, repeat the process with the 3D (distance-based) features and combine the data. 

6) Investigate the relationships of the selected features with the target variable (inactive or active). 

### Using Mutual Information to Reduce Dimensionality

My strategy for dimensionality reduction using mutual information is as follows:

1) Determine a threshold value for the PMI score. 

2) Take the first two features (columns) and calculate the PMI score between them. 

    (a) If the PMI score is large (above threshold value), then that means that those two features are highly dependent. In this case, keep the feature with the largest information gain and store in a dataframe. The other feature will be removed and stored in a separate dataframe. 
    
    (b) If the PMI score is low (below threshold value), then those two features are weakly correlated and both features may be retained. 
    
    (c) If the PMI score is 0, then the two features are statistically independent and both features may be retained.


3) Repeat the process with the next set of two features until all the 2D features have been tested. 

4) Continue the process with the 3D features.

Afterwards, I will investigate the relationships of the remaining features with the target variable.