# The Analysis Helper

#### Contents:
    1. Obtaining the data
        a. Imports
        b. Reading the file
        c. Feature extraction
        d. Binarification
        e. Missing values
        f. Scaling
        g. Outliers
    2. Exploring the rankings
        a. Ranking table
        b. Object popularity
        c. Clustering
        d. Cluster popularity
    3. Prediction model
        a. Initial run
        b. Feature selection
        c. Hyperparameter tuning
        d. Model significance
        e. Final run

## 1. Obtaining the data

### 1. a. Imports

#### Run this cell to import all the necessary packages.

In [None]:
# imports
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.covariance import EllipticEnvelope
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score, confusion_matrix
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import ParameterGrid

# our modules
%run ./obtainingTheData.ipynb
%run ./exploringTheRankings.ipynb
%run ./predictionModel.ipynb

### 1. b. Reading the file

#### Set the file path. The function will then read from this file.

Example:
- file = 'filename.csv'

In [None]:
file = 'filename.csv'

str_data, flt_data = csvGet(file)

#### Sanity check: Set the column title and row number to check whether the cell value is equivalent to the value in the csv.

Example: 
- column_title = 'my_column'
- row_number = 5

In [None]:
column_title = 'my_column'
row_number = 5

sanityCheck1(str_data, column_title, row_number)

### 1. c. Feature extraction

#### Run this to filter the rows. Set the column and category to only retain data from that category.

Example - Only retain data where the 'Type' is 'One':
- column = 'Type'
- category = 'One'

Example - use this setting to retain all data:
- column = None
- category = None

In [None]:
column = 'Type'
category = 'One'
    
filtered_flt_data, filtered_str_data = retainRows(flt_data, str_data, column, category)

sanityCheck2(flt_data, filtered_flt_data)

#### Get specific useful columns - the y labels, the stimulus labels, and the subject IDs. Set the labels for the csv columns matching the desired variables, and the function will retrieve them.

Example:
- y_label = 'Ranking_Stimuli'
- stimulus_label = 'Stimulus'
- subjectID_label = 'SubID'

In [None]:
y_label = 'Ranking_Stimuli'
stimulus_label = 'Stimulus'
subjectID_label = 'SubID'

y_col, stimulus_col, subID_col, object_list, subID_list = getColumns(filtered_flt_data, filtered_str_data, y_label, stimulus_label, subjectID_label)

#### Get the X-variables (features): Set the list of features you would like to include as X-values in the analysis. The function will then create the X-matrix.

Example:
- feature_list = ['Feature_one', 'Featue_two', 'Feature_three']

In [None]:
feature_list = ['Feature_one', 'Featue_two', 'Feature_three']

In [None]:
X = getFeatures(filtered_flt_data, feature_list)

### 1. d. Binarification

#### This cell will create a binary version of the y data (a column with 0s and 1s). This is done according to the median - a value above the median will become 1, otherwise it will become 0. Set the median type to 'subject' or 'overall', depending on which median you would like to use.

Example:
- median = 'subject'

In [None]:
median = 'subject'

binary_y_col = binarify(y_col, subID_col, median)

#### Data check: it is important for the y-labels to be balanced (a similar percentage of 0s and 1s). The next cell will report the precentage of 0s and 1s.

In [None]:
dataCheck(binary_y_col)

### 1. e. Missing values

#### Run this cell to replace missing values in the data with the average value of the variable.

In [None]:
replaced_X = replaceMissingValues(X)

### 1. f. Scaling

#### Run this cell to scale each feature of the data.

In [None]:
scaled_X = scale(replaced_X)

### 1. g. Outliers

#### Set the fraction of data you are considering removing. You can play with this number to mark different amounts.

Example: remove 1% of the data:
- outlier_fraction = 0.01

In [None]:
outlier_fraction = 0.01

sanityCheck3(scaled_X, outlier_fraction)

#### Run this cell to find and visualize the outliers.

In [None]:
outliers_indices = displayOutliers(scaled_X, outlier_fraction)

#### Finally, run this cell to remove the outliers from the data.

In [None]:
processed_X, processed_bi_y, processed_y, processed_stim_col, processed_subID_col = removeOutliers(
    outliers_indices, scaled_X, binary_y_col, y_col, stimulus_col, subID_col)

sanityCheck2(scaled_X, processed_X)

## 2. Exploring the rankings

### 2. a. Ranking table

#### This cell creates a table ("dataframe") where the rows represent subjects, and the columns represent stimuli.

In [None]:
ranking_dataframe = createDataframe(
    processed_subID_col, processed_stim_col, processed_bi_y, object_list, subID_list)

#### A peek of the dataframe:

In [None]:
ranking_dataframe

#### This cell enables you to get the ranking a certain subject gave a certain object from the table.

Example: to view the ranking (0 or 1) subject 1 gave the 'Stimulus_one' stimulus:
- subject_ID = 1
- object_name = 'Stimulus_one'

In [None]:
subject_ID = 1
object_name = 'Stimulus_one'

getRanking(ranking_dataframe, subject_ID, object_name)

### 2. b. Object popularity

#### A list of the objects from least favorite to most favorite

In [None]:
favoriteObjects(ranking_dataframe, object_list)

### 2. c. Clustering

#### Set the desired number of clusters

Example:
- num_clusters = 4

In [None]:
num_clusters = 4

#### Perform clustering and display the clusters

In [None]:
cluster_labels, t_rankings = getClusters(ranking_dataframe, num_clusters, object_list)

#### Visualize the clusters on a PCA graph

In [None]:
pca_components = visualizeClusters(cluster_labels, t_rankings)

### 2. d. Cluster popularity

It is important to make sure the popular and unpopular objects are balanced between the clusters.
This will be done in two ways. An object's popularity will be measured as the amount of subjects who ranked it 1.
The minimum possible score is zero, while the maximum possible score is the number of subjects.

#### Color the objects based on popularity, to make sure that it is evenly scattered in the diffefent clusters. A lighter color means the object is more popluar.

In [None]:
popularities = visualizePopularity(pca_components, ranking_dataframe)

#### Display the mean popularity for each cluster.

In [None]:
displayPopularity(popularities, num_clusters, cluster_labels)

## 3. Prediction model

### 3. a. Initial run

#### Perform 10-fold cross-validation on the model, then display evaluations of the model.

In [None]:
summary_df, predictions = svmModel(processed_X, processed_bi_y, SVC(kernel='rbf', C=1, gamma=1))

In [None]:
summary_df

#### Sanity check: during the cross-validation run, the model should have made a roughly balanced amount of '1' and '0' predictions. The next cell will check how many such predictions were made.

In [None]:
sanityCheck4(predictions)

### 3. b. Feature selection

#### A reminder: this is the current feature list used for the model.

In [None]:
feature_list

#### Now moving on to select a subset of features that yields good performance. Set the desired number of features. If set to None, up to 1/2 of the features will be selected. If set to a number, that is the max number of selected features. Note that the number affects the runtime.

Examples:
- n_features = None
- n_features = 3 (with 3, the function will run for about 1 minute).

In [None]:
n_features = None

#### Sequential feature selection: this function will try to find a subset of features that yields good performance from the model. It will take a bit of time to run.

In [None]:
feature_selection, selection_mask = featureSelect(processed_X, processed_bi_y, n_features)

#### The list of selected features:

In [None]:
feature_selection

#### An evaluation of the model with the new feature set:

In [None]:
summary_df, predictions, new_feature_X = evaluateFeatureSet(processed_X, processed_bi_y, selection_mask)

In [None]:
summary_df

#### Sanity check: again, during the cross-validation run, the model should have made a roughly balanced amount of '1' and '0' predictions. Display this ratio:

In [None]:
sanityCheck4(predictions)

#### Next, you muse decide whether you would like to use the previous features list, or the new subset.

To use the old feature set:
- final_X = processed_X

To use the new feature set:
- final_X = new_feature_X

In [None]:
final_X = new_feature_X

### 3. b. Hyperparameter tuning

#### The next cell will perform a 'grid search' - it will try out the RBF SVM model with different values for the c and gamma parameters, and report which ones yield the best performance.

In [None]:
gridSearch(final_X, processed_bi_y)

#### Set the final model's parameters to your desire. We suggest basing the selection on these results.

Example:
- new_C = 1
- new_gamma = 10

In [None]:
new_C = 1
new_gamma = 1

#### An evaluation of the model with the new parameters:

In [None]:
summary_df, predictions = svmModel(final_X, processed_bi_y, SVC(kernel='rbf', C=new_C, gamma=new_gamma))

In [None]:
summary_df

#### Sanity check: again, during the cross-validation run, the model should have made a roughly balanced amount of '1' and '0' predictions. Display this ratio:

In [None]:
sanityCheck4(predictions)

### 3. d. Model significance

#### Set the number of runs for a permutation test. This will affect runtime.

Example:
- n_permutations = 1000

In [None]:
n_permutations = 100

#### The next cell will perform a permutation test to check the model's p-value. This should take a while to run.

In [None]:
summary_df = permutationTest(n_permutations, final_X, processed_bi_y, SVC(kernel='rbf', C=new_C, gamma=new_gamma))

In [None]:
summary_df

### 3. e. Final run

#### Now that the features and parameters have been selected, it's time to create the final model, and train it on all available data.

In [None]:
svm_model = finalModel(final_X, processed_bi_y, new_C, new_gamma)

#### Display the model's performance on the whole dataset.

In [None]:
svm_model.score(final_X, processed_bi_y)

#### Sanity check: again, the model should have made a roughly balanced amount of '1' and '0' predictions. Display this ratio:

In [None]:
sanityCheck4(svm_model.predict(final_X))