# Homework Assignment 6: Model Evaluation 2
As in the previous assignments, in this homework assignment you will continue your exploration of the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM), described in the paper found [here](https://doi.org/10.1038/s41597-020-0548-x).

This assignment will continue to utilize a copy of the extracted feature dataset we used in Homework 5. Recall that the dataset has been processed by performing log, z-score and range scaling. We continuing to use more than one partition worth of data, so for the scaling, the mean, standard deviation, minimum, and maximum were calculated using data from both partitions so that a global scaling can be performed on each partition. 


---

## Step 1: Downloading the Data

This assignment will continue to use [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) for a training set and [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD) as a testing set. 

---

For this assignment, cleaning, transforming, and normalization of the data has been completed using both partitions to find the various minimum, maximum, standard deviation, and mean values needed to perform these operations. Recall from lecture that we should not perform these operations on each partition individually, but as a whole as there may(will) be different values for these in different partitions. 

For example, if we perform simple range scaling on each partition individually and we see a range of 0 to 100 in one partition and 0 to 10 in another. After individual scaling the values with 100 in the first would be mapped to 1 just like the values that had 10 in the second. This can cause serious performance problems in your model, so I have made sure that the normalization was treated properly for you. 

Below you will find the full partitions and `toy` sampled data from each partition, where only 20 samples from each of our 5 classes have been included in the data.  

#### Full
- [Full Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1ExtractedFeatures.csv)
- [Full Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition2ExtractedFeatures.csv)

#### Toy
- [Toy Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition1ExtractedFeatures.csv)
- [Toy Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition2ExtractedFeatures.csv)

Now that you have the two files, you should load each into a Pandas DataFrame using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. 

---

### Evaluation Metric

As was done in Homework 5, for each of the models we evaluate in this assignmnet, you will calculate the True Skill Statistic score using the test data from Partition 2 to determine which model performs the best for classifying the positive flaring class.

    True skill statistic (TSS) = TPR + TNR - 1 = TPR - (1-TNR) = TPR - FPR

Where:

    True positive rate (TPR) = TP/(TP+FN) Also known as recall or sensitivity
    True negative rate (TNR) = TN/(TN+FP) Also known as specificity or selectivity
    False positive rate (FPR) = FP/(FP+TN) = (1-TNR) Also known as fall-out or false alarm ratio


**Recall**

    True positive (TP)
    True negative (TN)
    False positive (FP)
    False negative (FN)
    
See [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for more information.

Below is a function implemented to provide your score for each model.

---

In [1]:
%matplotlib inline
import os
import itertools
import pandas as pd
from pandas import DataFrame 
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoLars
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cluster import KMeans

from sklearn.svm import SVC
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
def calc_tss(y_true, y_predict):
    '''
    Calculates the true skill score for binary classification based on the output of the confusion
    table function
    
        Parameters:
            y_true   : A vector/list of values that represent the true class label of the data being evaluated.
            y_predict: A vector/list of values that represent the predicted class label for the data being evaluated.
    
        Returns:
            tss_value (float): A floating point value (-1.0,1.0) indicating the TSS of the input data
    '''
    scores = confusion_matrix(y_true, y_predict).ravel()
    TN, FP, FN, TP = scores
    print('TN={0}\tFP={1}\tFN={2}\tTP={3}'.format(TN, FP, FN, TP))
    tp_rate = TP / float(TP + FN) if TP > 0 else 0  
    fp_rate = FP / float(FP + TN) if FP > 0 else 0
    
    return tp_rate - fp_rate

---

In addition to the TSS, you will be asked to also calculate the Heidke Skill Score (HSS) to see how much better your model performs than a random forecast.  

Below is a function implemented to provide your score fore each model.

---

In [3]:
def calc_hss(y_true, y_predict):
    '''
    Calculates the Heidke Skill Score for binary classification based on the output of the confusion
    table function.
    
    The HSS measures the fractional improvement of the forecast over the standard forecast.
    The "standard forecast" is usually the number correct by chance or the proportion 
    correct by chance.
    
        Parameters:
            y_true   : A vector/list of values that represent the true class label of the data being evaluated.
            y_predict: A vector/list of values that represent the predicted class label for the data being evaluated.
    
        Returns:
            hss_value (float): A floating point value (-inf,1.0) indicating the HSS of the input data. 
                Negative values indicate that the chance forecast is better, 0 means no skill, and a perfect forecast obtains a HSS of 1.
    '''
    scores = confusion_matrix(y_true, y_predict).ravel()
    TN, FP, FN, TP = scores
    #print('TN={0}\tFP={1}\tFN={2}\tTP={3}'.format(TN, FP, FN, TP))
    P = float(TP + FN)
    N = float(FP + TN)
    numerator = 2*((TP * TN) - (FN * FP))
    denominator = P*(FN + TN) + N*(TP + FP)
    
    return numerator/denominator

---

As in the previous assignment, we will be utilizing a binary classification of our 5 class dataset. So, below is the helper function to change our class labels from the 5 class target feature to the binary target feature. The function is implemented to take a dataframe (e.g. our `abt`) and prepares it for a binary classification by merging the `X`- and `M`-class samples into one group, and the rest (`NF`, `B`, and `C`) into another group, labeled with `1`s and `0`s, respectively.

---

In [4]:
def dichotomize_X_y(data: pd.DataFrame):
    """
    dichotomizes the dataset and split it into the features (X) and the labels (y).
    
    :return: two np.ndarray objects X and y.
    """
    data_dich = data.copy()
    data_dich['lab'] = data_dich['lab'].map({'NF': 0, 'B': 0, 'C': 0, 'M': 1, 'X': 1})
    y = data_dich['lab']
    X = data_dich.drop(['lab'], axis=1)
    return X.values, y.values

---

### Reading the partitions

In [5]:
data_dir = '/Users/zeek/Desktop/'
data_file = "normalized_partition1ExtractedFeatures.csv"
data_file2 = "normalized_partition2ExtractedFeatures.csv"

In [6]:
abt = pd.read_csv(os.path.join(data_dir, data_file))
abt2 = pd.read_csv(os.path.join(data_dir, data_file2))

---

### Run Feature Selection

Below you have code to perform feature selction using [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). The scoring function being used is [scikit-learn f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif).

Once feature selection is done with this one method, a set of training and testing dataframes are constructed by doing the following:

* Utilizing the `get_support` function of the feature selection object we get a mask of the features we will select from our original analytics base table DataFrame.  

* The mask of selected features is then paired with the `loc` function on our datframe containing only the descriptive features to get our selected featrues on all rows in our feature dataframe.

* The set of selected features are concatenated with our labels to construct a training dataset.

* This process was then repeated to construct the testing set.

---

In [7]:
numFeat = 20

# Split the target and descriptive features for Partition 1 into two 
# different DataFrame objects
df_labels = abt['lab'].copy()
df_feats = abt.copy().drop(['lab'], axis=1)

# Split the target and descriptive features for Partition 2 inot two
# different DataFrame Objects
df_test_labels = abt2['lab'].copy()
df_test_feats = abt2.copy().drop(['lab'], axis=1)

# Do feature selection
feats1 = SelectKBest(f_classif, k=numFeat).fit(df_feats, df_labels)

In [8]:
# Construct a training dataset from Partition 1 with only the selected descriptive 
# features and the target feature
df_selected_feats1 = df_feats.loc[:, feats1.get_support()]
df_train_set1 = pd.concat([df_labels, df_selected_feats1], axis=1)

# Construct a testing dataset from Partition 2 with only the selected descriptive
# features and the target feature
df_test_selected_feats1 = df_test_feats.loc[:, feats1.get_support()]
df_test_set1 = pd.concat([df_test_labels, df_test_selected_feats1], axis=1)

---
### Q1 (5 points)

Using the feature selection task above as a template, you will now perform feature selection again to produce a second set of training and testing data. This time you will use the [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). 

Instead of using the [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) that is shown in the example documentation linked above, you will be utilizing the [LassoLars](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html#sklearn.linear_model.LassoLars) as your `estimator`.  You should also set the `max_features` to the number of features we are going to select (20 features). 

**Note:** The LassoLars class is a regression model and will not work on our string class lables. So, you need to map the `NF`, `B`, `C`, `M`, `X` labels to a range between `-1` and `1` before you attempt to fit the feature selection model to them. You can try evenly spacing them, or forcing the classes that will be used as the negative class into a tight range and those that will be used as the postive class into another tight range.  Maybe this another location for you to do hyperparameter tuning?

In [9]:
numFeat = 20
abt_copy = abt.copy()
abt2_copy = abt2.copy()

In [10]:
    #----------------------------------------------
    l_labs = abt_copy['lab'].map({'NF':-1, 'B':-0.5, 'C':0, 'M':0.5, 'X':1})
    l_feats = abt_copy.drop(['lab'], axis=1)
    l_dffeats =   SelectFromModel(max_features = numFeat, estimator = LassoLars( alpha = 0, eps =1)).fit(l_feats, l_labs)
    l_train = l_feats.loc[:, l_dffeats.get_support()]  
    l_trainset = pd.concat([abt['lab'], l_train], axis = 1)
    l_trainset
    #----------------------------------------------
 

Unnamed: 0,lab,TOTBSQ_slope_of_longest_mono_decrease,TOTUSJZ_dderivative_mean,TOTUSJZ_gderivative_mean,SAVNCPP_gderivative_mean,ABSNJZH_kurtosis,TOTUSJH_gderivative_mean,ABSNJZH_stddev,MEANPOT_slope_of_longest_mono_decrease,MEANSHR_slope_of_longest_mono_decrease,...,SHRGT45_slope_of_longest_mono_decrease,EPSX_slope_of_longest_mono_decrease,USFLUX_slope_of_longest_mono_decrease,TOTBSQ_gderivative_stddev,TOTPOT_average_absolute_change,TOTBSQ_average_absolute_change,MEANPOT_difference_of_medians,USFLUX_slope_of_longest_mono_increase,TOTBSQ_avg_mono_increase_slope,SAVNCPP_max
0,NF,0.999113,0.979145,0.979265,0.975866,0.046063,0.683983,0.166475,0.999915,0.998414,...,0.998884,0.999664,0.953375,0.015288,0.000074,0.026092,0.000041,0.028674,0.021381,0.073789
1,NF,0.999991,0.979309,0.979332,0.977080,0.030247,0.682688,0.122735,0.999982,0.998197,...,0.998806,0.998982,0.999580,0.002713,0.000017,0.005966,0.000035,0.003590,0.005345,0.009592
2,NF,0.999990,0.979237,0.979280,0.976580,0.047912,0.680145,0.105429,0.999988,0.999814,...,0.999086,0.996353,0.998553,0.004391,0.000032,0.009646,0.000003,0.007625,0.007204,0.017374
3,NF,0.999751,0.979829,0.979674,0.975500,0.038173,0.694549,0.183616,0.999948,0.999569,...,0.999646,0.999499,0.996641,0.014514,0.000147,0.028324,0.000033,0.016026,0.021110,0.037189
4,NF,0.999948,0.979352,0.979351,0.976705,0.047934,0.684250,0.103106,0.999954,0.997813,...,0.999126,0.999903,0.997705,0.004602,0.000045,0.008830,0.000005,0.007592,0.006826,0.018851
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73487,NF,0.999872,0.979317,0.979341,0.976041,0.052463,0.684865,0.057351,0.999655,0.999328,...,0.998228,0.997778,0.997607,0.002863,0.000015,0.005738,0.000006,0.010243,0.005423,0.005171
73488,NF,1.000000,0.979596,0.979535,0.975673,0.039513,0.696048,0.136841,0.999709,0.996946,...,0.997911,0.999393,0.992283,0.005927,0.000049,0.012674,0.000217,0.005476,0.009679,0.014201
73489,C,0.999981,0.979785,0.979718,0.979449,0.063704,0.695232,0.214363,0.999804,0.999597,...,0.999917,0.999618,0.973915,0.025836,0.000237,0.049894,0.000085,0.019388,0.022464,0.057426
73490,B,0.999581,0.978214,0.978635,0.978052,0.048476,0.659593,0.184096,0.999951,0.999761,...,0.999053,0.999981,0.994456,0.020409,0.000103,0.035119,0.000019,0.024067,0.020761,0.043784


In [11]:
    #test
    l2_labs = abt2_copy['lab'].map({'NF':-1, 'B':-0.5, 'C':0, 'M':0.5, 'X':1})
    l2_feats = abt2_copy.drop(['lab'], axis=1)
    l_df_testfeats = l2_feats.loc[:, l_dffeats.get_support()]
    l_test = l_feats.loc[:, l_dffeats.get_support()]  
    l_testset = pd.concat([abt2['lab'], l_test], axis = 1)
    l_testset

Unnamed: 0,lab,TOTBSQ_slope_of_longest_mono_decrease,TOTUSJZ_dderivative_mean,TOTUSJZ_gderivative_mean,SAVNCPP_gderivative_mean,ABSNJZH_kurtosis,TOTUSJH_gderivative_mean,ABSNJZH_stddev,MEANPOT_slope_of_longest_mono_decrease,MEANSHR_slope_of_longest_mono_decrease,...,SHRGT45_slope_of_longest_mono_decrease,EPSX_slope_of_longest_mono_decrease,USFLUX_slope_of_longest_mono_decrease,TOTBSQ_gderivative_stddev,TOTPOT_average_absolute_change,TOTBSQ_average_absolute_change,MEANPOT_difference_of_medians,USFLUX_slope_of_longest_mono_increase,TOTBSQ_avg_mono_increase_slope,SAVNCPP_max
0,NF,0.999113,0.979145,0.979265,0.975866,0.046063,0.683983,0.166475,0.999915,0.998414,...,0.998884,0.999664,0.953375,0.015288,0.000074,0.026092,0.000041,0.028674,0.021381,0.073789
1,NF,0.999991,0.979309,0.979332,0.977080,0.030247,0.682688,0.122735,0.999982,0.998197,...,0.998806,0.998982,0.999580,0.002713,0.000017,0.005966,0.000035,0.003590,0.005345,0.009592
2,NF,0.999990,0.979237,0.979280,0.976580,0.047912,0.680145,0.105429,0.999988,0.999814,...,0.999086,0.996353,0.998553,0.004391,0.000032,0.009646,0.000003,0.007625,0.007204,0.017374
3,NF,0.999751,0.979829,0.979674,0.975500,0.038173,0.694549,0.183616,0.999948,0.999569,...,0.999646,0.999499,0.996641,0.014514,0.000147,0.028324,0.000033,0.016026,0.021110,0.037189
4,NF,0.999948,0.979352,0.979351,0.976705,0.047934,0.684250,0.103106,0.999954,0.997813,...,0.999126,0.999903,0.997705,0.004602,0.000045,0.008830,0.000005,0.007592,0.006826,0.018851
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88552,NF,,,,,,,,,,...,,,,,,,,,,
88553,NF,,,,,,,,,,...,,,,,,,,,,
88554,NF,,,,,,,,,,...,,,,,,,,,,
88555,NF,,,,,,,,,,...,,,,,,,,,,


---
### Q2 (5 points)

In this question, you will again perform the feature selection task [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). However, this time, you will be utililizing a random forest model called [ExtraTressClssifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier) as the `estimator`. 

You need set the `max_features` of the SelectFromModel object to the number of features we are going to select (20 features). 

You also need to set the `n_estimators` of the random forest algorithm to `75` when you construct it.

**Note:** This method allows you to utilize our string class labels, so you don't need to map the lables to any other values. You can use the labels that were used in the original example above.

In [12]:
numFeat = 20

In [13]:
    #----------------------------------------------
    x_labels = abt_copy['lab']
    x_features = abt_copy.drop(['lab'], axis = 1)
    x_model_features = SelectFromModel(max_features = numFeat, estimator = ExtraTreesClassifier(n_estimators = 75)).fit(x_features, x_labels)
    x_selected_features = x_features.loc[:, x_model_features.get_support()]
    x_train = pd.concat([x_labels, x_selected_features], axis = 1)
    x_train
    #----------------------------------------------

Unnamed: 0,lab,TOTUSJH_var,TOTBSQ_max,TOTBSQ_quadratic_weighted_average,TOTBSQ_last_value,TOTPOT_stddev,TOTPOT_linear_weighted_average,TOTUSJZ_mean,USFLUX_quadratic_weighted_average,R_VALUE_median,...,R_VALUE_linear_weighted_average,R_VALUE_quadratic_weighted_average,R_VALUE_last_value,TOTUSJH_min,TOTUSJH_max,TOTUSJH_quadratic_weighted_average,TOTUSJH_last_value,ABSNJZH_dderivative_stddev,ABSNJZH_avg_mono_increase_slope,id
0,NF,0.703435,0.886776,0.880515,0.883499,0.856244,0.891949,0.930756,0.950338,0.000000,...,0.086008,0.095111,0.371876,0.238758,0.275990,0.250146,0.271143,0.143979,0.147190,0.294529
1,NF,0.536687,0.837925,0.829410,0.828409,0.836245,0.866946,0.888203,0.924473,0.000000,...,0.039303,0.019833,0.000000,0.106759,0.123894,0.109375,0.108247,0.094326,0.101681,0.398784
2,NF,0.593047,0.843844,0.834038,0.833590,0.843089,0.869607,0.896440,0.927509,0.000000,...,0.000000,0.000000,0.000000,0.116361,0.141522,0.123883,0.120471,0.107461,0.108718,0.280851
3,NF,0.646995,0.925464,0.925004,0.924958,0.861715,0.919039,0.941759,0.966469,0.747649,...,0.728003,0.721305,0.697463,0.315587,0.328616,0.322843,0.320618,0.142697,0.147689,0.330699
4,NF,0.508972,0.867887,0.865675,0.862664,0.851582,0.892195,0.895863,0.931611,0.000000,...,0.021154,0.009999,0.000000,0.125745,0.140699,0.132812,0.124584,0.105344,0.104563,0.305167
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73487,NF,0.462685,0.792289,0.774293,0.770557,0.825029,0.843009,0.853690,0.898127,0.000000,...,0.048507,0.037961,0.000000,0.038092,0.066916,0.052391,0.047050,0.065959,0.068696,0.055015
73488,NF,0.606655,0.846074,0.843262,0.841001,0.855176,0.876780,0.889427,0.922339,0.415645,...,0.270086,0.210162,0.000000,0.086322,0.134423,0.121065,0.113201,0.120419,0.122110,0.323100
73489,C,0.711481,0.943771,0.939821,0.938537,0.876509,0.930272,0.956541,0.972969,0.764108,...,0.758822,0.755744,0.753714,0.409456,0.429922,0.414165,0.416328,0.190574,0.195310,0.410030
73490,B,0.732800,0.929611,0.926733,0.925886,0.874946,0.912685,0.951605,0.973103,0.703855,...,0.713355,0.717663,0.713111,0.370843,0.395401,0.374104,0.366209,0.188689,0.204119,0.145593


In [14]:
x_test_labels = abt2_copy['lab']
x_test_features = abt2_copy.drop(['lab'], axis = 1)
x_test_selected = x_test_features.loc[:, x_model_features.get_support()]
x_test = pd.concat([x_test_labels, x_test_selected], axis = 1)
x_test

Unnamed: 0,lab,TOTUSJH_var,TOTBSQ_max,TOTBSQ_quadratic_weighted_average,TOTBSQ_last_value,TOTPOT_stddev,TOTPOT_linear_weighted_average,TOTUSJZ_mean,USFLUX_quadratic_weighted_average,R_VALUE_median,...,R_VALUE_linear_weighted_average,R_VALUE_quadratic_weighted_average,R_VALUE_last_value,TOTUSJH_min,TOTUSJH_max,TOTUSJH_quadratic_weighted_average,TOTUSJH_last_value,ABSNJZH_dderivative_stddev,ABSNJZH_avg_mono_increase_slope,id
0,NF,0.625650,0.910396,0.910168,0.910402,0.868978,0.918002,0.932597,0.951585,0.575157,...,0.444758,0.410609,0.547147,0.248654,0.264179,0.255972,0.253303,0.156771,0.175954,0.578116
1,NF,0.524445,0.859605,0.854401,0.853716,0.842308,0.878799,0.906536,0.939085,0.000000,...,0.000000,0.000000,0.000000,0.148949,0.161652,0.154432,0.152454,0.112890,0.121236,0.736474
2,NF,0.542118,0.863722,0.861191,0.858692,0.843777,0.882648,0.905747,0.935372,0.000000,...,0.030981,0.013494,0.000000,0.161101,0.174518,0.167077,0.163801,0.157069,0.163217,0.942857
3,NF,0.690086,0.913040,0.909206,0.907126,0.877817,0.909533,0.933567,0.959465,0.713647,...,0.712612,0.711166,0.701873,0.263200,0.292981,0.270127,0.260010,0.136500,0.138083,0.858359
4,NF,0.447053,0.842784,0.838926,0.836830,0.843102,0.873573,0.877655,0.921374,0.000000,...,0.117955,0.135080,0.310076,0.091531,0.104345,0.097687,0.094275,0.087940,0.089760,0.589970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88552,NF,0.636863,0.870733,0.863351,0.861416,0.857414,0.883171,0.915647,0.940915,0.356644,...,0.163147,0.105829,0.000000,0.181140,0.205758,0.185497,0.180158,0.137763,0.138010,0.941337
88553,NF,0.342742,0.788695,0.785622,0.787742,0.806316,0.844060,0.848843,0.902201,0.000000,...,0.000000,0.000000,0.000000,0.049787,0.058479,0.055035,0.057382,0.070615,0.073404,0.658663
88554,NF,0.401249,0.787746,0.775761,0.771510,0.815244,0.835070,0.847389,0.895124,0.000000,...,0.009245,0.005289,0.000000,0.047382,0.062495,0.053349,0.053345,0.078176,0.079521,0.800304
88555,NF,0.537470,0.834092,0.831255,0.829807,0.828145,0.866177,0.884763,0.926024,0.000000,...,0.000000,0.000000,0.000000,0.094972,0.115870,0.107972,0.105768,0.093569,0.097608,0.881155


---
### Q3 (5 points)

Now that you have three different datasets, you need to convert them each to a binary classification problem datase or dichotomize the training and testing data. Lucky for you, a method has already been provided to do this. All you need to do is apply it to teach of the `DataFrame`s you constructed with the feature selected training and testing data from the exmpale, Q1, and Q2.

**Note:** You might want to put the training and testing tuples you get from the call to the dichotomize method into seperate training and testing lists. Then you can loop over them later. 

In [15]:
    #----------------------------------------------
    train = [df_train_set1.copy(), l_trainset.copy(), x_train.copy()]
    test = [df_test_set1.copy(), l_testset.copy(), x_test.copy()]
    di_train_x1 = []
    di_train_y1 = []
    di_test_x2 = []
    di_test_y2 = []
    for set1 in train:
        x1, y1 = dichotomize_X_y(set1)
        di_train_x1.append(x1)
        di_train_y1.append(y1)
    for set2 in test:
        x2, y2 = dichotomize_X_y(set2)
        di_test_x2.append(x2)
        di_test_y2.append(y2)
    #----------------------------------------------

---
### Q4 (10 points)

Now that you have your data setup, you will be constructing an [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on each one of the different datasets. For this exercise default regularization parameter `C` value of 1.0, the default `kernel` type of `rbf`, and the default setting of the kernel coefficient `gamma` for the `rbf` kernel. You should, however, set the `class_weight` to `balanced` when you construct your models. This way the regularization parameter is adjusted for each class in proportion the occurrence of that class in the dataset.

You should train the model on your training data, then test it on the testing data with the same set of selected descriptive features. You will then calculate both the TSS and HSS scores and print them out.

**Note:** for more information on what the `C` and `gamma` parameters do on the `rbf` kernel see the [RBF SVM parameters](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) documentation. We won't be tuning these values in this question, but it is genearally accepted that tuning should be done to find the best performing model. 

In [16]:
selected_labels = ['F-Val', 'FromLasso', 'FromForest']

In [17]:
    #----------------------------------------------
    for ind in range(len(selected_labels)):
        classifier = SVC(class_weight = 'balanced')
        classifier.fit(di_train_x1[ind],di_train_y1[ind])
        y_pred = classifier.predict(di_test_x2[ind])
        t_score = calc_tss(di_test_y2[ind], y_pred)
        h_score = calc_hss(di_test_y2[ind], y_pred)
        print(selected_labels[ind])
        print(f"TSS: {t_score}")
        print(f"HSS: {h_score}")
    #----------------------------------------------

TN=76129	FP=11027	FN=76	TP=1325
F-Val
TSS: 0.8192327710296818
HSS: 0.1690723900997794


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

---
### Q5 (15 points)

After training and testing the SVC model on the above dataset, you will likely see that this process is quite time consuming. This is because the algorithm needs evaluate all the instances in the training dataset to find instances that can be used as points in a separating hyperplane between the samples of different classes.  

In order to speed this process, lets reduce the number of samples in the dataset through undersampling the classes like we did before. Unlike was done before, where we just pick some random sample of instances in the various classes, we will be performing some data informed under sampling using the [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) clustering algorithm.

So, what I want you to do is construct a function (I have started it below) that performs the following:

* Groups the data by the `lab` column and finds the size of the smallest group.

* For every group of samples in the dataset that is not the smallest group, you will use the KMeans algorithm to cluster the samples into `K` clusters. The size of `K` is the size of the smallest group.  **Note:** You need to make sure you are only using the descriptive features when doing this, so drop the label column from each group.

* Once you have the `K` clusters, you will get the cluster centers by using the `cluster_centers_` attribute of your Kmeans object. This attribute is a `(n_clusters, n_features)` array of values. These will be the new set of samples of descriptive features for the class you are processing. You should construct a DataFrame with these and add a label column with your class label for each one of these samples.

* The samples for the smalles class group from the original dataset will be the samples you return for that class. 

* You will need to concatenate all of the results into one DataFrame and return it at the end of the function.

Once you have completed the function, you need to apply it to each of your three training sets (the ones that have not had the dicotimize process applied, I hope you kept a copy). Then you will apply the dicotomize process to the sampled training sets and place them into a list for use in the next problem.

**Note:** By training our models on representations of the real data instead of the acutal measurements, we are building a type of surrogate model. By doing so, we can approximate how our model might behave when trained with the true data, but can test several different settings much faster than what we otherwise would.

In [None]:
def perform_under_sample_clust(data:DataFrame)->DataFrame:
    #----------------------------------------------
    thetemp = data.copy()
    theres = pd.DataFrame()
    freq = pd.DataFrame(thetemp['lab'].value_counts())
    indices = freq.index.tolist()
    min_val_ind = freq.idxmin()[0]
    indices.remove(min_val_ind)
    min_v = freq['lab'][min_val_ind]
    desc_feats = thetemp.columns[1:]
    theres = pd.concat([theres, thetemp[thetemp['lab'] == min_val_ind]])
    for ind in indices:
        classifier = KMeans(n_clusters=min_v)
        classifier.fit(thetemp[thetemp['lab'] == ind].drop(['lab'], axis = 1))
        thetemp2 = pd.DataFrame(classifier.cluster_centers_, columns=desc_feats)
        thetemp2['lab'] = ind
        theres = pd.concat([theres, thetemp2]) 
    theres.reset_index(inplace=True, drop=True)
    return theres
    #----------------------------------------------

In [None]:
    #----------------------------------------------
    for frame in train:
        frame = perform_under_sample_clust(frame)
    us_di_train_x1 = []
    us_di_train_y1 = []
    for frame in train:
        x1, y1 = dichotomize_X_y(frame)
        us_di_train_x1.append(x1)
        us_di_train_y1.append(y1)
    #----------------------------------------------

---
### 6 (10 points)

In question 5 we produced datasets that approxumate what our real data looks like. By training our models on these representations of the real data instead of the acutal measurements, we are building a type of surrogate model. In doing so, we can approximate how our model might behave when trained with the true data, but we obtain a major advantage in that we can test several different settings much faster than what we otherwise would if using the true dataset. We can then use these surrogate results to find a range of the hyperparameters that we might wish to investigate using the true input data.

For this question, you will again train your models on the three different feature selected data. However, instead of the full partition 1 training datasets, you will be using the sampling with KMeans training datasets you constructed in Q5. 

You will again be constructing an [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on each one of the different datasets. Like before, you should set the `class_weight` to `balanced` when you construct your models. Unlike the previous question where you traind the models, this time you will be asked to evaluate several different settings for the `kernel`, the regularization parameter `C`, and kernel coefficient `gamma`. **Note:** The `gamma` paramter is only utilized on the ‘rbf’, ‘poly’ and ‘sigmoid’ kernels, so there is no reason to evaluate multiple settings for the `linear` kernel. I have listed the settings of each parameter in a code block below. 

For each of the settings, you should train the model on your training data, then test it on the testing data with the same set of selected descriptive features. You will then calculate both the TSS and HSS scores and print them out.

In [None]:
selected_labels = ['F-Val', 'FromLasso', 'FromForest']

In [None]:
kernel = ['linear', 'poly', 'rbf']
c_vals = [ 0.5, 1.0]
gamma_vals = [0.5, 1, 10]
temp = [kernel, c_vals, gamma_vals]
params = list(itertools.product(*temp))

In [None]:
    #----------------------------------------------
    for ind in range(len(selected_labels)):
        print(selected_labels[ind], "- - - - - - - - - -- - - - -")
        for kernel, c_value, gamma_value in params:
                classifier = SVC(class_weight = 'balanced', gamma=gamma_value, C=c_value, kernel=kernel)
                classifier.fit(us_di_train_x1[ind],di_train_y1[ind])
                y_pred = classifier.predict(di_test_x2[ind])
                t_score = calc_tss(di_test_y2[ind], y_pred)
                h_score = calc_hss(di_test_y2[ind], y_pred)
                print(f"TSS: {t_score}")
                print(f"HSS: {h_score}")
    #----------------------------------------------

---
### 7 (10 points)

Results above were able to find some combinations of hyperparamters and datasets that work fairly well for our problem. But the question remains, can we do better?

Maybe one way to improve our results would be to elinate the easy to classify instances from our dataset and only focus our efforts on the more difficult ones. If you recall from our data preparation there was a feature in our dataset that we could use to easily distinguish between a rather large percentage of `flare` and `non-flare` data. This feature was `R_VALUE_median`, but we don't know what value to use to filter off part of our data.

So, for this question, let's plot and see where a good cutoff might be. To do this, let's use the seaborn [ecdfplot](https://seaborn.pydata.org/generated/seaborn.ecdfplot.html#seaborn.ecdfplot) or the cumulative distribution function plot.  Your input will be the original analytics base table of partition one.  You should set the `x` axis to `R_VALUE_median`, and set the `hue` to `lab`.

After plotting this, you will see that around 0.5 we begin to see some instances of the `M` and around 0.7 we begin to see some instances of the `X` class flares in our dataset. So, use 0.5 as a threshold value to filter out all of the instances that fall below this threshold from our training data. Construct a copy of the original partition 1 data with this applied.

You can then verify this using the seaborn [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot) using the new filtered data as input, setting `x` to `R_VALUE_median` again, and setting `hue` to `lab` again.  

In [None]:
    #----------------------------------------------
    sns.ecdfplot(abt, hue='lab', x='R_VALUE_median')
    #----------------------------------------------

In [None]:
filter_abt = abt[abt['R_VALUE_median'] >= 0.5].copy()

---
### 8 (10 points)

For this question, you will utilize the filtered analytics base table you constructed in the previous question.  You should:

* Repeat the feature selection I did for you in the  example using the [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection), and scoring function [scikit-learn f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif).

* Repeat the feature selection from Q1 using the [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection), and utilizing the [LassoLars](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html#sklearn.linear_model.LassoLars) as your `estimator`.  You should also set the `max_features` to the number of features we are going to select (20 features).

* Repeat the feature selection from Q2 using [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). With a random forest model called [ExtraTressClssifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier) as the `estimator`. The `n_estimators` of the random forest algorithm should be set to `75` when you construct it. The `max_features` of the SelectFromModel should be set to the number of features we are going to select (20 features).

In [None]:
    #----------------------------------------------
numFeat = 20


df_labels = filter_abt['lab'].copy()
df_feats = filter_abt.copy().drop(['lab'], axis=1)
df_test_labels = abt2['lab'].copy()
df_test_feats = abt2.copy().drop(['lab'], axis=1)
feats1 = SelectKBest(f_classif, k=numFeat).fit(df_feats, df_labels)
df_selected_feats1 = df_feats.loc[:, feats1.get_support()]
df_train_set1 = pd.concat([df_labels, df_selected_feats1], axis=1)
df_test_selected_feats1 = df_test_feats.loc[:, feats1.get_support()]
df_test_set1 = pd.concat([df_test_labels, df_test_selected_feats1], axis=1)
#----------------------------------------------
l_labs = filter_abt['lab'].map({'NF':-1, 'B':-0.5, 'C':0, 'M':0.5, 'X':1})
l_feats = filter_abt.drop(['lab'], axis =1)
l_df_feats = SelectFromModel(max_features = numFeat, estimator = LassoLars( alpha = 0, eps =1)).fit(l_feats, l_labs)
l_train = l_feats.loc[:, l_df_feats.get_support()]
l_trainset = pd.concat([filter_abt['lab'], l_train], axis = 1)
#----------------------------------------------
l2_labs = abt2_copy['lab'].map({'NF':-1, 'B':-0.5, 'C':0, 'M':0.5, 'X':1})
l2_feats = abt2_copy.drop(['lab'], axis =1)
l_df_testfeats = l2_feats.loc[:, l_df_feats.get_support()]
l_test = l_feats.loc[:, l_df_feats.get_support()]
l_test_set = pd.concat([abt2['lab'], l_test], axis = 1)
#----------------------------------------------
x_labels = filter_abt['lab'].copy()
x_features = filter_abt.copy().drop(['lab'], axis = 1)
x_model_features = SelectFromModel(max_features = numFeat, estimator = ExtraTreesClassifier(n_estimators = 75)).fit(x_features, x_labels)
x_selected_features = x_features.loc[:, x_model_features.get_support()]
x_train = pd.concat([x_labels, x_selected_features], axis = 1)
#----------------------------------------------
x_test_labels = abt2_copy['lab']
x_test_features = abt2_copy.drop(['lab'], axis = 1)
x_test_selected = x_test_features.loc[:, x_model_features.get_support()]
x_test = pd.concat([x_test_labels, x_test_selected], axis = 1)
    #----------------------------------------------

---
### Q9 (10 points)

Using the training and testing datsets you constructed in the previous question after performing feature selection on the filtered partition 1 analytics base table. You now need to perform the sampling on the training data using the function you made in Q5.

Then you should convert each of the new training and testing datasets to a binary classification problem datase or dichotomize the training and testing data like you did in Q3. Lucky for you, a method has already been provided to do this. All you need to do is apply it to teach of the `DataFrame`s you constructed with the feature selected training and testing data.

**Note:** You might want to put the training and testing tuples you get from the call to the dichotomize method into seperate training and testing lists. Then you can loop over them later. 


In [None]:
    #----------------------------------------------
    train = [df_train_set1.copy(), l_trainset.copy(), x_train.copy()]
    test = [df_test_set1.copy(), l_testset.copy(), x_test.copy()]
    #undersampling
    for frame in train:
        frame = perform_under_sample_clust(frame)
    di_test_x1 = []
    di_test_y1 = []
    di_train_x2 = []
    di_train_y2 = []
    for frame in train:
        x1, y1 = dichotomize_X_y(frame)
        di_train_x1.append(x1)
        di_train_y1.append(y1)
    for frame in test:
        x2, y2 = dichotomize_X_y(frame)
        di_test_x2.append(x2)
        di_test_y2.append(y2)

    #----------------------------------------------

---
### Q10 (20 points)

Like in Q6, this question will be utilizing the filtered and sampled datasets constructed in the previous question. For this question, you will again train your models on the three different feature selected data that had the instances below our thrshold filtered out and then had sampling by clustering performed on them. 

You will again be constructing an [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) model on each one of the different datasets. Like before, you should set the `class_weight` to `balanced` when you construct your models. You will again evaluate several different settings for the `kernel`, the regularization parameter `C`, and kernel coefficient `gamma`. **Note:** The `gamma` paramter is only utilized on the ‘rbf’, ‘poly’ and ‘sigmoid’ kernels, so there is no reason to evaluate multiple settings for the `linear` kernel. I have listed the settings of each parameter in a code block below. 

For each of the settings, you should train the model on your training data, then test it on the testing data with the same set of selected descriptive features. You will then calculate both the TSS and HSS scores and print them out. **Note:** The testing data has the samples in it that are below our threshold value, so you will first need to filter those out of the data you plan to pass to your model for testing. However, you still want those instances included in the calculation of the TSS and HSSS. So, your groud truth `lab` data should include all the instances in partition 2. You will need to concatenate a vector with all zeros in it to the match the labels you partitioned from the model testing data. 

Let's give you a representation of that:
    
    labels_from_data = [labels for samples > threshold] + [labels for samples <= threshold]
    predict_labels = [labels from the model on > thrshold samples] + [0s the length of samples <= threshold]



In [None]:
thresh = 0.5

kernel = ['linear', 'poly', 'rbf']
c_vals = [ 0.5, 1.0]
gamma_vals = [0.5, 1, 10]
temp = [kernel, c_vals, gamma_vals]
params = list(itertools.product(*temp))

In [None]:
    #----------------------------------------------
    for ind in range(len(selected_labels)):
        print(selected_labels[ind], "- - - - - - - - - - - - - -", end='\n\n')
        for kernel, c_value, gamma_value in params:
            classifier = SVC(class_weight = 'balanced', gamma=gamma_value, C=c_value, kernel=kernel)
            classifier.fit(di_train_x1[ind],di_train_y1[ind])
            y_pred = classifier.predict(di_test_x2[ind])
            t_score = calc_tss(di_test_y2[ind], y_pred)
            h_score = calc_hss(di_test_y2[ind], y_pred)
            print(f"TSS: {t_score}")
            print(f"HSS: {h_score}")
    #----------------------------------------------

All of these results are getting unruely, we should maybe be saving them to do analysis on them too? Maybe I'll ask you to do that for the extra credit assignment.