# Project 3: Madelon Feature Selection and Classification

*Author: Uyen Truong*

In [2]:
%run ../__init__.py

### Project Description:

This project is on the Madelon dataset.  One set of data is downloaded from the UCI website and the other artificial dataset can be queried from a Database. The dataset from UCI has 500 features and of these 500 features, only 20 are real features where 5 are informative, the other 15 are linear combinations of the 5 informative features. The dataset from the Database has 1000 features and 220,000 rows of data, no other information was given about this dataset in regards to the number of informative features.

The goal of the project is to use statistical models for feature slection and classification in order to best predict the examples into 2 classes corresponding to the target values (-1 or 1 for UCI dataset and 0 or 1 for Database dataset). 


### Initial Step:  Download the data and perform EDA
Although the instructions indicated that we needed to work on three 10% sets of the database dataset, computing such large datasets of about 22,000 rows and 1000 is computational intensive and constantly crashed AWS T-2Micro machine. Therefore, I used Survey Monkey Sample Size calculator to determine a sample size that would be suitable for analysis. I determined that a sample size that would give me 95% confidence interval and 1% margin error would be sufficient for analysis and according to Survey Monkey Calculator, this calls for about 5% of the dataset which is equivalent to about 9200-9500 rows of the Database dataset.  All of the calculations for the database dataset were run on three sets of 5%.

Initial EDA of the datasets indicated that there are no null values in the data except for one column in the UCI dataset where all the values were NaN, so I dropped that column. A heatmap of the entire UCI dataset and a sample of the Database dataset show that there are significant noise in both sets and it is difficult to determine any correlation from looking at the map. 

![](../Images/uci_heatmap.png)![](../Images/db_heatmap.png)

### Step 1: Benchmarking

I built pipelines to perform naive fit to set the benchmark scores with the sample datasets (entire UCI train dataset and 5% of Database dataset with the following  model classes:

- Logistic Regression(LR)
- Decision Tree(Tree)
- K Nearest Neighbors(KNN)
- Support Vector Classier(SVC)

According to the instructions, a large C must be set for Logistic Regression and Support Vector Classier to minimize regulization.  Therefore, I set the C value equal to 1e10 for both LR and SVC.  Below are the results of the benchmark scores. **Note:** the scores present throughout this notebook are Accuracy scores.


In [5]:
pd.read_pickle("../Images/benchmarkscores")

Unnamed: 0,data,model,test_score,train_score
0,Database,LogisticRegression,0.539298,0.677543
1,Database,Nearest_Neighbors,0.513099,0.688419
2,Database,Support_vector,0.503707,1.0
3,Database,Decision_Tree,0.629263,1.0
0,Uci,LogisticRegression,0.5,0.7825
1,Uci,Nearest_Neighbors,0.7,0.825625
2,Uci,Support_vector,0.4925,1.0
3,Uci,Decision_Tree,0.755,1.0


### Step 2: Finding Salient Features

**The detailed codes for this step can be found in notebooks 02_Database_Feature_Selection and 02_UCI_Feature_Selections**

Given the information about the UCI Madelon dataset where there are 480 'noisy' features out of the 500, I used the feature selection methods described below to identify the 20 relevant features.  While no information was given for the Madelon database dataset, I assume that it would have similar structure as the UCI dataset where there are many 'noisy' features, some redundant linear combinations and some informative features. Therefore, I applied the same methods on the Database dataset.

**Method 1: Calculate $R^2$ scores**

In this method, the goal is to remove the actual target column and calculate the R^2 scores of each feature taking turn as the target.  This technique can tell us if there's a linear relationship between the features.  If the R^2 is high, that means there is a linear relationship with the other features; therefore, it could be one of the informative or redudant features.  If the $R^2$ is low, that means it is independent of the all other features and would most likely not be a redundant or informative feature. 

In this method, KneighborRegressor and DecisionTreeRegressor were used to calculate the $R^2$ scores. **Important note**: each time the $R^2$ was calculated for each feature, it yielded similar but different results; therefore, the $R^2$ was calculated multiple times and the mean $R^2$ was used to determine which features could be one of the salient features. The functions below were used to calculate $R^2$, they were run in the notebooks noted above, the sample below are for illustration purpse.

In [None]:
# use this function to find r2 of redundant features 
# dropping a feature and seeing if the other features can predict it
def calculate_r_2_for_feature_tree(data,feature):
    new_data = data.drop(feature, axis=1)

    X_train, \
    X_test,  \
    y_train, \
    y_test = train_test_split(
        new_data,data[feature],test_size=0.25
    )

    regressor = DecisionTreeRegressor()
    regressor.fit(X_train,y_train)

    score = regressor.score(X_test,y_test)
    return score

#use this function to take the mean of the scores after 5 runs
def mean_r2_for_feature_tree(data, feature):
    scores = []
    for _ in range(5):
        scores.append(calculate_r_2_for_feature_tree(data, feature))
        
    scores = np.array(scores)
    return scores.mean()

# use this function to get the mean of scores of multiple columns 
def mean_column_range_tree(data):
    r2_tree= []
    for i in tqdm(range(0,500)):
        if mean_r2_for_feature_tree(data, i) > 0:
            r2_tree.append(i)
    return r2_tree

**Method 1: Results**

For the UCI dataset, the results of the top $R^2$ of both KneighborsRegressor and DecisionTree were identical. This method identified 20 total relevant features: 28, 48, 64, 105, 128, 153, 241, 281, 318, 336, 338, 378, 433, 442, 451, 453, 455, 472, 475, and 493.

For the Database dataset, I only ran the code for DecisionTree since this dataset is much larger and the computational time would be too intensive and it can be inferred from the UCI dataset that the results for DecisionTree and KneighborRegressor would be the same. This method identified 20 total relevant features:'feat_257', 'feat_269', 'feat_308', 'feat_315', 'feat_336', 'feat_341','feat_395', 'feat_504', 'feat_526', 'feat_639', 'feat_681', 'feat_701','feat_724', 'feat_736', 'feat_769', 'feat_808', 'feat_829', 'feat_867','feat_920', 'feat_956'.

**Method 2: Find top 20 features from SelectKBest**

In this method, the goal is to find the top 20 features that would predict the target using SelectKbest pipeline with the score_function = f_regression. The codes below were used to find the top 20 features. 

**Method 2 Results:**

Unfortunately, the top 20 features identified by SelectKBest did not return the same 20 features as the $R^2$ method; however, it did generate some overlaps.

For the Uci sample dataset, the top 20 features identified were: 48,  64, 105, 128, 137, 149, 199, 204, 241, 282, 329, 336, 338, 378, 424, 442, 453, 472, 475, 493. Although not all 20 match with the $R^2$ method, 13 out of 20 features from SelectKBest were the same as the ones from $R^2$.

For the Database sample dataset, the top 20 features identified were: 'feat_003', 'feat_086', 'feat_257', 'feat_269', 'feat_308', 'feat_315','feat_336', 'feat_341', 'feat_395', 'feat_504', 'feat_526', 'feat_557','feat_597', 'feat_673', 'feat_681', 'feat_691', 'feat_701', 'feat_724', 'feat_736', 'feat_769', 'feat_783', 'feat_808', 'feat_829', 'feat_867','feat_920'. Although not all 20 match with the $R^2$ method, 17 out of 20 features from SelectKBest were the same as the ones from $R^2$.

In [None]:
def skb_top_feats(x, y):
    X_train, X_test, y_train, y_test = train_test_split(x, 
                                                    y, 
                                                    test_size=.2, 
                                                    random_state=42)
    
    scaler = StandardScaler()
    
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    skb_list = []
    skb = SelectKBest(k=20, score_func=f_regression)
    skb.fit(X_train, y_train)
    
    skb_feats = np.where(skb.get_support())[0]
    
    skb_list.append(skb_feats)
    
    return skb_list

**Method 3: Generate correlation matrix**
In this method, the goal is to create a correlation matrix of all the features and find the ones with strong correlations to other features.  This method is similar to the $R^2$ method where the Informative and Redundant features will have stronger correlations whereas the noisy features will have weak correlations. 

**Method 3: Results**
For the UCI dataset, this method identified the same 20 features with high correlationthat were identified by $R^2$ method.

For the Database dataset, this also method identified the same 20 features with high correlation that were identified by $R^2$ method.

**Conclusions:** Even though I did not get consistent results across all three methods, $R^2$ and Correlation Matrix identified the same top 20 features for the UCI and for Database datasets. Therefore, these will be the 20 important features that I will use to try to build models for classifications.

### Step 3: Feature Importance

The goal of this step is to identify the 5 important features that can be used to classify the target. I used three different methods on 3 sample UCI datasets and 3 5% samples of the Database dataset. Since the top 20 salient features were identified in the previous step, I created new dataframes of the UCI and Database datasets to only include the 20 features in order to reduce the size of the datasets which will make computing time faster.

**Method 1: SelectKbest to identify top 5 features**
In this method, the goal is to find the top 5 features that would predict the target using SelectKbest pipeline with the score_function = f_regression. 

**Method 1: Results**
As you can see in the results below, the top 5 features found in each of the 3 UCI sample datasets were inconsistent. The results from the database looked more promising! All 3 sample sets identified the same top 5 features.

![](../Images/skb_uci.png)
![](../Images/skb_db.png)

**Method 2: Recursive Feature Elimination to identify top 5 features**
In this method, the goal is to find the top 5 features that would predict the target using RFE with DecisionTree as the estimator.

**Method 2: Results**
As you can see in the results below, the top 5 features found in each of the 3 UCI sample datasets were inconsistent and the top 5 features found in each of the 3 Database datasets were also inconsistent. These results also varied from the results found in Method 1. 

![](../Images/db_rfe.png)
![](../Images/uci_rfe.png)

**Method 3: Feature_importance with RandomForest**
In this method, the goal is to find the top 5 features which have the highest Gini-Importance.

**Method 3: Results**

As you can see in the results below, the top 5 features found in each of the 3 UCI sample datasets were inconsistent and the top 5 features found in each of the 3 Database datasets were also inconsistent. These results also varied from the results found in Method 1 and 2. 

uci_1 | uci_2 |uci_3
- | - |- |
![](../Images/uci_gini1.png) | ![](../Images/uci_gini2.png)|![](../Images/uci_gini3.png)

db_1 | db_2 |db_3
- | - |- |
![](../Images/db_gini1.png) | ![](../Images/db_gini2.png)|![](../Images/db_gini3.png)

**Conclusions:**
While there are some overlap of features between the 3 methods, results are still inconclusive of which 5 features are the most important; therefore, I decided to keep all 20 salient features in model creation and to implement the feature reduction methods within the model pipelines.

### Step 4: Pipelines/GS to Build and Test Models

In this step, I built functions to run multiple types of pipelines and Gridsearch with different feature selection methods within the pipelines and ran the Gridsearch for all the 3 UCI and Database datasets. The following Pipelines were created:
- SelectKBest as Feature Reduction with the following regressors:
    - KNeighborsClassifier
    - LogisticRegression
    - DecisionTreeClassifier
    - RandomForestClassifier
    - SVC
- SelectfromModel with Ridgeclassifier as an estimator ran in with the following regressors:
    - KNeighborsClassifier
    - LogisticRegression
    - DecisionTreeClassifier
    - RandomForestClassifier
    - SVC
- PCA as an estimator ran in with the following regressors:
    - KNeighborsClassifier
    - LogisticRegression
    - DecisionTreeClassifier
    - RandomForestClassifier
    - SVC
Each pipeline included specific parameters related to each steps. The codes below illustrate the method/functions used to create the pipelines (details for each Feature Reduction function can be found in notebooks 04).

![](../Images/pipes_code.png)

**Please note** for the PCA method, data was only scaled and not deskewed.  Base on the graph below, the data was equality distributed in each of the 20 salient features (graphs for all 6 sample sets to can be found in notebook 4).
![](../Images/skew.png)

**Results** 
For each sample dataset, there were 15 results from the pipeline/gridsearch.  For the sake of space, only the top 5 scores of all results are shown below. 

In [6]:
pd.read_pickle('../Datasets/db_top5.p')

Unnamed: 0,best_params,best_score,estimator,test_score,train_score,Method
9,"{'clf__C': 10, 'pca__n_components': 5}",0.810172,"SVC(C=10, cache_size=200, class_weight=None, c...",0.827927,0.888935,PCA
4,"{'clf__C': 10, 'pca__n_components': 5}",0.818317,"SVC(C=10, cache_size=200, class_weight=None, c...",0.82699,0.888271,PCA
35,"{'clf__n_neighbors': 9, 'sfm__estimator': Ridg...",0.792715,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.816383,0.849072,SFM
0,"{'clf__n_neighbors': 9, 'pca__n_components': 5}",0.801261,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.81562,0.843159,PCA
5,"{'clf__n_neighbors': 7, 'pca__n_components': 5}",0.797113,"KNeighborsClassifier(algorithm='auto', leaf_si...",0.814733,0.855945,PCA


In [8]:
pd.read_pickle('../Datasets/uci_top5.p')

Unnamed: 0,best_params,best_score,estimator,test_score,train_score,Method
0,"{'clf__C': 10.0, 'pca__n_components': 5}",0.809229,"SVC(C=10.0, cache_size=200, class_weight=None,...",0.874728,0.988817,PCA
29,"{'clf__C': 10.0, 'feature_selection__k': 15}",0.8125,"SVC(C=10.0, cache_size=200, class_weight=None,...",0.863636,0.940341,SKB
1,"{'clf__max_features': 'auto', 'clf__n_estimato...",0.827744,"(DecisionTreeClassifier(class_weight=None, cri...",0.849294,1.0,PCA
28,"{'clf__max_features': 'auto', 'clf__n_estimato...",0.801136,"(DecisionTreeClassifier(class_weight=None, cri...",0.840909,1.0,SKB
30,"{'clf__C': 10.0, 'sfm__estimator': RidgeClassi...",0.78125,"SVC(C=10.0, cache_size=200, class_weight=None,...",0.840909,0.889205,SFM


As you can see, the best scores from both the Database and UCI were there following:
- PCA method feature reduction method
- SVC estimator
- C = 10.0
- n_components = 5

While these results aren't very high, they are higher than the benchmark results where the highest score was 0.75. This shows that by narrowing down the datasets to the top 20 salient features and using other regularization methods in the pipelines, the new models are able to better classify the data.