<img src="ml2_group_assignment.png" width="800"/>

# <font color=darkgreen> Group A: </font> <img src="Team_A.png" width="800"/>
#### Stephanie Gessler, Pedro V. Esteban, Itay Young, Salma ElGuendy, Connie Kim 

# <font color=green> Introduction </font>

This is a continuation of forest_cover_type_detector_gr_a_Part1

Above we import the files created in the previous notebook so that this notebook can run independently

Table of contents is an extension of the previous notebook

<img src="tree_types.png" width="800"/>

# <font color=green> Table of contents </font>

* Data Analysis
* Exploratory Data Analysis
* Feature Engineering & Selection
* Compare Several Machine Learning Models
* Perform Hyperparameter Tuning on the Best Model
* Interpret Model Results
* Evaluate the Best Model with Test Data (replying the initiating question)
* Summary & Conclusions

# Sections 
* [Libaries used](#0)
* [0. Import Data](#0)
* [5. Feature Engineering](#5.)  
  * [5.1 Check for Anomalies and Outliers](#5.1)
       * [5.1.1 Outlier Detection Treatment using Inter-Quartile Range rule Function](#5.1.1)
       * [5.1.2 Inter-Quartile Range rule: 4 IQR from Median](#5.1.2)
       * [5.1.3 Inter-Quartile Range rule: 3 IQR from Median](#5.1.3)
  * [5.2 Feature Transformation and Building of new features](#5.2)
      * [5.2.1 Bivariate Combinations](#5.2.1)  
      * [5.2.2 Polynominal](#5.2.2)
      * [5.2.3 ID](#5.2.3)  
      * [5.2.4 Distance to Hydrology](#5.2.4)   
      * [5.2.5 Horizontal Distance To Roadways ](#5.2.5) 
      * [5.2.6 Slope](#5.2.6)  
      * [5.2.7 Horizontal Distance To Fire Points ](#5.2.7)  
      * [5.2.8 Hillshade](#5.2.8) 
          * [5.2.8.1 Mean Hillshade](#5.2.8.1) 
          * [5.2.8.2 Hillshade 9am](#5.2.8.2)
          * [5.2.8.3 Hillshade Noon](#5.2.8.3)          
          * [5.2.8.4 Hillshade 3pm](#5.2.8.4)       
          * [5.2.8.5 Hillshade Ratios](#5.2.8.5)        
      * [5.2.9 Geoclimate Groping](#5.2.9) 
* [6. Feature Selection](#6)
  * [6.0 Prepare Data and Standardization](#6.1)
  * [6.1 Single tree](#6.2) 
  * [6.2 Bagging](#6.2) 
  * [6.3 Random Forest](#6.3)  
  * [6.4 Extra Trees](#6.4)
       * [6.4.1 Feature Number Selecion](#6.6.1)       
       * [6.4.2 Lasso Regularization](#6.6.2)
       * [6.4.3 Filter Methods](#6.6.3)   
  * [6.5 Recursive Feature Elimination](#6.5)  
  * [6.6 Tree Based Methodologies](#6.6) 
       * [6.6.1 RandomForestClassifier](#6.6.1)       
       * [6.6.2 XGBoost](#6.6.2)
       * [6.6.3 Extra trees Classifier](#6.6.3)       
  * [6.7 Score of all methods Together](#6.7) 


<img src="roosevelt-national-forest.jpeg" width=1200 height=800 align="center">

<a id='0'></a>
# <font color=green> Libraries used </font>

In [1]:
#!pip install squarify
#!pip install htmltabletomd
#!pip install GraphViz
!pip install pygraphviz



In [2]:
import pandas as pd
import numpy as np
import math 
import seaborn as sns  # Graphing
import matplotlib.pyplot as plt
import squarify #treemap
import matplotlib.pyplot as plt
import warnings
import plotly.graph_objects as go
import xgboost as xgb
import scipy.stats as stats
import htmltabletomd
import pydotplus

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import KBinsDiscretizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor


from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression

from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

from sklearn import datasets
from sklearn import linear_model

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import f_classif

from scipy.stats import norm


from yellowbrick.target import FeatureCorrelation
from yellowbrick.classifier import ROCAUC
from yellowbrick.model_selection import rfecv


from plotly.subplots import make_subplots
from IPython.display import Image  
from io import StringIO


from itertools import combinations

from dtreeviz.trees import *


from numpy import percentile

warnings.simplefilter(action='ignore', category=FutureWarning)

<a id='0'></a>
##  <font color=green>0.Import the Data </font>

Let’s load the original Kaggle training and test data and create a data frame

In [4]:
data_train = pd.read_csv("train.csv")
data_test = pd.read_csv("test.csv")

Let's keep the original dataset for later comparisons and make a copy for the FE process

In [5]:
df_original = data_train.copy()

<a id='5'></a>
<a id='5.1'></a>
# <font color=green>  5.Feature Engineering<font>
# <font color=green>  5.1. Check for Anomalies and Outliers <font>

Since the Z-score is sensitive to the  mean and standard deviation and its assumption is a normal distribution, we cannot use the z-score for outlier handling because our data is skewed and failed to pass the normal test. Our data is not normally distributed or at least not just yet.

The disadvantage using percentiles is that it considers always elements at both sides of the spectrum of the lowest or highest value, which can potentially be mistaken as outliers. 

As the number of observations increases, so does the number of observations considered outliers. After all, using a percentile based method will always flat-out and reject a certain percentage of our observations.Thus, we need to use the percentiles with caution. 

<a id='5.1.1'></a>
### <font color=darkcyan> 5.1.1 Outlier Detection Treatment using Inter-Quartile Range rule Function <font>

The IQR is the difference between the 75th and 25th percentile. The IQR is more resistant to outliers. The IQR by definition only covers the middle 50% of the data, so outliers are well outside this range and the presence of a small number of outliers is not likely to change this significantly. 

Now we are testing different ranges for IQR, namely 2,3 and 4 to check for more extreme outlier values. 

In [11]:
def outlier_function(df, col_name,value_IQR):
    ''' This function detects first and third quartile and interquartile range for a given column
    of a dataframe. Then calculates upper and lower limits to determine outliers conservatively and
    returns the number of lower and uper limit and number of outliers respectively
    '''
    first_quartile = np.percentile(np.array(df[col_name].tolist()), 25)
    third_quartile = np.percentile(np.array(df[col_name].tolist()), 75)
    IQR = third_quartile - first_quartile
                      
    upper_limit = third_quartile+(value_IQR*IQR)
    lower_limit = first_quartile-(value_IQR*IQR)
    outlier_count = 0
                      
    for value in df[col_name].tolist():
        if (value < lower_limit) | (value > upper_limit):
            outlier_count +=1
    return lower_limit, upper_limit, outlier_count

<a id='5.1.2'></a>
### <font color=darkcyan> 5.1.2 Inter-Quartile Range rule: 4 IQR from Median <font>

We loop through all columns to see if there are any outliers, for all values which are not only 0 and 1


In [12]:
for column in ["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology","Horizontal_Distance_To_Roadways","Hillshade_9am","Hillshade_Noon","Hillshade_3pm","Horizontal_Distance_To_Fire_Points"]:
    if outlier_function(data_train, column,4)[2] > 0:
        print("There are {} outliers in {}".format(outlier_function(data_train, column,4)[2], column))

There are 13 outliers in Vertical_Distance_To_Hydrology
There are 1 outliers in Hillshade_9am


There is 1 record of Hillshade_9am with a zero value, which is a valid value as Hillshade can be zero. This is because there are parts in the mountain that never see the sunlight (blind spots). Hence we keep the value as it is. 

Now we remove outliers and test again in our baseline model


In [13]:
q25, q75 = percentile(data_train['Vertical_Distance_To_Hydrology'], 25), percentile(data_train['Vertical_Distance_To_Hydrology'], 75)
iqr = q75 - q25
# calculate the outlier cutoff
cut_off = iqr * 4
lower, upper = q25 - cut_off, q75 + cut_off
# remove outliers
data_train_vd_h = data_train[(data_train['Vertical_Distance_To_Hydrology'] > lower) & (data_train['Vertical_Distance_To_Hydrology'] < upper)]

Checking up to see if the model improves after removing vertical distance to hydrology 

In [None]:
X4=data_train_vd_h.drop(labels=['Id','Cover_Type'],axis=1)
y4=data_train_vd_h['Cover_Type']

In [None]:
scale_numerical =['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology',
            'Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm',
            'Horizontal_Distance_To_Fire_Points']
scaler = StandardScaler()
X4[scale_numerical]=scaler.fit_transform(X4[scale_numerical])

In [None]:
y4.value_counts()

In [None]:
X4_train,X4_val,y4_train,y4_val = train_test_split(X4,y4,random_state=37) #seed is 37!

In [None]:
forest_iqr4 = RandomForestClassifier(random_state=37)
model_forest_iqr4 = forest_iqr4.fit(X4_train,y4_train)

In [None]:
# calculating accuracy_score
model_forest_iqr4.score(X4_val,y4_val)

In [None]:
forest = RandomForestClassifier(random_state=37)
print("Accuracy = {0:.4f}".format(np.mean(cross_val_score(model_forest_iqr4, X4_val, y4_val))))

<table>
  <tr>
    <th><b>Algorithm</b></th>
    <th><b>Accuracy</b></th>
    <th><b>CV Accuracy</b></th>
    <th><b>Accuracy with IQR4</b></th>
    <th><b>CV Accuracy with IQR4</b></th>
  </tr>
  <tr>
<td> Random Forest  </td>
    <td> <b>0.8613<b></td>
    <th><b>0.8016</b></th>
      <td> 0.8607</td>
      <td> 0.7993</td>
  </tr>     
  </tr>
</table>


Comparing the previous score with the new score after removing the outliers of vertical distance to Hydrology: Accuracy decresases slightly, hence better not to remove outliers.

<a id='5.1.3'></a>
### <font color=darkcyan> 5.1.3 Inter-Quartile Range rule: 3 IQR <font>

First we are checking which variables will be removed using the 3 IQR dunction Hillshade_9am,Hillshade_Noon,Hillshade_3pm has been excluded as these are valid outliers. 

In [51]:
for column in ["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology","Horizontal_Distance_To_Roadways","Horizontal_Distance_To_Fire_Points"]:
    if outlier_function(data_train, column,3)[2] > 0:
        print("There are {} outliers in {}".format(outlier_function(data_train, column,3)[2], column))

There are 53 outliers in Horizontal_Distance_To_Hydrology
There are 49 outliers in Vertical_Distance_To_Hydrology
There are 3 outliers in Horizontal_Distance_To_Roadways
There are 132 outliers in Horizontal_Distance_To_Fire_Points


In [52]:
cols = ["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology","Horizontal_Distance_To_Roadways","Hillshade_9am","Hillshade_Noon","Hillshade_3pm","Horizontal_Distance_To_Fire_Points"] # one or more

Q1 = data_train[cols].quantile(0.25)
Q3 = data_train[cols].quantile(0.75)
IQR = Q3 - Q1

df = data_train[~((data_train[cols] < (Q1 - 3 * IQR)) |(data_train[cols] > (Q3 + 3 * IQR))).any(axis=1)]

In [53]:
df.shape

(14866, 56)

In [54]:
X3=df.drop(labels=['Cover_Type'],axis=1)
y3=df['Cover_Type']

scale_numerical =['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology',
            'Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm',
            'Horizontal_Distance_To_Fire_Points']
scaler = StandardScaler()
X3[scale_numerical]=scaler.fit_transform(X3[scale_numerical])

Looking at the cover types you can see the values become unbalanced if we decided to remove them. 

In [55]:
y3.value_counts()

4    2160
6    2159
3    2153
5    2121
1    2113
7    2088
2    2072
Name: Cover_Type, dtype: int64

In [56]:
X_train3,X_val3,y_train3,y_val3 = train_test_split (X3,y3,random_state=37) #seed is 37!

In [57]:
forest_iqr3 = RandomForestClassifier(random_state=37)
model_forest_iqr3 = forest_iqr3.fit(X_train3,y_train3)

In [58]:
model_forest_iqr3.score(X_val3,y_val3)

0.87409200968523

In [59]:
forest_iqr3 = RandomForestClassifier(random_state=37)
print("Accuracy = {0:.4f}".format(np.mean(cross_val_score(model_forest_iqr3, X_val3, y_val3))))

Accuracy = 0.8025


<table>
  <tr>
    <th><b>Algorithm</b></th>
    <th><b>Accuracy Baseline</b></th>
    <th><b>CV Accuracy Baseline</b></th>
    <th><b>Accuracy with IQR3</b></th>
    <th><b>CV Accuracy with IQR3</b></th>
      </tr>
    </tr>
<td> Random Forest  </td>
    <td> <b>0.8613<b></td>
    <th><b>0.8016</b></th>
    <td> 0.8740</td>
    <td> 0.8025</td>
      </tr>     
    </tr>
</table>

Using the IQR3 rule to remove all outliers does not significantly improve the model hence we will not remove any outliers using IQR3. 

<a id='5.1.3'></a>
### <font color=darkcyan> 5.1.3 Inter-Quartile Range rule: 1.5 IQR <font>

First we are checking which variables will be removed using the 3 IQR dunction Hillshade_9am,Hillshade_Noon,Hillshade_3pm has been excluded as these are valid outliers. 

In [63]:
for column in ["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology","Horizontal_Distance_To_Roadways","Horizontal_Distance_To_Fire_Points"]:
    if outlier_function(data_train, column,1.5)[2] > 0:
        print("There are {} outliers in {}".format(outlier_function(data_train, column,1.5)[2], column))

There are 512 outliers in Horizontal_Distance_To_Hydrology
There are 586 outliers in Vertical_Distance_To_Hydrology
There are 830 outliers in Horizontal_Distance_To_Roadways
There are 645 outliers in Horizontal_Distance_To_Fire_Points


In [64]:
cols = ["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology","Horizontal_Distance_To_Roadways","Hillshade_9am","Hillshade_Noon","Hillshade_3pm","Horizontal_Distance_To_Fire_Points"] # one or more

Q1 = data_train[cols].quantile(0.25)
Q3 = data_train[cols].quantile(0.75)
IQR = Q3 - Q1

df1 = data_train[~((data_train[cols] < (Q1 - 1.5 * IQR)) |(data_train[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

In [65]:
df1.shape

(12261, 56)

In [67]:
X1=df1.drop(labels=['Cover_Type'],axis=1)
y1=df1['Cover_Type']

scale_numerical =['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology',
            'Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm',
            'Horizontal_Distance_To_Fire_Points']
scaler = StandardScaler()
X1[scale_numerical]=scaler.fit_transform(X1[scale_numerical])

Looking at the cover types you can see the values become unbalanced if we decided to remove them. 

In [68]:
y1.value_counts()

4    2055
6    1912
5    1814
3    1780
7    1606
1    1555
2    1539
Name: Cover_Type, dtype: int64

In [69]:
X_train1,X_val1,y_train1,y_val1 = train_test_split (X1,y1,random_state=37) #seed is 37!

In [70]:
forest_iqr1 = RandomForestClassifier(random_state=37)
model_forest_iqr1 = forest_iqr1.fit(X_train1,y_train1)

In [71]:
model_forest_iqr1.score(X_val1,y_val1)

0.8701891715590345

In [72]:
forest_iqr1 = RandomForestClassifier(random_state=37)
print("Accuracy = {0:.4f}".format(np.mean(cross_val_score(model_forest_iqr1, X_val1, y_val1))))

Accuracy = 0.8001


<table>
  <tr>
    <th><b>Algorithm</b></th>
    <th><b>Accuracy Baseline</b></th>
    <th><b>CV Accuracy Baseline</b></th>
    <th><b>Accuracy with IQR3</b></th>
    <th><b>CV Accuracy with IQR3</b></th>
      </tr>
    </tr>
<td> Random Forest  </td>
    <td> <b>0.8613<b></td>
    <th><b>0.8016</b></th>
    <td> 0.8701</td>
    <td> 0.8001</td>
      </tr>     
    </tr>
</table>

Using the IQR1.5 rule to remove all outliers does not significantly improve the model hence we will not remove any outliers using IQR3. 

<a id='5.2'></a>
## <font color=green> 5.2 Feature Transformation and Building of new features <font>

<a id='5.2.1'></a>
### <font color=green> 5.2.1 Bivariate Combinations <font>

During feature engineering, we want to try to create a wide variety of interactions between multiple variables in order to create new variables. 


By manipulating them together, we create opportunities to have new and impactful features which could potentially impact our target variable, thus engineering our features. 

For this argument, we will create as many bivariate combinations of our predicting variables using the ‘combinations’ method from itertools library.

We will not make interactions with the dummy variables as these are either 0 or 1 and we will not get any additional information from making the interaction this way. 

Furthermore, it is not recommended to use standardization before bivariate combinations as we want to increase the signal. <br>

Sources: https://towardsdatascience.com/feature-engineering-combination-polynomial-features-3caa4c77a755 <br>

https://samchaaa.medium.com/preprocessing-why-you-should-generate-polynomial-features-first-before-standardizing-892b4326a91d

In order to use the bivariate combination we split the dataset for using it.Note this is not the split we will use later for testing the algorithm. This has only the purpose of testing all the combination and selecting the best once. 

In [None]:
# Identify and drop our target variable 'Cover_Type' from dataframe
X = data_train.drop('Cover_Type', axis = 1)

# Isolate our dependent variable as a feature
y = data_train['Cover_Type']

Train Test Split (80/20 size), drop duplicates and missing values


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = .2, random_state=37, stratify=y)

X_train.drop_duplicates(inplace = True)
X_train.dropna(inplace = True)

Here, we create every possible bivariate combination to be tested for feature engineering, no dummies


We take categorical variables prior to our feature engineering

In [None]:
column_list = X_train.columns
filtered_column_list = [column for column in column_list if 'Soil_Type' not in column and 'Wilderness_Area' not in column and 'Id' not in column ] 
interactions = list(combinations(filtered_column_list, 2))
interactions

Addition and division has been taken out as it created a lot of noise in the data. The division makes sense if it has a business meaning and the addition only if it is the same scale. 

However we will add the variables which have the same metrics together in a second step but not in a for loop. 

In [None]:
for (key, value) in interactions:
    data_train[key + '_x_' + value] = data_train[key] * data_train[value]
    #data_train[key + '_+_' + value] = data_train[key] + data_train[value]
    #data_train[key + '_divide_' + value] = data_train[key] / data_train[value]

In [None]:
pd.set_option('display.max_columns', None) #to make all columns visible in dataframe now that we have many
data_train

<a id='5.2.2'></a>
### <font color=green> 5.2.2 Polynomial Features <font>
    
We have just seen how to make two variables interact together, but sometimes the relationship between dependent and independent variables are more complex and not linear. 
    
Polynomials is another way to create new features! A very strong option for new features is increasing the power of a single variable. 
    
For our purposes, we will try and see if all the existing variables, can improve our Baseline by being increased to the  power.<br>
Source: https://towardsdatascience.com/feature-engineering-combination-polynomial-features-3caa4c77a755

Here we select only the columns we are interested in, this is from column 2 to 9

In [None]:
X_train_int_pf = data_train.iloc[:, 1:10]
X_train_int_pf

By default, when running polynomials we lose the name of our labels. This function gives us the ability to preserve it with the name + the transformation done

In [None]:
def PolynomialFeatures_labeled(input_df,power):
    '''Basically this is a cover for the sklearn preprocessing function. 
    The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially
    a whole bunch of unlabeled columns. 

    Inputs:
    input_df = Your labeled pandas dataframe (list of x's not raised to any power) 
    power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly)

    Ouput:
    Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and 
    outputs a labeled pandas dataframe   
    '''
    poly = PolynomialFeatures(power)
    output_nparray = poly.fit_transform(input_df)
    powers_nparray = poly.powers_

    input_feature_names = list(input_df.columns)
    target_feature_names = ["Constant Term"]
    for feature_distillation in powers_nparray[1:]:
        intermediary_label = ""
        final_label = ""
        for i in range(len(input_feature_names)):
            if feature_distillation[i] == 0:
                continue
            else:
                variable = input_feature_names[i]
                power = feature_distillation[i]
                intermediary_label = "%s^%d" % (variable,power)
                if final_label == "":         #If the final label isn't yet specified
                    final_label = intermediary_label
                else:
                    final_label = final_label + " x " + intermediary_label
        target_feature_names.append(final_label)
    output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
    return output_df



Polynominal features of degree two gives us 46 new features. Since we have already enough information, we will not go for Polynominal three to avoid dimensionality issues later on

In [None]:
output_df_pw2 = PolynomialFeatures_labeled(X_train_int_pf,2)
pd.set_option('display.max_columns', None)
output_df_pw2.shape

There are some fields duplicated with respect to the original df. This is a side effect of the function as normal features are replicated to the power of one which is still the same value so we delete these

In [None]:
column_list = output_df_pw2.columns
cols = [column for column in column_list if '^1' not in column]
output_df_pw2=output_df_pw2[cols]
output_df_pw2

In here, we concatenate our output to consolidate the ponlynomials with the feature combinations

In [None]:
data_train = pd.concat([data_train,output_df_pw2], axis=1)
data_train

This results in a huge dataset. Just for curiosity we ran polynomials to the power of three and see that just the resulting dataframe is already bigger than the consolidated one above

In [None]:
output_df_pw3 = PolynomialFeatures_labeled(X_train_int_pf,3)
output_df_pw3

<a id='5.2.3'></a>
### <font color=green> 5.2.3 ID <font>

We agree for the test to not remove ID because the ID is the unique indentifier to evaluate


For the train we will remove it as it doesn't add any value to the model

In [None]:
data_train.drop('Id',axis = 1, inplace = True)

<a id='5.2.4'></a>
### <font color=green> 5.2.4 Distance To Hydrology <font>
#### <font color=green> New Features <font>

We combine Vertical distance to Hydrology and Horizontal distance to Hydrology since these two are highly correlated. This suggests to attempt a diagonal distance to hidrology using the Pythagoras theorem.

We will call this newly engineered feature, Distance_To_Hydrology
 
Source : https://towardsdatascience.com/types-of-transformations-for-better-normal-distribution-61c22668d3b9

In [None]:
data_train['Distance_To_Hydrology'] = data_train['Horizontal_Distance_To_Hydrology']**2 +data_train['Vertical_Distance_To_Hydrology']**2
data_train['Distance_To_Hydrology'] = data_train['Distance_To_Hydrology']**0.5
data_train.head()

#### <font color=green> Square root and logarithm Transformation  <font>

Now we are checking the distribution of the newly created variable and see if further transformation is needed. 

The Distance to Hydrology inherits skewness from parent variables. It is positively skewed and has zero values. 

In order to use log we will use log + 1 in order to use logarithm with zero values. 

Source: https://www.youtube.com/watch?v=_c3dVTRIK9c and 

Source_2: https://towardsdatascience.com/types-of-transformations-for-better-normal-distribution-61c22668d3b9

As a rule of thumb, the skewness can be interpreted as follows:
<img src="Skew.png" width=400 height=200 align="center">

Source: https://www.marsja.se/transform-skewed-data-using-square-root-log-box-cox-methods-in-python/

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Distance_To_Hydrology'].skew(), 
      "\nmin\n", data_train['Distance_To_Hydrology'].min(),
      "\nmax\n", data_train['Distance_To_Hydrology'].max(),)

We do some transformations to minimize skewness

In [None]:
#Using the log10+ 1 logarithm 
data_train['log10_Distance_To_Hydrology'] = np.log10(data_train['Distance_To_Hydrology']+1)

In [None]:
#Using the square root 
data_train['sqr_Distance_To_Hydrology'] = data_train['Distance_To_Hydrology']**0.5

#### <font color=green> Results after logarithm and <font color=darkcyan> Square root Transformation<font>

In [None]:
print('\033[92m' +"Skew after Log transformation\n", data_train['log10_Distance_To_Hydrology'].skew(), 
      "\nmin\n", data_train['log10_Distance_To_Hydrology'].min(),
      "\nmax\n", data_train['log10_Distance_To_Hydrology'].max(),)

In [None]:
print('\033[96m'+ "Skew after Square Root Transformation\n", data_train['sqr_Distance_To_Hydrology'].skew(), 
      "\nmin\n", data_train['sqr_Distance_To_Hydrology'].min(),
      "\nmax\n", data_train['sqr_Distance_To_Hydrology'].max(),)

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(20,15))
f.add_subplot(331)
histPlot(data_train['Distance_To_Hydrology'], 'purple')
f.add_subplot(332)
histPlot(data_train['log10_Distance_To_Hydrology'], 'green')
f.add_subplot(333)
histPlot(data_train['sqr_Distance_To_Hydrology'], 'c')

As you can see above, for distance to Hydrology the __square root__ showed a better performance in terms of skewness and is closer to a normal bell shaped than the logarithm transformation. We will be using Square Root as a new feature in the dataset and will frop the others from the dataset.  

# Stephanie, you removed only log10 but not distance to hydrology? Why? Aren't we suppose to just keep the sqr transformation?

In [None]:
data_train.drop(['log10_Distance_To_Hydrology'], axis=1,inplace=True)

<a id='5.2.5'></a>
### <font color=green> 5.2.5 Horizontal Distance To Roadways <font>

#### <font color=green> Square root and logarithm Transformation  <font>

For log transformation there should be no zeros, negative values and the distribution should be positive skewed( bigger than 1 is positive) hence we are using the square root as you can see for logarithm transformation below the distribution did not improve!!!

In [None]:
print('\033[95m'+"Skew before Transformation\n", data_train['Horizontal_Distance_To_Roadways'].skew(), 
      "\nmin before Transformation\n", data_train['Horizontal_Distance_To_Roadways'].min(),
      "\nmax before Transformation\n", data_train['Horizontal_Distance_To_Roadways'].max(),)

#### <font color=green> Results after logarithm and <font color=darkcyan> Square root Transformation<font>

In [None]:
# since we have null values we add plus 1 to avoid log of zero.We are using natural log and log10
data_train['Sqr_Horizontal_Distance_To_Roadways'] = data_train['Horizontal_Distance_To_Roadways']**0.5
data_train['log_Horizontal_Distance_To_Roadways'] = np.log(data_train['Horizontal_Distance_To_Roadways']+1)
data_train['log10_Horizontal_Distance_To_Roadways'] = np.log10(data_train['Horizontal_Distance_To_Roadways']+1)

In [None]:
print('\033[96m'+ "Skew after Square Root Transformation\n", data_train['Sqr_Horizontal_Distance_To_Roadways'].skew(), 
      "\nmin \n", data_train['Sqr_Horizontal_Distance_To_Roadways'].min(),
      "\nmax \n", data_train['Sqr_Horizontal_Distance_To_Roadways'].max(),)


In [None]:
print('\033[92m' +"Skew after log Transformation\n", data_train['log_Horizontal_Distance_To_Roadways'].skew(), 
      "\nmin\n", data_train['log_Horizontal_Distance_To_Roadways'].min(),
      "\nmax\n", data_train['log_Horizontal_Distance_To_Roadways'].max(),)

In [None]:
print('\033[92m'+ "Skew after log10 transformation\n", data_train['log10_Horizontal_Distance_To_Roadways'].skew(), 
      "\nmin \n", data_train['log10_Horizontal_Distance_To_Roadways'].min(),
      "\nmax \n", data_train['log10_Horizontal_Distance_To_Roadways'].max(),)

In [None]:
# testing if the sqaure root is normally distributed and it shows it is not, however it is less skewed than before
stats.normaltest(data_train['Sqr_Horizontal_Distance_To_Roadways'])

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(15,10))
f.add_subplot(331)
histPlot(data_train['Horizontal_Distance_To_Roadways'], 'purple')
f.add_subplot(334)
histPlot(data_train['log_Horizontal_Distance_To_Roadways'], 'green')
f.add_subplot(335)
histPlot(data_train['log10_Horizontal_Distance_To_Roadways'], 'green')
f.add_subplot(332)
histPlot(data_train['Sqr_Horizontal_Distance_To_Roadways'], 'c')

We achieved the best result again using square root of the Horizontal Distance to Roadways. Similarly as before, we remove failed experiments

In [None]:
data_train.drop(['log_Horizontal_Distance_To_Roadways','log10_Horizontal_Distance_To_Roadways'], axis=1,inplace=True)

<a id='5.2.6'></a>
### <font color=green> 5.2.6 Slope <font>
#### <font color=green> Square root and logarithm Transformation  <font>

In [None]:
print('\033[95m'+ "Skew before transformation\n", data_train['Slope'].skew(), 
      "\nmin\n", data_train['Slope'].min(),
      "\nmax \n", data_train['Slope'].max(),)

#### <font color=green> Results after logarithm and <font color=darkcyan> Sqrare root Transformation<font>

In [None]:
# since we have null values we add plus 1 to avoid log of zero
data_train['logSlope'] = np.log(data_train['Slope']+1)

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['logSlope'].skew(), 
      "\nmin\n", data_train['logSlope'].min(),
      "\nmax\n", data_train['logSlope'].max(),)

In [None]:
data_train['SqrSlope'] = data_train['Slope']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['SqrSlope'].skew(), 
      "\nmin\n", data_train['SqrSlope'].min(),
      "\nmax\n", data_train['SqrSlope'].max(),)

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(15,10))
f.add_subplot(331)
histPlot(data_train['Slope'], 'purple')
f.add_subplot(332)
histPlot(data_train['logSlope'], 'green')
f.add_subplot(333)
histPlot(data_train['SqrSlope'], 'c')

Since the skweness for the slope shows better performance when using the square root, we will transform the variable into square root as well. 

In [None]:
data_train.drop(['logSlope'], axis=1,inplace=True)

<a id='5.2.7'></a>
### <font color=green> 5.2.7 Horizontal Distance To Fire Points  <font>
#### <font color=green> Transformation  <font>

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Horizontal_Distance_To_Fire_Points'].skew(), 
      "\nmin\n", data_train['Horizontal_Distance_To_Fire_Points'].min(),
      "\nmax\n", data_train['Horizontal_Distance_To_Fire_Points'].max(),)

#### <font color=green> Results after logarithm and <font color=darkcyan> Sqrare root Transformation<font>

In [None]:
# since we have null values we add plus 1 to avoid log of zero
data_train['log_Horizontal_Distance_To_firepoints'] = np.log(data_train['Horizontal_Distance_To_Fire_Points']+1)

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['log_Horizontal_Distance_To_firepoints'].skew(), 
      "\nmin\n", data_train['log_Horizontal_Distance_To_firepoints'].min(),
      "\nmax\n", data_train['log_Horizontal_Distance_To_firepoints'].max(),)

In [None]:
#Transform with square root
data_train['sqr_Horizontal_Distance_To_firepoints'] = data_train['Horizontal_Distance_To_Fire_Points']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['sqr_Horizontal_Distance_To_firepoints'].skew(), 
      "\nmin\n", data_train['sqr_Horizontal_Distance_To_firepoints'].min(),
      "\nmax\n", data_train['sqr_Horizontal_Distance_To_firepoints'].max(),)

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(15,10))
f.add_subplot(331)
histPlot(data_train['Horizontal_Distance_To_Fire_Points'], 'purple')
f.add_subplot(332)
histPlot(data_train['log_Horizontal_Distance_To_firepoints'], 'green')
f.add_subplot(333)
histPlot(data_train['sqr_Horizontal_Distance_To_firepoints'], 'c')

Since square root transformation gives the best result in skewness, we will also use sqr for the feature variable.

In [None]:
data_train.drop(['log_Horizontal_Distance_To_firepoints'], axis=1,inplace=True)

<a id='5.2.8'></a>
### <font color=green> 5.2.8 Hillshades <font>
<a id='5.2.8.1'></a>
### <font color=green> 5.2.8.1 Mean Hillshade <font>
#### <font color=green> Creation of new Feature: Mean Hillshade <font>

In [None]:
# We take the average of Hillshades,which gives you the average light exposure of each cover type during the day
data_train['Mean_Hillshade'] = (data_train['Hillshade_9am']+data_train['Hillshade_Noon']+data_train['Hillshade_3pm'])/3

In [None]:
#Itensity of the Hillshade variables in 3 bin siizes with the bin discretizer
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
data_train['Mean_Hillshade_bin'] = est.fit_transform(data_train[['Mean_Hillshade']])

In [None]:
data_train[['Mean_Hillshade_bin','Mean_Hillshade']].describe()

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Mean_Hillshade'].skew(), 
      "\nmin\n", data_train['Mean_Hillshade'].min(),
      "\nmax\n", data_train['Mean_Hillshade'].max(),)

#### <font color=green> Results after logarithm Transformation <font color=darkcyan>, Square root Transformation<font color=gold> and BoxCox Transformation<font>

In [None]:
data_train['log_Mean_Hillshade'] = np.log(data_train['Mean_Hillshade'])

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['log_Mean_Hillshade'].skew(), 
      "\nmin\n", data_train['log_Mean_Hillshade'].min(),
      "\nmax\n", data_train['log_Mean_Hillshade'].max(),)

In [None]:
data_train['log10Mean_Hillshade'] = np.log10(data_train['Mean_Hillshade'])

In [None]:
print('\033[92m'+"Skew after log10 transformation\n", data_train['log10Mean_Hillshade'].skew(), 
      "\nmin\n", data_train['log10Mean_Hillshade'].min(),
      "\nmax\n", data_train['log10Mean_Hillshade'].max(),)

In [None]:
data_train['sqr_Mean_Hillshade'] = data_train['Mean_Hillshade']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['sqr_Mean_Hillshade'].skew(), 
      "\nmin\n", data_train['sqr_Mean_Hillshade'].min(),
      "\nmax\n", data_train['sqr_Mean_Hillshade'].max(),)

In [None]:
#Now, the Box-Cox transformation also requires our data to only contain positive numbers
# transform training data with Boxcox
data_train['Mean_Hillshade_boxcox'], _ = stats.boxcox(data_train['Mean_Hillshade'])

In [None]:
print('\033[93m'+"Skew after Boxcox transformation\n", data_train['Mean_Hillshade_boxcox'].skew(), 
      "\nmin\n", data_train['Mean_Hillshade_boxcox'].min(),
      "\nmax\n", data_train['Mean_Hillshade_boxcox'].max(),)

In [None]:
stats.normaltest(data_train['Mean_Hillshade_boxcox'])

In [None]:
from scipy.stats import norm
import scipy.stats as stats

def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(20,10))
f.add_subplot(331)
histPlot(data_train['Mean_Hillshade'], 'purple')
f.add_subplot(335)
histPlot(data_train['log_Mean_Hillshade'], 'green')
f.add_subplot(334)
histPlot(data_train['Mean_Hillshade_boxcox'], 'gold')                    
f.add_subplot(332)
histPlot(data_train['sqr_Mean_Hillshade'], 'c')

The distribution did not improve with Square Root and Logarithms Transformation. Hence we use BoxCox which improved the distribution substantially. 

In [None]:
data_train.drop(['log10Mean_Hillshade','log_Mean_Hillshade','sqr_Mean_Hillshade'], axis=1,inplace=True)

<a id='5.2.8.2'></a>
### <font color=green> 5.2.8.2 Hillshade 9am <font>
#### <font color=green> Transformation  <font>

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Hillshade_9am'].skew(), 
      "\nmin\n", data_train['Hillshade_9am'].min(),
      "\nmax\n", data_train['Hillshade_9am'].max(),)

#### <font color=green> Results after logarithm Transformation <font color=darkcyan>, Square root Transformation<font color=gold> and BoxCox Transformation<font>

In [None]:
data_train['log_Hillshade_9am'] = np.log(data_train['Hillshade_9am']+1)

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['log_Hillshade_9am'].skew(), 
      "\nmin\n", data_train['log_Hillshade_9am'].min(),
      "\nmax\n", data_train['log_Hillshade_9am'].max(),)

In [None]:
data_train['sqr_Hillshade_9am'] = data_train['Hillshade_9am']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['sqr_Hillshade_9am'].skew(), 
      "\nmin\n", data_train['sqr_Hillshade_9am'].min(),
      "\nmax\n", data_train['sqr_Hillshade_9am'].max(),)

In [None]:
#Now, the Box-Cox transformation also requires our data to only contain positive numbers, transform training data with Boxcox
data_train['Hillshade_9am_boxcox'], lam  = stats.boxcox(data_train['Hillshade_9am']+1)
#lam is the best lambda for the distribution

In [None]:
print('\033[93m'+"Skew after Boxcox transformation\n", data_train['Hillshade_9am_boxcox'].skew(), 
      "\nmin\n", data_train['Hillshade_9am_boxcox'].min(),
      "\nmax\n", data_train['Hillshade_9am_boxcox'].max(),)

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(20,10))
f.add_subplot(331)
histPlot(data_train['Hillshade_9am'], 'purple')
f.add_subplot(335)
histPlot(data_train['log_Hillshade_9am'], 'green')
f.add_subplot(334)
histPlot(data_train['Hillshade_9am_boxcox'], 'gold')                    
f.add_subplot(332)
histPlot(data_train['sqr_Hillshade_9am'], 'c')

BoxCox outperforms the other two for the Hillshade 9am 

In [None]:
data_train.drop(['log_Hillshade_9am','sqr_Hillshade_9am'], axis=1,inplace=True)

<a id='5.2.8.3'></a>
### <font color=green> 5.2.8.3 Hillshade Noon <font>
#### <font color=green> Transformation  <font>

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Hillshade_Noon'].skew(), 
      "\nmin\n", data_train['Hillshade_Noon'].min(),
      "\nmax\n", data_train['Hillshade_Noon'].max(),)

#### <font color=green> Results after logarithm Transformation <font color=darkcyan>, Square root Transformation<font color=gold> and BoxCox Transformation<font>

In [None]:
data_train['log_Hillshade_Noon'] = np.log(data_train['Hillshade_Noon']+1)

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['log_Hillshade_Noon'].skew(), 
      "\nmin\n", data_train['log_Hillshade_Noon'].min(),
      "\nmax\n", data_train['log_Hillshade_Noon'].max(),)

In [None]:
data_train['sqr_Hillshade_Noon'] = data_train['Hillshade_Noon']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['sqr_Hillshade_Noon'].skew(), 
      "\nmin\n", data_train['sqr_Hillshade_Noon'].min(),
      "\nmax\n", data_train['sqr_Hillshade_Noon'].max(),)

In [None]:
#Now, the Box-Cox transformation also requires our data to only contain positive numbers, transform training data with Boxcox
data_train['Hillshade_Noon_boxcox'], lam  = stats.boxcox(data_train['Hillshade_Noon'])
#lam is the best lambda for the distribution

In [None]:
print('\033[93m'+"Skew after Boxcox transformation\n", data_train['Hillshade_Noon_boxcox'].skew(), 
      "\nmin\n", data_train['Hillshade_Noon_boxcox'].min(),
      "\nmax\n", data_train['Hillshade_Noon_boxcox'].max(),)

In [None]:
from scipy.stats import norm
import scipy.stats as stats

def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(20,10))
f.add_subplot(331)
histPlot(data_train['Hillshade_Noon'], 'purple')
f.add_subplot(335)
histPlot(data_train['log_Hillshade_Noon'], 'green')
f.add_subplot(334)
histPlot(data_train['Hillshade_Noon_boxcox'], 'gold')                    
f.add_subplot(332)
histPlot(data_train['sqr_Hillshade_Noon'], 'c')

Box Coc is outperforming the other transformations for Hillshade Noon

In [None]:
data_train.drop(['log_Hillshade_Noon','sqr_Hillshade_Noon'], axis=1,inplace=True)

<a id='5.2.8.4'></a>
### <font color=green> 5.2.8.4 Hillshade 3pm <font>
#### <font color=green> Transformation  <font>

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Hillshade_3pm'].skew(), 
      "\nmin\n", data_train['Hillshade_3pm'].min(),
      "\nmax\n", data_train['Hillshade_3pm'].max(),)

#### <font color=green> Results after logarithm Transformation <font color=darkcyan>, Square root Transformation<font color=gold> and BoxCox Transformation<font>

In [None]:
data_train['log_Hillshade_3pm'] = np.log(data_train['Hillshade_3pm']+1)

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['log_Hillshade_3pm'].skew(), 
      "\nmin\n", data_train['log_Hillshade_3pm'].min(),
      "\nmax\n", data_train['log_Hillshade_3pm'].max(),)

In [None]:
data_train['sqr_Hillshade_3pm'] = data_train['Hillshade_3pm']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['sqr_Hillshade_3pm'].skew(), 
      "\nmin\n", data_train['sqr_Hillshade_3pm'].min(),
      "\nmax\n", data_train['sqr_Hillshade_3pm'].max(),)

In [None]:
#Now, the Box-Cox transformation also requires our data to only contain positive numbers, transform training data with Boxcox
data_train['Hillshade_3pm_boxcox'], lam  = stats.boxcox(data_train['Hillshade_3pm']+1)
#lam is the best lambda for the distribution

In [None]:
print('\033[93m'+"Skew after Boxcox transformation\n", data_train['Hillshade_3pm_boxcox'].skew(), 
      "\nmin\n", data_train['Hillshade_3pm_boxcox'].min(),
      "\nmax\n", data_train['Hillshade_3pm_boxcox'].max(),)

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(20,10))
f.add_subplot(331)
histPlot(data_train['Hillshade_3pm'], 'purple')
f.add_subplot(335)
histPlot(data_train['log_Hillshade_3pm'], 'green')
f.add_subplot(334)
histPlot(data_train['Hillshade_3pm_boxcox'], 'gold')                    
f.add_subplot(332)
histPlot(data_train['sqr_Hillshade_3pm'], 'c')

For the Hillshade 3pm the data was not highly skwed, we either keep the original or we can use boxcox as it improved the variables as well. 

In [None]:
data_train.drop(['log_Hillshade_3pm','sqr_Hillshade_3pm'], axis=1,inplace=True)

<a id='5.2.8.5'></a>
### <font color=green> 5.2.8.5 Hillshades  Ratios <font>

In [None]:
data_train['ratio_Hillshade_3pm'] = data_train['Hillshade_3pm']/255
data_train['ratio_Hillshade_Noon'] = data_train['Hillshade_Noon']/255
data_train['ratio_Hillshade_9am'] = data_train['Hillshade_9am']/255

<a id='5.2.8.6'></a>
### <font color=green> 5.2.8.6 Aspect <font>
#### <font color=green> Transformation  <font>

In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Aspect'].skew(), 
      "\nmin\n", data_train['Aspect'].min(),
      "\nmax\n", data_train['Aspect'].max(),)

#### <font color=green> Results after logarithm Transformation <font color=darkcyan>and Square root Transformation<font>

In [None]:
data_train['sqr_Aspect'] = data_train['Aspect']**0.5

In [None]:
print('\033[96m'+"Skew after Square Root transformation\n", data_train['sqr_Aspect'].skew(), 
      "\nmin\n", data_train['sqr_Aspect'].min(),
      "\nmax\n", data_train['sqr_Aspect'].max(),)

In [None]:
data_train['log_Aspect'] = np.log(data_train['Aspect']+1)

In [None]:
print('\033[92m'+"Skew after log transformation\n", data_train['log_Aspect'].skew(), 
      "\nmin\n", data_train['log_Aspect'].min(),
      "\nmax\n", data_train['log_Aspect'].max(),)

In [None]:
def histPlot(first_feature,col):
    sns.distplot(first_feature,color=col,fit = norm,kde = True,kde_kws = {'shade': True, 'linewidth': 3});

f = plt.figure(figsize=(20,10))
f.add_subplot(331)
histPlot(data_train['Aspect'], 'purple')
f.add_subplot(332)
histPlot(data_train['log_Aspect'], 'green')
#f.add_subplot(334)
#histPlot(data_train['Hillshade_3pm_boxcox'], 'gold')                    
f.add_subplot(333)
histPlot(data_train['sqr_Aspect'], 'c')

For aspect square root turned out to be the best transformation in terms of skeweness. 

Overall, the best transformations done here are square rt, and boxcox. log transformations did not proved to be benefitial

In [None]:
data_train.drop(['log_Aspect'], axis=1,inplace=True)

In here we are transforming the ratios into a unit scale by dividing by its index. We do so because we think it is much easier to understand

In [None]:
data_train['ratio_Hillshade_3pm'] = data_train['Hillshade_3pm']/255
data_train['ratio_Hillshade_Noon'] = data_train['Hillshade_Noon']/255
data_train['ratio_Hillshade_9am'] = data_train['Hillshade_9am']/255


<a id='5.2.8.7'></a>
### <font color=green> 5.2.8.7 Aspect in degrees <font>
### <font color=green> New Features  <font>
The azimuth is the angular direction of the sun, measured from north clockwise in degrees from 0 to 360. An Azimuth of 90 degrees is east.The Cut of values will be between for instance the middle of north and east.
    
We make a transformation so to get dummies for Aspect ordinal values: north, south, east and west

* Aspect_North: from 315 deg to 45 deg
* Aspect_East: from 45 deg to 135 deg
* Aspect_South: from 135 deg to 225 deg
* Aspect_West: from 225 deg to 315 deg    

<img src="angle_azimuth.png" width=400 height=200 align="center">
    
Source:https://www.pveducation.org/pvcdrom/properties-of-sunlight/azimuth-angle

In [None]:
#Grouping Aspect in the four directions
data_train['Aspect_North']=  np.where(((data_train['Aspect']>=0) & (data_train['Aspect']<45))|((X_train['Aspect']>=315) & (X_train['Aspect']<=360)), 1 ,0)
data_train['Aspect_East']= np.where((data_train['Aspect']>=45) & (data_train['Aspect']<135), 1 ,0)
data_train['Aspect_South']= np.where((data_train['Aspect']>=135) & (data_train['Aspect']<225), 1 ,0)
data_train['Aspect_West']= np.where((data_train['Aspect']>=225) & (data_train['Aspect']<315), 1 ,0)


<a id='5.2.8.8'></a>
### <font color=green> 5.2.8.8 Elevation <font>

No transformation is done as it is already very symetric distributed 


In [None]:
print('\033[95m'+"Skew before transformation\n", data_train['Elevation'].skew(), 
      "\nmin\n", data_train['Elevation'].min(),
      "\nmax\n", data_train['Elevation'].max(),)

In [None]:
data_train['binned_elevation'] = [math.floor(v/50.0) for v in data_train['Elevation']]

We are making more features byt summing and substracing different combinations similar in terms of units

Addition and Substraction on the same scale
Using for loop was giving us a bad performance hence we are using the features on the same scale which to add or substract 

In [None]:
data_train['Road+Fire'] = data_train['Horizontal_Distance_To_Roadways'] + data_train['Horizontal_Distance_To_Fire_Points']
data_train['Road-Fire'] = abs(data_train['Horizontal_Distance_To_Roadways'] - data_train['Horizontal_Distance_To_Fire_Points'])
data_train['Road+Hydro'] = data_train['Horizontal_Distance_To_Roadways'] + data_train['Horizontal_Distance_To_Hydrology']
data_train['Road-Hydro'] = abs(data_train['Horizontal_Distance_To_Roadways'] - data_train['Horizontal_Distance_To_Hydrology'])
data_train['Hydro+Fire'] = data_train['Horizontal_Distance_To_Hydrology'] + data_train['Horizontal_Distance_To_Fire_Points']
data_train['Hydro-Fire'] = abs(data_train['Horizontal_Distance_To_Hydrology'] - data_train['Horizontal_Distance_To_Fire_Points'])

data_train['Road+Fire+Hydro'] = data_train['Horizontal_Distance_To_Roadways']  + data_train['Horizontal_Distance_To_Fire_Points'] + data_train['Horizontal_Distance_To_Hydrology']

data_train['Ele+Road+Fire+Hydro'] = data_train['Elevation'] + data_train['Horizontal_Distance_To_Roadways']  + data_train['Horizontal_Distance_To_Fire_Points'] + data_train['Horizontal_Distance_To_Hydrology']

data_train['Ele+road'] = data_train['Elevation'] + data_train['Horizontal_Distance_To_Roadways']
data_train['Ele-road'] = abs(data_train['Elevation'] - data_train['Horizontal_Distance_To_Roadways'])
data_train['Ele+fire'] = data_train['Elevation'] + data_train['Horizontal_Distance_To_Fire_Points']
data_train['Ele-fire'] = abs(data_train['Elevation'] - data_train['Horizontal_Distance_To_Fire_Points'])
data_train['Ele+hydro'] = data_train['Elevation'] + data_train['Horizontal_Distance_To_Hydrology']
data_train['Ele-hydro'] = abs(data_train['Elevation'] - data_train['Horizontal_Distance_To_Hydrology'])




### <font color=green> 5.2.9 Geoclimate grouping  <font>

#### <font color=green> 5.2.9.1 Climatic feature engineering to group soils  <font>

In the Kaggle competition, there is a reference to John A. Blackard which happened to be one geologist working for the forest federal US agency. In a co-authored paper, he gives further insights on the soil families with a list of codes. These are digits categorizing the soils according to climate and geology. We decide to take this valuable insight and engineer features around this dynamic so to cut down the number of soils

From original database donated by John A. Blackard

Code Designations:

Wilderness Areas:  	<br>

1 - Rawah Wilderness Area <br>
2 - Neota Wilderness Area  <br>
3 - Comanche Peak Wilderness Area<br>
4 - Cache la Poudre Wilderness Area<br>

Soil Types:             1 to 40 : based on the USFS Ecological
                        Landtype Units (ELUs) for this study area:<br>

  Study Code USFS ELU Code			Description<br>
	 1	   2702		Cathedral family - Rock outcrop complex, extremely stony.<br>
	 2	   2703		Vanet - Ratake families complex, very stony.<br>
	 3	   2704		Haploborolis - Rock outcrop complex, rubbly.<br>
	 4	   2705		Ratake family - Rock outcrop complex, rubbly.<br>
	 5	   2706		Vanet family - Rock outcrop complex complex, rubbly.<br>
	 6	   2717		Vanet - Wetmore families - Rock outcrop complex, stony.<br>
	 7	   3501		Gothic family.<br>
	 8	   3502		Supervisor - Limber families complex.<br>
	 9	   4201		Troutville family, very stony.<br>
	10	   4703		Bullwark - Catamount families - Rock outcrop complex, rubbly.<br>
	11	   4704		Bullwark - Catamount families - Rock land complex, rubbly.<br>
	12	   4744		Legault family - Rock land complex, stony.<br>
	13	   4758		Catamount family - Rock land - Bullwark family complex, rubbly.<br>
	14	   5101		Pachic Argiborolis - Aquolis complex.<br>
	15	   5151		unspecified in the USFS Soil and ELU Survey.<br>
	16	   6101		Cryaquolis - Cryoborolis complex.<br>
	17	   6102		Gateview family - Cryaquolis complex.<br>
	18	   6731		Rogert family, very stony.<br>
	19	   7101		Typic Cryaquolis - Borohemists complex.<br>
	20	   7102		Typic Cryaquepts - Typic Cryaquolls complex.<br>
	21	   7103		Typic Cryaquolls - Leighcan family, till substratum complex.<br>
	22	   7201		Leighcan family, till substratum, extremely bouldery.<br>
	23	   7202		Leighcan family, till substratum - Typic Cryaquolls complex.<br>
	24	   7700		Leighcan family, extremely stony.<br>
	25	   7701		Leighcan family, warm, extremely stony.<br>
	26	   7702		Granile - Catamount families complex, very stony.<br>
	27	   7709		Leighcan family, warm - Rock outcrop complex, extremely stony.<br>
	28	   7710		Leighcan family - Rock outcrop complex, extremely stony.<br>
	29	   7745		Como - Legault families complex, extremely stony.<br>
	30	   7746		Como family - Rock land - Legault family complex, extremely stony.<br>
	31	   7755		Leighcan - Catamount families complex, extremely stony.<br>
	32	   7756		Catamount family - Rock outcrop - Leighcan family complex, extremely stony.<br>
	33	   7757		Leighcan - Catamount families - Rock outcrop complex, extremely stony.<br>
	34	   7790		Cryorthents - Rock land complex, extremely stony.<br>
	35	   8703		Cryumbrepts - Rock outcrop - Cryaquepts complex.<br>
	36	   8707		Bross family - Rock land - Cryumbrepts complex, extremely stony.<br>
	37	   8708		Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.<br>
	38	   8771		Leighcan - Moran families - Cryaquolls complex, extremely stony.<br>
	39	   8772		Moran family - Cryorthents - Leighcan family complex, extremely <br>stony.
	40	   8776		Moran family - Cryorthents - Rock land complex, extremely stony.<br>

        Note:   First digit:  climatic zone       Second digit:  geologic zones
                1.  lower montane dry             1.  alluvium
                2.  lower montane                 2.  glacial
                3.  montane dry                   3.  shale
                4.  montane                       4.  sandstone
                5.  montane dry and montane       5.  mixed sedimentary
                6.  montane and subalpine         6.unspecified in the USFS ELU Survey
                7.  subalpine                     7.  igneous and metamorphic
                8.  alpine                        8.  volcanic

The USFD, an American federal agency for forest service dependent on the department of agriculture has classified soil types according to __climatic zone (first digit)__ and __geology (second digit)__. Because of this, we believe a similar classification can be artificially engineered grouping all similar soils in 7 categories for climate (there is no lower montane dry soils) and 4 for geology (we do not take into consideration shale, sandstone, volcanic or unspecified)

#### <font color=green> 5.2.9.2 Climatic Zone feature engineering to group soils  <font>

In [None]:
data_train["Lower_Montane_Climate"] = data_train.loc[:,data_train.columns.str.contains("^Soil_Type[23456]$")].max(axis=1)

In [None]:
data_train['Montane_Dry_Climate'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[78]$")].max(axis=1)

In [None]:
data_train['Montane_Climate'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[1][0123]$|Soil_Type[9]$")].max(axis=1)

In [None]:
data_train['Montane_Dry_and_Montane_Climate'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[1][45]$")].max(axis=1)

In [None]:
data_train['Montante_and_Subalpine_Climate'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[1][678]$")].max(axis=1)


In [None]:
data_train['Subalpine_Climate'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type19$|^Soil_Type[2][0-9]$|^Soil_Type[3][0-4]$")].max(axis=1)


In [None]:
data_train['Alpine_Climate'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[3][56789]$|Soil_Type40")].max(axis=1)

#### <font color=green> 5.2.9.2 Geological feature engineering to group soils  <font>

        Note:   First digit:  climatic zone             Second digit:  geologic zones
                1.  lower montane dry                   1.  alluvium
                2.  lower montane                       2.  glacial
                3.  montane dry                         3.  shale
                4.  montane                             4.  sandstone
                5.  montane dry and montane             5.  mixed sedimentary
                6.  montane and subalpine               6.  unspecified in the USFS ELU Survey
                7.  subalpine                           7.  igneous and metamorphic
                8.  alpine                              8.  volcanic

In [None]:
data_train['Alluvium_Soil'] = data_train.loc[:,data_train.columns.str.contains("^Soil_Type[1][45679]$|^Soil_Type[2][01]$")].max(axis=1)

In [None]:
data_train['Glacial_Soil'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[9]$|^Soil_Type[2][23]$")].max(axis=1)

In [None]:
data_train['Mixed_Sedimentary_Soil'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[7-8]$")].max(axis=1)

In [None]:
data_train['Igneus_and_Metamorphic_Soil'] =data_train.loc[:,data_train.columns.str.contains("^Soil_Type[1-6]$|^Soil_Type[1][01238]$|^Soil_Type[3-4]\d$|^Soil_Type[2][4-9]$")].max(axis=1)

In [None]:
data_train.head()

Based on the medium paper "Preprocessing: Why you should generate polynominal features first before standardizing" mention it is not good practice to standardize the variablesbefore before PolynominalFeatures. This should be done after to not loss the signal of the variables.  

In [None]:
# Identify and drop our target variable 'Cover_Type' from dataframe, isolating our independent variables
X = data_train.drop('Cover_Type', axis = 1)

# Isolate our dependent variable as a feature
y = data_train['Cover_Type']

In [None]:
# Train Test Split (80/20 size), drop duplicates and missing values


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = .2, random_state = 37, stratify=y)

X_train.drop_duplicates(inplace = True)
X_train.dropna(inplace = True)

### Advanced Factorization

The numerical values present a level of detail that may be much more fine-grained than we need. For instance, the soil level can be represented by different categories (soil family, complex or stony/rubberly). We aggregate the data up which can help to avoid overfitting when the data is more aggregate.

__We played with these features but reached a final conclusion of not to considering any more families. For purposes of the assignment we show the code as a RAWNBCONVERT__

The only family grouping we do on soils are the ones above, on geological and climate grounds

### <font color=green> 5.6 Soil Type Family  <font>

Using Discretization to bin the soil variable to the family type.<br>

__Cathedral__ <br>
1 Cathedral family - Rock outcrop complex, extremely stony.<br>

__Ratake__ <br>
2 Vanet - Ratake families complex, very stony.<br>
4 Ratake family - Rock outcrop complex, rubbly.<br>

__Vanet__<br>
5 Vanet family - Rock outcrop complex complex, rubbly.<br>

__Wetmore__<br>
6 Vanet - Wetmore families - Rock outcrop complex, stony.<br>

__Gothic__<br>
7 Gothic family.<br>
                    
__Limber__ <br>
8 Supervisor - Limber families complex. <br>

__Troutville__<br>
9 Troutville family, very stony.<br>

__Legault__<br>
12 Legault family - Rock land complex, stony.<br>
29 Como - Legault families complex, extremely stony.<br>

__Gateview__ <br>
17 Gateview family - Cryaquolis complex.<br>

__Rogert__<br>
18 Rogert family, very stony.<br>


__Como__<br>
30 Como family - Rock land - Legault family complex, extremely stony.<br>

__Bross__<br>
36 Bross family - Rock land - Cryumbrepts complex, extremely stony.<br>



__Catamount__<br>
10 Bullwark - Catamount families - Rock outcrop complex, rubbly.<br>
11 Bullwark - Catamount families - Rock land complex, rubbly.<br>
13 Catamount family - Rock land - Bullwark family complex, rubbly.<br>
26 Granile - Catamount families complex, very stony.<br>
32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.<br>
31 Leighcan - Catamount families complex, extremely stony.<br>
33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.<br>

__Leighcan__<br>
21 Typic Cryaquolls - Leighcan family, till substratum complex.<br>
22 Leighcan family, till substratum, extremely bouldery.<br>
23 Leighcan family, till substratum - Typic Cryaquolls complex.<br>
24 Leighcan family, extremely stony.<br>
25 Leighcan family, warm, extremely stony.<br>
27 Leighcan family, warm - Rock outcrop complex, extremely stony.<br>
28 Leighcan family - Rock outcrop complex, extremely stony.<br>

__Moran__<br>
38 Leighcan - Moran families - Cryaquolls complex, extremely stony.<br>
39 Moran family - Cryorthents - Leighcan family complex, extremely stony.<br>
40 Moran family - Cryorthents - Rock land complex, extremely stony.<br>

__Others__<br> 
3 Haploborolis - Rock outcrop complex, rubbly.<br>
15 unspecified in the USFS Soil and ELU Survey.<br>
37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.<br>
34 Cryorthents - Rock land complex, extremely stony.<br>
35 Cryumbrepts - Rock outcrop - Cryaquepts complex.<br>
20 Typic Cryaquepts - Typic Cryaquolls complex.<br>
14 Pachic Argiborolis - Aquolis complex.<br>
16 Cryaquolis - Cryoborolis complex.<br>
19 Typic Cryaquolis - Borohemists complex.<br>

__NOT USED__

We will group the soil types according to their family and according to the complex and stonyness

Complex Group <br>
__Rock_outcrop_complex__ <br>
1 Cathedral family - Rock outcrop complex, extremely stony.<br>
2 Vanet - Ratake families complex, very stony.<br>
3 Haploborolis - Rock outcrop complex, rubbly.<br>
4 Ratake family - Rock outcrop complex, rubbly.<br>
5 Vanet family - Rock outcrop complex complex, rubbly.<br>
6 Vanet - Wetmore families - Rock outcrop complex, stony.<br>
10 Bullwark - Catamount families - Rock outcrop complex, rubbly.<br>
27 Leighcan family, warm - Rock outcrop complex, extremely stony.<br>
28 Leighcan family - Rock outcrop complex, extremely stony.<br>
33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.<br>

__Ratake_families_complex__<br>
2 Vanet - Ratake families complex, very stony.<br>


__Limber families complex__<br>
8 Supervisor - Limber families complex.<br>

__rock land complex__<br>
11 Bullwark - Catamount families - Rock land complex, rubbly.<br>
12 Legault family - Rock land complex, stony.<br>
34 Cryorthents - Rock land complex, extremely stony.<br>
40 Moran family - Cryorthents - Rock land complex, extremely stony.<br>

__Cryoborolis complex__<br>
16 Cryaquolis - Cryoborolis complex.<br>
17 Gateview family - Cryaquolis complex.<br>

__Bullwark family complex__<br>
13 Catamount family - Rock land - Bullwark family complex, rubbly.<br>

__Aquolis complex__<br>
14 Pachic Argiborolis - Aquolis complex.<br>

__Borohemists complex__<br>
19 Typic Cryaquolis - Borohemists complex.<br>

__Cryaquolls complex__<br>
20 Typic Cryaquepts - Typic Cryaquolls complex.<br>
23 Leighcan family, till substratum - Typic Cryaquolls complex.<br>
38 Leighcan - Moran families - Cryaquolls complex, extremely stony.<br>

__till substratum complex__<br>
21 Typic Cryaquolls - Leighcan family, till substratum complex.<br>

__Catamount families complex__<br>
26 Granile - Catamount families complex, very stony.<br>
1 Leighcan - Catamount families complex, extremely stony.<br>
31 Leighcan - Catamount families complex, extremely stony.<br>

__Legault families complex__<br>
29 Como - Legault families complex, extremely stony.<br>
30 Como family - Rock land - Legault family complex, extremely stony.<br>

__Leighcan family complex__<br>
32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.<br>
39 Moran family - Cryorthents - Leighcan family complex, extremely stony.<br>

__Cryaquepts complex__<br>
35 Cryumbrepts - Rock outcrop - Cryaquepts complex.<br>

__Cryumbrepts complex__<br>
36 Bross family - Rock land - Cryumbrepts complex, extremely stony.<br>

__Cryorthents complex__<br>
37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.<br>

__others__ <br>
7 Gothic family.<br>
9 Troutville family, very stony.<br>
22 Leighcan family, till substratum, extremely bouldery.<br>
24 Leighcan family, extremely stony.<br>
25 Leighcan family, warm, extremely stony.<br>
18 Rogert family, very stony.<br>
15 unspecified in the USFS Soil and ELU Survey.<br>


Source: https://www.kaggle.com/competitions/forest-cover-type-prediction/data

__NOT USED__

__Stony__ <br>
1 Cathedral family - Rock outcrop complex, extremely stony.<br>
2 Vanet - Ratake families complex, very stony.<br>
6 Vanet - Wetmore families - Rock outcrop complex, stony.<br>
9 Troutville family, very stony.<br>
12 Legault family - Rock land complex, stony.<br>
18 Rogert family, very stony.<br>
24 Leighcan family, extremely stony.<br>
25 Leighcan family, warm, extremely stony.<br>
26 Granile - Catamount families complex, very stony.<br>
27 Leighcan family, warm - Rock outcrop complex, extremely stony.<br>
28 Leighcan family - Rock outcrop complex, extremely stony.<br>
29 Como - Legault families complex, extremely stony.<br>
30 Como family - Rock land - Legault family complex, extremely stony.<br>
31 Leighcan - Catamount families complex, extremely stony.<br>
32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.<br>
33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.<br>
34 Cryorthents - Rock land complex, extremely stony.<br>
36 Bross family - Rock land - Cryumbrepts complex, extremely stony.<br>
37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.<br>
38 Leighcan - Moran families - Cryaquolls complex, extremely stony.<br>
39 Moran family - Cryorthents - Leighcan family complex, extremely stony.<br>
40 Moran family - Cryorthents - Rock land complex, extremely stony.<br>

__Rubbly__<br>
3 Haploborolis - Rock outcrop complex, rubbly.<br>
4 Ratake family - Rock outcrop complex, rubbly.<br>
5 Vanet family - Rock outcrop complex complex, rubbly.<br>
10 Bullwark - Catamount families - Rock outcrop complex, rubbly.<br>
11 Bullwark - Catamount families - Rock land complex, rubbly.<br>
13 Catamount family - Rock land - Bullwark family complex, rubbly.<br>

__others__<br>
7 Gothic family.<br>
8 Supervisor - Limber families complex.<br>
14 Pachic Argiborolis - Aquolis complex.<br>
15 unspecified in the USFS Soil and ELU Survey.<br>
16 Cryaquolis - Cryoborolis complex.<br>
17 Gateview family - Cryaquolis complex.<br>
19 Typic Cryaquolis - Borohemists complex.<br>
20 Typic Cryaquepts - Typic Cryaquolls complex.<br>
21 Typic Cryaquolls - Leighcan family, till substratum complex.<br>
22 Leighcan family, till substratum, extremely bouldery.<br>
23 Leighcan family, till substratum - Typic Cryaquolls complex.<br>
35 Cryumbrepts - Rock outcrop - Cryaquepts complex.<br>

__NOT USED__

In [None]:
# Soil Type
family_types = {
    'Type_Stony': ['Soil_Type1','Soil_Type2', 'Soil_Type6', 'Soil_Type9', 'Soil_Type12', 'Soil_Type18', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40'],
    'Type_Rubbly': ['Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type10', 'Soil_Type11', 'Soil_Type13'],
    'Type_Other': ['Soil_Type7','Soil_Type8', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type35']
} 

for family in family_types:
    data_train[family] = 0
    soil_types = family_types[family]
    for soil_type in soil_types:
        data_train[family] += data_train[soil_type]

data_train

Note: Soil type is a single variable which has been one-hot encoded presumably , so we will reverse engineer the soil type. We will eventually drop the original soil type columns which has the added effect of significantly reducing the total number of features.

In [None]:
# Original soil features
soil_features = [f'Soil_Type{i}' for i in range(1,41)]

In [None]:
# Drop original soil features
data_train.drop(columns = soil_features, inplace = True)

In [None]:
#test if elevation makes a difference to take out with the new interaction model improves
data_train = data_train.drop(['Elevation^2'], axis = 1)
data_train = data_train.drop(['Elevation'], axis = 1)

In [None]:
data_train

Removing the original scaled variables did not improve nor worsen the model. Since it does not change much the score, we remove it as we have it double in the model with the scaled features. 

In [None]:
data_train= data_train.drop(['Aspect','Slope','Horizontal_Distance_To_Roadways','Horizontal_Distance_To_Hydrology','Vertical_Distance_To_Hydrology','Hillshade_9am','Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points'], axis = 1)

In [None]:
X_drop =data_train.drop(labels=["Cover_Type"],axis=1)
y_drop =data_train ['Cover_Type']

In [None]:
X_train_drop,X_val_drop,y_train_drop,y_val_drop = train_test_split (X_drop,y_drop,random_state=37) #seed is 37

In [None]:
forest_dropped_variables = RandomForestClassifier(random_state=37)
model_dropped_variables = forest_dropped_variables.fit(X_train_drop,y_train_drop)

In [None]:
# calculating accuracy_score
model_dropped_variables.score(X_val_drop,y_val_drop)

In [None]:
forest_dropped_variables = RandomForestClassifier(random_state=37)
print("Accuracy = {0:.4f}".format(np.mean(cross_val_score(model_dropped_variables, X_val_drop, y_val_drop))))

Model does not improve with respect to baseline's 80%

### <font color=green> 5.10 Summary <font>

<table>
  <tr>
    <th><b>Features</b></th>
    <th><b>Transformation</b></th>
  </tr>
  <tr>
<td>ID  </td>
    <td> Drop</td>
  </tr>
  <tr>
    <td>Distance To Hydrology  </td>
    <td><b><i>Square Root</i></b> of the length of the side of horizontal and vertical </td>
  </tr>
  <tr>
    <td>Horizontal Distance To Roadways</td>
    <td><b>Square Root</b> of horizontal Distance to Roadways</td>
  </tr>
  <tr>
    <td> Slope</td>
    <td><b><i>Square Root</i></b> Slope</td>
  </tr>
  <tr>
    <td> Horizontal_Distance To firepoints</td>
    <td><b><i>Square Root</i></b> Horizontal Distance to firepoints</td>
  </tr>
  <tr>
    <td>Mean Hillshade</td>
    <td><b><i>Box Cox Average</i></b> of all Hillshades features</td>
  <tr>      
  </tr>
    <td>Hillshade 9am</td>
    <td><b><i>Box Cox </i></b> Hillshade 9am</td>
  <tr>      
  </tr>
    <td>Hillshade Noon</td>
    <td><b><i>Box Cox </i></b> Hillshade Noon</td>
  <tr>      
  </tr>
    <td>Hillshade 3pm</td>
    <td><b><i>Box Cox</i></b> Hillshade 3pm</td>
  <tr>      
  </tr>
        <td>Aspect</td>
    <td><b><i>Square Root</i></b> Aspect</td>
  <tr>      
  </tr>
    <td>Aspect North, East,South and West</td>
    <td><b><i>Grouping</i></b> Aspect</td>
  <tr>      
  </tr>
    <td>Geological Grouping</td>
    <td><b><i>Grouping</i></b> Soil Types</td>
  <tr>      
  </tr>
    <td>Climate Grouping</td>
    <td><b><i>Grouping</i></b> Soil Types</td>
  <tr>      
  </tr>
     <td>Soil Family</td>
    <td><b><i>Grouping</i></b> Soil Families</td>
  <tr>      
  </tr>
     <td>Soil Type Complex</td>
    <td><b><i>Grouping</i></b> Soil Complex</td>
  <tr>      
  </tr> 
     <td>Soil Type Stonyness</td>
    <td><b><i>Grouping</i></b> by Soil stonyness</td>
  <tr>      
  </tr>     

</table>

<a id='6'></a>
# <font color=green> 6.Feature Selection <font>


We will try to use several feature selection algorithms where we use them in combination of all the different selection method and will take the best score of all the used common algorithms score. 

Source: https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2 

### Locking all features in a csv

For the shake of efficiency, we create a csv file to reuse also later on in part III

In [None]:
data_train.shape

This code below have all features that we did in the feature engineering part and transport this to csv before we actually use the feature selection methods and select only few variables. 

In [None]:
#Only X_Train replacement
data_train.to_csv('all_features_data_train.csv')
data_train.to_csv('all_features_data_train.csv') #this one is doubled

<a id='6.1'></a>
## <font color=green> 6.1. Standardization <font>

We now proceed to do some standardization in the hopes it will help assist feature selection algorithms more efficiently.

We attempt also to run a few correlation matrices and take out heavily correlated values but this did not improved baselines' accuracy so it is marked as RAWNBCONVERT

__NOT USED__

We can barely see anything but I'd remove all green blocks (polynomial features)

__NOT USED__

Green correlations are still there, I do another drop

__NOT USED__

Doing an extra round of drops

__NOT USED__

Now removing perfect colinearity 1s

__NOT USED__

We split the dataset to train and validation set, in order to test our models. We use stratify to have a balanced dataset with regards to y predictor

In [None]:
X = data_train.drop(['Cover_Type'], axis=1)
y = data_train['Cover_Type']
column_list = X.columns

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=37,stratify=y)
print("The shape of validation data:{} and {} ".format(X_val.shape,y_val.shape))
print("The shape of training data:{} and {} ".format(X_train.shape,y_train.shape))

For the standardization we need only numerical values, since these has been aleady encoded we use the names to filter out the dummy variables 

In [None]:
scale_numerical  = [column for column in column_list if 'Soil' not in column and 'Wilderness_Area' not in  column and 'Aspect_North' not in  column and 'Climate' not in  column and 'Family' not in  column and 'Type' not in  column and 'complex' not in  column and 'Aspect_East' not in  column and 'Aspect_South' not in  column and 'Aspect_West' not in  column ]
scale_categorial= [column for column in column_list if column not in scale_numerical ]

We want to see only the dummy variables filtered 


In [None]:
data_train.dtypes

In [None]:
numerical_train = data_train.filter(items=scale_numerical)

In [None]:
categorial_train = data_train.filter(items=scale_categorial)

We proceed now to standardize using the Standard Scaler after trying MinMax, which didn't bring good results

In [None]:
scaler = StandardScaler()
#scaler = MinMaxScaler()

In [None]:
X_train[scale_numerical] = scaler.fit_transform(X_train[scale_numerical])
X_val[scale_numerical] = scaler.fit_transform(X_val[scale_numerical])

# Stephanie, I passed part 3 (the decision tree and bagging regressors for feature importances) into here because this is about feature importances, just to be consistent :)

<a id='6.2'></a>
## <font color=green> 6.2. Single Tree <font>

Let's start with a basic tree regressor to find the best features in the dataset.

Function below is to map the tree

In [None]:
def plot_tree(tree, feature_names):
    dot_data = StringIO()
    export_graphviz(tree, out_file=dot_data, feature_names=feature_names, filled=True, rounded=True,special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    return Image(graph.create_png())

In [None]:
single_tree = DecisionTreeRegressor(random_state=37)
model_tree = single_tree.fit(X_train, y_train)
print("MSE = {0:.4f}".format(-np.mean(cross_val_score(single_tree, X_train, y_train, scoring='neg_mean_squared_error'))))

In [None]:
plot_tree(model_tree, X_train_new.columns)

Tree is massive, needs some prunning! Let's see the initial features selected by this algorhithm

In [None]:
plt.figure(figsize=(80,40))
plt.bar(X_train.columns, single_tree.feature_importances_)
plt.title('Feature Importance', fontsize=100);

Let's prune the tree by setting a max_septh which we will find with the help of GridSearchCV

In [None]:
param_grid = {'max_depth': range(1,30)}

single_tree_a = GridSearchCV(single_tree,
                            param_grid,
                            scoring='neg_mean_squared_error',
                            cv=5, n_jobs=-1, verbose=1)

single_tree_a.fit(X_train, y_train)

In [None]:
print("Best parameters set found on development set:")
print()
print(single_tree_a.best_params_)
print()
print("Grid scores on development set:")
print()
means = single_tree_a.cv_results_['mean_test_score']
stds = single_tree_a.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, single_tree_a.cv_results_['params']):
    print("MSE = %0.3f (+/-%0.03f) for %r" % (-mean, std * 2, params))

In [None]:
plt.figure(figsize=(10,10))
plt.errorbar(range(1,30,1), [-m for m in means], yerr=stds, fmt='-o')
plt.title('MSE for different Depths', fontsize=20)
plt.xlabel("Depth", fontsize=16)
plt.ylabel("MSE", fontsize=16);

Best depth to select is 12, lets plug it in the regressor

In [None]:
single_tree_pruned = DecisionTreeRegressor(random_state=18, max_depth=12)
model_tree_pruned = single_tree_pruned.fit(X_train, y_train)


print("MSE = {0:.4f}".format(-np.mean(cross_val_score(single_tree_pruned, X_train, y_train, scoring='neg_mean_squared_error'))))
print("Accuracy Train= {0:.4f}".format(np.mean(cross_val_score(single_tree_pruned, X_train, y_train))))
print("Accuracy Test= {0:.4f}".format(np.mean(cross_val_score(single_tree_pruned, X_val, y_val))))

Scores are not so good, lets try the Classifier

In [None]:
single_tree_pruned = DecisionTreeClassifier(random_state=18, max_depth=12)

print("MSE = {0:.4f}".format(-np.mean(cross_val_score(single_tree_pruned, X_train, y_train, scoring='neg_mean_squared_error'))))
print("Accuracy Train= {0:.4f}".format(np.mean(cross_val_score(single_tree_pruned, X_train, y_train))))
print("Accuracy Test= {0:.4f}".format(np.mean(cross_val_score(single_tree_pruned, X_val, y_val))))

In [None]:
plt.figure(figsize=(80,40))
plt.bar(X_train.columns, model_tree_pruned.feature_importances_)
plt.title('Feature Importance', fontsize=100);

<a id='6.3'></a>
## <font color=green> 6.3. Bagging <font>

In [None]:
tree_bagging = RandomForestRegressor(random_state=37, n_jobs=-1, max_features=len(X_train.columns))
print("MSE = {0:.4f}".format(-np.mean(cross_val_score(tree_bagging, X_train, y_train, scoring='neg_mean_squared_error'))))
print("Accuracy Train= {0:.4f}".format(np.mean(cross_val_score(tree_bagging, X_train, y_train))))
print("Accuracy Train= {0:.4f}".format(np.mean(cross_val_score(tree_bagging, X_val, y_val))))

In [None]:
plt.figure(figsize=(100,60))
tree_bagging.fit(X_train,y_train)
plt.bar(X_train.columns, tree_bagging.feature_importances_)
plt.title('Feature Importance', fontsize=100);

Image must be amplified to see the most relevant features, best to download and oped in a separate image viewer

<a id='6.3'></a>
## <font color=green> 6.3. Random Forest <font>

In [None]:
random_forest = RandomForestRegressor(random_state=37, max_features='sqrt')
forest_fit = random_forest.fit(X_train,y_train)
print("MSE = {0:.4f}".format(-np.mean(cross_val_score(random_forest, X_train, y_train, scoring='neg_mean_squared_error'))))
print("Accuracy Train= {0:.4f}".format(np.mean(cross_val_score(random_forest, X_train, y_train))))
print("Accuracy Test= {0:.4f}".format(np.mean(cross_val_score(random_forest, X_val, y_val))))

In [None]:
plt.figure(figsize=(100,60))
plt.bar(X_train.columns, random_forest.feature_importances_)
plt.title('Feature Importance', fontsize=100);

It is almost an scaled version of the bagging

<a id='6.4'></a>
## <font color=green> 6.4. Extra Trees <font>

Now we ran Feature Importance Strategies, first we do the extra trees

# Stephanie, can you please check if the Classifier is the way to go for the Extra Trees and not the Regressor? For feature importances I always used regressor and for the model the classifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) Plus, we need CV as well

In [None]:
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) 
 
#plot the graph of feature importances 
fig, ax = plt.subplots(figsize=(10, 10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(40).plot(kind='barh')
plt.show()

Resulting features seem to be reasonable to use, elevation being the most important one

In [None]:
pd.set_option('display.max_columns', None)
data_train.head()

<a id='6.4.1'></a>
## <font color=green> 6.4.1 Number of feature selection  <font>

# Stephanie, we did already a few models for feature selection and now we decide on the number, is this ok like this?

We will start with 46 maximum features

In [None]:
num_feats=46

We first fit a linear model to the initial dataset to have a baseline to evaluate the data cleaning and feature engineering impact.

To facilitate the training process, we will use the `sklearn` library <https://scikit-learn.org/stable/index.html> that provides a wrapper for the preprocessing, training, and evaluation of many machine learning algorithms. 

In [None]:
initial_lm_mod = linear_model.LogisticRegression(multi_class='multinomial',solver='saga',
   max_iter=1000, penalty='none',n_jobs=-1)

#initial_lm_mod = RandomForestRegressor(n_estimators=150)
baseline_acc = np.mean(
    cross_val_score(initial_lm_mod, X_train,y_train, cv=5))

print(f"Baseline model with Accuracy = {baseline_acc:.4}")

In [None]:
def get_feature_importance(clf, feature_names):
    """
    Function to print the most important features of a logreg classifier
    based on the coefficient values
    """
    return pd.DataFrame(
        {
            'variable': feature_names, # Feature names
            'coefficient': clf.coef_[0] # Feature Coeficients
        }
    ) \
    .round(decimals=2) \
    .sort_values('coefficient', ascending=False) \
    .style.bar(color=['red', 'green'], align='zero')

In [None]:
get_feature_importance(
    initial_lm_mod.fit(X_val,y_val), 
    X_train.columns.get_level_values(0).tolist()
)

We get another view of feature importance

<a id='6.4.2'></a>
## <font color=green> Embedded Method <font>
## <font color=green> 6.4.2 Lasso Regularization <font>

When applying regularization to a Machine Learning model, we add a penalty to the model parameters to avoid that our model tries to resemble too closely our input data. In this way, we can make our model less complex and we can avoid overfitting (making learn to our model, not just the key data characteristics but also it’s intrinsic noise).
One of the possible Regularization Methods is Lasso (L1) Regression. When using Lasso Regression, the coefficients of the inputs features gets shrunken if they are not positively contributing to our Machine Learning model training. In this way, some of the features might get automatically discarded assigning them coefficients equal to zero.

In [None]:
lasso_mod = linear_model.LogisticRegression(penalty='l1', solver='liblinear')
print("Accuracy = {:.4}".format(np.mean(cross_val_score(lasso_mod, X_train, y_train, cv=5))))

In [None]:
get_feature_importance(lasso_mod.fit(X_train,y_train), X_train.columns.get_level_values(0).tolist())

Similar performance w.r.t the un-regularized models. However, you can see how the feature coefficients are smaller than the original ones, due to the regularization.

Both methods give quite a lot of importance to elevation related variables


Let's look at how the coefficient weights and accuracy scores change along with the different regularization values.
To that end, I have implemented the following piece of code. Do not be overwhelmed by it. It basically defines a list of regularization values to test and train a new Logistic Regression model for one of these regularization values. We keep track of the coefficient values and the accuracy of each of these models to plot them according to the defined regularization parameters.

In [None]:
#Watch out, takes 10 min to run on an i7 processor
lasso_mod = linear_model.LogisticRegression(penalty='l1',solver='liblinear')
alphas = 10**np.linspace(-1,-4,100)

coefs_ = []
scores_ = []
for a in alphas:
    lasso_mod.set_params(C=a)
    scores_.append(np.mean(cross_val_score(lasso_mod, X_train, y_train, cv=5))) # Appends the accuracy of the model
    coefs_.append(lasso_mod.fit(X_train, y_train).coef_.ravel().copy()) # Appends the coefficient of the model

coefs_ = np.array(coefs_)
scores_ = np.array(scores_)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,10))
fig.suptitle('Logistic Regression Path', fontsize=20)

# Coeff Weights Plot
ax1.plot(alphas, coefs_, marker='o')
ymin, ymax = plt.ylim()
ax1.set_ylabel('Coefficient Weights', fontsize = 15)
ax1.set_xlabel('log(C)', fontsize = 15)
ax1.axis('tight')

# Accuracy Plot
ax2.plot(alphas, scores_, marker='o')
ymin, ymax = plt.ylim()
ax2.set_ylabel('Accuracy Score', fontsize = 15)
ax2.set_xlabel('log(C)', fontsize = 15)
ax2.axis('tight')

plt.show()

# Stephanie, is this explanation still in line with the graph above?

As you can see in the left figure, the smaller the alpha value (alpha), the larger the regularization and, consequently, the smaller the weights of the coefficients. This is because, if we check the sklearn documentation, we will see that this value is the: "Inverse of regularization strength."

When regularization is large enough (i.e., alpha is small), the values of the coefficients are close to 0 (i.e., null model).

As there is a trade-off between variance (i.e., less over-fitted model --> more regularization) and bias (i.e., learning more from the training set --> less regularization), You must find the optimal alpha value. 

As you can see in the right figure, this value is achieved with small alpha values (i.e., more regularization). 

To automatize the process of finding the optimal value, you can make use of the LogisticRegressionCV function in sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) that performs CV, testing different hyperparameters (that you can provide) and selecting the optimal one.

In [None]:
embeded_lr_selector = SelectFromModel(LogisticRegression(C=0.04,penalty='l1',solver='liblinear',max_iter=1000,n_jobs=-1), max_features=num_feats)
embeded_lr_selector.fit(X_train, y_train)

In [None]:
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X_train.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

In [None]:
embeded_lr_feature

<a id='6.4.3'></a>
## <font color=green> 6.4.3 Filter Method <font>
    
<a id='6.4.3.1'></a>
### <font color=green> 6.4.3.1 Anova F-value <font>
Chi-Square does not work because it needs non - negative values. For that reason we will use Anova. It is a univariate filter method that uses variance to find out the separability of the individual features between classes. It applies to multi-class endpoints.The SelectKBest class just scores the features using a function  f_classif and then "removes all but the k highest scoring features

__f_classif__ =  ANOVA F-value between label/feature for classification tasks.
    
__k__ is the prior selected numbers of chosen features we want to select for further selection.
    

In [None]:
#Code from class Forum, select the best features 
anov_selector = SelectKBest(f_classif, k=num_feats) 

In [None]:
anov_support = anov_selector.get_support()
# Get  columns from original dataframe
anov_feature = X_train.iloc[:,anov_support].columns.tolist()
print(str(len(anov_feature)), 'selected features')


<a id='6.4.3.2'></a>
### <font color=green> 6.4.3.2. Pearson correlation <font>

We get another set of features, this time giving more importance to the alpine climate

In [None]:
#X, y = data_train['data'], data_train['Cover_Type']

# Create a list of the feature names
#features = np.array(data['feature_names'])
fig, ax = plt.subplots(figsize=(10,40))         # Sample figsize in inches
# Instantiate the visualizer
visualizer = FeatureCorrelation(labels=None,sort=True)

ax = visualizer.fit(X_train, y_train)        # Fit the data to the visualizer
visualizer.show()           # Finalize and render the figure

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
corr = data_train.corrwith(data_train["Cover_Type"])
X_y = data_train.copy()
X_y['Cover_Type'] = y_train

In [None]:
fig, ax = plt.subplots(figsize=(8,50))         # Sample figsize in inches

corr_matrix = X_y.corr()

# Isolate the column corresponding to `exam_score`
corr_target = corr_matrix[['Cover_Type']].drop(labels=['Cover_Type'])
corr_target_sorted = corr_target.sort_values(by = 'Cover_Type')
sns.heatmap(corr_target_sorted, annot=True, fmt='.3', cmap='RdBu_r',ax=ax)
plt.show()

In [None]:
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X_train, y_train,num_feats)
print(str(len(cor_feature)), 'selected features')

In [None]:
cor_feature

<a id='6.5'></a>
## <font color=green> 6.5. Recursive Feature Elimination <font>

The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
rfe_selector = RFE(estimator=LogisticRegression(max_iter=3000), n_features_to_select=num_feats, step=10, verbose=5)
rfe_selector.fit(X_train, y_train)

In [None]:
rfe_support = rfe_selector.get_support()
rfe_feature = X_train.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

In [None]:
rfe_feature

In [None]:
#Watch out, it takes 20 min to run n an i7 processor
cv = StratifiedKFold(5)
visualizer = rfecv(LogisticRegression(max_iter=3000), X=X_train, y=y_train, cv=cv, scoring='accuracy', n_jobs=-1) #uses all processors

<a id='6.6'></a>
## <font color=green> 6.6. Tree-based: SelectFromModel <font>
<a id='6.6.1'></a>
### <font color=green> 6.6.1 RandomForestClassifier<font>
Embedded methods use algorithms that have built-in feature selection methods. We can also use RandomForest to select features based on feature importance. We calculate feature importance using node impurities in each decision tree. In Random forest, the final feature importance is the average of all decision tree feature importance.

# I think I did this again a bit earlier, feel free to delete either mine or yours; or maybe we can use for comparison if this is post RFE

In [None]:
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=num_feats)
embeded_rf_selector.fit(X_train, y_train)

In [None]:
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X_train.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')

In [None]:
embeded_rf_feature

<a id='6.6.2'></a>
### <font color=green> 6.6.2 XgBoost <font>

XGBoost is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.


In [None]:
#y_train needs to be transformed from 1,2,3,4,5,6,7 to 0 1 2 3 4 5,6
le = LabelEncoder()
y_train1 = le.fit_transform(y_train)

model=xgb.XGBClassifier(learning_rate=0.1,n_estimators = 400,max_depth = 3,n_jobs=-1)

embeded_xgb_selector = SelectFromModel(model, max_features=num_feats)
embeded_xgb_selector.fit(X_train, y_train1)
#learning_rate=0.1,n_estimators = 400,max_depth = 3,n_jobs=-1

# We didn't use CV for XGBoost as he said it is quite sensible to hyperparameters, given that it takes up to 90 min to run and is just for feature selection I think we can leave it for the part 3

In [None]:
embeded_xgb_support = embeded_xgb_selector.get_support()
embeded_xgb_feature = X_train.loc[:,embeded_xgb_support].columns.tolist()
print(str(len(embeded_xgb_feature)), 'selected features')

In [None]:
embeded_xgb_feature

# Stephanie, is this repeated? You used it earlier as well, unless we are now rerunning after RFE?

<a id='6.6.3'></a>
### <font color=green> 6.6.3 ExtraTreesClassifier <font>

Each Decision Tree in the Extra Trees Forest is constructed from the original training sample. Then, at each test node, Each tree is provided with a random sample of k features from the feature-set from which each decision tree must select the best feature to split the data based on some mathematical criteria (typically the Gini Index). This random sample of features leads to the creation of multiple de-correlated decision trees.
To perform feature selection using the above forest structure, during the construction of the forest, for each feature, the normalized total reduction in the mathematical criteria used in the decision of feature of split (Gini Index if the Gini Index is used in the construction of the forest) is computed. This value is called the Gini Importance of the feature.

In [None]:
extra_tree_model= ExtraTreesClassifier()
# Training the model
extra_tree_forest_selector = SelectFromModel(ExtraTreesClassifier(max_features=num_feats,criterion='gini',n_jobs=-1))
extra_tree_forest_selector.fit(X_train, y_train)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
extra_tree_model.fit(X_train,y_train)
extra_tree_model.feature_importances_
#plot graph of feature importances for better visualization
feat_importances = pd.Series(extra_tree_model.feature_importances_, index=X_train.columns)
ax = feat_importances.nlargest(num_feats).plot(kind='barh', colormap = 'rainbow')
plt.show()


In [None]:
extra_tree_forest_support = extra_tree_forest_selector.get_support()
extra_tree_forest_feature = X_train.loc[:,extra_tree_forest_support].columns.tolist()
print(str(len(extra_tree_forest_feature)), 'selected features')

In [None]:
extra_tree_forest_feature

<a id='6.7'></a>
## <font color=green> 6.7. Score of all methods together  <font>

In [None]:
feature_name = list(X_train.columns)

In [None]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name,'ExtraTree':extra_tree_forest_support, 'Pearson':cor_support, 'RFE':rfe_support,'Anova':anov_support, 'Logistics':embeded_lr_support,
                                    'Random Forest':embeded_rf_support, 'XGB':embeded_xgb_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(num_feats)

Logistic Regression 
Esemble Learning and Random Forest 
Decision Trees 
Random Forest (XGB)

In [None]:
conditions = [(feature_selection_df['Total']>4)]
 
new_df = feature_selection_df.loc[feature_selection_df['Total'] >4]

In [None]:
feature_list = new_df['Feature'].to_list()
feature_list

#Since we want to add all the features of the dummy variale family and climate, we will remove them first. we also remove complex and soil for the time being and see how the model performs

In [None]:
new = [column for column in feature_list if 'Soil' not in column and 'Type' not in  column and 'complex' not in  column and 'Family' not in column and 'Climate' not in column ]
new
missing = (data_train.filter(regex='Family').columns)
missing2 = (data_train.filter(regex='Climate').columns)

new.extend(missing)
new.extend(missing2)
new

In [None]:
embeded_xgb_feature

In [None]:
X_trial=data_train[embeded_xgb_feature]
y_trial = data_train['Cover_Type']

In [None]:
print(X_trial.shape)
y_trial.shape

In [None]:
#Only X_Train replacement
X_trial.to_csv('X_trial.csv')
y_trial.to_csv('y_trial.csv')

In [None]:
X_selected = data_train[feature_list]
y_selected = data_train['Cover_Type']

In [None]:
#Only X_Train replacement
X_selected.to_csv('X_selected.csv')
y_selected.to_csv('y_selected.csv')

In [None]:
X_selected1 = pd.read_csv("X_selected.csv")
y_selected1 = pd.read_csv("y_selected.csv")

In [None]:
print(X_selected.shape)
print(y_selected.shape)

if X.shape[0] != y.shape[0]:
  print("X and y rows are mismatched, check dataset again")

In [None]:
y_selected.shape

Test

In [None]:
forest = RandomForestClassifier(n_estimators=20)
model_forest = forest.fit(X_train_new,y_train_new)

In [None]:
forest.score(X_val_new,y_val_new)

In [None]:
print(classification_report(y_val_new, y_pred_test_forest))

Checking Correlation among features 

In [None]:
#Heatmap 
matrix = X_selected.corr()
mask = np.zeros_like(matrix)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(50,20))
heatmap = sns.heatmap(matrix, center=0, fmt=".3f", square=True, annot=True, linewidth=1.3, mask = mask,vmax=0.9);
plt.show()

# I go to sleep, see you in the whatsapp chat tomorrow :)

# Feel free to delete all my markups, I ordered the code, added headings, ordered the index, ordered the imports and ran it from end to end. Seems to be working and functional