<a class="anchor" id="0"></a>
# Random Forest Classifier with Feature Importance


**Tugas Laporan Praktikum Pertemuan 9**

## 1. Task Description <a class="anchor" id="1"></a>

Please find another dataset that include **Time Series**. Do the same thing as what we tried in the 9th simulation.

Bonus: If you can visualize with Decision Tree, please put it to the bottom after Classification Report (Last Part).

*Nb: This file is the template for your 9th simulation, because some of code are still missing, you can see your simulation worksheet to help your code in this notebook. Also, you can re-edit or add the code to fit your dataset. But don't re-edit sub-chapter's name or even delete them (this is prohibited).*



## 2. Import libraries <a class="anchor" id="2"></a>


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
!pip install category-encoders

sns.set(style="whitegrid")

## 3. Import dataset <a class="anchor" id="3"></a>


In [None]:
data = ''

df = pd.read_(data)

## 4. Exploratory data analysis <a class="anchor" id="4"></a>

### 4.1  View dimensions of dataset <a class="anchor" id="4.1"></a>

In [None]:
# print the shape
print('The shape of the dataset : ', df.shape)

### 4.2 Preview the dataset <a class="anchor" id="4.2"></a>

In [None]:
df.head()

### 4.3 Rename column names <a class="anchor" id="4.3"></a>

We can see that the dataset does not have proper column names. The column names contain underscore. We should give proper names to the columns. We will do it as follows:-

In [None]:
col_names = []

df.columns = col_names

df.columns

### 4.4 View summary of dataset <a class="anchor" id="4.4"></a>

In [None]:
df.info()

### 4.5 Check the data types of columns <a class="anchor" id="4.5"></a>

- The above `df.info()` command gives us the number of filled values along with the data types of columns.

- If we simply want to check the data type of a particular column, we can use the following command.

In [None]:
df.dtypes

### 4.6 View statistical properties of dataset <a class="anchor" id="4.6"></a>

In [None]:
df.describe()

- The above `df.describe()` command presents statistical properties in vertical form.

- If we want to view the statistical properties in horizontal form, we should run the following command.

In [None]:
df.describe().T

We can see that the above `df.describe().T` command presents statistical properties in horizontal form.

#### Important points to note


- The above command `df.describe()` helps us to view the statistical properties of numerical variables. It excludes character variables.

- If we want to view the statistical properties of character variables, we should run the following command -

        `df.describe(include=['object'])`

- If we want to view the statistical properties of all the variables, we should run the following command -

        `df.describe(include='all')`

In [None]:
df.describe(include='')

### 4.7 Check for missing values <a class="anchor" id="4.7"></a>


- In Python missing data is represented by two values:

   - **None** : None is a Python singleton object that is often used for missing data in Python code.

   - **NaN** : NaN is an acronym for Not a Number. It is a special floating-point value recognized by all systems   that use the standard IEEE floating-point representation.

- There are different methods in place on how to detect missing values.


#### Pandas isnull() and notnull() functions 

- Pandas offers two functions to test for missing values - **isnull()** and **notnull()**. 

- These are simple functions that return a boolean value indicating whether the passed in argument value is in fact missing data.


Below, We will list some useful commands to deal with missing values.


#### Useful commands to detect missing values 

- **df.isnull()**

The above command checks whether each cell in a dataframe contains missing values or not. If the cell contains missing value, it returns True otherwise it returns False.

- **df.isnull().sum()**

The above command returns total number of missing values in each column in the dataframe.

- **df.isnull().sum().sum()**

It returns total number of missing values in the dataframe.


- **df.isnull().mean()**

It returns percentage of missing values in each column in the dataframe.


- **df.isnull().any()**

It checks which column has null values and which has not. The columns which has null values returns TRUE and FALSE otherwise.

- **df.isnull().any().any()**

It returns a boolean value indicating whether the dataframe has missing values or not. If dataframe contains missing values it returns TRUE and FALSE otherwise.

- **df.isnull().values.any()**

It checks whether a particular column has missing values or not. If the column contains missing values, then it returns TRUE otherwise FALSE.

- **df.isnull().values.sum()**

It returns the total number of missing values in the dataframe.


In [None]:
# check for missing values
df..sum()

### 4.8 Functional approach to EDA <a class="anchor" id="4.9"></a>

- An alternative approach to EDA is to write a function that presents initial EDA of dataset.

- We can write such a function as follows :-

In [None]:
def initial_eda(df):
    if isinstance(df, pd.DataFrame):
        total_na = df.isna().sum().sum()
        print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
        print("Total NA Values : %d " % (total_na))
        print("%38s %10s     %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
        col_name = df.columns
        dtyp = df.dtypes
        uniq = df.nunique()
        na_val = df.isna().sum()
        for i in range(len(df.columns)):
            print("%38s %10s   %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
        
    else:
        print("Expect a DataFrame but got a %15s" % (type(df)))


In [None]:
initial_eda(df)

### Types of variables

- In this section, We segregate the dataset into categorical and numerical variables. 

- There are a mixture of categorical and numerical variables in the dataset. 

- Categorical variables have data type object. Numerical variables have data type int64.

- First of all, We will explore categorical variables.

## 5. Explore Categorical Variables <a class="anchor" id="5"></a>

### 5.1 Find categorical variables 

In [None]:
categorical = [var for var in df.columns if df[var].dtype]

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

### 5.2 Preview categorical variables 

In [None]:
df[].head()

### 5.3 Frequency distribution of categorical variables 

Now, we will check the frequency distribution of categorical variables.

In [None]:
for var in categorical: 
    
    print(df[].value_counts())

### 5.4 Percentage of frequency distribution of values

In [None]:
for var in categorical:
    
     print(df[].value_counts()/np.float(len()))

### 5.5 Explore the variables 

#### Explore target variable 

In [None]:
# check for missing values

df[''].isnull().sum()

In [None]:
# view number of unique values

df[''].nunique()

In [None]:
# view the unique values

df[''].unique()

In [None]:
# view the frequency distribution of values

df[''].value_counts()

In [None]:
# view percentage of frequency distribution of values

df[''].value_counts()/len(df)

In [None]:
# visualize frequency distribution of ... variable

f,ax=plt.subplots(1,2,figsize=(18,8))

ax[0] = df[''].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('')


#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of ... variable")

plt.show()

We can plot the bars horizontally as follows :-

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(y="", data=df, palette="Set1")
ax.set_title("Frequency distribution of ...")
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="", hue="", data=df, palette="Set1")
ax.set_title("Frequency distribution of ...")
plt.show()

#### Explore *`another`* variable

You can repeat this step, based on the variable that you think need to explore.

In [None]:
# check number of unique labels

df..nunique()

In [None]:
# view unique labels

df..()


In [None]:
# view frequency distribution of values

df..value_counts()

In [None]:
# replace '?' values in ... variable with `NaN`

df[''].replace(' ?', np.NaN, inplace=True)


In [None]:
# again check the frequency distribution of values

df..value_counts()

In [None]:
# visualize frequency distribution of `...` variable

f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="", data=df, palette="Set1")
ax.set_title("Frequency distribution of ... variable")
ax.set_xticklabels(df..value_counts().index, rotation=30)
plt.show()

### 5.6 Check missing values in categorical variables 

In [None]:
df[categorical].isnull().sum()

### 5.7 Number of labels: Cardinality 

- The number of labels within a categorical variable is known as **cardinality**. 

- A high number of labels within a variable is known as **high cardinality**. 

- High cardinality may pose some serious problems in the machine learning model. So, We will check for high cardinality.

In [None]:
# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

We can see that native_country column contains relatively large number of labels as compared to other columns. We will check for cardinality after train-test split.

## 6. Explore Numerical Variables <a class="anchor" id="6"></a>

### 6.1  Find numerical variables 

In [None]:
numerical = [var for var in df.columns if df[var].dtype]

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :\n\n', numerical)

### 6.2 Preview the numerical variables

In [None]:
df[].head()

### 6.3 Check missing values in numerical variables 

In [None]:
df[numerical].isnull().sum()

We can see that there are no missing values in the numerical variables.

### 6.4 Explore numerical variables

In [None]:
# YOUR CODE HERE

## 7. Declare feature vector and target variable <a class="anchor" id="7"></a>

In [None]:
X = df.drop(['...'], axis=1)

y = df['']

## 8. Split data into separate training and test set <a class="anchor" id="8"></a>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = _split(X, y, test_size = 0.3, random_state = 0)


In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

## 9. Feature Engineering  <a class="anchor" id="9"></a>
**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. 

### 9.1 Display categorical variables in training set


In [None]:
categorical = [col for col in X_train. if X_train[col].dtypes ]

categorical

### 9.2 Display numerical variables in training set


In [None]:
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

### 9.3 Engineering missing values in categorical variables

In [None]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

In [None]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

In [None]:
# impute missing categorical variables with most frequent value

for  in [X_train, X_test]:
    # YOUR CODE HERE   

In [None]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

In [None]:
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

As a final check, We will check for missing values in X_train and X_test.

In [None]:
# check missing values in X_train

X_train.isnull().sum()

In [None]:
# check missing values in X_test

X_test.isnull().sum()

We can see that there are no missing values in X_train and X_test.

### 9.4 Encode categorical variables


In [None]:
# preview categorical variables in X_train

X_train[].head()

In [None]:
# import category encoders

import category_encoders as ce

In [None]:
# encode categorical variables with one-hot encoding

encoder = .OneHotEncoder(cols=[''])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
X_test.head()

In [None]:
X_test.shape

* We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called **feature scaling**. We will do it as follows.

## 10. Feature Scaling <a class="anchor" id="10"></a>

In [None]:
cols = X_train.columns


In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


In [None]:
X_train = pd.DataFrame(X_train, columns=[cols])

In [None]:
X_test = pd.DataFrame(X_test, columns=[cols])

We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows.

## 11. Random Forest Classifier model with default parameters <a class="anchor" id="11"></a>

In [None]:
# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# instantiate the classifier 
rfc = (random_state=0)

# fit the model
rfc.fit(X_train, y_train)

# Predict the Test set results
y_pred = rfc.predict(X_test)

# Check accuracy score 
from sklearn.metrics import accuracy_score

print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Here, **y_test** are the true class labels and **y_pred** are the predicted class labels in the test-set.

Here, We have build the Random Forest Classifier model with default parameter of `n_estimators = 10`. So, We have used 10 decision-trees to build the model. Now, We will increase the number of decision-trees and see its effect on accuracy.

## 12. Random Forest Classifier model with 100 Decision Trees  <a class="anchor" id="12"></a>

In [None]:
# instantiate the classifier with n_estimators = 10
rfc_100 = (n_estimators=100, random_state=0)

# fit the model to the training set
rfc_100.fit(X_train, y_train)

# Predict on the test set results
y_pred_100 = .predict(X_test)

# Check accuracy score 
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))

The model accuracy score with 10 decision-trees is `[ANSWER HERE BASED ON YOUR RESULT]` but the same with 100 decision-trees is `[ANSWER HERE BASED ON YOUR RESULT]`. So, as expected accuracy increases with number of decision-trees in the model.

## 13. Find important features with Random Forest model <a class="anchor" id="13"></a>

In [None]:
# create the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)


# fit the model to the training set
clf.(X_train, y_train)

Now, We will use the feature importance variable to see feature importance scores.

In [None]:
# view the feature scores
feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores

We can see that the most important feature is `[ANSWER HERE BASED ON YOUR RESULT]` and least important feature is `[ANSWER HERE BASED ON YOUR RESULT]`.

## 14. Build the Random Forest model on selected features <a class="anchor" id="14"></a>

Now, We will drop the least important feature `[ANSWER HERE BASED ON YOUR RESULT]` from the model, rebuild the model and check its effect on accuracy.

In [None]:
# drop the least important feature from X_train and X_test

X_train = X_train.drop(['...'], axis=1)
X_test = X_test.drop(['...'], axis=1)

Now, We will build the random forest model again and check accuracy.

In [None]:
# instantiate the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# fit the model to the training set
clf.fit(X_train, y_train)

# Predict on the test set results
y_pred = clf.(X_test)

# Check accuracy score 
print('Model accuracy score with ... variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.


But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. 


We have another tool called `Confusion matrix` that comes to our rescue.

## 15. Confusion matrix <a class="anchor" id="15"></a>

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.


Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-


**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.


**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.


**False Positives (FP)** – False Positives occur when we predict an observation belongs to a    certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**



**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**



These four outcomes are summarized in a confusion matrix given below.


In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = _matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)



In [None]:
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

## 16. Classification Report <a class="anchor" id="16"></a>

**Classification report** is another way to evaluate the classification model performance. It displays the  **precision**, **recall**, **f1** and **support** scores for the model. We have described these terms in later.

We can print a classification report as follows:-

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, ))

## Bonus Section

Decision Tree Visualization

---


In [None]:
# CODE HERE (For bonus section only)