<a class="anchor" id="0"></a>
# Random Forest Classifier 


In a random forest classification, multiple decision trees are created using different random subsets of the data and features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by calculating the prediction for each decision tree, then taking the most popular result. (For regression, predictions use an averaging technique instead.)

Random forests are a popular supervised machine learning algorithm. 

* Random forests are for supervised machine learning, where there is a labeled target variable.
* Random forests can be used for solving regression (numeric target variable) and classification (categorical target variable) problems.
* Random forests are an ensemble method, meaning they combine predictions from other models.
* Each of the smaller models in the random forest ensemble is a decision tree.

![image.png](attachment:image.png)

<a class="anchor" id="0.1"></a>
## Table of Contents


1.	[The problem statement](#1)
1.	[Import libraries](#2)
1.	[Import dataset](#3)
1.	[Exploratory data analysis](#4)
1.  [Explore categorical variables](#5)
1.  [Explore numerical variables](#6)
1.  [Declare feature vector and target variable](#7)
1.	[Split data into separate training and test set](#8)
1.	[Feature Engineering](#9)
1.  [Feature Scaling](#10)
1.	[Random Forest Classifier with default parameters](#11)
1.	[Random Forest Classifier with 100 Decision Tress](#12)
1.	[Find important features with Random Forest model](#13)
1.	[Visualize feature scores of the features](#14)
1.	[Build the Random Forest model on selected features](#15)
1.	[Confusion matrix](#16)
1.	[Classification report](#17)
1.	[Results and conclusion](#18)


## 1. The problem statement <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)


In this kernel, I try to make predictions where the prediction task is to determine whether a person makes over 50K a year. I implement Random Forest Classification with Python and Scikit-Learn. So, to answer the question, I build a Random Forest classifier to predict whether a person makes over 50K a year.

I have used the **Income classification data set** for this project.


## 2. Import libraries <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set(style="whitegrid")

In [None]:
import warnings

warnings.filterwarnings('ignore')

## 3. Import dataset <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)


In [None]:
data = 'Data/income_evaluation.csv'

df = pd.read_csv(data)

## 4. Exploratory data analysis <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)


Now, I will explore the data to gain insights about the data. 

### 4.1  View dimensions of dataset <a class="anchor" id="4.1"></a>

In [None]:
# print the shape
print('The shape of the dataset : ', df.shape)

We can see that there are 32561 instances and 15 attributes in the data set.

### 4.2 Preview the dataset <a class="anchor" id="4.2"></a>

In [None]:
df.head()

### 4.3 Rename column names <a class="anchor" id="4.3"></a>

We can see that the dataset does not have proper column names. The column names contain underscore. We should give proper names to the columns. I will do it as follows:-

In [None]:
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

df.columns

### 4.4 View summary of dataset <a class="anchor" id="4.4"></a>

In [None]:
df.info()

#### Findings

- We can see that the dataset contains 9 character variables and 6 numerical variables.

- `income` is the target variable.

- There are no missing values in the dataset. I will explore this later,

### 4.5 Check the data types of columns <a class="anchor" id="4.5"></a>

- The above `df.info()` command gives us the number of filled values along with the data types of columns.

- If we simply want to check the data type of a particular column, we can use the following command.

In [None]:
df.dtypes

### 4.6 View statistical properties of dataset <a class="anchor" id="4.6"></a>

In [None]:
df.describe()

- The above `df.describe()` command presents statistical properties in vertical form.

- If we want to view the statistical properties in horizontal form, we should run the following command.

In [None]:
df.describe().T

We can see that the above `df.describe().T` command presents statistical properties in horizontal form.

#### Important points to note


- The above command `df.describe()` helps us to view the statistical properties of numerical variables. It excludes character variables.

- If we want to view the statistical properties of character variables, we should run the following command -

        `df.describe(include=['object'])`

- If we want to view the statistical properties of all the variables, we should run the following command -

        `df.describe(include='all')`

In [None]:
df.describe(include='all')

### 4.7 Check for missing values <a class="anchor" id="4.7"></a>


- In Python missing data is represented by two values:

   - **None** : None is a Python singleton object that is often used for missing data in Python code.

   - **NaN** : NaN is an acronym for Not a Number. It is a special floating-point value recognized by all systems   that use the standard IEEE floating-point representation.

- There are different methods in place on how to detect missing values.


#### Pandas isnull() and notnull() functions 

- Pandas offers two functions to test for missing values - **isnull()** and **notnull()**. 

- These are simple functions that return a boolean value indicating whether the passed in argument value is in fact missing data.


Below, I will list some useful commands to deal with missing values.


#### Useful commands to detect missing values 

- **df.isnull()**

The above command checks whether each cell in a dataframe contains missing values or not. If the cell contains missing value, it returns True otherwise it returns False.

- **df.isnull().sum()**

The above command returns total number of missing values in each column in the dataframe.

- **df.isnull().sum().sum()**

It returns total number of missing values in the dataframe.


- **df.isnull().mean()**

It returns percentage of missing values in each column in the dataframe.


- **df.isnull().any()**

It checks which column has null values and which has not. The columns which has null values returns TRUE and FALSE otherwise.

- **df.isnull().any().any()**

It returns a boolean value indicating whether the dataframe has missing values or not. If dataframe contains missing values it returns TRUE and FALSE otherwise.

- **df.isnull().values.any()**

It checks whether a particular column has missing values or not. If the column contains missing values, then it returns TRUE otherwise FALSE.

- **df.isnull().values.sum()**

It returns the total number of missing values in the dataframe.


In [None]:
# check for missing values

df.isnull().sum()

#### Interpretation

We can see that there are no missing values in the dataset.

### 4.8 Check with ASSERT statement <a class="anchor" id="4.8"></a>


- We must confirm that our dataset has no missing values.

- We can write an **Assert statement** to verify this.

- We can use an assert statement to programmatically check that no missing, unexpected 0 or negative values are present.

- This gives us confidence that our code is running properly.

- **Assert statement** will return nothing if the value being tested is true and will throw an AssertionError if the value is false.

- Asserts

   - assert 1 == 1 (return Nothing if the value is True)

   - assert 1 == 2 (return AssertionError if the value is False)

In [None]:
#assert that there are no missing values in the dataframe

assert pd.notnull(df).all().all()

#### Interpretation

- The above command does not throw any error. Hence, it is confirmed that there are no missing or negative values in the dataset.

- All the values are greater than or equal to zero excluding character values.

### 4.9 Functional approach to EDA <a class="anchor" id="4.9"></a>

- An alternative approach to EDA is to write a function that presents initial EDA of dataset.

- We can write such a function as follows :-

In [None]:
def initial_eda(df):
    if isinstance(df, pd.DataFrame):
        total_na = df.isna().sum().sum()
        print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
        print("Total NA Values : %d " % (total_na))
        print("%38s %10s     %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
        col_name = df.columns
        dtyp = df.dtypes
        uniq = df.nunique()
        na_val = df.isna().sum()
        for i in range(len(df.columns)):
            print("%38s %10s   %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
        
    else:
        print("Expect a DataFrame but got a %15s" % (type(df)))


In [None]:
initial_eda(df)

### Types of variables

- In this section, I segregate the dataset into categorical and numerical variables. 

- There are a mixture of categorical and numerical variables in the dataset. 

- Categorical variables have data type object. Numerical variables have data type int64.

- First of all, I will explore categorical variables.

## 5. Explore Categorical Variables <a class="anchor" id="5"></a>


[Back to Table of Contents](#0.1)

### 5.1 Find categorical variables 

In [None]:
categorical = [var for var in df.columns if df[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

### 5.2 Preview categorical variables 

In [None]:
df[categorical].head()

### 5.3 Summary of categorical variables 

- There are 9 categorical variables in the dataset.

- The categorical variables are given by `workclass`, `education`, `marital_status`, `occupation`, `relationship`, `race`, `sex`, `native_country` and `income`.

- `income` is the target variable.

### 5.4 Frequency distribution of categorical variables 

Now, we will check the frequency distribution of categorical variables.

In [None]:
for var in categorical: 
    
    print(df[var].value_counts())

### 5.5 Percentage of frequency distribution of values

In [None]:
for var in categorical:
    
     print(df[var].value_counts()/np.float(len(df)))

#### Comment

- Now, we can see that there are several variables like `workclass`, `occupation` and `native_country` which contain missing values. 

- Generally, the missing values are coded as `NaN` and python will detect them with the usual command of df.isnull().sum().

- But, in this case the missing values are coded as `?`. Python fail to detect these as missing values because it does not consider `?` as missing values. 

- So, I have to replace `?` with `NaN` so that Python can detect these missing values.

- I will explore these variables and replace `?` with `NaN`.

### 5.6 Explore the variables 

#### Explore `income` target variable 

In [None]:
# check for missing values

df['income'].isnull().sum()

We can see that there are no missing values in the `income` target variable.

In [None]:
# view number of unique values

df['income'].nunique()

There are 2 unique values in the `income` variable.

In [None]:
# view the unique values

df['income'].unique()

The two unique values are `<=50K` and `>50K`.

In [None]:
# view the frequency distribution of values

df['income'].value_counts()

In [None]:
# view percentage of frequency distribution of values

df['income'].value_counts()/len(df)

In [None]:
# visualize frequency distribution of income variable

f,ax=plt.subplots(1,2,figsize=(18,8))

ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Income Share')


#f, ax = plt.subplots(figsize=(6, 8))
ax[1] = sns.countplot(x="income", data=df, palette="Set1")
ax[1].set_title("Frequency distribution of income variable")

plt.show()

We can plot the bars horizontally as follows :-

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
ax = sns.countplot(y="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable")
plt.show()

#### Visualize `income` wrt `sex` variable

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt sex")
plt.show()

#### Interpretation


- We can see that males make more money than females in both the income categories.

#### Visualize `income` wrt `race`

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.countplot(x="income", hue="race", data=df, palette="Set1")
ax.set_title("Frequency distribution of income variable wrt race")
plt.show()

#### Interpretation


- We can see that whites make more money than non-whites in both the income categories.

#### Explore `workclass` variable

In [None]:
# check number of unique labels 

df.workclass.nunique()

In [None]:
# view the unique labels

df.workclass.unique()

In [None]:
# view frequency distribution of values

df.workclass.value_counts()

We can see that there are 1836 values encoded as `?` in workclass variable. I will replace these `?` with `NaN`.

In [None]:
# replace '?' values in workclass variable with `NaN`

df['workclass'].replace(' ?', np.NaN, inplace=True)

In [None]:
# again check the frequency distribution of values in workclass variable

df.workclass.value_counts()

- Now, we can see that there are no values encoded as `?` in the workclass variable.

- I will adopt similar approach with `occupation` and `native_country` column.

#### Visualize `workclass` variable

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
ax = df.workclass.value_counts().plot(kind="bar", color="green")
ax.set_title("Frequency distribution of workclass variable")
ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)
plt.show()

#### Interpretation


- We can see that there are lot more private workers than other category of workers.

#### Visualize `workclass` variable wrt `income` variable

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="income", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt income")
ax.legend(loc='upper right')
plt.show()

#### Interpretation


- We can see that workers make less than equal to 50k in most of the working categories.

- But this trend is more appealing in Private `workclass` category.

#### Visualize `workclass` variable wrt `sex` variable

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="workclass", hue="sex", data=df, palette="Set1")
ax.set_title("Frequency distribution of workclass variable wrt sex")
ax.legend(loc='upper right')
plt.show()

#### Interpretation


- We can see that there are more male workers than female workers in all the working category.

- The trend is more appealing in Private sector.

#### Explore `occupation` variable

In [None]:
# check number of unique labels

df.occupation.nunique()

In [None]:
# view unique labels

df.occupation.unique()


In [None]:
# view frequency distribution of values

df.occupation.value_counts()

We can see that there are 1843 values encoded as `?` in occupation variable. I will replace these `?` with `NaN`.

In [None]:
# replace '?' values in occupation variable with `NaN`

df['occupation'].replace(' ?', np.NaN, inplace=True)


In [None]:
# again check the frequency distribution of values

df.occupation.value_counts()

In [None]:
# visualize frequency distribution of `occupation` variable

f, ax = plt.subplots(figsize=(12, 8))
ax = sns.countplot(x="occupation", data=df, palette="Set1")
ax.set_title("Frequency distribution of occupation variable")
ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)
plt.show()

#### Explore `native_country` variable

In [None]:
# check number of unique labels

df.native_country.nunique()

In [None]:
# view unique labels 

df.native_country.unique()


In [None]:
# check frequency distribution of values

df.native_country.value_counts()


We can see that there are 583 values encoded as `?` in native_country variable. I will replace these `?` with `NaN`.

In [None]:
# replace '?' values in native_country variable with `NaN`

df['native_country'].replace(' ?', np.NaN, inplace=True)

In [None]:
# again check the frequency distribution of values

df.native_country.value_counts()

In [None]:
# visualize frequency distribution of `native_country` variable

f, ax = plt.subplots(figsize=(16, 12))
ax = sns.countplot(x="native_country", data=df, palette="Set1")
ax.set_title("Frequency distribution of native_country variable")
ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)
plt.show()

We can see that `United-States` dominate amongst the `native_country` variables.

### 5.7 Check missing values in categorical variables 

In [None]:
df[categorical].isnull().sum()

Now, we can see that `workclass`, `occupation` and `native_country` variable contains missing values.

### 5.8 Number of labels: Cardinality 

- The number of labels within a categorical variable is known as **cardinality**. 

- A high number of labels within a variable is known as **high cardinality**. 

- High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality.

In [None]:
# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split.

## 6. Explore Numerical Variables <a class="anchor" id="6"></a>


[Back to Table of Contents](#0.1)

### 6.1  Find numerical variables 

In [None]:
numerical = [var for var in df.columns if df[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :\n\n', numerical)

### 6.2 Preview the numerical variables

In [None]:
df[numerical].head()

### 6.3 Summary of numerical variables

- There are 6 numerical variables.

- These are given by `age`, `fnlwgt`, `education_num`,`capital_gain`, `capital_loss` and `hours_per_week`.

- All of the numerical variables are of discrete data type.

### 6.4 Check missing values in numerical variables 

In [None]:
df[numerical].isnull().sum()

We can see that there are no missing values in the numerical variables.

### 6.5 Explore numerical variables

#### Explore `age` variable

In [None]:
df['age'].nunique()

#### View the distribution of `age` variable

In [None]:
f, ax = plt.subplots(figsize=(10,8))
x = df['age']
ax = sns.distplot(x, bins=10, color='blue')
ax.set_title("Distribution of age variable")
plt.show()

We can see that `age` is slightly positively skewed.

We can use Pandas series object to get an informative axis label as follows :-

In [None]:
f, ax = plt.subplots(figsize=(10,8))
x = df['age']
x = pd.Series(x, name="Age variable")
ax = sns.distplot(x, bins=10, color='blue')
ax.set_title("Distribution of age variable")
plt.show()

We can shade under the density curve and use a different color as follows:-

In [None]:
f, ax = plt.subplots(figsize=(10,8))
x = df['age']
x = pd.Series(x, name="Age variable")
ax = sns.kdeplot(x, shade=True, color='red')
ax.set_title("Distribution of age variable")
plt.show()

#### Detect outliers in `age` variable with boxplot

In [None]:
f, ax = plt.subplots(figsize=(10,8))
x = df['age']
ax = sns.boxplot(x)
ax.set_title("Visualize outliers in age variable")
plt.show()

We can see that there are lots of outliers in `age` variable.

#### Explore relationship between `age` and `income` variables

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.boxplot(x="income", y="age", data=df)
ax.set_title("Visualize income wrt age variable")
plt.show()

#### Interpretation

- As expected, younger people make less money as compared to senior people.

#### Visualize `income` wrt `age` and `sex` variable

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
ax = sns.boxplot(x="income", y="age", hue="sex", data=df)
ax.set_title("Visualize income wrt age and sex variable")
ax.legend(loc='upper right')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
ax = sns.catplot(x="income", y="age", col="sex", data=df, kind="box", height=8, aspect=1)
plt.show()

#### Interpretation

- Senior people make more money than younger people.

#### Visualize relationship between `race` and `age`

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x ='race', y="age", data = df)
plt.title("Visualize age wrt race")
plt.show()

#### Interpretation

- Whites are more older than other groups of people.

#### Find out the correlations

In [None]:
# plot correlation heatmap to find out correlations

df.corr().style.format("{:.4}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)

#### Interpretation

- We can see that there is no strong correlation between variables.

#### Plot pairwise relationships in dataset

In [None]:
sns.pairplot(df)
plt.show()

#### Interpretation

- We can see that `age` and `fnlwgt` are positively skewed.

- The variable `education_num` is negatively skewed while `hours_per_week` is normally distributed.

- There exists weak positive correlation between `capital_gain` and `education_num` (correlation coefficient=0.1226). 

In [None]:
sns.pairplot(df, hue="income")
plt.show()

In [None]:
sns.pairplot(df, hue="sex")
plt.show()

## 7. Declare feature vector and target variable <a class="anchor" id="7"></a>

[Back to Table of Contents](#0.1)

In [None]:
X = df.drop(['income'], axis=1)

y = df['income']

## 8. Split data into separate training and test set <a class="anchor" id="8"></a>

[Back to Table of Contents](#0.1)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

## 9. Feature Engineering  <a class="anchor" id="9"></a>


[Back to Table of Contents](#0.1)


- **Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. 

- I will carry out feature engineering on different types of variables.

- First, I will display the categorical and numerical variables in training set separately.

### 9.1 Display categorical variables in training set


In [None]:
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

### 9.2 Display numerical variables in training set


In [None]:
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

### 9.3 Engineering missing values in categorical variables

In [None]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

In [None]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

In [None]:
# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)    

In [None]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

In [None]:
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

As a final check, I will check for missing values in X_train and X_test.

In [None]:
# check missing values in X_train

X_train.isnull().sum()

In [None]:
# check missing values in X_test

X_test.isnull().sum()

We can see that there are no missing values in X_train and X_test.

### 9.4 Encode categorical variables


In [None]:
# preview categorical variables in X_train

X_train[categorical].head()

In [None]:
# import category encoders

import category_encoders as ce

In [None]:
# encode categorical variables with one-hot encoding

encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 
                                 'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train.shape

We can see that from the initial 14 columns, we now have 105 columns in training set.

Similarly, I will take a look at the X_test set.

In [None]:
X_test.head()

In [None]:
X_test.shape

* We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called **feature scaling**. We will do it as follows.

## 10. Feature Scaling <a class="anchor" id="10"></a>

[Back to Table of Contents](#0.1)

In [None]:
cols = X_train.columns


In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


In [None]:
X_train = pd.DataFrame(X_train, columns=[cols])

In [None]:
X_test = pd.DataFrame(X_test, columns=[cols])

We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows.

## 11. Random Forest Classifier model with default parameters <a class="anchor" id="11"></a>


[Back to Table of Contents](#0.1)

In [None]:
# import Random Forest classifier

from sklearn.ensemble import RandomForestClassifier
# instantiate the classifier 

rfc = RandomForestClassifier(random_state=0)
# fit the model

rfc.fit(X_train, y_train)
# Predict the Test set results

y_pred = rfc.predict(X_test)
# Check accuracy score 

from sklearn.metrics import accuracy_score

print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

Here, **y_test** are the true class labels and **y_pred** are the predicted class labels in the test-set.

Here, I have build the Random Forest Classifier model with default parameter of `n_estimators = 10`. So, I have used 10 decision-trees to build the model. Now, I will increase the number of decision-trees and see its effect on accuracy.

## 12. Random Forest Classifier model with 100 Decision Trees  <a class="anchor" id="12"></a>



[Back to Table of Contents](#0.1)

In [None]:
# instantiate the classifier with n_estimators = 100

rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

rfc_100.fit(X_train, y_train)



# Predict on the test set results

y_pred_100 = rfc_100.predict(X_test)



# Check accuracy score 

print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))

The model accuracy score with 10 decision-trees is 0.8446 but the same with 100 decision-trees is 0.8521. So, as expected accuracy increases with number of decision-trees in the model.

## 13. Find important features with Random Forest model <a class="anchor" id="13"></a>


[Back to Table of Contents](#0.1)


Until now, I have used all the features given in the model. Now, I will select only the important features, build the model using these features and see its effect on accuracy. 


First, I will create the Random Forest model as follows:-

In [None]:
# create the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the training set

clf.fit(X_train, y_train)


Now, I will use the feature importance variable to see feature importance scores.

In [None]:
# view the feature scores

feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores

We can see that the most important feature is `fnlwgt` and least important feature is `native_country_41`.

## 14. Visualize feature scores of the features <a class="anchor" id="14"></a>


[Back to Table of Contents](#0.1)


Now, I will visualize the feature scores with matplotlib and seaborn.

In [None]:
# Creating a seaborn bar plot

f, ax = plt.subplots(figsize=(30, 24))
ax = sns.barplot(x=feature_scores, y=feature_scores.index, data=df)
ax.set_title("Visualize feature scores of the features")
ax.set_yticklabels(feature_scores.index)
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()


#### Interpretation


- The above plot confirms that the most important feature is `fnlwgt` and least important feature is `native_country_41`.

## 15. Build the Random Forest model on selected features <a class="anchor" id="15"></a>


[Back to Table of Contents](#0.1)


Now, I will drop the least important feature `native_country_41` from the model, rebuild the model and check its effect on accuracy.

In [None]:
# drop the least important feature from X_train and X_test

X_train = X_train.drop(['native_country_41'], axis=1)

X_test = X_test.drop(['native_country_41'], axis=1)


Now, I will build the random forest model again and check accuracy.

In [None]:
# instantiate the classifier with n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)



# fit the model to the training set

clf.fit(X_train, y_train)


# Predict on the test set results

y_pred = clf.predict(X_test)



# Check accuracy score 

print('Model accuracy score with native_country_41 variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))


#### Interpretation

- I have removed the `native_country_41` variable from the model, rebuild it and checked its accuracy. 

- The accuracy of the model now comes out to be 0.8544. 

- The accuracy of the model with all the variables taken into account is 0.8521. 

- So, we can see that the model accuracy has been improved with `native_country_41` variable removed from the model.

Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.


But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. 


We have another tool called `Confusion matrix` that comes to our rescue.

## 16. Confusion matrix <a class="anchor" id="16"></a>


[Back to Table of Contents](#0.1)


A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.


Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-


**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.


**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.


**False Positives (FP)** – False Positives occur when we predict an observation belongs to a    certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**



**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**



These four outcomes are summarized in a confusion matrix given below.


In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)



In [None]:
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

## 17. Classification Report <a class="anchor" id="17"></a>


[Back to Table of Contents](#0.1)


**Classification report** is another way to evaluate the classification model performance. It displays the  **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.

We can print a classification report as follows:-

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

## 18. Results and Conclusion <a class="anchor" id="18"></a>

[Back to Table of Contents](#0.1)


1.	In this project, I build a Random Forest Classifier to predict the income of a person. I build two models, one with 10 decision-trees and another one with 100 decision-trees. 
2.	The model accuracy score with `10 decision-trees is 0.8446` but the same with `100 decision-trees is 0.8521`. So, as expected accuracy increases with number of decision-trees in the model.
3.	I have used the Random Forest model to find only the important features, build the model using these features and see its effect on accuracy. 
4.	I have removed the `native_country_41` variable from the model, rebuild it and checked its accuracy. The `accuracy of the model with native_country_41 variable removed is 0.8544`. So, we can see that the model accuracy has been improved with `native_country_41` variable removed from the model.
5.	Confusion matrix and classification report are another tool to visualize the model performance. They yield good performance.



[Go to Top](#0)