# Project objective
This project is designed to review random forest method and its python implementation using UCI hand-written digit dataset.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)


In [0]:
import numpy as np
import sklearn as sk

# Introduction to the dataset

**Name**: UCI ML digit image data

**Summary**: Images of hand-written digits in UCI ML repository

**number of features**: 8*8(64) pixels (features)

**Number of data points (instances)**: 1797

**dataset accessibility**: Dataset is available as part of sklearn package.




## Loading the dataset and separating features and labels
The dataset is available as part of sklearn package. Hence, we do not need to import the data directly from UCI ML repository. 

In [2]:
from sklearn import datasets

# Loading digit images
digits = datasets.load_digits()
# separating feature arrays of pixel values (X) and labels (y) 
input_features = digits.data
output_var = digits.target
# printing number of features (pixels) and data points 
n_samples, n_features = input_features.shape
print("number of samples (data points):", n_samples)
print("number of features:", n_features)

number of samples (data points): 1797
number of features: 64


## Splitting data to training and testing sets

We need to split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about generalizability of the model we train.

**test_size**: Traditionally, 30%-40% of the dataset cna be used for test set. If you split the data to train, validation and test, you can use 60%, 20% and 20% of teh dataset, respectively.

**Note.**: We need the validation and test sets to be big enough for checking generalizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state** as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Building the supervised learning model
We want to build a multi-class classification model as the output variable is categorical with 10 classes. Here we build a random forest model.

### Decision tree
A decision tree is built starting from the best feature splitting the data points to 2 purest possible groups. Then each group is splitted again by next best features for purification of groups. Although this process can be continued till getting to 100% purity (having only one class) in each group, it would probably lower than generalizability of the model. Hence, we usually cut the tree before getting to 100% purity.

### Random forest
Decision trees usually have high variance, meaning their prediction performance varies largely between datasets. To overcome this issue we can rely on concept of ensemble learning. In ensemble learning we want to use wisdom of crowd instead of single classifier. For example, random forest as an ensemble model uses multiple decision trees to predict class of each data point. Here is the process of bulding a random forest model:

1) Randomly sampling data points with replacement (bootstrapping)

2) Randomly selecting the features 

3) Build a decision tree using the randomly selected data points and features in steps 1 and 2.

4) Building multiple decision trees as decsribed in steps 1 to 3

5) Using majority vote of all the decision trees as the identified class for a given data point

Note. We don't need to write code for these steps but they will be done automatically when using random forest in python. But we need to know how it works. 




In [4]:
from sklearn.ensemble import RandomForestClassifier 

# Create logistic regression object
rf = RandomForestClassifier()

# Train the model using the training sets
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Prediction of test (or validation) set
We now have to use the trained model to predict y_test.

In [0]:
# Make predictions using the testing set
y_pred = rf.predict(X_test)

## Evaluating performance of the model
We need to assess performance of the model using the predictions of the test set. We use F1 score for this model. Here is the definition of F1 score:


* **F1 score** is the harmonic mean of precision and recall as follows

$${\displaystyle {\text{F1}}={\frac {2}{\frac {1}{precision}+ \frac {1}{recall}}}\,} $$

where 

* **precision** is the fraction of true positives out of all the positive predictions

$${\displaystyle {\text{precision}}={\frac {tp}{tp+fp}}\,} $$

* **recall** is also referred to as the true positive rate or sensitivity




$${\displaystyle {\text{recall}}={\frac {tp}{tp+fn}}\,} $$





In [6]:
from sklearn import metrics

print("F1 score of the predictions:", metrics.f1_score(y_test, y_pred, average=None))

F1 score of the predictions: [0.98275862 0.95412844 0.99130435 0.95575221 0.97674419 0.97674419
 0.97826087 0.98305085 0.92156863 0.96      ]


**Note** We cannot use default value of "average" parameter in metrics.f1_score which is "binary" as it is designed for binary classification while we are dealing with multi-class classification here. 

### Interpretation of results
As we can see, we could achieve more than 0.94 F1 score for all the classes (digits 0 to 9). However, there is still a gap between class 0 (images of digit 0) and classes 8 and 9 (images of digits 8 and 9). Hence, we need to figure out if the lower F1 scores of classes 8 and 9 are due to lower precision or recall or both. As we can see in the following results, precision and recall of class 8 are the same while precision of class 9 is higher than its recall. Interestingly, there are classes which their precision is higher than their recall (such as classes 0 and 2) while the reverse is true for some other classes (such as class 9).

In [7]:
print("precision of the predictions:", metrics.precision_score(y_test, y_pred, average=None))
print("recall of the predictions:", metrics.recall_score(y_test, y_pred, average=None))

precision of the predictions: [0.98275862 0.9122807  1.         1.         0.97674419 0.96923077
 1.         0.98305085 0.90384615 0.96      ]
recall of the predictions: [0.98275862 1.         0.98275862 0.91525424 0.97674419 0.984375
 0.95744681 0.98305085 0.94       0.96      ]
