<a href="https://colab.research.google.com/github/Yashithi98/Machine-Learning/blob/main/Random_Forest_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Random Forest CLassifier**

Demonstration of an application of Random Forest Machine Learning technique over a UCI dataset

*- Python Code*

# **Importing Libraries and Packages**

In [None]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd  

# **Importing the Dataset from UCI Machine Learning Repository**

**-- Breast Cancer Coimbra Data Set**

There are 10 predictors, all quantitative, and a binary dependent variable, indicating the presence or absence of breast cancer.
The predictors are anthropometric data and parameters which can be gathered in routine blood analysis.
Prediction models based on these predictors, if accurate, can potentially be used as a biomarker of breast cancer.

**Attributes :**

1. Age (years)
2. BMI (kg/m2)
3. Glucose (mg/dL)
4. Insulin (µU/mL)
5. HOMA
6. Leptin (ng/mL)
7. Adiponectin (µg/mL)
8. Resistin (ng/mL)
9. MCP-1(pg/dL)

Labels:
1=Healthy controls
2=Patients


Dataset Information : https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra

Datset Download Link : https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv

In [None]:
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00451/dataR2.csv")

In [None]:
dataset

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
0,48,23.500000,70,2.707,0.467409,8.8071,9.702400,7.99585,417.114,1
1,83,20.690495,92,3.115,0.706897,8.8438,5.429285,4.06405,468.786,1
2,82,23.124670,91,4.498,1.009651,17.9393,22.432040,9.27715,554.697,1
3,68,21.367521,77,3.226,0.612725,9.8827,7.169560,12.76600,928.220,1
4,86,21.111111,92,3.549,0.805386,6.6994,4.819240,10.57635,773.920,1
...,...,...,...,...,...,...,...,...,...,...
111,45,26.850000,92,3.330,0.755688,54.6800,12.100000,10.96000,268.230,2
112,62,26.840000,100,4.530,1.117400,12.4500,21.420000,7.32000,330.160,2
113,65,32.050000,97,5.730,1.370998,61.4800,22.540000,10.33000,314.050,2
114,72,25.590000,82,2.820,0.570392,24.9600,33.750000,3.27000,392.460,2


# **Extracting Features (Independant) and Classification (Dependant) Variables**

In [None]:
features = dataset.iloc[:,:-1].values
classification = dataset.iloc[:,-1].values  

In [None]:
features

array([[ 48.        ,  23.5       ,  70.        , ...,   9.7024    ,
          7.99585   , 417.114     ],
       [ 83.        ,  20.69049454,  92.        , ...,   5.429285  ,
          4.06405   , 468.786     ],
       [ 82.        ,  23.12467037,  91.        , ...,  22.43204   ,
          9.27715   , 554.697     ],
       ...,
       [ 65.        ,  32.05      ,  97.        , ...,  22.54      ,
         10.33      , 314.05      ],
       [ 72.        ,  25.59      ,  82.        , ...,  33.75      ,
          3.27      , 392.46      ],
       [ 86.        ,  27.18      , 138.        , ...,  14.11      ,
          4.35      ,  90.09      ]])

In [None]:
classification

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2])

# **Splitting the Dateset into Training Data and Testing Data**

In [None]:
from sklearn.model_selection import train_test_split  
features_train, features_test, classification_train, classification_test= train_test_split(features, classification, test_size= 0.2, random_state=0)  

In [None]:
features_train

array([[4.40000000e+01, 2.47400000e+01, 1.06000000e+02, 5.84600000e+01,
        1.52853413e+01, 1.81600000e+01, 1.61000000e+01, 5.31000000e+00,
        2.44750000e+02],
       [4.00000000e+01, 2.76360544e+01, 1.03000000e+02, 2.43200000e+00,
        6.17890133e-01, 1.43224000e+01, 6.78387000e+00, 2.60136000e+01,
        2.93123000e+02],
       [4.50000000e+01, 2.68500000e+01, 9.20000000e+01, 3.33000000e+00,
        7.55688000e-01, 5.46800000e+01, 1.21000000e+01, 1.09600000e+01,
        2.68230000e+02],
       [7.40000000e+01, 2.86501377e+01, 8.80000000e+01, 3.01200000e+00,
        6.53804800e-01, 3.11233000e+01, 7.65222000e+00, 1.83557400e+01,
        5.72401000e+02],
       [7.30000000e+01, 2.20000000e+01, 9.70000000e+01, 3.35000000e+00,
        8.01543333e-01, 4.47000000e+00, 1.03587250e+01, 6.28445000e+00,
        1.36855000e+02],
       [4.90000000e+01, 3.24619114e+01, 1.34000000e+02, 2.48870000e+01,
        8.22598307e+00, 4.23914000e+01, 1.07939400e+01, 5.76800000e+00,
        6.5

In [None]:
features_test

array([[3.40000000e+01, 2.14700000e+01, 7.80000000e+01, 3.46900000e+00,
        6.67435600e-01, 1.45700000e+01, 1.31100000e+01, 6.92000000e+00,
        3.54600000e+02],
       [4.60000000e+01, 2.22100000e+01, 8.60000000e+01, 3.69400000e+01,
        7.83620533e+00, 1.01600000e+01, 9.76000000e+00, 5.68000000e+00,
        3.12000000e+02],
       [5.40000000e+01, 2.42187500e+01, 8.60000000e+01, 3.73000000e+00,
        7.91257333e-01, 8.68740000e+00, 3.70523000e+00, 1.03445500e+01,
        6.35049000e+02],
       [7.70000000e+01, 2.59000000e+01, 8.50000000e+01, 4.58000000e+00,
        9.60273333e-01, 1.37400000e+01, 9.75326000e+00, 1.17740000e+01,
        4.88829000e+02],
       [7.60000000e+01, 2.38000000e+01, 1.18000000e+02, 6.47000000e+00,
        1.88320133e+00, 4.31100000e+00, 1.32513200e+01, 5.10420000e+00,
        2.80694000e+02],
       [8.60000000e+01, 2.66666667e+01, 2.01000000e+02, 4.16110000e+01,
        2.06307338e+01, 4.76470000e+01, 5.35713500e+00, 2.43701000e+01,
        1.6

In [None]:
classification_train

array([2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2,
       2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2,
       1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2,
       2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2,
       2, 2, 1, 1])

In [None]:
classification_test

array([1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1,
       1, 2])

# **Using the SKLEARN library to fit the training dataset into the Random Forest Algorithm**

In [None]:
from sklearn.ensemble import RandomForestClassifier  
model = RandomForestClassifier(n_estimators= 2, criterion="entropy")  
model.fit(features_train, classification_train)  

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=2,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

# **Prediction using the test dataset**

In [None]:
predictions = model.predict(features_test)  

# **Creating Confusion Matrix to verfiy the correct predictions and identify the incorrect predictions**

In [None]:
pd.crosstab(classification_test,predictions,rownames=['Real Classification'],colnames=['Predicted Classification'])

Predicted Classification,1,2
Real Classification,Unnamed: 1_level_1,Unnamed: 2_level_1
1,10,1
2,5,8


Here we can see that the predicitions are not 100% accurate. As a total of 6 results have been predicted incorrectly.

# **Model Accuracy**

In [None]:
from sklearn import metrics
modelAccuracy = metrics.accuracy_score(classification_test,predictions)
modelAccuracy*100

75.0