# Random Forest Algorithm Implementation

We will use Possum Regression dataset to predict a possum’s sex based on its characteristics like belly size, skull width, etc.

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv("./possum.csv")
df.sample(5, random_state=44)

Unnamed: 0,case,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
88,89,7,other,m,6.0,97.7,58.4,84.5,35.0,64.4,46.2,14.4,29.0,30.5
39,40,2,Vic,f,3.0,91.0,55.0,84.5,36.0,72.8,51.4,13.6,27.0,30.0
92,93,7,other,m,3.0,89.2,54.0,82.0,38.0,63.8,44.9,12.8,24.0,31.0
7,8,1,Vic,f,6.0,94.8,57.6,91.0,37.0,72.7,53.9,14.5,29.0,34.0
98,99,7,other,f,3.0,93.3,56.2,86.5,38.5,64.8,43.8,14.0,28.0,35.0


In [6]:
# removing any rows with missing data
df.info()
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101 entries, 0 to 103
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   case      101 non-null    int64  
 1   site      101 non-null    int64  
 2   Pop       101 non-null    object 
 3   sex       101 non-null    object 
 4   age       101 non-null    float64
 5   hdlngth   101 non-null    float64
 6   skullw    101 non-null    float64
 7   totlngth  101 non-null    float64
 8   taill     101 non-null    float64
 9   footlgth  101 non-null    float64
 10  earconch  101 non-null    float64
 11  eye       101 non-null    float64
 12  chest     101 non-null    float64
 13  belly     101 non-null    float64
dtypes: float64(10), int64(2), object(2)
memory usage: 11.8+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 101 entries, 0 to 103
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   case      101 non-null    int

Removing the unnecessary columns, then store the features and the label data in separate variables.

In [8]:
X = df.drop(["case", "site", "Pop", "sex"], axis=1)
y = df["sex"]

## Training Our Random Forest Model

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# We allocate 30% (test_size=0.3) of our data to training features
# The rest goes to test data (X_test, y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)

rf_model = RandomForestClassifier(n_estimators=50, random_state=44)#50 trees

rf_model.fit(X_train, y_train)

In [15]:
rf_model.classes_

array(['f', 'm'], dtype=object)

## Making Predictions with Model

In [11]:
"""
We just give some data (X_test) to your model (rf_model),
then call the predict() method too, well, make predictions.
The predictions are saved to predictions.
"""
predictions = rf_model.predict(X_test)
predictions

array(['f', 'm', 'm', 'm', 'f', 'f', 'm', 'm', 'm', 'm', 'm', 'f', 'm',
       'm', 'm', 'm', 'm', 'f', 'm', 'm', 'm', 'f', 'f', 'm', 'm', 'm',
       'm', 'm', 'm', 'm', 'f'], dtype=object)

In [13]:
# comparing the predicted values (predictions) to the true values (y_test)
y_test

57     m
80     m
27     m
97     m
61     f
69     f
7      f
26     f
24     m
65     f
71     m
30     m
38     f
76     m
19     f
103    f
44     m
33     m
11     f
72     m
83     m
39     f
5      f
73     f
51     m
94     m
56     f
96     m
89     m
90     m
98     f
Name: sex, dtype: object

In [14]:
# Checking the probabilities assigned by our model to each prediction
rf_model.predict_proba(X_test)

array([[0.5 , 0.5 ],
       [0.26, 0.74],
       [0.36, 0.64],
       [0.3 , 0.7 ],
       [0.66, 0.34],
       [0.54, 0.46],
       [0.48, 0.52],
       [0.42, 0.58],
       [0.18, 0.82],
       [0.34, 0.66],
       [0.48, 0.52],
       [0.54, 0.46],
       [0.4 , 0.6 ],
       [0.28, 0.72],
       [0.48, 0.52],
       [0.36, 0.64],
       [0.4 , 0.6 ],
       [0.54, 0.46],
       [0.44, 0.56],
       [0.06, 0.94],
       [0.32, 0.68],
       [0.56, 0.44],
       [0.66, 0.34],
       [0.46, 0.54],
       [0.18, 0.82],
       [0.46, 0.54],
       [0.34, 0.66],
       [0.26, 0.74],
       [0.2 , 0.8 ],
       [0.1 , 0.9 ],
       [0.5 , 0.5 ]])

Each array contains two probabilities in this case because we have two categories to predict: male or female. The left value shows the predicted probability of belonging to the category of female, the second shows the same for belonging to the category of male.

In [16]:
# Checking how important each feature is in predicting a possum’s sex
rf_model.feature_importances_

array([0.05237143, 0.1461762 , 0.08878178, 0.11322144, 0.07150935,
       0.16099193, 0.11990223, 0.11082213, 0.05286153, 0.08336197])

**Matching its values to the feature's name.**

In [None]:
importances = rf_model.feature_importances_
columns = X.columns

i = 0

while i < len(columns):
    print(f"The importance of feature '{columns[i]}' is {
          round(importances[i]*100, 2)}%.")