# Random Forest in Action

In [1]:
# for data manipulation & visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the Penguins' dataset

In [2]:
# dataset
df = pd.read_csv(r"D:\ML\Machine Learning_Practical\Scikit Learn\Decision Trees\Data\penguins_size.csv")

In [3]:
# top 5 examples
df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [4]:
# features available
df.columns

Index(['species', 'island', 'culmen_length_mm', 'culmen_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

In [5]:
# basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   culmen_length_mm   342 non-null    float64
 3   culmen_depth_mm    342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [6]:
df["island"].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

## Understanding the Penguin Species dataset

**Numeric** features:
1. **`culmen_length_mm`**
    - Length (milimeters) of upper ridge of a bird’s bill (beak)
2. **`culmen_depth_mm`**
    - Depth/height (milimeters) of upper ridge of a bird’s bill (beak)
3. **`flipper_length_mm`**
    - Length (milimeters) of penguin's flippers (~wings)
4. **`body_mass_g`**
    - Mass (grams) of penguin

**Categorical** features
1. **`sex`** - Male / Female
2. **`island`** - Natural habitat of penguin
    - 3 possible choices
        - `Torgersen`
        - `Biscoe`
        - `Biscoe`

**Target** variable to predict
1. **`species`**
    - 3 possible choices
        - `Adelie`
        - `Chinstrap`
        - `Gentoo`
---
### Images & Diagrams for better understanding of features

**Parts of a Penguin**
![image.png](attachment:image.png)

**The 3 islands** (_rough location_) in **Antarctic peninsula**
![image-2.png](attachment:image-2.png)

---

# Dealing with missing values

Dataset was **already analyzed** in a previous notebook (_`Decision Trees.ipynb`_). Based on previous analysis, **dropping all examples with missing value** will **not hamper dataset quality**.  

In [7]:
# drop NaN examples
df = df.dropna(axis=0)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 334 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            334 non-null    object 
 1   island             334 non-null    object 
 2   culmen_length_mm   334 non-null    float64
 3   culmen_depth_mm    334 non-null    float64
 4   flipper_length_mm  334 non-null    float64
 5   body_mass_g        334 non-null    float64
 6   sex                334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.9+ KB


# Data Preparation

In [10]:
# input features

# get input features
X = df.drop(labels="species", axis=1)

# categorical features will be on-hot encoded
X = pd.get_dummies(data=X, drop_first=True)

In [12]:
# target variable (species col)

y = df["species"]

## Training & Test set

In [13]:
# for splitting data

from sklearn.model_selection import train_test_split

In [16]:
# training & test set split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=0)