# AdaBoost using Scikit-Learn

#### Python Imports

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

#### Load and Display the Palmer Penguins Data Set
Source: [Palmer Penguins Data Set](https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv)

**Attribute**

1. species: denotes the penguin species (Adelie, Chinstrap, and Gentoo)
2. island: denotes the island in Palmer Archipelago, Antarctica (Biscoe, Dream, or Torgersen)
3. bill_length_mm: denotes the penguins beak length (millimeters)
4. bill_depth_mm: denotes the penguins beak depth (millimeters)
5. flipper_length_mm: denotes the penguins flipper length (millimeters)
6. body_mass_g: denotes the penguins body mass (grams)
7. sex: denotes the penguins sex (female, male)
8. year: denotes the study year (2007, 2008, or 2009)

In [3]:
url = 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv'
penguins_df = pd.read_csv(url)
penguins_df = penguins_df.drop(penguins_df.columns[0], axis=1)
penguins_df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


#### Address all the missing values

- Given that mot of the columns are missing for the rows at index 3 and 271, it is better to drop them
- Comparing the mean values of the features (`bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`) with the ones in the rows at index 8, 10, 11, we can infer the value for `sex` to be **female**. Similarly, the value of `sex` for the row at index 9 is **male**
- Comparing the mean values of the features (`bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`) with the ones in the row at index 47, we can infer the value for `sex` to be **female**
- Comparing the mean values of the features (`bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`) with the ones in the rows at index 178, 218, 256, 268, we can infer the value for `sex` to be **female**

In [4]:
penguins_df = penguins_df.drop([3, 271], axis=0)
penguins_df.loc[[8, 10, 11], 'sex'] = 'female'
penguins_df.at[9, 'sex'] = 'male'
penguins_df.at[47, 'sex'] = 'female'
penguins_df.loc[[178, 218, 256, 268], 'sex'] = 'female'

#### Display feature information about the Palmer Penguins data set after the fixes

In [5]:
penguins_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            342 non-null    object 
 1   island             342 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                342 non-null    object 
 7   year               342 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 32.1+ KB


#### Encode the categorical features `island` and `sex`

In [6]:
nom_features = ['island', 'sex']
nom_encoded_df = pd.get_dummies(penguins_df[nom_features], prefix_sep='.', drop_first=True, sparse=False)
nom_encoded_df

Unnamed: 0,island.Dream,island.Torgersen,sex.male
0,0,1,1
1,0,1,0
2,0,1,0
4,0,1,0
5,0,1,1
...,...,...,...
339,1,0,1
340,1,0,0
341,1,0,1
342,1,0,1


#### Replace categorical features with encoded features

In [7]:
penguins_df2 = penguins_df.drop(penguins_df[nom_features], axis=1)
penguins_df3 = pd.concat([penguins_df2, nom_encoded_df], axis=1)
penguins_df3

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year,island.Dream,island.Torgersen,sex.male
0,Adelie,39.1,18.7,181.0,3750.0,2007,0,1,1
1,Adelie,39.5,17.4,186.0,3800.0,2007,0,1,0
2,Adelie,40.3,18.0,195.0,3250.0,2007,0,1,0
4,Adelie,36.7,19.3,193.0,3450.0,2007,0,1,0
5,Adelie,39.3,20.6,190.0,3650.0,2007,0,1,1
...,...,...,...,...,...,...,...,...,...
339,Chinstrap,55.8,19.8,207.0,4000.0,2009,1,0,1
340,Chinstrap,43.5,18.1,202.0,3400.0,2009,1,0,0
341,Chinstrap,49.6,18.2,193.0,3775.0,2009,1,0,1
342,Chinstrap,50.8,19.0,210.0,4100.0,2009,1,0,1


#### Create the training and testing data sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(penguins_df3, penguins_df3['species'], test_size=0.25, random_state=101)

#### Drop the target from the training and testing data sets

In [9]:
X_train = X_train.drop('species', axis=1)
X_test = X_test.drop('species', axis=1)

#### Initialize and fit the AdaBoost model for Classification

The hyperparameter `n_estimators` indicates the number of Decision Trees to use - the default is 50
The hyperparameter `learning_rate` controls the magnitude of the performance coefficient $\alpha$. Smaller the values smaller the weights at each iteration

In [10]:
model = AdaBoostClassifier(n_estimators=100, learning_rate=0.01, random_state=101)
model.fit(X_train, y_train)

AdaBoostClassifier(learning_rate=0.01, n_estimators=100, random_state=101)

#### Predict the target `species` using the testing data set

In [11]:
y_predict = model.predict(X_test)

#### Display the accuracy score

In [12]:
accuracy_score(y_test, y_predict)

0.9651162790697675