# Feature Engineering II,
### a.k.a.
* ### Advanced Feature Engineering 
* ### Feature Engineering with _scikit-learn_

(Concepts are the same as in the intro to FE, how we transform the data is different)

In [142]:
# stuff you know already
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [143]:
# new stuff !!
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### 1. Get Data

In [167]:
df = pd.read_csv('all_penguins_clean.csv')
df

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Real ID,Sex
0,PAL0708,1,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,A_0,MALE
1,PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,A_1,FEMALE
2,PAL0708,3,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,A_2,FEMALE
3,PAL0708,4,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,A_3,
4,PAL0708,5,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,A_4,FEMALE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,PAL0910,120,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12/1/09,,,,,G_339,
340,PAL0910,121,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,G_340,FEMALE
341,PAL0910,122,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,G_341,MALE
342,PAL0910,123,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,G_342,FEMALE


In [168]:
df.isna().sum()

studyName               0
Sample Number           0
Species                 0
Region                  0
Island                  0
Stage                   0
Individual ID           0
Clutch Completion       0
Date Egg                0
Culmen Length (mm)      2
Culmen Depth (mm)       2
Flipper Length (mm)     2
Body Mass (g)           2
Real ID                 0
Sex                    10
dtype: int64

In [169]:
df=df[~df["Individual ID"].isin(["N2A2","N38A2"])]
df.isna().sum()

studyName              0
Sample Number          0
Species                0
Region                 0
Island                 0
Stage                  0
Individual ID          0
Clutch Completion      0
Date Egg               0
Culmen Length (mm)     0
Culmen Depth (mm)      0
Flipper Length (mm)    0
Body Mass (g)          0
Real ID                0
Sex                    8
dtype: int64

In [170]:
df

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Real ID,Sex
0,PAL0708,1,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,A_0,MALE
1,PAL0708,2,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,A_1,FEMALE
2,PAL0708,3,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,A_2,FEMALE
4,PAL0708,5,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,A_4,FEMALE
5,PAL0708,6,Adelie,Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,11/16/07,39.3,20.6,190.0,3650.0,A_5,MALE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,PAL0910,119,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N38A1,No,12/1/09,47.2,13.7,214.0,4925.0,G_338,FEMALE
340,PAL0910,121,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,G_340,FEMALE
341,PAL0910,122,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,G_341,MALE
342,PAL0910,123,Gentoo,Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,G_342,FEMALE


In [171]:
X=df[["Island","Sex","Flipper Length (mm)","Body Mass (g)"]]
y=df["Species"]

### 2. Train-Test Split

In [172]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=101)

### 3. Explore the Data

In [173]:
df.isnull().sum()

studyName              0
Sample Number          0
Species                0
Region                 0
Island                 0
Stage                  0
Individual ID          0
Clutch Completion      0
Date Egg               0
Culmen Length (mm)     0
Culmen Depth (mm)      0
Flipper Length (mm)    0
Body Mass (g)          0
Real ID                0
Sex                    8
dtype: int64

In [175]:
X_train.replace(".","FEMALE",inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


In [195]:
X_train["Sex"].value_counts()

FEMALE    139
MALE      125
Name: Sex, dtype: int64

### 4. Feature Engineer

Q: How do we want to feature engineer /transform our columns?

A: 
*one hot encode Sex cand Island column
*impute missing values in the Sex column
*discretize numerical values (bin+one-hot-encode)
*scale numerical values

Introducing `ColumnTransformer`: 

https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
        
Takes as parameters:
* list of tuples of the format `(name, transformer, columns)`
* what to do with columns not included: `remainder='drop'/'passthrough'`

`ColumnTransformer` helps us to do all our feature engineering in one go.

In [179]:
numerical_columns=["Flipper Length (mm)","Body Mass (g)"]
categorical_columns=["Island","Sex"]

In [180]:
#name,transformer,columns
column_transformer=ColumnTransformer([

    ("sex_imputer",SimpleImputer(strategy="most_frequent"),["Sex"]),
    ("island_ohe",OneHotEncoder(sparse=False,handle_unknown="error",drop="if_binary"),["Island"]),
    ("num_scaler",MinMaxScaler(),numerical_columns)
])

In [181]:
column_transformer.fit(X_train)
X_train_fe =column_transformer.transform(X_train)
X_test_fe= column_transformer.transform(X_test)

In [182]:
X_train_fe

array([['MALE', 1.0, 0.0, 0.0, 0.6949152542372885, 0.875],
       ['MALE', 1.0, 0.0, 0.0, 0.745762711864407, 0.5694444444444444],
       ['MALE', 0.0, 1.0, 0.0, 0.4915254237288136, 0.375],
       ...,
       ['FEMALE', 1.0, 0.0, 0.0, 0.745762711864407, 0.6388888888888888],
       ['FEMALE', 1.0, 0.0, 0.0, 0.6779661016949157, 0.6944444444444444],
       ['FEMALE', 0.0, 0.0, 1.0, 0.1694915254237288, 0.13888888888888884]],
      dtype=object)

What happens when we want to transform the same column twice?

Introducing `Pipeline`: 

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Takes as parameters: 
* list of tuples of the format `(name, transformer)`

`Pipeline` allows us to apply sequential transformations to the same data.

#make_pipeline
#make_column_transformer

In [183]:
categorical_pipeline= Pipeline([
    ("cat_imputer",SimpleImputer(strategy="most_frequent")),
    ("cat_ohe",OneHotEncoder(sparse=False,handle_unknown="ignore"))

])

In [209]:
column_transformer=ColumnTransformer([
    ("sex_imputer",categorical_pipeline,["Sex"]),
    ("island_ohe",OneHotEncoder(sparse=False,handle_unknown="error",drop="first"),["Island"]),
    ("num_scaler",MinMaxScaler(),numerical_columns)
])

In [221]:
X_train

Unnamed: 0,Island,Sex,Flipper Length (mm),Body Mass (g)
233,Biscoe,MALE,213.0,5850.0
263,Biscoe,MALE,216.0,4750.0
167,Dream,MALE,201.0,4050.0
91,Dream,MALE,205.0,4300.0
299,Biscoe,MALE,223.0,5950.0
...,...,...,...,...
89,Dream,FEMALE,190.0,3600.0
64,Biscoe,FEMALE,184.0,2850.0
330,Biscoe,FEMALE,216.0,5000.0
342,Biscoe,FEMALE,212.0,5200.0


In [219]:
X_train["Sex"].unique()

array(['MALE', 'FEMALE', nan], dtype=object)

In [211]:
column_transformer.fit(X_train)
X_train_fe =column_transformer.transform(X_train)
X_test_fe= column_transformer.transform(X_test) #Do not .fit() on test data

In [212]:
X_train_fe

array([[0.        , 1.        , 0.        , 0.        , 0.69491525,
        0.875     ],
       [0.        , 1.        , 0.        , 0.        , 0.74576271,
        0.56944444],
       [0.        , 1.        , 1.        , 0.        , 0.49152542,
        0.375     ],
       ...,
       [1.        , 0.        , 0.        , 0.        , 0.74576271,
        0.63888889],
       [1.        , 0.        , 0.        , 0.        , 0.6779661 ,
        0.69444444],
       [1.        , 0.        , 0.        , 1.        , 0.16949153,
        0.13888889]])

### 5. Train Model

In [187]:
m = LogisticRegression()

In [188]:
m.fit(X_train_fe, y_train)

LogisticRegression()

### 6. Optimize

In [190]:
m.score(X_train_fe,y_train)

0.8745387453874539

In [192]:
m.score(X_test_fe,y_test)

0.8676470588235294

In [218]:
X_train_fe["Female"]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

### 7. Calculate Test Score

**BONUS**: Figure out how to wrap your feature engineering / ColumnTransformer and your model / LogisticRegression into a single Pipeline.