# OneHotEncoder

_By Jeff Hale_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Understand when you would want to use OneHotEncoder 
- Use OneHotEncoder to create dummy variables for training and test data
- Create a baseline model for a classification project
- Generate a confusion matrix
- Compute the sensitivity from a confusion matrix

---

## OneHotEncoder

One hot encoder is extremely helpful for dummy encoding variables for machine learning so that you don't have to worry about information from the test set leaking into the training set during the training process.

### Read in titanic data from seaborn

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [2]:
df_titanic = sns.load_dataset('titanic', )
df_titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


## Split into x and y. 

Let's use `survived` for y and `sex` and `class` for X.

In [4]:
X = df_titanic[['sex', 'class']]
y = df_titanic['survived']

In [5]:
X.head()

Unnamed: 0,sex,class
0,male,Third
1,female,First
2,female,Third
3,female,First
4,male,Third


In [6]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

In [9]:
y.value_counts()

0    549
1    342
Name: survived, dtype: int64

In [8]:
y.value_counts(normalize=True)

0    0.616162
1    0.383838
Name: survived, dtype: float64

## Split into training and test sets

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [16]:
X_train.head(2)

Unnamed: 0,sex,class
709,male,Third
558,female,First


In [17]:
X_test.head(2)

Unnamed: 0,sex,class
725,male,Third
861,male,Second


In [18]:
y_train.head(2)

709    1
558    1
Name: survived, dtype: int64

In [19]:
y_test.head(2)

725    0
861    0
Name: survived, dtype: int64

### Make an object from the OneHotEncoder class. 

#### Warning! ☝️  The arguments are important here. 

In [20]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=False)

### Save the fit and transformed training data

In [23]:
X_train_dummified = ohe.fit_transform(X_train, y_train)
X_train_dummified

array([[0., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0.],
       [1., 0., 0., 1., 0.],
       ...,
       [1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 1.]])

### Save the transformed `X_test`

In [25]:
X_test_dummified = ohe.transform(X_test)
X_test_dummified

array([[0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

In [26]:
pd.get_dummies(X_train)

Unnamed: 0,sex_female,sex_male,class_First,class_Second,class_Third
709,0,1,0,0,1
558,1,0,1,0,0
327,1,0,0,1,0
256,1,0,1,0,0
51,0,1,0,0,1
...,...,...,...,...,...
579,0,1,0,0,1
502,1,0,0,0,1
537,1,0,1,0,0
196,0,1,0,0,1


## Make a LogisticRegression model

In [27]:
logreg = LogisticRegression()


### Fit the model

In [28]:
logreg.fit(X_train_dummified, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Create the model predictions

In [29]:
preds = logreg.predict(X_test_dummified)

In [30]:
preds

array([0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1])

In [31]:
logreg.score(X_test_dummified, y_test)

0.7309417040358744

### Generate the confusion matrix

In [32]:
confusion_matrix(y_test, preds)

array([[111,  24],
       [ 36,  52]])

In [34]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
tn

111

In [35]:
fp

24

In [36]:
fn

36

In [38]:
tp

52

#### Compute the True Postive Rate

In [39]:
tp / (tp + fn)

0.5909090909090909

#### Compute the Sensitivity

In [40]:
tp / (tp + fn)

0.5909090909090909

#### Compute the Recall

In [41]:
tp / (tp + fn)

0.5909090909090909

## Baseline model

In [42]:
y_train.value_counts(normalize=True)

0    0.61976
1    0.38024
Name: survived, dtype: float64

#### Predict the most common class every time.

In [43]:
from sklearn.metrics import accuracy_score

In [45]:
accuracy_score(np.zeros_like(y_test), y_test)

0.6053811659192825

### How does our LogisticRegression model perform compared to the baseline model?

### How could we try to improve our model?

# Summary

You've seen how to use OneHotEncoder.

You've practiced computing the recall.

### Check for Understanding

- Why would you want to use OneHotEncoder instead of pd.get_dummies()?

- How do you use OneHotEncoder with the test data?