# OneHotEncoder

_By Jeff Hale_

---

## Learning Objectives
By the end of this lesson students will be able to:

- Understand when you would want to use OneHotEncoder 
- Use OneHotEncoder to create dummy variables for training and test data
- Create a baseline model for a classification project
- Generate a confusion matrix
- Compute the sensitivity from a confusion matrix

---

## OneHotEncoder

One hot encoder is extremely helpful for dummy encoding variables for machine learning so that you don't have to worry about information from the test set leaking into the training set during the training process.

### Read in titanic data from seaborn

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [None]:
df_titanic = sns.load_dataset('titanic', )
df_titanic.head()

In [None]:
df_titanic.info()

## Split into x and y. 

Let's use `survived` for y and `sex` and `class` for X.

In [None]:
X = df_titanic[['sex', 'class']]
y = df_titanic['survived']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
y.value_counts()

In [None]:
y.value_counts(normalize=True)

## Split into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

In [None]:
X_train.head(2)

In [None]:
X_test.head(2)

In [None]:
y_train.head(2)

In [None]:
y_test.head(2)

### Make an object from the OneHotEncoder class. 

#### Warning! ☝️  The arguments are important here. 

In [None]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe

### Save the fit and transformed training data

In [None]:
X_train_dummified = ohe.fit_transform(X_train, y_train)
X_train_dummified

### Save the transformed `X_test`

In [None]:
X_test_dummified = ohe.transform(X_test)
X_test_dummified

In [None]:
pd.get_dummies(X_train)

## Make a LogisticRegression model

In [None]:
logreg = LogisticRegression()


### Fit the model

In [None]:
logreg.fit(X_train_dummified, y_train)

### Create the model predictions

In [None]:
preds = logreg.predict(X_test_dummified)

In [None]:
preds

In [None]:
logreg.score(X_test_dummified, y_test)

### Generate the confusion matrix

In [None]:
confusion_matrix(y_test, preds)

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
tn

In [None]:
fp

In [None]:
fn

In [None]:
tp

### Try out the plot_confusion_matrix method

In [None]:
from sklearn.metrics import plot_confusion_matrix

In [None]:
plot_confusion_matrix(logreg, X_test_dummified, y_test, values_format = '.5g')

Accuracy = 73%

#### Compute the True Postive Rate

In [None]:
tp/(tp+fn)

#### Compute the Sensitivity

#### Compute the Recall

#### Compute the Precision

In [None]:
tp / (tp + fp)

#### Compute the Specificity

In [None]:
tn/ (tn + fp)

In [None]:
from sklearn.metrics import recall_score, precision_score

In [None]:
recall_score(y_test, preds)

In [None]:
precision_score(y_test, preds)

In [None]:
from sklearn.metrics import classification_report

In [None]:
classification_report(y_test, preds, output_dict=True)

### Make the ROC curve

### What's the ROC AUC score

#### How good is that score?

## Baseline model

In [None]:
y_train.value_counts(normalize=True)

#### Predict the most common class every time.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(np.zeros_like(y_test), y_test)

### How does our LogisticRegression model perform compared to the baseline model?

### How could we try to improve our model?

# Summary

You've seen how to use OneHotEncoder.

You've practiced computing the recall.

### Check for Understanding

- Why would you want to use OneHotEncoder instead of pd.get_dummies()?

- How do you use OneHotEncoder with the test data?