 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [74]:
import category_encoders as ce
import numpy as np
import pandas as pd
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import warnings


warnings.simplefilter(action='ignore', category=FutureWarning)

columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [75]:
features = df.drop(columns='income').columns.tolist()
target = 'income'
X = df[features]
y = df[target]

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [76]:
# Get percentage of observations for each class in target feature
# Output accuracy score of majority class baseline
print(y.value_counts(normalize=True))
print('Majority class baseline accuracy: ', y.value_counts(normalize=True)[0])

<=50K    0.75919
>50K     0.24081
Name: income, dtype: float64
Majority class baseline accuracy:  0.7591904425539756


What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

In [77]:
# ROC AUC can be calculated by TPR/FPR
# y_true is the array of true binary label indicators
# y_score is the array of probability estimates of the positive class

print('y_true: ', y.values[:5])
print('y_score: ', ([y.value_counts(normalize=True)[0]]*len(y))[:5])
roc_auc_score(y.values, [y.value_counts(normalize=True)[0]]*len(y))

y_true:  ['<=50K' '<=50K' '<=50K' '<=50K' '<=50K']
y_score:  [0.7591904425539756, 0.7591904425539756, 0.7591904425539756, 0.7591904425539756, 0.7591904425539756]


0.5

In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [78]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.20, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26048, 14), (6513, 14), (26048,), (6513,))

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

In [79]:
# Explore the data, look for multicollinearity in numeric features
X_train.corr()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
age,1.0,-0.07358,0.038749,0.079644,0.056354,0.074627
fnlwgt,-0.07358,1.0,-0.041892,0.001656,-0.010029,-0.018145
education-num,0.038749,-0.041892,1.0,0.123857,0.083419,0.145229
capital-gain,0.079644,0.001656,0.123857,1.0,-0.031766,0.077219
capital-loss,0.056354,-0.010029,0.083419,-0.031766,1.0,0.051329
hours-per-week,0.074627,-0.018145,0.145229,0.077219,0.051329,1.0


In [80]:
# Examine categorical data, check ordinality 
X_train.describe(exclude='number')

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
count,26048,26048,26048,26048,26048,26048,26048,26048
unique,9,16,7,15,6,5,2,42
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
freq,18118,8416,12026,3312,10603,22221,17403,23300


In [81]:
# Make pipeline / Encode categoricals
preprocessor = make_pipeline(ce.BinaryEncoder(drop_invariant=True), StandardScaler())

X_train_encoded = preprocessor.fit_transform(X_train)
X_train_encoded = pd.DataFrame(X_train_encoded)

X_train_encoded.head()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,-0.408756,-0.025555,-0.436206,-2.369145,1.917634,0.080051,-0.039705,-0.473168,-1.335366,-0.959732,...,1.418827,-0.145715,-0.217998,0.77946,-0.098838,-0.167153,-0.18896,-0.187209,-0.282171,0.234889
1,-0.188857,-0.025555,-0.436206,0.422093,-0.521476,-0.981653,-0.039705,-0.473168,-1.335366,1.041957,...,-0.704807,-0.145715,4.457168,0.77946,-0.098838,-0.167153,-0.18896,-0.187209,-0.282171,0.234889
2,1.423734,-0.025555,-0.436206,0.422093,1.917634,0.126197,-0.039705,-0.473168,-1.335366,1.041957,...,-0.704807,-0.145715,-0.217998,-0.03151,-0.098838,-0.167153,-0.18896,-0.187209,-0.282171,0.234889
3,-1.288351,-0.025555,-0.436206,0.422093,-0.521476,-0.090935,-0.039705,-0.473168,-1.335366,1.041957,...,-0.704807,-0.145715,-0.217998,0.455072,-0.098838,-0.167153,-0.18896,-0.187209,-0.282171,0.234889
4,-0.848554,-0.025555,-0.436206,0.422093,-0.521476,0.856334,-0.039705,-0.473168,0.748858,-0.959732,...,-0.704807,-0.145715,-0.217998,-0.03151,-0.098838,-0.167153,-0.18896,-0.187209,-0.282171,0.234889


In [82]:
# Instantiate and fit a Logistic Regression model
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
log_reg.fit(X_train_encoded, y_train)

# Get cross validation accuracy 
cv_accuracy_scores = cross_val_score(log_reg, X_train_encoded, y_train, n_jobs=-1, scoring='accuracy', cv=5)
print('Cross Validation Accuracy Scores: ', cv_accuracy_scores)

Cross Validation Accuracy Scores:  [0.84452975 0.84664107 0.84606526 0.84123632 0.84622768]


## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

In [83]:
# Random Forest Classifier with sigmoid (Platt) calibration
# cv=None specifies the use of sklearn.model_selection.StratifiedKFold
rf_clf = RandomForestClassifier(max_depth=5, n_estimators=100, 
                                n_jobs=-1, random_state=42)
rf_clf_sigmoid = CalibratedClassifierCV(rf_clf, cv=None, method='sigmoid')
rf_clf_sigmoid.fit(X_train_encoded, y_train)

CalibratedClassifierCV(base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
            cv=None, method='sigmoid')

In [84]:
# Still using X_train_encoded data for model evaluation
rf_clf_pos_prob = rf_clf_sigmoid.predict_proba(X_train_encoded)[:,1]
print('ROC AUC:\t', roc_auc_score(y_train, rf_clf_pos_prob))

cv_accuracy_scores = cross_val_score(rf_clf_sigmoid, X_train_encoded, y_train, cv=5)
print('CV Accuracy Scores: ', cv_accuracy_scores)

ROC AUC:	 0.8973490432799287
CV Accuracy Scores:  [0.84587332 0.8512476  0.85028791 0.85025917 0.85659436]


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

Calculate accuracy

In [85]:
TP = 36
TN = 85
FP = 58
FN = 8

accuracy = (TP + TN) / (TP + TN + FP + FN)
print(accuracy)

0.6470588235294118


Calculate precision

In [86]:
precision = (TP) / (TP + FP)
print(precision)

0.3829787234042553


Calculate recall

In [87]:
recall = (TP) / (TP + FN)
print(recall)

0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 

In [88]:
### Part 1 ###

# This time select only non-numeric features
features = df.drop(columns='income').select_dtypes('object').columns.tolist()
target = 'income'
X = df[features]
y = df[target]

print(features)
df.describe(exclude='number')

['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
count,32561,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27816,21790,29170,24720


In [89]:
# Create a pipeline
preprocessor = make_pipeline(ce.OrdinalEncoder(), StandardScaler())

df_X = preprocessor.fit_transform(X)
df_X = pd.DataFrame(df_X)
df_y = y

X_train, X_test, y_train, y_test = \
train_test_split(df_X, df_y, test_size=0.2, random_state=42)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [90]:
from xgboost import XGBClassifier

# Same params as Random Forest Classifier
model = XGBClassifier(max_depth=5, n_estimators=100)
model.fit(X_train, y_train)

cross_val_score(model, df_X, df_y)

array([0.82835821, 0.8355445 , 0.83599005])