 _Lambda School Data Science Unit 2_
 
 # Classification & Validation Sprint Challenge

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

#### For this Sprint Challenge, you'll predict whether a person's income exceeds $50k/yr, based on census data.

You can read more about the Adult Census Income dataset at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/adult

#### Run this cell to load the data:

In [16]:
import pandas as pd
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.dummy import DummyClassifier

In [1]:
columns = ['age', 
           'workclass', 
           'fnlwgt', 
           'education', 
           'education-num', 
           'marital-status', 
           'occupation', 
           'relationship', 
           'race', 
           'sex', 
           'capital-gain', 
           'capital-loss', 
           'hours-per-week', 
           'native-country', 
           'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                 header=None, names=columns)

df['income'] = df['income'].str.strip()

In [7]:
df.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
22400,26,Private,375980,HS-grad,9,Separated,Sales,Unmarried,Black,Female,0,0,37,United-States,<=50K
24089,35,Self-emp-inc,111319,Assoc-acdm,12,Married-civ-spouse,Sales,Husband,White,Male,0,1887,45,United-States,>50K
30005,23,Private,140414,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
14395,42,Local-gov,201723,Some-college,10,Never-married,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,<=50K
16619,36,Private,175360,Masters,14,Never-married,Adm-clerical,Not-in-family,White,Male,13550,0,50,United-States,>50K


In [10]:
df['inc_bin'] = df['income'].replace({'<=50K':0, '>50K':1})

## Part 1 — Begin with baselines

Split the data into an **X matrix** (all the features) and **y vector** (the target).

(You _don't_ need to split the data into train and test sets here. You'll be asked to do that at the _end_ of Part 1.)

In [13]:
X = df.drop(columns=['income', 'inc_bin'])
y = df['inc_bin']

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
27924,42,Private,198619,Assoc-voc,11,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
30047,32,Federal-gov,90653,HS-grad,9,Never-married,Exec-managerial,Unmarried,White,Female,0,1380,40,United-States
21874,51,Private,95946,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States
853,26,Private,96467,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,40,United-States
595,27,Private,267147,HS-grad,9,Never-married,Sales,Own-child,White,Male,0,0,40,United-States


What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You can answer this question either with a scikit-learn function or with a pandas function.)

In [14]:
# 0 is less than 50k, 1 is more

y.value_counts()

0    24720
1     7841
Name: inc_bin, dtype: int64

In [17]:
# Dummy classifier to make prediction based on most frequent

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X,y)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

In [18]:
# Print prediction

dummy_pre = dummy.predict(X)
accuracy_score(y, dummy_pre)

0.7591904425539756

Doing the math manually: 
24720 / (24720 + 7841) = .75919

What **ROC AUC score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of ROC AUC.)

In [180]:
roc_auc_score(y, dummy_pre)

0.5

In this Sprint Challenge, you will use **"Cross-Validation with Independent Test Set"** for your model validaton method.

First, **split the data into `X_train, X_test, y_train, y_test`**. You can include 80% of the data in the train set, and hold out 20% for the test set.

In [169]:
from sklearn.model_selection import train_test_split

# Select 6 columns,  do train test split

X_train, X_test, y_train, y_test = train_test_split(X[['workclass','education','sex','age',
                                                       'fnlwgt','hours-per-week']], 
                                                    y, train_size=.80, test_size=.20, random_state=42)

## Part 2 — Modeling with Logistic Regression!

- You may do exploratory data analysis and visualization, but it is not required.
- You may **use all the features, or select any features** of your choice, as long as you select at least one numeric feature and one categorical feature.
- **Scale your numeric features**, using any scikit-learn [Scaler](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) of your choice.
- **Encode your categorical features**. You may use any encoding (One-Hot, Ordinal, etc) and any library (category_encoders, scikit-learn, pandas, etc) of your choice.
- You may choose to use a pipeline, but it is not required.
- Use a **Logistic Regression** model.
- Use scikit-learn's [**cross_val_score**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function. For [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules), use **accuracy**.
- **Print your model's cross-validation accuracy score.**

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import MaxAbsScaler, Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

from category_encoders import OneHotEncoder

In [32]:
df['education'].value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

In [170]:
# Make pipeline to do my encoding and scaling:

pipeline = make_pipeline(
    OneHotEncoder(use_cat_names=True), 
    MaxAbsScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000))

In [174]:
# Fit the model

pipeline.fit(X_train, y_train);

In [175]:
# Generate predictions for test data

y_pre_lr = pipeline.predict(X_test)

In [176]:
# Get cross validation scores

cross_val_score(pipeline, X_test, y_test, cv=3)

array([0.80616943, 0.80423768, 0.79861751])

In [177]:
# Get accuracy score

accuracy = accuracy_score(y_test, y_pre_lr)

In [178]:
print('Accuracy_score: ', accuracy)

Accuracy_score:  0.8056195301704284


In [179]:
# For extra, get ROC AUC score

print('ROC AUC Score: ', roc_auc_score(y_test, y_pre_lr))

ROC AUC Score:  0.6617662401360557


## Part 3 — Modeling with Tree Ensembles!

Part 3 is the same as Part 2, except this time, use a **Random Forest** or **Gradient Boosting** classifier. You may use scikit-learn, xgboost, or any other library. Then, print your model's cross-validation accuracy score.

In [76]:
from sklearn.ensemble import RandomForestClassifier

In [209]:
# Set up similar pipeline with random forest from sklearn

pipelinerf = make_pipeline(
    OneHotEncoder(use_cat_names=True), 
    MaxAbsScaler(), 
    RandomForestClassifier(max_depth=6, n_estimators=800)
)


In [210]:
# Fit the model

pipelinerf.fit(X_train, y_train);

In [211]:
# Get predictions

y_pre_rf = pipelinerf.predict(X_test)

In [219]:
# Get Cross validation score

cross_val_score(pipelinerf, X_test, y_test, cv=3)

array([0.7946593 , 0.79548595, 0.79631336])

In [212]:
# Accuracy score

accuracy_rf = accuracy_score(y_test, y_pre_rf)

In [213]:
print('Accuracy_score: ', accuracy_rf)

Accuracy_score:  0.7937970213419315


In [220]:
# Gradient boost to see if I can get better accuracy

from xgboost import XGBClassifier


pipelinegb = make_pipeline(
    OneHotEncoder(use_cat_names=True), 
    MaxAbsScaler(), 
    XGBClassifier(max_depth=5, n_estimators=80)
)

In [221]:
pipelinegb.fit(X_train, y_train);

In [222]:
y_pre_gb = pipelinegb.predict(X_test)

In [223]:
cross_val_score(pipelinegb, X_test, y_test, cv=3)

array([0.80432781, 0.81667434, 0.80967742])

In [224]:
accuracy_rf = accuracy_score(y_test, y_pre_gb)
print('Accuracy_score: ', accuracy_rf)

Accuracy_score:  0.8136035621065562


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <td colspan="2" rowspan="2"></td>
    <td colspan="2">Predicted</td>
  </tr>
  <tr>
    <td>Negative</td>
    <td>Positive</td>
  </tr>
  <tr>
    <td rowspan="2">Actual</td>
    <td>Negative</td>
    <td style="border: solid">85</td>
    <td style="border: solid">58</td>
  </tr>
  <tr>
    <td>Positive</td>
    <td style="border: solid">8</td>
    <td style="border: solid"> 36</td>
  </tr>
</table>

Calculate accuracy

In [102]:
print('Accuracy: ', (85 + 36) / (85 + 58 + 8 + 36))

Accuracy:  0.6470588235294118


Calculate precision

In [103]:
print('Precision: ', 36 / (58 + 36))

Precision:  0.3829787234042553


Calculate recall

In [104]:
print('Recall', 36 / (8 + 36))

Recall 0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Experiment with feature selection, preprocessing, categorical encoding, and hyperparameter optimization, to try improving your cross-validation score.

### Part 3
Which model had the best cross-validation score? Refit this model on the train set and do a final evaluation on the held out test set — what is the test score? 

### Part 4
Calculate F1 score and False Positive Rate. 