<h1 style="text-align: center;">Practical 3: Towards models in the real world <br> SVMs, Model Selection, Pipelines, Standardization & Category Encoding</h1>

**This week's practical will build on last weeks. At the end of this practical you should be able to:**
1. Undertake classification tasks with SVMs. This includes both linear and non-linear versions.
2. Extend models such as SVMs to handle categorical features encoded as strings via Pipelines and OneHot encoding within a deployment scenario.
3. Perform label encoding / decoding to define target classes to take advantage of sklearns inbuilt evaluation measures (precision, recall etc) while maintaining human readable labels.
4. Undertake semi-automatic model selection as part of an "extended classifier" pipeline while still achieving unbiased estimates of final model performance.

## Before you start...
You'll want to use the same OneHotEncoder that we used last week. If you haven't already installed it, in a terminal type:<br>
`sudo pip3 install category_encoders`

And then restart your kerenel if you have already started jupyter.

If you are using colaboratory, then please include the line:<br>
`!pip install category_encoders`
at the top of the notebook.

[The documentation for this package](https://contrib.scikit-learn.org/category_encoders/onehot.html).


<br>Recall you have to choose how to deal with unknown values. I.e. values in our test data which were not in our training data. How they are dealt with via the parameter `handle_unknown` or . <br>
There are three options:<br>
`handle_unknown = 'value'`(default): will encode a new value as 0 in every dummy column. ‘indicator’ <br>
`handle_unknown = 'error'`: will raise a ValueError at transform time if there are new categories.<br>
`handle_unknown = 'return_nan'`: will encode a new value as np.nan in every dummy column.<br>
`handle_unknown = 'indicator'`: will add an additional dummy column (in both training and test data).

Please use `handle_unknown = 'indicator'` in this practical.

## The task.

Task: Predict whether a person makes over $50k per year from census data known about them.

Data set from the paper: Kohavi, Ron. "Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid." KDD. Vol. 96. 1996.
Data URL: We will be using modified versions of the publically avaliable data. Please download the data from the URLs provided. 

**Output Feature:** 

Feature | type | values
:-------:|:--------:|:--------:|
salary | categorical | >50K, <=50K| 

**Input features**

|     Feature    |     Type    |                                                                                                                                                                                                              Values                                                                                                                                                                                                             |
|:--------------:|:-----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|       age      |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                               |
|    workclass   | categorical |                                                                                                                                                              Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked                                                                                                                                                              |
|     fnlwgt     |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|    education   | categorical |                                                                                                                                      Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.                                                                                                                                     |
|  education-num |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| marital-status | categorical |                                                                                                                                                            Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.                                                                                                                                                           |
|   occupation   | categorical |                                                                                                    Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.                                                                                                    |
|  relationship  | categorical |                                                                                                                                                                               Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.                                                                                                                                                                               |
|      race      | categorical |                                                                                                                                                                                   White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.                                                                                                                                                                                  |
|       sex      | categorical |                                                                                                                                                                                                          Female, Male.                                                                                                                                                                                                          |
|  capital-gain  |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|  capital-loss  |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| hours-per-week |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| native-country | categorical | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |



## The final flow (for developing and testing the model)

** 1. Data Preparation **
* Load data, provide headings, ensure correct dtypes
* Setup the label encoding and generate the labels
* Split the data into a training and test set

**2. Model setup & evaluation **
* The baseline (majority) learner
* The logistic regression baseline, embed in a pipeline with OneHot encoding, standardization and grid search
* The linear SVM, embed in a pipeline with OneHot encoding, standardization and grid search
* The non-linear (rbf) SVM, embed in a pipeline with OneHot encoding, standardization and grid search
* The non-linear (polynomial) SVM, embed in a pipeline with OneHot encoding, standardization and grid search
* In all cases train the classifiers using the training set (automatically doing meta-parameter selection via cross-validation with a validation set) and evaluate on the test set

**3. Final training of best model **

**4. In the future: Checking model performance **
* Evaluate the performance of the final classifier on a completely new set of data hypothetically collected after your company has been using the model for awhile. 

* Evaluate the performance of all classifiers on the test set

## Think you know what you're doing and want to see if you can do it yourself? 
Start a new jupyter notebook and do it without help.

The data you, as an analyst, have been given to develop and test your model:<br> [http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small](http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small)

The data you as an analyst are given 3 months later to check how your model is performing. It contains both the input and output features:<br>[http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.test](http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.test)


## Still a little unsure?
Follow this tutorial, filling in the blanks.

## Done and back for more?
Implement a Random Forest / kNN / Gradient Boosting classifier. How does it compare?

# Following along? Let's begin....
**NOTE:** It is **expected** that you will have to google and **look up documentation to complete this practical**. This is how it will be in the real world.

# 1. Data 

The data you have been given as an analyst to develop and test your model is located at the following URL (use read_csv to directly load the data as a pandas DataFrame):<br>
[http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small](http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small)


In [2]:
# If not using colaboratory, comment the line below
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.3.0-py2.py3-none-any.whl (82 kB)
[?25l[K     |████                            | 10 kB 15.7 MB/s eta 0:00:01[K     |████████                        | 20 kB 11.6 MB/s eta 0:00:01[K     |████████████                    | 30 kB 8.5 MB/s eta 0:00:01[K     |████████████████                | 40 kB 5.7 MB/s eta 0:00:01[K     |████████████████████            | 51 kB 5.0 MB/s eta 0:00:01[K     |████████████████████████        | 61 kB 5.8 MB/s eta 0:00:01[K     |████████████████████████████    | 71 kB 6.3 MB/s eta 0:00:01[K     |███████████████████████████████▉| 81 kB 7.0 MB/s eta 0:00:01[K     |████████████████████████████████| 82 kB 332 kB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.3.0


In [3]:
# Put your imports here
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from category_encoders.one_hot import OneHotEncoder
from sklearn.pipeline import Pipeline

# Read the data into a pandas DataFrame 
data = pandas.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small', header = None, names = ['age','workclass','fnlwgt','education','education-num','matrial-status','occupation','relationship','race','sex','captial-gain','captial-loss','hours-per-week','native-counrty','salary'])

# Check the DataFrame, does it look correct?
data.head(5)

  import pandas.util.testing as tm


Unnamed: 0,age,workclass,fnlwgt,education,education-num,matrial-status,occupation,relationship,race,sex,captial-gain,captial-loss,hours-per-week,native-counrty,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [41]:
data.isnull().any()

age               False
workclass         False
fnlwgt            False
education         False
education-num     False
matrial-status    False
occupation        False
relationship      False
race              False
sex               False
captial-gain      False
captial-loss      False
hours-per-week    False
native-counrty    False
salary            False
dtype: bool

In [None]:
# If not fix it. Either here or when loading the csv or via a combination of both.

In [None]:
# Check the dtypes
data.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
matrial-status    object
occupation        object
relationship      object
race              object
sex               object
captial-gain       int64
captial-loss       int64
hours-per-week     int64
native-counrty    object
salary            object
dtype: object

In [None]:
# Do they look correct (i.e. do they match the documentation)? 
# Remember Strings will be listed as Objects. 
# If not fix them here.

In [4]:
# Define our input features and our output feature
# Call our input features X and our output feature y (the sklearn standard)
X = data.drop( columns = 'salary' ) 
y = data.salary # or y = data['salary']


In [5]:
y.head(10)

0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
5    <=50K
6    <=50K
7     >50K
8     >50K
9     >50K
Name: salary, dtype: object

In [6]:
# Now we need to encode our output feature to be an integer 0 or 1. 
# This is because we have a binary classification problem and in order to use sklearn's
# built-in evaluation measures we need to have one class defined as 1 (target) and one as 0 (non-target).

# We could do this by using the LabelEncoder from sklearn. The LabelEncoder will convert n-distinct values
# to 0,..,n-1 values in this case giving us what we want. We assume that our training set contains both
# labels and that this mapping will be valid. However, we have no control 
# over which value is represented by 1 and which is represented by 0.
# Therefore it is easier (in terms of subsequent interpretation) to do this
# manually. Recall the problem, we want our target variable (1) to be '>50k'

# To do this (your variable y is a pandas.Series object, use the replace method):
# 1) update all values '<=50K' within y to equal 0
# 2) update all values '>50K' within y to equal 1

y.replace(to_replace = '<=50K', value = 0, inplace = True)
y.replace(to_replace = '>50K', value = 1, inplace = True)

In [7]:
# Check your y varaible now only contains 0 and 1
y.unique()

array([0, 1])

In [8]:
# Since we are going to automatically tune our model parameters 
# we need to split our data into training and testing sets. 
# Do this here. 

# Keep 90% of the data for training 
# Use a random_state of 42

# Use the following variable names:
# X_train : The input features for the training set
# y_train : The output feature for the training set
# X_test : The input features for the test set
# y_test : The output feature for the test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=42)

# 2. Model setup

In [9]:
# Setup the majority classifier (DummyClassifier from sklearn)
#p.s. A simple majority classifier is one where every point is assigned to 
#whichever class is in the majority in the training set. 
#(If there is no majority, one of the classes is chosen arbitrarily). 
#This classifier is often used as a baseline for comparing other machine learning techniques.
# The dummy classifier will still need to be wrapped in a Onehot encoder pipeline.
# (sklearn consistantly doesn't like to handle categorical variables natively)
# We do not need to wrap it in a gridsearch as there are no parameters to tune

from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier

steps = [
    ('onehot',OneHotEncoder(handle_unknown='indicator')),  # will automatically pick string columns (could have specified)
    ('model',DummyClassifier())
]

dc = Pipeline( steps = steps )

In [10]:
# Train and test the dummy classifier
# Print the accuracy, precision and recall (sensitivity). 
# The methods for these are in sklearn.metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score

dc.fit(X_train,y_train)
dc_pred = dc.predict(X_test)
results = {}

results['dc:'] = 'Acc: {:.2f}, Prec: {:.2f}, Recall: {:.2f}'.format(accuracy_score(y_test, dc_pred),
                                                                    precision_score(y_test, dc_pred,zero_division=1),
                                                                    recall_score(y_test, dc_pred,zero_division=1)) 
print(results)

{'dc:': 'Acc: 0.76, Prec: 1.00, Recall: 0.00'}


In [1]:
# In the next few steps we are going to use GridSearch via cross-validation with a validation set.
# This will be done across three models (1) logistic regression, (2) linear SVM & (3) RBF SVM
# In order to be directly comparable the internal cross-validation should use the same random splits.
# Create this fixed random split (KFold object) here so we can re-use in the GridSearchCV.

from sklearn.model_selection import KFold

folds = KFold(n_splits=3, shuffle=True, random_state=0)

In [12]:
# Setup the logistic regression baseline pipeline wrapped in a cross-validation grid search
# The pipeline should be of the form: 
# =================
# One-hot encoding (from category_encoders NOT sklearn)
# Standardization (sklearn StandardScalar)
# Logistic regressor (sklearn LogisticRegression)
# ================
# When listing the parameters for the grid search to search over remember that
# to specify parameters within a pipeline use:
# 'pipelineStageName__parameter'
# TIP: include the parameter n_jobs = X when creating the GridSearchCV object
#      this will use multiple cores (you have 4 hyperthreaded to act like 8)
#      What's a good number? Pick 6 or 7. That uses ~3 out of 4 cores leaving 
#      one (or at least one thread) to run your interface.
#
# TIP: Always pick the same number of meta-parameters as your number of cores
#      or a multiple of.
#
# TIP: Start with the grid 'model__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]
#      with n_jobs = 7

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from category_encoders.one_hot import OneHotEncoder
from sklearn.model_selection import GridSearchCV


steps = [
    ('onehot',OneHotEncoder(handle_unknown='indicator')),  # will automatically pick string columns (could have specified)
    ('standardize', StandardScaler()), # will convert everything (can't specify which columns but all columns are fine)
    ('model',LogisticRegression() )
]

lr_pipe = Pipeline( steps = steps )

param_grid = [{'model__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}]


lr = GridSearchCV( lr_pipe, param_grid, n_jobs = 7, cv = folds  )

In [13]:
# Train the logistic regressor classifier
lr.fit(X_train,y_train)

GridSearchCV(cv=KFold(n_splits=3, random_state=0, shuffle=True),
             estimator=Pipeline(steps=[('onehot',
                                        OneHotEncoder(handle_unknown='indicator')),
                                       ('standardize', StandardScaler()),
                                       ('model', LogisticRegression())]),
             n_jobs=7,
             param_grid=[{'model__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}])

In [14]:
# Print the meta-parameters that were found to be best. 
# If they were at the boundary of your grid search then
# we may not have found the best parameters. Keeping the
# same number of parameters to explore (for speed) explore more
# parameters.

print('{}, {}'.format(lr.best_params_,lr.best_score_))

{'model__C': 0.01}, 0.8266666666666667


In [15]:
# Test the logistic regressor classifier
# Print the accuracy, precision and recall (sensitivity). 

lr_pred = lr.predict(X_test)
from sklearn.metrics import accuracy_score, precision_score, recall_score

results['lr:'] = 'Acc: {:.2f}, Prec: {:.2f}, Recall: {:.2f}'.format(accuracy_score(y_test, lr_pred),
                                                                    precision_score(y_test, lr_pred),
                                                                    recall_score(y_test, lr_pred)) 

print(results)

{'dc:': 'Acc: 0.76, Prec: 1.00, Recall: 0.00', 'lr:': 'Acc: 0.83, Prec: 0.73, Recall: 0.47'}


In [16]:
# Setup the linear SVM pipeline wrapped in a cross-validation grid search
# The pipeline should be of the form: 
# =================
# One-hot encoding (from category_encoders NOT sklearn)
# Standardization (sklearn StandardScalar)
# Linear SVM (use sklearn SVC with a linear kernel)
# ================
# Make sure to set the cv equal to the same fold split as before

from sklearn.svm import LinearSVC

steps = [
    ('onehot',OneHotEncoder(handle_unknown='indicator')),  # will automatically pick string columns (could have specified)
    ('standardize', StandardScaler()), # will convert everything (can't specify which columns but all columns are fine)
    ('model',LinearSVC(max_iter=30000) )
]

lsvm_pipe = Pipeline( steps = steps )

param_grid = [{'model__C':[0.001,  0.1, 1, 100, 1000]}]

lsvm = GridSearchCV( lsvm_pipe, param_grid, n_jobs = 5, cv = folds )

In [17]:
# Train the linear SVM classifier
# Print the meta-parameters that were found to be best. 
# If they were at the boundary of your grid search then
# we may not have found the best parameters. Keeping the
# same number of parameters to explore (for speed) explore more
# parameters.

lsvm.fit(X_train,y_train)
print('{}, {}'.format(lsvm.best_params_,lsvm.best_score_))

{'model__C': 1}, 0.8188888888888889


In [None]:
# Was there a warning? Perhaps a "ConvergenceWarning"? Follow the suggestion and increase the number of iterations.

In [18]:
# Test the linear SVM classifier
# Print the accuracy, precision and recall (sensitivity). 

lsvm_pred = lsvm.predict(X_test)

results['lsvm:'] = 'Acc: {:.2f}, Prec: {:.2f}, Recall: {:.2f}'.format(accuracy_score(y_test, lsvm_pred),
                                                                    precision_score(y_test, lsvm_pred),
                                                                    recall_score(y_test, lsvm_pred)) 

for k, v in results.items():
  print('{0:>8}| {1}'.format(k,v))

     dc:| Acc: 0.76, Prec: 1.00, Recall: 0.00
     lr:| Acc: 0.83, Prec: 0.73, Recall: 0.47
   lsvm:| Acc: 0.84, Prec: 0.68, Recall: 0.64


In [19]:
# Setup the rbf SVM pipeline wrapped in a cross-validation grid search
# The pipeline should be of the form: 
# =================
# One-hot encoding (from category_encoders NOT sklearn)
# Standardization (sklearn StandardScalar)
# RBF SVM (use sklearn SVC with a rbf kernel)
# ================
## Make sure to set the cv equal to the same fold split as before

from sklearn.svm import SVC

steps = [
    ('onehot',OneHotEncoder(handle_unknown='indicator')),  # will automatically pick string columns (could have specified)
    ('standardize', StandardScaler()), # will convert everything (can't specify which columns but all columns are fine)
    ('model',SVC(kernel='rbf') )
]

rbfsvm_pipe = Pipeline( steps = steps )

param_grid = [{'model__C':[1000, 1100, 1250, 1350, 1500, 2500, 5000], 'model__gamma':[0.000001, 0.00001,0.00005,0.00008, 0.0001]}]

rbfsvm = GridSearchCV( rbfsvm_pipe, param_grid, n_jobs = 6, cv = folds )

In [20]:
# Train the RBF SVM classifier
# Print the meta-parameters that were found to be best. 
# If they were at the boundary of your grid search then
# we may not have found the best parameters. Keeping the
# same number of parameters to explore (for speed) explore more
# parameters.

rbfsvm.fit(X_train,y_train)
print('{}, {}'.format(rbfsvm.best_params_,rbfsvm.best_score_))

{'model__C': 1350, 'model__gamma': 5e-05}, 0.8377777777777777


In [21]:
# Test the RBF SVM classifier
# Print the accuracy, precision and recall (sensitivity). 

rbfsvm_pred = rbfsvm.predict(X_test)

results['rbfsvm:'] = 'Acc: {:.2f}, Prec: {:.2f}, Recall: {:.2f}'.format(accuracy_score(y_test, rbfsvm_pred),
                                                                    precision_score(y_test, rbfsvm_pred),
                                                                    recall_score(y_test, rbfsvm_pred)) 

for k, v in results.items():
  print('{0:>8}| {1}'.format(k,v))

     dc:| Acc: 0.76, Prec: 1.00, Recall: 0.00
     lr:| Acc: 0.83, Prec: 0.73, Recall: 0.47
   lsvm:| Acc: 0.84, Prec: 0.68, Recall: 0.64
 rbfsvm:| Acc: 0.84, Prec: 0.69, Recall: 0.61


# 3. Final training of best model

In [83]:
# Which performed the best? 
# Train the best performing model (logistic Regression, Linear SVM, RBF SVM) based on all our data.


steps = [
    ('onehot',OneHotEncoder(handle_unknown='indicator')),  # will automatically pick string columns (could have specified)
    ('standardize', StandardScaler()), # will convert everything (can't specify which columns but all columns are fine)
    ('model',SVC(kernel='rbf') )
]

rbfsvm = Pipeline( steps = steps )

param_grid = [{'model__C':[1100, 1250, 1350, 1500], 'model__gamma':[0.00001,0.00005,0.00008]}]

model_final = GridSearchCV( rbfsvm, param_grid, n_jobs = 5, cv = folds )

model_final.fit(X,y)



GridSearchCV(cv=KFold(n_splits=3, random_state=0, shuffle=True),
             estimator=Pipeline(steps=[('onehot',
                                        OneHotEncoder(handle_unknown='ignore')),
                                       ('standardize', StandardScaler()),
                                       ('model', SVC())]),
             n_jobs=5,
             param_grid=[{'model__C': [1100, 1250, 1350, 1500],
                          'model__gamma': [1e-05, 5e-05, 8e-05]}])

# 4. In the future: Checking model performance
3 months later you decide to check how your model is performing. You manually collect ground truth labels for a new set of data:<br>[http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.test](http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.test)

In [None]:
# Load the data, check it is correct
data_test = pandas.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.test', skiprows = 1, names = ['age','workclass','fnlwgt','education','education-num','matrial-status','occupation','relationship','race','sex','captial-gain','captial-loss','hours-per-week','native-counrty','salary'])

# Check the DataFrame, does it look correct?
data_test.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,matrial-status,occupation,relationship,race,sex,captial-gain,captial-loss,hours-per-week,native-counrty,salary
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [None]:
# Define our input features and our output feature
# Encode our labels

X_new = data_test.drop('salary',axis = 1) # 0 rows, 1 cols
y_new = data_test.salary
                        
y_new.replace(to_replace = '<=50K', value = 0, inplace = True)
y_new.replace(to_replace = '>50K', value = 1, inplace = True)

In [None]:
# Make the predictions using the model and evaluate them

results = model_final.predict(X_new)

'Acc: {:.2f}, Prec: {:.2f}, Recall: {:.2f}'.format(accuracy_score(y_new, results),
                                                                    precision_score(y_new, results),
                                                                    recall_score(y_new, results)) 



  Xt = transform.transform(Xt)


'Acc: 0.85, Prec: 0.74, Recall: 0.59'

# Well done. 
Still got time? Remember we only used a single split in our outter loop. Start fresh in a cell below and use a cross-validation outer loop to pick the best model. Then train using the full data set. Finally make your predictions on the separately provided "Future" set.