**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.**

In [0]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
#@title -- Downloading Data -- { display-mode: "form" }
!mkdir -p data/adult_income
!wget -nc -O data/adult_income.zip https://www.dropbox.com/s/077gouf58kvzsuc/adult_income.zip?dl=1
!unzip -oq -d data/adult_income data/adult_income.zip

--2020-03-29 16:26:22--  https://www.dropbox.com/s/077gouf58kvzsuc/adult_income.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.1, 2620:100:601f:1::a27d:901
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/077gouf58kvzsuc/adult_income.zip [following]
--2020-03-29 16:26:22--  https://www.dropbox.com/s/dl/077gouf58kvzsuc/adult_income.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd46abdd7d0996c676d3babc60f.dl.dropboxusercontent.com/cd/0/get/A00qIlfJ8AhD1w4a6wkxcTiQb4MGuPcDMBFZyS89Ch_lgMazhB5axtf9E52_tcGu-8eTkv1NclFzN4VPS-WADzMQaHWZz_J23W9dUYpE-Uph4pTwF6bOARouOV2DNykuW-w/file?dl=1# [following]
--2020-03-29 16:26:22--  https://ucd46abdd7d0996c676d3babc60f.dl.dropboxusercontent.com/cd/0/get/A00qIlfJ8AhD1w4a6wkxcTiQb4MGuPcDMBFZyS89Ch_lgMazhB5axtf9E52_tcGu-8eTkv1NclFzN4VPS-WADzMQaHWZz_J23W9d

# Pipelines and KNN: The Adult Income Dataset

As a more practical example of data preprocessing and KNN-based classification we will use the [Adult Income Dataset](https://archive.ics.uci.edu/ml/datasets/adult) dataset, which contains data from a census and the task is to predict whether a particular person has an income greater or smaller than 50 000 dollars.

As usual, we will first display the description of the data.

In [3]:
with open("data/adult_income/adult.names", "r") as file:
    print("".join(file.readlines()))

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a person makes over

Let's start by loading the data from CSV files. The dataset comes pre-split ito the training and the testing fold – we will therefore load each separately. *The testing data contains an extra period at the end of the last column for some reason. To make it compatible with the training set we will remove it directly after loading the data.*

In [4]:
df_train = pd.read_csv("data/adult_income/adult.data",
                       header=None)
df_test = pd.read_csv("data/adult_income/adult.test",
                      header=None, skiprows=1)

df_test[14] = df_test[14].apply(lambda x: x[:-1])
df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


---

## Task 1: Column Selection

**Our first task will be to – in a way similar to the previous example – to select, which columns will be used as inputs and whether they contain numeric or categorical data.** The desired outputs are in the last column.

---

In [0]:
categorical_inputs = [1,3,5,6,7,8,9,13]

numeric_inputs = [0,2,4,10,11,12]

output = 14

---

## Task 2: Creating the Pipeline

**The next step is to create our preprocessing pipeline. You can copy the pipeline from the previous example into the following cell.** The desired outputs will be preprocessed using the ``OrdinalEncoder`` transformer.

---

In [0]:
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

In [0]:
output_enc = OrdinalEncoder()

## Data Preprocessing

We will now use the transformers created above to preprocess our data.

In [0]:
X_train = input_preproc.fit_transform(df_train)
Y_train = output_enc.fit_transform(df_train[[output]]).reshape(-1)

**Keep in mind that we need to use method ``transform`` and not ``fit_transform`` to preprocess our testing data.**

In [0]:
X_test = input_preproc.transform(df_test)
Y_test = output_enc.transform(df_test[[output]]).reshape(-1)

## Training

The code to train the model can be copied from the previous example verbatim.

In [12]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

## Testing

The code to test the model can be copied verbatim as well.

In [0]:
y_test = model.predict(X_test)

In [14]:
cm = pd.crosstab(Y_test, y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm)

predicted    0.0   1.0
actual                
0.0        11254  1181
1.0         1650  2196


In [15]:
acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))

Accuracy = 0.8261163319206437
