# An intro to using Python in Jupyter notebooks for data science
## Introduction to machine learning tasks in scikit-learn

## Overview

**Purpose**: The purpose of these repository notebooks is to perform a classification task on the selected dataset to classify whether, according to some predictors, a person's income will exceed $50k.  The dataset includes several predictors, but the ones that we will use for classification are `hours-per-week` (how many hours are worked per week), `workclass` (broad category of the type of work performed by the person), and `age` (how old the person is, in years). 

**Notebook purpose**: This notebook demonstrate some sample steps of a machine learning task.

**Data**: In this notebook, we will used the cleaned data (at data/adult_clean.csv) created from the previous notebook.  All the columns of the original data set are maintained, but all NAs are removed/replaced.

**Preprocessing tasks**:  After the data is cleaned, I preprocess the variables of interest in the following ways:
- **numerical predictors** (`hours-per-week`, `age`): A standard scaler is applied to these.  More details are given in the appropriate section.
- **categorical variables** (`workclass`, `salary-class`): These are pre-processed into one hot features.  For the current ACCRE version scikit-learn, this first requires converting the string-encoded categories into numerical values (e.g., 'a', 'b', 'c' -> 1, 2, 3).  The one hot encoder then converts this to one hot.

**Modeling tasks**:  The data is modeled using a logistic regression classifier with 5-fold cross validation.

**Performance evaluation**: The confusion matrix and other metrics including precision, recall, and area under the curve (AUC) are computed.

## Load the data

In [None]:
#Imports for data frame behavior
import pandas as pd

In [None]:
#Now we will load the data.  Best practices and helping to ensure a hardened pipeline require that some checks are done on the data after it is loaded.  This will be discussed at length during the semester.
cleaned_data_filename = 'data/adult_clean.csv'
df = pd.read_csv(cleaned_data_filename)

## Modeling the data: Explicitly performing each individual step
In this step, we'll model the data using some parts of the data as predictors and one column as the response.  We will explicitly perform each step of the pipeline to understand each step and its purpose.

For the predictors, we will use \[age, workclass, hours-per-week\], and for the response, we will use salary-class.  Some of these variables will require some preprocessing to get them into a form suitable for a classifier, which will be detailed later.

Note the importance of splitting the training and testing set prior to performing any preprocessing (normally also including imputation).  This ensures that we don't inform the model about behavior of the testing set during training.

In [None]:
#imports
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

In [None]:
#Divide data into training and testing sets.  This is essential so that information that should not be known to the testing set is not known.

#Split data into relevant portions
data_y = df['salary-class']
data_x = df[['age', 'hours-per-week', 'workclass']]

ptrain_x, ptest_x, ptrain_y, ptest_y = train_test_split(data_x, data_y, train_size=0.85, test_size=0.15)

### Encodings for predictors and response variables
Here we encode and preprocess the data in a way that the package expects the inputs to be received.  The preprocessing will be as follows:
* **Predictors**:
    - workclass: One-hot encoding.  Note that for backward compatibility with scikit-learn and pandas, the onehot encoder must be preceded by a labelencoder (i.e., one cannot directly encode strings with OneHotEncoder for this version of scikit-learn.)
    - age: Standard Scaling (i.e., $\frac{(x-\mu)}{\sigma}$)
    - hours-per-week: Standard Scaling (i.e., $\frac{(x-\mu)}{\sigma}$)
* **Response variable**:
    - salary-class: Binary encoding (e.g., $<$50k is class 0, >=50k is class 1)

In [None]:
# One hot encode workclass; note that if we wanted to, we could one hot all of the categorical matrices
pre_wc_encoder = LabelEncoder()
wc_le = pre_wc_encoder.fit_transform(ptrain_x['workclass'])

In [None]:
wc_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
wc_1h = wc_encoder.fit_transform(pd.DataFrame(wc_le))
wc_df = pd.DataFrame(wc_1h, columns=pre_wc_encoder.classes_)

In [None]:
print(wc_df.shape)
wc_df.head()

In [None]:
ptrain_x.head()

In [None]:
#Scale inputs to logistic regression
num_scaler = StandardScaler()
ns = num_scaler.fit_transform(ptrain_x[['age','hours-per-week']])
ns_df = pd.DataFrame(ns, columns=['age', 'hours-per-week'])

In [None]:
print(ns_df.shape)
ns_df.head()

### Encodings for responses

In [None]:
#binary encode label
sc_encoder = LabelEncoder()
sc = sc_encoder.fit_transform(ptrain_y)
train_y = pd.Series(sc, name='salary-class')

In [None]:
ptrain_y.head()

In [None]:
print(train_y.shape)
train_y.head()

In [None]:
#Concatenate all of the features together
#feat_df = df[['age', 'capital-gain', 'capital-loss', 'hours-per-week']].copy()
train_x = pd.concat([wc_df, ns_df], axis=1)

In [None]:
print(train_x.shape)
train_x.head()

In [None]:
print(train_y.shape)
train_y.head()

### Modeling and Prediction

In [None]:
#Create a classifier (via logistic regression)
lr_classifier = LogisticRegressionCV(class_weight='balanced', solver='lbfgs', max_iter = 1000, cv=5)

In [None]:
#Train classifier via kfold cross validation
lr_classifier.fit(train_x, train_y)

In [None]:
#Use above encoding methods to create test set
wc_le_test = pre_wc_encoder.transform(ptest_x['workclass'])
test_x = pd.concat( [pd.DataFrame(wc_encoder.transform(pd.DataFrame(wc_le_test)),
                                 columns=pre_wc_encoder.classes_),
                     pd.DataFrame(num_scaler.transform(ptest_x[['age', 'hours-per-week']]),
                                  columns = ['age', 'hours-per-week'])], axis=1)
test_y = pd.Series(sc_encoder.transform(ptest_y), name='salary-class')
                   

In [None]:
print(test_x.shape)
test_x.head()

In [None]:
print(test_y.shape)
test_y.head()

In [None]:
#Test the classifier on the held out test set
pred_y = lr_classifier.predict(test_x)

In [None]:
#Can use a classification report to get other metrics:
print("Classification report: \n", classification_report(test_y, pred_y))

In [None]:
#Investigate popular singular metrics
roc_auc_score(test_y, pred_y)

Often, visualization provides a more intuitive understanding of the results.  For a better vis of the confusion matrix, we can use the matplotlib and seaborn packages.

In [None]:
#Look at the confusion matrix of the result of the testing set
conf_mat = confusion_matrix(test_y,pred_y)
conf_mat_ratio = conf_mat/(pred_y.shape[0])
print("Confusion matrix: \n", conf_mat)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn

plt.figure(figsize=(10,4))

plt.subplot(1,2,1) 
ax = sn.heatmap(pd.DataFrame(conf_mat, index=sc_encoder.classes_, columns=sc_encoder.classes_),
          annot=True, fmt = "d", annot_kws={"size": 14}, cbar=False)
ax.set_xlabel('Predicted class', fontsize=16);
ax.set_ylabel('Actual class', fontsize=16);
ax.set_title('Salary-class Confusion: Counts')

plt.subplot(1,2,2) 
ax = sn.heatmap(pd.DataFrame(conf_mat_ratio, index=sc_encoder.classes_, columns=sc_encoder.classes_),
          annot=True, fmt = "0.2f", annot_kws={"size": 14}, cbar=False)
ax.set_xlabel('Predicted class', fontsize=16);
ax.set_ylabel('Actual class', fontsize=16);
plt.subplots_adjust(wspace=0.4)
ax.set_title('Salary-class Confusion: Ratio');

## Modeling the Data via Pipelines
One extremely valuable behavior of scikit-learn is to provide pipelines.  This essentially defines a set of steps that should be taken for any data that will be input to the model.  This enables a simple description of what should be performed on each part of the data, and additionally restricts the ability of training information to be used on the testing data.

One drawback of pipelines in general is that they may fail silently.  However, as scikit-learn is both open-source and built on Python, additional functionality (e.g., transformer types) may be created by the developer to attempt to combat such challenges.