# Introduction
In this demo, we will focus on techniques such as using Pandas to manipulate data, adding new features to your model, and fitting + training your model.

## Pandas 🐼
Pandas is a powerful data analysis library that allows you to manipulate data. In this project, we will convert the csv files that contain the labels and existing features to Pandas Dataframe object in order to add new features and edit the data if we want to. 

Optional: Take a look at the [Pandas Dataframe documentation](https://pandas.pydata.org/docs/reference/frame.html) if you're interested.

## scikit-learn
We will be using scikit-learn to split our data, train and refine our models, and do other amazing ML stuff. This library has many useful tools from finetuning hyperparameters to training your model and computing accuracy scores.

## Goal
Able to classify labels from **question** and **not a question** with an accuracy of 90+%.

In [23]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDClassifier

## Import and clean the data

In [2]:
# import csv file that contains the labels and features to dataframe using read_csv
dataframe = pd.read_csv('demo.csv')

# peek at the first 5 rows to make sure the data looks ok
# label [string]: label extracted from textract
# is_question [0 or 1]: is the label a question or not a question 
# rest of the columns: features to identify if label is a question or not
dataframe.head()

Unnamed: 0,label,is_question,length,wordcount,question_mark,colon,tx_confidence,left,right,top,bottom
0,YOUR HOSPITAL BLOOD BANK,0,24,4,0,0,0.0,0.204442,0.436648,0.27907,0.29901
1,FILE,0,4,1,0,0,0.0,0.707655,0.73148,0.275127,0.290316
2,RETURN TO BLOOD BANK AFTER TRANSFUSION,0,38,6,0,0,0.0,0.575593,0.84322,0.292005,0.309388
3,NAME OF RECIPIENT,1,17,3,0,0,77.241501,0.107973,0.198529,0.309325,0.32418
4,MEDTCAL RECORD #,1,16,3,0,0,77.615753,0.4195,0.50782,0.309093,0.322


In [3]:
# peek at the number of rows and columns in the csv
dataframe.shape

(764, 11)

In [4]:
# random peek at some rows 
dataframe.sample(n=10)

Unnamed: 0,label,is_question,length,wordcount,question_mark,colon,tx_confidence,left,right,top,bottom
318,Phone #:,1,8,2,0,1,79.189041,0.737831,0.790565,0.295985,0.308895
533,) Diabetes,1,10,2,0,0,0.0,0.466562,0.542239,0.032046,0.044946
68,Crossmatch Date/Time:,1,21,2,0,1,48.565109,0.337732,0.474819,0.279058,0.29122
381,Refer to other Community Services:,1,34,5,0,1,83.656227,0.552849,0.761436,0.66857,0.680803
122,1 Hour from start,1,17,4,0,0,0.0,0.124489,0.246901,0.776878,0.789216
243,I have been notified of a change in the plan o...,1,53,12,0,0,59.705364,0.104954,0.444643,0.340402,0.352384
34,"** IF ""YES"" NOTIFY BLOOD PHYSICIAN, AND COMPLE...",0,70,11,0,0,0.0,0.775179,0.874035,0.50288,0.515524
28,TIME FINISHED,1,13,2,0,0,0.0,0.0,0.0,0.0,0.0
475,MD Office ( ),1,13,4,0,0,0.0,0.677996,0.786652,0.371336,0.384906
75,(For Rh Immune Globulin injections with no adv...,0,96,14,0,0,0.0,0.180676,0.811592,0.350183,0.363568


In [5]:
# Get the column of question labels
question_column = dataframe['is_question']

# If you don't want a column, you can drop it
dataframe.drop(['is_question'], axis=1, inplace=True)
dataframe.head()

Unnamed: 0,label,length,wordcount,question_mark,colon,tx_confidence,left,right,top,bottom
0,YOUR HOSPITAL BLOOD BANK,24,4,0,0,0.0,0.204442,0.436648,0.27907,0.29901
1,FILE,4,1,0,0,0.0,0.707655,0.73148,0.275127,0.290316
2,RETURN TO BLOOD BANK AFTER TRANSFUSION,38,6,0,0,0.0,0.575593,0.84322,0.292005,0.309388
3,NAME OF RECIPIENT,17,3,0,0,77.241501,0.107973,0.198529,0.309325,0.32418
4,MEDTCAL RECORD #,16,3,0,0,77.615753,0.4195,0.50782,0.309093,0.322


## Insert new feature

You can think of features as being the dimension that your data live in. If you have one feature, then when you train and fit your model, the datapoints are like a one dimensional point on a line. 

In this demo, we will add a new feature to the proportion of capital letters to length of word.

E.g. 'ABCDabcd' will be 4/8 or 0.5

In [6]:
def capital_proportion(row):
    return len(re.findall(r'[A-Z]', row['label'])) / row['length']

# Method 1: `apply(func, axis=0/1)` iterates through the rows and apply the function. Axis 1 = column
dataframe['capital'] = dataframe.apply(capital_proportion, axis=1)

# Method 2: apply to series (aka columns of Dataframe object) and insert into dataframe
# dataframe['capital'] = dataframe['label'].apply(lambda label: len(re.findall(r'[A-Z]', label)) / len(label))

# Method 3: insert into pandas dataframe (read documentation here)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html

dataframe.head()

Unnamed: 0,label,length,wordcount,question_mark,colon,tx_confidence,left,right,top,bottom,capital
0,YOUR HOSPITAL BLOOD BANK,24,4,0,0,0.0,0.204442,0.436648,0.27907,0.29901,0.875
1,FILE,4,1,0,0,0.0,0.707655,0.73148,0.275127,0.290316,1.0
2,RETURN TO BLOOD BANK AFTER TRANSFUSION,38,6,0,0,0.0,0.575593,0.84322,0.292005,0.309388,0.868421
3,NAME OF RECIPIENT,17,3,0,0,77.241501,0.107973,0.198529,0.309325,0.32418,0.882353
4,MEDTCAL RECORD #,16,3,0,0,77.615753,0.4195,0.50782,0.309093,0.322,0.8125


## Train your model

Before training your model, you should **split your dataset into a training set and a test set**. Training set is the set of data that you will train your model on. Testing set is for predicting how well your model does.

In [7]:
# Before you train your data, let's vectorize the label column since scikit learn doesn't take strings as input
# Credit: Raymond Xu
vectorizer = TfidfVectorizer(dataframe["label"].tolist())
vectors = vectorizer.fit_transform(dataframe["label"].astype('U'))
feature_names = vectorizer.get_feature_names()
denselist = vectors.todense().tolist()
df = pd.DataFrame(denselist, columns=feature_names)

# Make sure the new dataframe has the same amount of data and looks ok
print(df.shape)
df.head()

(764, 952)


Unnamed: 0,00,000,03,032,05,10,105,106,108,11,...,xxxx,xxxxx,year,yearly,years,yeast,yes,you,your,zip
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.452608,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
df.insert(0, "length", dataframe['length'], True)
df.insert(0, "wordcount", dataframe['wordcount'], True)
df.insert(0, "question_mark", dataframe['question_mark'], True)
df.insert(0, "colon", dataframe['colon'], True)
df.insert(0, "capital", dataframe['capital'], True)

# make sure they're inserted
print(df.shape)
df.head()

(764, 957)


Unnamed: 0,capital,colon,question_mark,wordcount,length,00,000,03,032,05,...,xxxx,xxxxx,year,yearly,years,yeast,yes,you,your,zip
0,0.875,0,0,4,24,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.452608,0.0
1,1.0,0,0,1,4,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.868421,0,0,6,38,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.882353,0,0,3,17,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.8125,0,0,3,16,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# Split your data into training and test set
data_train, data_test, target_train, target_test = train_test_split(df, question_column, test_size=.2)

# Make sure that the split is what you intended
data_train.shape, target_train.shape

((611, 957), (611,))

## About the model

In this demo, we're going to be looking at Linear Regression and K-Nearest Neighbors for classification. These are both supervised machine learning models, which means that the models look for relationships in pre-labeled data, and then uses that to approximate what the correct output labels should be. Unsupervised machine learning does *not* have labeled inputs, so the goal is for the model to infer the natural structure present within a set of data points. (More on supervised vs unsupervised learning [here](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d))

### Linear Regression

Linear regression is based off of a simple linear equation that assigns a scale value ($\beta$) to each column ($x$) of the input. The sum of these weighted inputs, along with the intercept/bias coefficient ($\beta_0$), produces the output ($y$). 

For two columns, the equation looks like this:
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$

In [10]:
# Train and fit your model
linreg = LinearRegression(normalize=True)
linreg.fit(data_train, target_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [14]:
# Predict your model, gives out what the model classify the training data as
predicted_values = (linreg.predict(data_test) > 0) * 1
predicted_values

array([0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1])

In [15]:
accuracy_score(target_test, predicted_values) * 100

60.130718954248366

That's not very good (only a little better than guessing)! This is because linear regression is typically used for *predictions* (continuous values) and not *classifications* (discrete values). For classification, we typically use other models like logistic regression, which uses a different mathematical function, instead.

### K-Nearest Neighbors

The KNN algorithm works by assuming that similar things exist in close proximity. "Closeness" can be defined in a variety of different ways, but many implementations of KNN simply calculate the straight line distance between two points (even on data with multiple dimensions). Whenever a datapoint needs to be classified, the KNN algorithm looks at its $k$ nearest neighbors, aka the $k$ other known datapoints that are closest to where this new datapoint is located. These neighbors will "vote" on how the new datapoint should be classified.

Here's a brief overview of how the algorithm works:
1. Initialize k (a hyperparameter that's passed into the function -- more on that below) to be the number of neighbors you want the algorithm to use. How many neighbors depends on your data, and you have to experiment with different values to find the best one. In the example below, we use `n_neighbors=3` (where `n_neighbors` is the same thing as `k`).
2. Let's refer to the point being classified as the query point. Calculate the distance between the query point and each point in the dataset. (Note: As you might imagine, this is super inefficient as your dataset gets bigger. sklearn's KNN classifier has an `algorithm` argument that you can set depending on the structure of your data to try and make this process more runtime efficient.)
3. Find the `k` neighbors with the smallest distance to the query point
4. Get the labels of the `k` neighbors
5. Use the mode of the labels will as the classification for the query point

In [16]:
# Train and fit your model
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data_train, target_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

Notice that in our example we only set the `n_neighbors` parameter, but there are a lot of other parameters that you can set depending on the dataset you're working with! Notice that several parameters like `algorithm`, `leaf_size` and `weights` have default settings. Read through the description of each parameter in the [KNN sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to find out what these parameters represent and how they each change the way that the model outputs predictions.

In [17]:
# Predict your model, gives out what the model classify the training data as
predicted_values = neigh.predict(data_test)
predicted_values

array([0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1])

In [18]:
accuracy_score(target_test, predicted_values) * 100

70.58823529411765

## Improve on your model

A big problem in training model is the potential that the model is being too biased and overfitting (aka only do well on the set of labels that we trained on and do very poorly on other new data). Adding new features will not necessarily make a model better because the more features you add, the more specific these features apply the datapoints that you train it on, but the less features you have, you are restricting your data to live in a lower dimensional subspace (not enough dimensions to define the features). We can rely on other techniques such as hyperparameter tuning and cross validation to make a model better.

Hyperparameters are the values that you need to input in a model. These aren't the features in your model but are values that set of values that define the behavior of your model. Some hyperparameters result in bad models while some give out great models. We will need to find the best hyperparameters that define our model.

GridSearchCV (stands for Grid Search Cross Validation) performs hyperparameter tuning to try all of the different combinations of possible hyperparameters that you provide to it to find the best ones. In our example below, we generate arrays of possible values for `n_neighbors` (1 through 40) and possible settings for the weights (`uniform` and `distance`). We then pass it into GridSearchCV and let it decide the best params to use, then store the model with the best parameters in the variable `best_model`.

In [19]:
# Hyperparameter tuning using GridSearchCV
# Click here to read for more hyperparameter tuning: 
# https://medium.com/@mandava807/cross-validation-and-hyperparameter-tuning-in-python-65cfb80ee485

data_train, data_test, target_train, target_test = train_test_split(df, question_column, test_size=.2)
neigh = KNeighborsClassifier()

# Create an array of all possible hyperparam values
n_neighbors = np.arange(1, 40, 1)
weights = ['uniform', 'distance']

# Perform GridSearchCV and fit the model to find the best params
hyperparams = dict(n_neighbors=n_neighbors, weights=weights)
gs = GridSearchCV(neigh, hyperparams, cv=5, verbose=0)
best_model = gs.fit(data_train, target_train)

best_n_neighbor = best_model.best_estimator_.get_params()['n_neighbors']
best_weight = best_model.best_estimator_.get_params()['weights']

print('Best n_neighbors: ', best_n_neighbor)
print('Best weights: ', best_weight)

Best n_neighbors:  19
Best weights:  distance


Cross validation is a technique to split your training set into 2 smaller sub training sets such that you train your data on the smaller subset and validate your data on the other subset. The goal of cross validation is to examine if your model is being overfitted or not. It is also useful to examine the accuracy of your model after finetuning hyperparameters.

In [20]:
neigh = KNeighborsClassifier(n_neighbors=best_n_neighbor, weights=best_weight)

# Perform cross validation on the selected params, k=5 is usually pretty good
kfold = KFold(n_splits=5)
results = cross_val_score(neigh, data_train, target_train, cv=kfold)
print('Accuracy from training data: ', results.mean() * 100)

Accuracy from training data:  80.20125283220045


In [21]:
neigh.fit(data_train, target_train)
predicted_values = neigh.predict(data_test)
score = accuracy_score(target_test, predicted_values)
print('Accuracy from testing data: ', score * 100)

Accuracy from testing data:  79.73856209150327


## Gradient Descent

Another tool to improve accuracy is gradient descent. This specifically only works for differentiable models (LinearSVC, Logistic Regression), but it uses a gradient and a learning rate to adjust the weights of the inputs and minimize loss (difference between the predicted value that the model outputs and the actual value). The Stochastic Gradient Descent classifier in sklearn 

In [55]:
# SGD = Stochastic Gradient Descent
sgd = SGDClassifier(max_iter=1000, tol=1e-3) # can use GridSearch to tune these hyperparams too
sgd_model = sgd.fit(data_train, target_train)
sgd_model

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [56]:
predicted_values = sgd_model.predict(data_test)
predicted_values

array([1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0])

In [57]:
sgd_accuracy = accuracy_score(target_test, predicted_values)
print('Accuracy from stochastic gradient descent: ', sgd_accuracy * 100)

Accuracy from stochastic gradient descent:  73.8562091503268


We can also use cross validation on top of gradient descent to check for overfitting.

In [58]:
# Perform cross validation on the selected params, k=5 is usually pretty good
sgd_kfold = KFold(n_splits=5)
sgd_results = cross_val_score(sgd, data_train, target_train, cv=kfold)
print('Accuracy from training data: ', sgd_results.mean() * 100)

Accuracy from training data:  69.87338397974143


In [59]:
sgd.fit(data_train, target_train)
predicted_values = sgd.predict(data_test)
sgd_score = accuracy_score(target_test, predicted_values)
print('Accuracy from testing data: ', sgd_score * 100)

Accuracy from testing data:  75.81699346405229
