# Introduction to Machine Learning at Fireside
scott sadlo, isabella seeman, joseph nelson

## Tools

### Jupyter Notebooks

[Jupyter Notebooks](https://en.wikipedia.org/wiki/Project_Jupyter) provide an amazing interface for collaborating on and sharing code and the process of data analaysis. 

### Pandas
[Pandas](https://pandas.pydata.org) is a library for easy manipulation and analysis of large sets of data.

### SKLearn
[SKLearn](https://scikit-learn.org/) provides an abundance of machine learning tools and algorithms that are both powerful and easy to use.

## Predicting Outgoing Responses

One of the common problems faced by staffers on The Hill is pairing incoming constituent messages with outgoing responses. What follows is an investigation into various techniques for automatically predicting the appropriate response for each incoming message.


### Getting Started

We start off by reading all of the data into a Pandas data frame and then do some rudimentary data analysis. For the purposes of privacy, our data set is not actually constituent mail here, it's actually sentences from scientific articles, but the methodology is the same. These messages have been pre-classified as follows:

* AIMX The specific research goal of the paper 
* OWNX The author’s own work, e.g. methods, results, conclusions
* CONT Contrast, comparison or critique of past work
* BASE Past work that provides the basis for the work in the article.
* MISC Any other sentences

The class can be thought of as a pointer to an outgoing response, and the sentence can be considered an incoming constituent message.

Note: This data comes from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Sentence+Classification)

 #### Prepare the Pandas/Jupyter environment

In [12]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 10000)

#### Read and Clean Our Data

In [13]:
import glob, re 
from io import StringIO
all_files = glob.glob("./data/*.txt")
all_lines = []
for f in all_files:
    r = open(f)
    for l in r.readlines():
        # some lines use ' ', and '--' as separators, here we
        # standardize these at '\t'
        all_lines.append(re.sub(r'^(AIMX|OWNX|CONT|BASE|MISC)( |-)+', '\\1\t', l))

data = StringIO('\n'.join(all_lines))
df = pd.read_csv(data, delimiter='\t', names=['outgoing', 'incoming'], header=None)

# strip out #### marker lines ####
df = df.loc[~df.outgoing.str.contains('#')]

# clean up NaN lines in 'incoming'
df['incoming'] = df['incoming'].astype(str)
df.shape

(3118, 2)

Get some basic stats on the data

In [3]:
df.describe()

Unnamed: 0,outgoing,incoming
count,3118,3118
unique,5,1323
top,MISC,The observations received by the learning algorithm often have some inherent temporal dependence
freq,1826,4


Have a peek at the first few messages in the data frame

In [4]:
df.head()

Unnamed: 0,outgoing,incoming
1,OWNX,this study was designed to assess sex-related differences in the selection of an appropriate strategy when facing novelty
2,OWNX,
3,OWNX,the exploration task was followed by a visual discrimination task and the responses were analyzed using signal detection theory
4,OWNX,during exploration women selected a local searching strategy in which the metric distance between what is already known and what is unknown was reduced whereas men adopted a global strategy based on an approximately uniform distribution of choices
5,OWNX,women's exploratory behavior gives rise to a notion of a secure base warranting a sense of safety while men's behavior does not appear to be influenced by risk


### Split the Data 


Fundamentally what we're trying to do here is take a set of known data (incoming messages or X) and predict a bit of unknown data (a specific outgoing message) from a field of other known data (outgoing messages or y). To do so using supervised learning, the entire corpus of data is typically split into training and testing datasets. Here we do just that, using 80% of the data for training, and 20% for testing. We also keep all of the data together for our look at Support-vector Machines.

Again, X is the data being used to predict the y labels.

In [14]:
import numpy as np

msk = np.random.rand(len(df)) < 0.8

incoming_X_train = df[msk].incoming.values
outgoing_y_train = df[msk].outgoing.values

incoming_X_test = df[~msk].incoming.values
outgoing_y_test = df[~msk].outgoing.values

incoming_X_all = df.incoming.values
outgoing_y_all = df.outgoing.values

### Vectorizing the Messages

In order for our sentences to be useful mathematically, we need to convert them into vectors, or sequences of numbers, that represent the vocabulary of our messages in different ways. We've put a couple of functions here to allow us to easily test different algorithms with a variety of vector representations.


For more on vectorizers, checkout [Hacking Scikit-Learn’s Vectorizers](https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af)

In [19]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

def call_with_transforms(vect, X_train, X_test, callable):
    dtm_train = vect.fit_transform(X_train)
    dtm_test = None
    if X_test is not None:
        dtm_test = vect.transform(X_test)
    callable(dtm_train, dtm_test, vect)
    
def try_vectorizers(callable, X_train, X_test):

    print('CountVectorizer')    
    vect = CountVectorizer(ngram_range=(1, 10), stop_words='english', min_df=1)
    call_with_transforms(vect, X_train, X_test, callable)
    
    print('')
    print('TfidfVectorizer')    
    vect = TfidfVectorizer(ngram_range=(1, 10), stop_words='english', min_df=1)
    call_with_transforms(vect, X_train, X_test, callable)
    
    print('')
    print('HashingVectorizer')    
    vect = HashingVectorizer(non_negative=True, stop_words='english')
    call_with_transforms(vect, X_train, X_test, callable)
    

### Find the Best Algorithm

We now take a look at a few common machine learning algorithms that are well suited for the tasks of text classification to find which produces the best results on our data.

#### Naive Bayes
[Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a popular text classification method that makes use of probabilities. As it turns out this method, of those we try here, produces the best results on our real data as well as the sample data used by this notebook.

In [20]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

def naive_bayes(dtm_train, dtm_test, vect):
    # use Naive Bayes to predict term
    nb = MultinomialNB()
    nb.fit(dtm_train, outgoing_y_train)
    y_pred_class = nb.predict(dtm_test)

    # calculate accuracy
    print(metrics.accuracy_score(outgoing_y_test, y_pred_class))
    
try_vectorizers(naive_bayes, incoming_X_train, incoming_X_test)

CountVectorizer
0.7728706624605678

TfidfVectorizer
0.8233438485804416

HashingVectorizer
0.6845425867507886


#### Support-vector Machine (SVM)
[Support-vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine) is another popular text classification algorithm, but it is geometric rather than probablistic.

In [23]:
from sklearn import svm
import numpy as np

def SVM(dtm_train, dtm_test, vect):
    clf = svm.SVC()
    gamma_range = 10.**np.arange(-5, 2)
    C_range = 10.**np.arange(-2, 3)
    kernel_range = ['rbf', 'sigmoid', 'linear', 'poly']
    param_grid = dict(gamma=gamma_range, C=C_range, kernel=kernel_range)
    grid = GridSearchCV(clf, param_grid, cv=10, scoring='accuracy')
    grid.fit(dtm_train, outgoing_y_all)
    print(grid.best_score_)
    print(grid.best_params_)
    print(grid.best_estimator_)
try_vectorizers(SVM, incoming_X_all, None)

CountVectorizer
0.8595253367543297
{'C': 10.0, 'gamma': 0.01, 'kernel': 'sigmoid'}
SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

TfidfVectorizer
0.8617703656189866
{'C': 10.0, 'gamma': 1.0, 'kernel': 'sigmoid'}
SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1.0, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

HashingVectorizer
0.8627325208466966
{'C': 1.0, 'gamma': 1.0, 'kernel': 'rbf'}
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1.0, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


#### Random Forest
[Random Forest](https://en.wikipedia.org/wiki/Random_forest) classifiers build decision trees from random feature subsets to determine the best features for classification.

In [None]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

def random_forest(dtm_train, dtm_test, vect):
    rf_grid = RandomForestClassifier(random_state=99, n_jobs=50)
    param_grid = {
        'criterion': ['gini', 'entropy'],
        'max_depth' : [None,2,5,8],
        'max_features' : ['auto', 'sqrt', 'log2'],
        'class_weight' : ['balanced', None]
    }

    grid = GridSearchCV(rf_grid, param_grid, cv=5, scoring='accuracy')

    #Fit the grid search to X, and y.
    grid.fit(dtm_train, outgoing_y_train)

    params = grid.best_params_
    rf = RandomForestClassifier(**params)
    rf_model = rf.fit(dtm_train, outgoing_y_train)
    y_pred = rf.predict(dtm_test)

    print("Best score =", grid.best_score_)
    print("RandomForest Cross Validation Score:\t", cross_val_score(rf, dtm_train, outgoing_y_train, cv=5).mean())
    print("Train/Test RandomForest Score:\t", rf.score(dtm_test, outgoing_y_test))

    df_features = pd.DataFrame(columns=['Features', 'Importance (Gini Index)'])
    df_features['Features'] = columns=vect.get_feature_names()
    df_features['Importance (Gini Index)'] = rf.feature_importances_
    df_features.sort_values('Importance (Gini Index)', ascending=False, inplace=True)

    df_features.head(15)

try_vectorizers(random_forest, incoming_X_train, incoming_X_test)

CountVectorizer
Best score = 0.785426731078905
{'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto'}
RandomForest Cross_Val Score:	 0.7846641940678634
Train/Test RandomForest Score:	 0.831230283911672

TfidfVectorizer
Best score = 0.789049919484702
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto'}
RandomForest Cross_Val Score:	 0.7814464453802886
Train/Test RandomForest Score:	 0.8170347003154574

HashingVectorizer


## Finding Relationships Between Messages

### Clustering
#### KMeans

In [None]:
from sklearn.cluster import KMeans

num_clusters = 20
km = KMeans(n_clusters=num_clusters)
km.fit(dtm)

labels = km.labels_.tolist()
df['kmeans-labels'] = labels
df.sort_values(by='kmeans-labels')

df.groupby('kmeans-labels').incoming.describe()