## Quora Insincere Questions Classification
### Detect toxic content to improve online conversations
### Binary Text Classification
#### Damon Resnick, 3/7/2019
  


In [1]:
# import the pandas library needed for loading the data
import pandas as pd
pd.set_option('max_colwidth', 1000) # This allows to us see the entire text string in a table using pandas

# numpy was used in a few cases
import numpy as np

# SMOTE may be used in one instance
from imblearn.over_sampling import SMOTE

# Use ignore warnings in the final running of the notebook to clean it up in one place.
# import warnings
# warnings.filterwarnings('ignore')

In [2]:
# load in the data as a data frame using pandas
data = pd.read_csv('train.csv')

In [3]:
data[0:4]

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province as a nation in the 1960s?,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you encourage people to adopt and not shop?",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity affect space geometry?,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg hemispheres?,0


In [4]:
data['question_text'][7]

'Is it crazy if I wash or wipe my groceries off? Germs are everywhere.'

In [5]:
data.rename(columns = {'$q':'text'}, inplace=True)
data.columns = ['id', 'text', 'flag']

In [6]:
data[0:4]

Unnamed: 0,id,text,flag
0,00002165364db923c7e6,How did Quebec nationalists see their province as a nation in the 1960s?,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you encourage people to adopt and not shop?",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity affect space geometry?,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg hemispheres?,0


In [7]:
print(data.info())
print('Types of flags:', data['flag'].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306122 entries, 0 to 1306121
Data columns (total 3 columns):
id      1306122 non-null object
text    1306122 non-null object
flag    1306122 non-null int64
dtypes: int64(1), object(2)
memory usage: 29.9+ MB
None
Types of flags: [0 1]


## Prep the data for Classification

In [8]:
# The data appears to be severely unbalanced: many more 0 flags than 1 flags!
print('Total Number of rows', data['flag'].count())
print('Number of 1s in flag', data['flag'].sum())
print('Number of 0s in flag', data['flag'].count() - data['flag'].sum())

Total Number of rows 1306122
Number of 1s in flag 80810
Number of 0s in flag 1225312


Clearly the data is unbalanced. However it is not clear if this is a problem or not. It would make sense that modeling the data without addressing this imbalance, biases the model towards correctly predicting 0s instead of 1s, but if the data in the future is like that as well then this might actually make sense. **Without correcting for imbalance with a simple model only 5.8% of 0s are incorrectly predicted while about 40% of the 1s are incorrectly predicted.** Whether there is a need to correct for this is debatable. However it is possible that correcting for the imbalance will also increase the over all accuracy and increase other prediction metrics like recall and precision. But if it is possible to increase the accuracy of predicting the 1s without decreasing the accuracy of predicting the 0s then that method will be a better overall method.
  
In order to try and correct for the imbalance we will try a few different methods and test them on a held out portion of the dataset.

#### Rebalance the data set: Two options
    - Over-sample the minority class, the 1s, such that there are as many as the 0s
    - Under-sample the majority class, the 0s, such that there are as many as the 1s

#### Create a hold out set:
In order to determine which method will give a better result we will hold out a significant portion of the training data to test the different methods in order to have a test set of data that is the same for each method that has remained unaffected. We have chosen the hold out set to be 375,811 rows because the already held out data, the test data, we wish to predict the classes for has 375,811 rows.

In [9]:
# Data will be split up to make a hold out set. The test set is 375,811 rows
# so we will make the hold out set the same size.
train = data.sample(n=375811, replace=False, random_state=42)
hold = data.drop(train.index)

train = train.reset_index(drop=True) # reset the index of the train2 set
hold = hold.reset_index(drop=True) # reset the index of the hold set

### Over-sample the minority class:
This adds many 1s by adding a random sample of the 1s rows to the data. This method may be a bit of a problem as it will then tend to bias the model towards the minority class even though the class would still be minority in a real word data set.

In [10]:
# First we need to know how many majority and minority rows there are.
num_maj = train['flag'].count() - train['flag'].sum()
num_min = train['flag'].sum()
differ = (num_maj - num_min)

print('Size of majority class:', num_maj)
print('Size of minority class:', num_min)
print('Difference in classes:', differ)

# We will then make the size of the class balanced by adding the difference to the minority class
balanced_os = pd.concat([train, 
                         train[train['flag'] == 1].sample(n=differ, replace=True, random_state=42)], 
                        ignore_index=True).sample(frac=1, random_state=42).reset_index(drop=True) 

X_os = balanced_os['text']
y_os = balanced_os['flag'] # Create a series to store the target: y


print('New size of majority class:', y_os.count() - y_os.sum())
print('New size of minority class:', y_os.sum())

Size of majority class: 352922
Size of minority class: 22889
Difference in classes: 330033
New size of majority class: 352922
New size of minority class: 352922


### Over sample using SMOTE  
  
-  SMOTE or the Synthetic Minority Over-sampling Technique is another method to over sample. Instead of simply replicating the minority class rows SMOTE uses those existing minority instances to create new ones. It uses a k-nearest neighbors technique to do this.  
  
SMOTE requires continuous samples so this will have to be done after the vectorization.

### Under-sample the majority class:
Since there are only 22889 1s, we randomly sample only 22889 of the 0s then create a new data set by adding those samples to the 1s rows. This creates a training data set that is only 45778 rows which is roughly 3.5% the size of the original data set. We are throwing away a lot of data so this method may need to be modified so that accuracy is not reduced. It may be useful to use an ensemble approach here to make sure we are using all the data. For the ensemble approach the 0s and 1s rows will be separated. The 0s will be split up into roughly 15 different data sets then combined with the 1s each. This will create 15 different data sets where each has the same 1s rows but different 0s rows.

In [11]:
# Create Undersampled set
balanced_us = pd.concat([train[train['flag'] == 0].sample(n=num_min, replace=True, random_state=42), 
                         train[train['flag'] == 1]], ignore_index=True)

X_us = balanced_us['text']
y_us = balanced_us['flag'] # Create a series to store the target: y

print('New size of majority class:', y_us.count() - y_us.sum())
print('New size of minority class:', y_us.sum())

New size of majority class: 22889
New size of minority class: 22889


In [12]:
# This code cell will create 14 dfs that are each almost perfectly balanced

# I want to split the majority class into 14 separate dfs, all of equal size to the minority data set.
# Then create 14 new balanced data sets. Using these sets I would then fit each to the model.
# I will then take each fit and predict_prob averaging the predicted probs for each model.
# Then round to get the prediction.

# First I need to split up the majority class into 14 separate dfs.

# randomly shuffle data set and reset the index
df_resample = train[train['flag'] == 0].sample(frac=1, random_state=42).reset_index(drop=True)

# Number of data sets needed to make it balanced.
num_splits = (num_maj/num_min).round(0).astype(int)

num = [int(n.round(0)) for n in np.arange(0, len(df_resample), len(df_resample)/num_splits)] # locations of splits

df_all_maj = []

# Make a list of the 13 dfs of the majority class by splitting the df 13 times
for i in range(len(num)):
    if i < len(num)-1:
        df_all_maj.append(df_resample.loc[num[i]:num[i+1]-1,:])
    else:
        df_all_maj.append(df_resample.loc[num[i]:,:])


df_min = train[train['flag'] == 1] # minority df

# Now take each df in the list and combine it with the minority df, df_min
df_all_us = [pd.concat([df, df_min]).sample(frac=1, random_state=42).reset_index(drop=True) for df in df_all_maj]

print('Number of data sets created:', len(df_all_us))

Number of data sets created: 15


In [13]:
# Make an initial train test split to test vectorizers
# This is used with the modeling below to run basic comparisons

from sklearn.model_selection import train_test_split

X = train['text']
y = train['flag']

# Create training and test sets in order to tune and compare models
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#### Use a method to vectorize the text into a form that a classification model can use to make prediction with.
  
There are basically two vectorizers: Count and TF-IDF

Count Vectorizer: The most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight.  

TF-IDF Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora. More on that here.
  
https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

In [14]:
# Try CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Initialize a CountVectorizer object: count_vectorizer
#count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer() # Slightly better results without stop words

# Fit and Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train.values)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test.values)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

['00', '000', '0000', '00000000000000000021e800', '00000001', '000000199', '000009808', '000009820', '000009920', '000125']


In [15]:
# Try TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
#tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)
tfidf_vectorizer = TfidfVectorizer(max_df=0.7) # Slightly better results without stop words

# Fit and Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train.values)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test.values)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
#print(tfidf_train.A[:5])

['00', '000', '0000', '00000000000000000021e800', '00000001', '000000199', '000009808', '000009820', '000009920', '000125']


## Model the Data with Naive Bayes Classifier
  
A Naive Bayes Classifier is used here to try different balancing methods as it is efficient and fast and make sense for the type of text and problem we have. Other models may be chosen for the final fit to maximize the prediction scores, but we use NB here to determine which method of balancing works best.

First we will see if one Vectorizer works better than the other by fitting them with the balanced over-sampled data set.

In [16]:
# Use a Naive Bayes method to model the CountVectorizer data
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier using alpha to tune
nb_classifier = MultinomialNB(alpha=0.0000000001) # alpha = 0 seems to be the best for CountVectorizer

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print('Acc:', score)
print('F1:', metrics.f1_score(y_test, pred))
print('Prec:', metrics.precision_score(y_test, pred))
print('Rec:', metrics.recall_score(y_test, pred))

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=[0,1])
print(cm)

Acc: 0.9209052326277557
F1: 0.4826385867200417
Prec: 0.4032867946480512
Rec: 0.6008667388949079
[[66445  4103]
 [ 1842  2773]]


In [17]:
# Use a Naive Bayes method to model the TfidfVectorizer data

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier tune with different alphas
nb_classifier_t = MultinomialNB(alpha=0.0000000001) # alpha = 0 seems to give the best results

# Fit the classifier to the training data
nb_classifier_t.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred_t = nb_classifier_t.predict(tfidf_test)

# Calculate the accuracy score: score
score_t = metrics.accuracy_score(y_test, pred_t)
print(score_t)
print('F1:', metrics.f1_score(y_test, pred_t))
print('Prec:', metrics.precision_score(y_test, pred_t))
print('Rec:', metrics.recall_score(y_test, pred_t))

# Calculate the confusion matrix: cm
cm_t = metrics.confusion_matrix(y_test, pred_t, labels=[0,1])
print(cm_t)

0.9447999680693958
F1: 0.4006933410371226
Prec: 0.6009532062391681
Rec: 0.30054171180931744
[[69627   921]
 [ 3228  1387]]


It looks like the two vectorizers give very similar base accuracy scores. Tfidf does slightly better at accuracy but much worse at other metrics. In fact Tfidf does predicts nearly 70% of the 1s as 0s, compared to the Count model of 40%. The difference in the confusion matrix shows that mix of TPs, TNs, FPs, and FNs are much different. Which Vectorizer we use should be determined by the problem we are trying to answer.
  
The type of vectorizer seems to make some difference here. We will use the **Count** vectorizer to start and later come back and make sure it was the best choice.

### Automate the vectorization and modeling
  
Here we automate the vectorization and modeling into one function to make it easier to compare the models on different methods of balancing the data sets.

In [18]:
# Make a function to perform all the data transformation and the modeling
# The function should take as input:
# X_train, y_train, X_test, y_test, and the tuned model to be fit and then predicted

def automate_tf1(X_train, y_train, X_test, y_test, model):
    
    ''' This function takes X_train, y_train, X_test, y_test, and the a tuned classification model then fits it and outputs
    scores and a confusion matrix as well as the probabilities of the classification model.
    '''

    count_vectorizer = CountVectorizer() # Slightly better without stop words

    # Fit and Transform the training data: X_train_t 
    X_train_t = count_vectorizer.fit_transform(X_train.values)

    # Transform the test data: X_test_t 
    X_test_t = count_vectorizer.transform(X_test.values)

    # Fit the classifier to the training data
    model.fit(X_train_t, y_train)

    # Create the predicted tags: pred
    pred = model.predict(X_test_t)
    pred_probs = model.predict_proba(X_test_t)

    # Calculate the accuracy score:
    print('Acc:', metrics.accuracy_score(y_test, pred))
    print('F1:', metrics.f1_score(y_test, pred))
    print('Prec:', metrics.precision_score(y_test, pred))
    print('Rec:', metrics.recall_score(y_test, pred))

    # Calculate the confusion matrix: cm
    print(metrics.confusion_matrix(y_test, pred, labels=[0,1]))

    return (pred_probs)

## Compare different models using the Hold out set
Now we will compare the different methods and models to the held out data instead of using test data from the modified train2 data.  
  
The key here is that the test data is the 1000 rows held out before any of the data was changed to make it more balanced.

In [19]:
# Use the automate function to compare different models to the hold out set

# Train
# Xtra = train2['text']
# ytra = train2['PreventiveFlag']

Xtra = X_os
ytra = y_os

# Test
Xtea = hold['text']
ytea = hold['flag']

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)
#model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=300, min_child_weight=0)

probs = automate_tf1(Xtra, ytra, Xtea, ytea, model = model)

Acc: 0.8601489179424945
F1: 0.3895996659582354
Prec: 0.26748740546042543
Rec: 0.7168557172700748
[[758685 113705]
 [ 16400  41521]]


#### Make a function to take multiple data sets and average the prediction probabilities in order to create an ensemble of results.  
  
This function will make it so we can compare different ensembles of each method in an attempt to find a combination that improves the overall model.

In [None]:
# Make a function to perform all the data transformation and the modeling
# The function should take as input:
# X_train, y_train, X_test, y_test, and the tuned modeling to be fit and then predicted

def automate_tf2(X_train, y_train, X_test, y_test, model, is_SMOTE = False, tfidf_vect = False):
    ''' This function takes X_train, y_train, X_test, y_test, and the a tuned classification model then fits it and outputs
    scores and a confusion matrix as well as the probabilities of the classification model.
    '''
    
    if tfidf_vect == True:
        vectorizer = TfidfVectorizer(max_df=0.7)
    else:
        vectorizer = CountVectorizer()

    # Fit and Transform the training data: X_train_t 
    X_train_t = vectorizer.fit_transform(X_train.values)

    # Transform the test data: X_test_t 
    X_test_t = vectorizer.transform(X_test.values)
        
    if is_SMOTE == True:
        sm = SMOTE(random_state=12, ratio = 1.0)
        X_train_t, y_train = sm.fit_sample(X_train_t, y_train)
    
    # Fit the classifier to the training data
    model.fit(X_train_t, y_train)
    
    pred_probs = model.predict_proba(X_test_t)

    return (pred_probs)


# Automate function to acount for ensembling comparisons

def auto_ensemble(list_dfs, X_test, y_test, model, is_SMOTE = False):
    ''' This function automates the ensemble process. It takes a list of dfs, models the data, and averages the results
    of each data set then takes the average results and computes the scores and confusion matrix of the ensembled results. 
    inputs:
    list_dfs is a list of data frames with both x and y columns.
    X_test, y_test: are pandas series presumably from the hold out set.
    model: model to be used in order of the list of dfs
    '''
    
    pred_probs_tot = 0
        
    for df in list_dfs:
        X_train = df['text']
        y_train = df['flag']
        pred_prob0 = automate_tf2(X_train, y_train, X_test, y_test, model, is_SMOTE)
        pred_probs_tot = pred_probs_tot + np.array(pred_prob0)
    
    pred_probs_ave = pred_probs_tot[:,1]/len(list_dfs)
    
    pred_ave_int = pred_probs_ave.round(0).astype(int)

    # Calculate the accuracy score:
    print('Acc:', metrics.accuracy_score(y_test, pred_ave_int))
    print('F1:', metrics.f1_score(y_test, pred_ave_int))
    print('Prec:', metrics.precision_score(y_test, pred_ave_int))
    print('Rec:', metrics.recall_score(y_test, pred_ave_int))

    # Calculate the confusion matrix: cm
    print(metrics.confusion_matrix(y_test, pred_ave_int, labels=[0,1]))
    
    return (pred_probs_ave, pred_ave_int)

#### Make some different lists of dfs so it is easy to see what combinations are being tried.

In [None]:
# This will put a lot of dfs into memory. To save memory you may want to do this one at a time

# Make a list of dfs that have all under-sample, over-sample, and the original data set
df_both_os = pd.concat([X_os, y_os], axis=1) # need to pass a list of dfs to the function
df_new_list1 = df_all_us.copy()
df_new_list1.extend([df_both_os])
df_new_list1.extend([train])

# Make a list of dfs that have one under-sample, over-sample, and the original data set
df_both_us = pd.concat([X_us, y_us], axis=1) # need to pass a list of dfs to the function
df_new_list2 = [df_both_us].copy()
df_new_list2.extend([df_both_os])
df_new_list2.extend([train])

# Make a list of dfs that have over-sample and the original data set
df_new_list3 = [df_both_os].copy()
df_new_list3.extend([train])

# Make a list of dfs that have one under-sample and the original data set
df_new_list4 = [df_both_us].copy()
df_new_list4.extend([train])

# Make a list of dfs that have all under-sample and the over sample
df_new_list5 = df_all_us.copy()
df_new_list5.extend([df_both_os])

# Make a list of dfs that have all under-sample and the over sample
df_new_list6 = [df_both_us].copy()
df_new_list6.extend([df_both_os])

# Make a list of dfs that have all under-sample, and the original data set
df_new_list7 = df_all_us.copy()
df_new_list7.extend([train])

# Test set is the hold out set
Xtea = hold['text']
ytea = hold['flag']

In [None]:
# Original data using NB model

df_list = [train]

# Test set is the hold out set
Xtea = hold['text']
ytea = hold['flag']

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Original Train data')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Original data using NB model using SMOTE

df_list = [train]

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Original Train data')

probs = auto_ensemble(df_list, Xtea, ytea, model = model, is_SMOTE = True)

In [None]:
# Original data using XGBoost model
from xgboost import XGBClassifier

df_list = [train]

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=50, min_child_weight=0)

print('Original data with XGBoost model')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Original data using NB model using SMOTE and XGB

df_list = [train]

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=300, min_child_weight=0)

print('Original Train data with SMOTE and XGB')

probs = auto_ensemble(df_list, Xtea, ytea, model = model, is_SMOTE = True)

The XGB classifier does a slightly better job at fitting the minority class, but not quite as good with the majority class.

In [None]:
# Over-sampled model

df_list = [df_both_os]

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Over-sampled model')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Over-sampled model with XGB

df_list = [df_both_os]

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=300, min_child_weight=0)

print('Over-sampled model with XGB')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Under-sampled model

df_list = [df_both_us]

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Under-sampled model')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Under-sampled model using XGB as the accuracy was very low

df_list = [df_both_us]

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=100, min_child_weight=0)

print('Under-sampled model using XGB')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all Under-sampled data

df_list = df_all_us

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all Under-sampled data')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all Under-sampled data using XGB
# This takes 13 seconds or so to run

df_list = df_all_us

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=100, min_child_weight=0)

print('Ensemble of all Under-sampled data')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

The under-sampled data sets do a much better job of predicting the minority class than the other models. Using XGBoost we can further increase the scores in a significant way. Clearly the better predictions of the minority class effect the prediction accuracy of the majority class. The overall accuracy is worse than the over-sampled sets but this model has the best predictions of the minority class.  
  
A combination of this ensemble of under sampled data using the XGBoost model with other ensembles may be the best method.

In [None]:
# Ensemble of all different data 1

df_list = df_new_list1

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 1')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 2

df_list = df_new_list2

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 2')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 3

df_list = df_new_list3

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 3')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 3

df_list = df_new_list3

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=100, min_child_weight=0)

print('Ensemble of all different data 3 with XGB')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 4

df_list = df_new_list4

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 4')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 5

df_list = df_new_list5

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 5')

probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 6

df_list = df_new_list6

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 6')

#probs = auto_ensemble(df_all_us, Xtea, ytea, model = model)
probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 7

df_list = df_new_list7

# Tuned Model
model = MultinomialNB(alpha=0.0000000001)

print('Ensemble of all different data 7')

#probs = auto_ensemble(df_all_us, Xtea, ytea, model = model)
probs = auto_ensemble(df_list, Xtea, ytea, model = model)

In [None]:
# Ensemble of all different data 7

df_list = df_new_list7

# Tuned Model
model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=100, min_child_weight=0)

print('Ensemble of all different data 7 with XGB')

#probs = auto_ensemble(df_all_us, Xtea, ytea, model = model)
probs = auto_ensemble(df_list, Xtea, ytea, model = model)

### The best model?
Oversampling using **SMOTE** and **XGB** seemed to give the best results, but only marginally over simply using a tuned **XGB** model None of the simple ensemble models seem to help the over all accuracy the most. However some did better model the minority class better so if that is the goal then one of those models should be picked. Although the **XGB** model on the all the data seemed to do almost as well.

## Apply the model to the unknown test data

In [None]:
# need to take out the y_test and print scores from the functions.

def automate_tf2_test(X_train, y_train, X_test, model, is_SMOTE = False):
    ''' This function takes X_train, y_train, X_test, y_test, and the a tuned classification model then fits it and outputs
    scores and a confusion matrix as well as the probabilities of the classification model.
    '''

    vectorizer = CountVectorizer() # Slightly better without stop words

    # Fit and Transform the training data: X_train_t 
    X_train_t = vectorizer.fit_transform(X_train.values)

    # Transform the test data: X_test_t 
    X_test_t = vectorizer.transform(X_test.values)
        
    if is_SMOTE == True:
        sm = SMOTE(random_state=12, ratio = 1.0)
        X_train_t, y_train = sm.fit_sample(X_train_t, y_train)
    
    # Fit the classifier to the training data
    model.fit(X_train_t, y_train)
    
    pred_probs = model.predict_proba(X_test_t)

    return (pred_probs)


# Automate function to acount for ensembling comparisons

def auto_ensemble_test(list_dfs, X_test, model, is_SMOTE = False):
    ''' This function automates the ensemble process. It takes a list of dfs, models the data, and averages the results
    of each data set then takes the average results and computes the scores and confusion matrix of the ensembled results. 
    inputs:
    list_dfs is a list of data frames with both x and y columns.
    X_test, y_test: are pandas series presumably from the hold out set.
    model: model to be used in order of the list of dfs
    '''
    
    pred_probs_tot = 0
        
    for df in list_dfs:
        X_train = df['text']
        y_train = df['flag']
        pred_prob0 = automate_tf2_test(X_train, y_train, X_test, model, is_SMOTE)
        pred_probs_tot = pred_probs_tot + np.array(pred_prob0)
    
    pred_probs_ave = pred_probs_tot[:,1]/len(list_dfs)
    
    pred_ave_int = pred_probs_ave.round(0).astype(int)

    
    return (pred_probs_ave, pred_ave_int)

In [None]:
# load in the test data
test = pd.read_csv('test.csv')

In [None]:
# Fit the model to the original data using SMOTE and XGB tuned

df_list = [data]

X_test1 = test['question_text']

# XGB Model Tuned on the trained set using a holdout
#model = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=300, min_child_weight=0)
model = MultinomialNB(alpha=0.0000000001)

#print('Original Train data with SMOTE and XGB')
print('Original Train data with NB model')

# Out put the probabilites and flags for new data
probs = auto_ensemble_test(df_list, X_test1, model = model, is_SMOTE = False)

print(probs[0][16:40])
print(probs[1][16:40])

In [None]:
# See the solution table!

test['probs'] = probs[0]
test['prediction'] = probs[1]

solution = pd.concat([test['qid'], test['prediction']], axis=1)
solution[12:30]

### Save the final solution table as .csv

In [None]:
solution.to_csv('submision.csv', index=False) # save table to .csv file

### Inspect the Model by Looking at the Labels

In [None]:
# Get the class labels: class_labels
class_labels = nb_classifier_t.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier_t.coef_[0], feature_names))

# Print the first class label and the top 30 feat_with_weights entries
print(class_labels[0], feat_with_weights[:30])

print('') # print a break

# Print the second class label and the bottom 30 feat_with_weights entries
print(class_labels[1], feat_with_weights[-30:])

So it looks like the words most associated with preventative care are vaccines, boosters, year, annual, and exam. We can see that it makes sense fore these to be associated with preventative care and as such should not be included in non-preventative claims.

# The rest of the notebook is just some code used to tinker with and tune the models.

## Tunning and some different classifier algorithms: XGBoost

In [None]:
# Use a XGBoost method to model the TfidfVectorizer data
from xgboost import XGBClassifier

# Instantiate a XGBoost classifier: xgb_classifier_t tune with different parameters
xgb_classifier_t = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=50, min_child_weight=0) # 

# Fit the classifier to the training data
xgb_classifier_t.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred_xgbt = xgb_classifier_t.predict(tfidf_test)

# Calculate the accuracy score: score
score_t = metrics.accuracy_score(y_test, pred_xgbt)
print(score_t)
print('F1:', metrics.f1_score(y_test, pred_xgbt))
print('Prec:', metrics.precision_score(y_test, pred_xgbt))
print('Rec:', metrics.recall_score(y_test, pred_xgbt))

# Calculate the confusion matrix: cm
cm_xgbt = metrics.confusion_matrix(y_test, pred_xgbt, labels=[0,1])
print(cm_xgbt)

In [None]:
# Use a XGBoost method to model the CountVectorizer data

# Instantiate a XGBoost classifier: nb_classifier tune with different parameters
xgb_classifier_c = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=50, min_child_weight=0) # 

# Fit the classifier to the training data
xgb_classifier_c.fit(count_train, y_train)

# Create the predicted tags: pred
pred_xgbc = xgb_classifier_c.predict(count_test)

# Calculate the accuracy score: score
score_c = metrics.accuracy_score(y_test, pred_xgbc)
print(score_c)
print('F1:', metrics.f1_score(y_test, pred_xgbc))
print('Prec:', metrics.precision_score(y_test, pred_xgbc))
print('Rec:', metrics.recall_score(y_test, pred_xgbc))

# Calculate the confusion matrix: cm
cm_xgbc = metrics.confusion_matrix(y_test, pred_xgbc, labels=[0,1])
print(cm_xgbc)

It looks like XGBoost on the can do a bit better if tuned with the CountVectorizer data.

In [None]:
# Tune the XGBoost method to model the TfidfVectorizer data
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

np.arange(0.000000001, 0.91, 0.1)
param_grid = {'max_depth': np.arange(3, 4, 1), 
              'learning_rate': np.arange(0.9, 1, 0.1), 
              'n_estimators': np.arange(10, 20, 10), 
              'min_child_weight': np.arange(0, 1, 1)}

# Instantiate a XGBoost classifier: nb_classifier tune with different parameters
#xgb_classifier_c = XGBClassifier(max_depth=3, learning_rate=0.9, n_estimators=80, min_child_weight=0) # 
xgb_classifier_c = XGBClassifier()

# Instantiate the GridSearchCV object: logreg_cv
kf = KFold(5, shuffle = True, random_state = 42) # creates the kfold object with random shuffling
grid_cv_xgb = GridSearchCV(xgb_classifier_c, param_grid, cv = kf)

# Fit it to the training data
grid_cv_xgb.fit(tfidf_train, y_train)

# Print the optimal parameters and best score
print("Best alpha: {}".format(grid_cv_xgb.best_params_))
print("Best Accuracy: {}".format(grid_cv_xgb.best_score_))

In [None]:
# Use a Naive Bayes method to model the TfidfVectorizer data

# Fit and Transform the training data: tfidf_X 
tfidf_X = tfidf_vectorizer.fit_transform(X.values)

# Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier tune with different alphas
nb_classifier_t = MultinomialNB(alpha=0.1) # alpha 0.1 seems to give the best results and beats CountVectorizer best results

# Fit the classifier to the training data
nb_classifier_t.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred_prob_t = nb_classifier_t.predict_proba(tfidf_test)[:,1]
#pred_t = nb_classifier_t.predict(tfidf_test)

# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, pred_prob_t)))

# Compute cross-validated AUC scores: cv_auc
kf = KFold(5, shuffle = True, random_state = 42) # creates the kfold object with random shuffling
cv_auc = cross_val_score(nb_classifier_t, tfidf_X, y, cv = kf, scoring = 'roc_auc')

# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))
print("Average AUC scores computed using 5-fold cross-validation: {}".format(cv_auc.mean().round(5)))

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': np.arange(0.000000001, 0.91, 0.1)}

nb_classifier_n = MultinomialNB()

# Instantiate the GridSearchCV object: logreg_cv
kf = KFold(5, shuffle = True, random_state = 42) # creates the kfold object with random shuffling
grid_cv = GridSearchCV(nb_classifier_n, param_grid, cv = kf)

# Fit it to the training data
grid_cv.fit(tfidf_train, y_train)

# Print the optimal parameters and best score
print("Best alpha: {}".format(grid_cv.best_params_))
print("Best Accuracy: {}".format(grid_cv.best_score_))

## Final solution

In [None]:
solution.to_csv('p2_solution.csv') # save table to .csv file
solution.head(20)