# Machine Learning

Machine learning uses algorithms that can figure out how to perform important tasks by generalizing from examples

Supervised learning involves inferring a function from labeled training data to make predictions on unseen data

Unsupervised learning involves deriving structure from data where we don't know the effect of any of the variables

The spam filter classification problem we are dealing with involves supervised learning

# K-fold Cross Validation

In k-folds cross validation, the full dataset is divided into k subsets and the holdout method is repeated k times

Each time, one of the k subsets is used as the test set and the other k-1 subsets are used to train the model

# Evaluation Metrics

Accuracy = # predicted correctly / total

Precision = # correctly predicted as spam / total predicted spam

Recall = # correctly predicted as spam / total spam

If false positives are very costly, we will want to optimize for precision

If false negatives are very costly, we will want to optimize for recall

# Random Forest

Random forest is an ensemble model, meaning it creates many models then combines them to create a mega model

The idea is to create a bunch of relatively weak models that can combine to make a strong model (each model votes on a prediction value)

Random forest models construct a collection of decision trees then aggregate the predictions of each tree to determine the final prediction

There are many benefits to random forests and ensemble methods:

-- Compatible with classification and regression problems

-- Can easily handle outliers, missing data, etc.

-- Accepts various types of data (ordinal, continuous, etc.)

-- Less likely to overfit

-- Outputs feature importance

We will build a random forest model in the code below

In [1]:
import nltk
import pandas as pd
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# stop words

stops = nltk.corpus.stopwords.words('english')

In [3]:
# stemmer

ps = nltk.PorterStemmer()

In [4]:
# load data

data = pd.read_csv("SMSSpamCollection.tsv", sep = '\t', header = None)

data.columns = ['label', 'text']

In [8]:
# create new features - need to make count punctuation function first

def count_punc(t):
    
    num_puncs = sum([1 for x in t if x in string.punctuation])
    
    return round(num_puncs / (len(t) - t.count(" ")), 3)*100

In [9]:
# new features

data['text_length'] = data['text'].apply(lambda x: len(x))

data['percent_punc'] = data['text'].apply(lambda x: count_punc(x))

In [10]:
data.head()

Unnamed: 0,label,text,text_length,percent_punc
0,ham,I've been searching for the right words to tha...,196,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,4.7
2,ham,"Nah I don't think he goes to usf, he lives aro...",61,4.1
3,ham,Even my brother is not like to speak with me. ...,77,3.2
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,35,7.1


In [16]:
# define clean text function

def clean_text(t):
    
    text = "".join([x.lower() for x in t if x not in string.punctuation])
    
    tokens = re.split('\W+', text)
    
    text = [ps.stem(x) for x in tokens if x not in stops]
    
    return text

In [17]:
# create TFIDF vectorizer with clean text analyzer

tfidf_vec = TfidfVectorizer(analyzer=clean_text)

In [18]:
# fit and transform vectorizer

X_tfidf = tfidf_vec.fit_transform(data['text'])

In [21]:
# create another object for feature only (no labels)

X_features = pd.concat([data['text_length'], data['percent_punc'], pd.DataFrame(X_tfidf.toarray())], axis = 1)

In [22]:
# check out feature data

X_features.head()

Unnamed: 0,text_length,percent_punc,0,1,2,3,4,5,6,7,...,8097,8098,8099,8100,8101,8102,8103,8104,8105,8106
0,196,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,155,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,61,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,77,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,35,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# import random forest classifier

from sklearn.ensemble import RandomForestClassifier

In [24]:
# what attributes and methods are contained in this object?

dir(RandomForestClassifier)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_cache',
 '_abc_negative_cache',
 '_abc_negative_cache_version',
 '_abc_registry',
 '_estimator_type',
 '_get_param_names',
 '_make_estimator',
 '_set_oob_score',
 '_validate_X_predict',
 '_validate_estimator',
 '_validate_y_class_weight',
 'apply',
 'decision_path',
 'feature_importances_',
 'fit',
 'get_params',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'score',
 'set_params']

In [25]:
# note that feature importances will give us info on the most helpful features for the model

# fit will allow us to fit the model, while predict will allow us to make predictions on new test data

print(RandomForestClassifier())

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


In [26]:
# max depth is how deep our decision tree is - default is no max depth

# n estimators is how many trees will be built in the random forest - default is 10

# we will want to run the model through cross validation - need to import

from sklearn.model_selection import KFold, cross_val_score

In [28]:
# Kfolds will split our data into cross validation subsets

# cross val score will tell us how our model scored on each subset

# start by creating instance of RandomForestClassifier() - n_jobs=-1 allows us to run the model faster in parallel

rf = RandomForestClassifier(n_jobs=-1)

In [29]:
# create KFolds object with 5 cross validation subsets

k_fold = KFold(n_splits=5)

In [30]:
# lets see the cross val score object using random forest model as estimator (first argument)

# we then pass in the features (X values) and the spam/ham labels (y values) and the kfolds object as the cv argument

# we will start by using accuracy as our scoring method and set n_jobs = -1 to run parallel calculations

cross_val_score(rf, X_features, data['label'], cv = k_fold, scoring = 'accuracy', n_jobs = -1)

array([0.96947935, 0.96947935, 0.96409336, 0.9640611 , 0.96855346])

In [31]:
# lets now create and use a holdout test set to evaluation our model performance

from sklearn.metrics import precision_recall_fscore_support as score

from sklearn.model_selection import train_test_split

In [34]:
# create the training and testing data with 20% of data in test set

X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size = 0.2)

In [35]:
# make a new random forest classifier with 50 trees and max depth of 20

rf = RandomForestClassifier(n_estimators = 50, max_depth = 20, n_jobs = -1)

In [36]:
# fit the random forest classifier into a model

rf_model = rf.fit(X_train, y_train)

In [38]:
# lets check out the feature importances of the model

sorted(zip(rf_model.feature_importances_, X_train.columns), reverse = True)[0:10]

[(0.07820339173773969, 'text_length'),
 (0.04074969502520828, 7353),
 (0.03632211190646081, 1804),
 (0.030691484822236527, 5727),
 (0.028108417496141336, 2032),
 (0.021775227340367763, 7030),
 (0.021743903809094082, 3135),
 (0.01989730051490501, 6749),
 (0.018103013080741356, 6288),
 (0.014577571923090884, 690)]

In [39]:
# lets predict values on the testing data and assign to an object

pred = rf_model.predict(X_test)

In [40]:
# score our predictions - the pos label is the label we are interested in predicting (in this case, spam)

# we need to assign to 4 diff values for 4 diff outputs from score function

precision, recall, fscore, support = score(y_test, pred, pos_label = 'spam', average = 'binary')

In [46]:
print(f"Precision: {round(precision,3)} \nRecall: {round(recall,3)} \nAccuracy: {round((pred==y_test).sum()/len(y_test),3)}")

Precision: 1.0 
Recall: 0.582 
Accuracy: 0.945


In [49]:
# Can we make our model better by changing the hyperparameter settings?

# We can find out using a Grid Search

# We will approach this by defining a function with entire RF training and prediction process

def train_rf(n_est, depth):
    
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs = -1)
    
    rf_model = rf.fit(X_train, y_train)
    
    pred = rf_model.predict(X_test)
    
    precision, recall, fscore, support = score(y_test, pred, pos_label='spam', average='binary')
    
    print(f'Est: {n_est}, Depth: {depth}')
    print(f'Precision: {round(precision,3)}, Recall: {round(recall,3)}, Accuracy: {round((pred==y_test).sum()/len(y_test),3)}')

In [50]:
for n_est in [10, 50, 100]:
        
        for depth in [10, 20, 30, None]:
            
            train_rf(n_est, depth)

Est: 10, Depth: 10
Precision: 1.0, Recall: 0.219, Accuracy: 0.898
Est: 10, Depth: 20
Precision: 1.0, Recall: 0.521, Accuracy: 0.937
Est: 10, Depth: 30
Precision: 1.0, Recall: 0.637, Accuracy: 0.952
Est: 10, Depth: None
Precision: 1.0, Recall: 0.76, Accuracy: 0.969
Est: 50, Depth: 10
Precision: 1.0, Recall: 0.253, Accuracy: 0.902
Est: 50, Depth: 20
Precision: 1.0, Recall: 0.5, Accuracy: 0.934
Est: 50, Depth: 30
Precision: 1.0, Recall: 0.692, Accuracy: 0.96
Est: 50, Depth: None
Precision: 1.0, Recall: 0.788, Accuracy: 0.972
Est: 100, Depth: 10
Precision: 1.0, Recall: 0.219, Accuracy: 0.898
Est: 100, Depth: 20
Precision: 1.0, Recall: 0.562, Accuracy: 0.943
Est: 100, Depth: 30
Precision: 1.0, Recall: 0.664, Accuracy: 0.956
Est: 100, Depth: None
Precision: 1.0, Recall: 0.815, Accuracy: 0.976


In [51]:
# We will now start using GridSearchCV to assist us in our parameter tuning

# start by creating a count vectorized document term matrix to compare against TFIDF matrix

from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer(analyzer = clean_text)

In [52]:
X_count = count_vec.fit_transform(data['text'])

In [54]:
# create new feature sets

X_tfidf_feat = pd.concat([data['text_length'], data['percent_punc'], pd.DataFrame(X_tfidf.toarray())], axis = 1)

X_count_feat = pd.concat([data['text_length'], data['percent_punc'], pd.DataFrame(X_count.toarray())], axis = 1)

In [56]:
# load in GridSearchCV and Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [57]:
# Our parameter grid will be a dictionary called param

# Key values will be parameter names and values will be ranges to explore

rf = RandomForestClassifier()

param = {'n_estimators':range(50, 301, 50), 'max_depth':range(30, 151, 30)}

In [58]:
# now we can run GridSearchCV and assign to gs

gs = GridSearchCV(rf, param, cv = 5, n_jobs = -1)

In [59]:
# we can fit gs like any other model - run for both TFIDF and Count Matrices

gs_tfidf = gs.fit(X_tfidf_feat, data['label'])

In [63]:
# lets see the results from each fold in this model - we clean it up in a dataframe and sort by avg test score

# we will use head to only look at top 5 models

pd.DataFrame(gs_tfidf.cv_results_).sort_values('mean_test_score', ascending = False).head()



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
18,18.461405,2.467248,0.523001,0.177135,120,50,"{'max_depth': 120, 'n_estimators': 50}",0.979372,0.976661,0.975741,...,0.975395,0.002593,1,1.0,0.999775,1.0,0.999776,0.999551,0.99982,0.000168
12,16.171962,1.38719,0.451994,0.151383,90,50,"{'max_depth': 90, 'n_estimators': 50}",0.975785,0.978456,0.97664,...,0.975216,0.002458,2,0.998877,0.998877,0.999327,0.999551,0.998204,0.998967,0.000462
13,29.632629,1.124234,0.545938,0.202112,90,100,"{'max_depth': 90, 'n_estimators': 100}",0.981166,0.979354,0.974843,...,0.975216,0.004787,2,0.999326,0.999102,0.999102,0.999327,0.999327,0.999237,0.00011
27,50.632745,3.199368,0.552323,0.066546,150,200,"{'max_depth': 150, 'n_estimators': 200}",0.981166,0.980251,0.974843,...,0.975036,0.005076,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
29,54.589573,6.306942,0.333509,0.091739,150,300,"{'max_depth': 150, 'n_estimators': 300}",0.978475,0.979354,0.973046,...,0.974677,0.003535,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [64]:
# run it all again with the count matrix instead of the tfidf matrix

rf = RandomForestClassifier()

param = {'n_estimators':range(50, 301, 50), 'max_depth':range(30, 151, 30)}

gs = GridSearchCV(rf, param, cv = 5, n_jobs = -1)

gs_count = gs.fit(X_count_feat, data['label'])

In [65]:
# look at results from count vectorizer model

pd.DataFrame(gs_count.cv_results_).sort_values('mean_test_score', ascending = False).head()



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
28,70.846042,2.200965,0.77099,0.084079,150,250,"{'max_depth': 150, 'n_estimators': 250}",0.978475,0.975763,0.974843,...,0.973779,0.003488,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
22,69.897827,0.689851,0.858779,0.088106,120,250,"{'max_depth': 120, 'n_estimators': 250}",0.977578,0.976661,0.974843,...,0.97342,0.003722,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
21,60.197678,3.25249,0.632771,0.049472,120,200,"{'max_depth': 120, 'n_estimators': 200}",0.980269,0.974865,0.972147,...,0.97324,0.004089,3,0.999775,1.0,1.0,1.0,1.0,0.999955,9e-05
20,49.499514,1.229818,0.706652,0.128027,120,150,"{'max_depth': 120, 'n_estimators': 150}",0.977578,0.972172,0.974843,...,0.97324,0.00276,3,0.999775,1.0,1.0,1.0,1.0,0.999955,9e-05
19,31.003665,0.295832,0.734116,0.255641,120,100,"{'max_depth': 120, 'n_estimators': 100}",0.976682,0.973968,0.973944,...,0.97324,0.002245,3,1.0,0.999775,0.999551,1.0,0.999776,0.99982,0.000168


The TFIDF vectorizer yielded the best predictions with a smaller number of estimators

The count vectorizer yielded the best predictions with more estimators

For both, the most accurate decision trees are those with the largest max_depth

However, note that mean fit time is much lower when depth and number of estimators are lower

Note that we would normally explore a lot more options:

-- Check whether or not including stop words is helpful

-- Check whether removing punctuation is helpful

-- Check results with n-grams

# Gradient Boosting Models

Gradient Boosting is another ensemble model similar to random forests (many models built and combined to make one powerful model)

It takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations

The first trees are tiny (essentially stumps), but the model continues to focus on what it got wrong in the previous model

The main difference between Gradient Boosting is it uses Boosting (increased weight on wrong predictions) while Random Forest uses Bagging (random sampling)

Unlike random forest, a gradient boosting model cannot train trees in parallel - it must be done iteratively

Gradient boosting uses a weighted vote for the final prediction, while random forest uses an unweighted prediction

Gradient boosting models are also harder to tune and easier to overfit than random forests

The trade off is that gradient boosting models are typically more powerful if tuned properly

Time to code it out

In [66]:
from sklearn.ensemble import GradientBoostingClassifier

In [69]:
# create function to train gradient boosting model - we add another argument for learning rate parameter

def train_gb(est, max_depth, lr):
    
    gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)
    
    gb_model = gb.fit(X_train, y_train)
    
    pred = gb_model.predict(X_test)
    
    precision, recall, fscore, support = score(y_test, pred, pos_label='spam', average='binary')
    
    print(f"Est: {est}, Max Depth: {max_depth}, Learning Rate: {lr}")
    print(f"Precision: {round(precision,3)}") 
    print(f"Recall: {round(recall,3)}") 
    print(f"Accuracy: {round((y_test==pred).sum()/len(pred),3)}")

In [70]:
# time to tune model - this will take a while

for n_est in [50, 100, 150]:
    
    for max_depth in [3, 7, 11, 15]:
        
        for lr in [0.01, 0.1, 1]:
            
            train_gb(n_est, max_depth, lr)

  'precision', 'predicted', average, warn_for)


Est: 50, Max Depth: 3, Learning Rate: 0.01
Precision: 0.0
Recall: 0.0
Accuracy: 0.869
Est: 50, Max Depth: 3, Learning Rate: 0.1
Precision: 0.945
Recall: 0.705
Accuracy: 0.956
Est: 50, Max Depth: 3, Learning Rate: 1
Precision: 0.862
Recall: 0.767
Accuracy: 0.953
Est: 50, Max Depth: 7, Learning Rate: 0.01
Precision: 1.0
Recall: 0.007
Accuracy: 0.87
Est: 50, Max Depth: 7, Learning Rate: 0.1
Precision: 0.938
Recall: 0.822
Accuracy: 0.969
Est: 50, Max Depth: 7, Learning Rate: 1
Precision: 0.889
Recall: 0.822
Accuracy: 0.963
Est: 50, Max Depth: 11, Learning Rate: 0.01
Precision: 1.0
Recall: 0.021
Accuracy: 0.872
Est: 50, Max Depth: 11, Learning Rate: 0.1
Precision: 0.924
Recall: 0.829
Accuracy: 0.969
Est: 50, Max Depth: 11, Learning Rate: 1
Precision: 0.912
Recall: 0.856
Accuracy: 0.97
Est: 50, Max Depth: 15, Learning Rate: 0.01
Precision: 1.0
Recall: 0.014
Accuracy: 0.871
Est: 50, Max Depth: 15, Learning Rate: 0.1
Precision: 0.952
Recall: 0.822
Accuracy: 0.971
Est: 50, Max Depth: 15, Learni

In [71]:
# the worst models had learning rates of 0.01 and lower values for max depth and number of estimators 

# the best performing models had a learning rate of 0.1 and higher values for max depth and number of estimators

# moving onto next phase with GridSearchCV for gradient boosting

from sklearn.model_selection import GridSearchCV

In [75]:
# instantiate gb classifier and create dictionary of parameters

gb = GradientBoostingClassifier()

param = {
    'n_estimators':[100, 150],
    'max_depth':[12, 15],
    'learning_rate':[0.1, 0.2]
}

In [76]:
# create GridSearchCV model - note that n_jobs = -1 does not train models in parallel, only parameter settings

gs = GridSearchCV(gb, param, cv = 5, n_jobs = -1)

In [77]:
# fit model with tfidf - this could take a while

gs_tfidf = gs.fit(X_tfidf_feat, data['label'])

In [78]:
# create DataFrame for results, sort by mean test score, then show top 5 

pd.DataFrame(gs_tfidf.cv_results_).sort_values('mean_test_score', ascending = False).head()



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
4,411.058594,22.773779,0.435553,0.043537,0.2,12,100,"{'learning_rate': 0.2, 'max_depth': 12, 'n_est...",0.963229,0.980251,...,0.970726,0.005633,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
5,547.091751,8.083826,0.776125,0.310329,0.2,12,150,"{'learning_rate': 0.2, 'max_depth': 12, 'n_est...",0.965022,0.976661,...,0.970546,0.00422,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
7,541.780269,8.193791,0.252728,0.050285,0.2,15,150,"{'learning_rate': 0.2, 'max_depth': 15, 'n_est...",0.963229,0.981149,...,0.970187,0.006044,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,464.675907,13.922152,0.627123,0.13578,0.2,15,100,"{'learning_rate': 0.2, 'max_depth': 15, 'n_est...",0.961435,0.981149,...,0.96893,0.006682,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,451.457653,6.657469,0.483105,0.048404,0.1,15,100,"{'learning_rate': 0.1, 'max_depth': 15, 'n_est...",0.961435,0.979354,...,0.96857,0.006032,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [79]:
# same process for count vectorizer

gb = GradientBoostingClassifier()

param = {
    'n_estimators':[100, 150],
    'max_depth':[12, 15],
    'learning_rate':[0.1, 0.2]
}

gs = GridSearchCV(gb, param, cv = 5, n_jobs = -1)

gs_count = gs.fit(X_count_feat, data['label'])

pd.DataFrame(gs_count.cv_results_).sort_values('mean_test_score', ascending = False).head()



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
7,702.508892,7.215908,0.384342,0.057611,0.2,15,150,"{'learning_rate': 0.2, 'max_depth': 15, 'n_est...",0.965919,0.982047,...,0.972522,0.005606,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
5,513.285822,3.617929,0.45688,0.080212,0.2,12,150,"{'learning_rate': 0.2, 'max_depth': 12, 'n_est...",0.96861,0.982944,...,0.972342,0.005402,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,461.243162,17.317166,0.567244,0.188291,0.2,15,100,"{'learning_rate': 0.2, 'max_depth': 15, 'n_est...",0.965022,0.981149,...,0.971803,0.00537,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
1,553.485005,5.391244,0.58284,0.109097,0.1,12,150,"{'learning_rate': 0.1, 'max_depth': 12, 'n_est...",0.961435,0.979354,...,0.970187,0.005709,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,335.422707,3.050208,0.403925,0.073323,0.2,12,100,"{'learning_rate': 0.2, 'max_depth': 12, 'n_est...",0.96861,0.978456,...,0.970187,0.004599,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0


# Model Selection

To determine the best model, we need to use a training set and a test set

This will tell us if our model is overfitting the data 

In [80]:
from sklearn.model_selection import train_test_split

In [82]:
# this time we will create train and test data from original data instead of vectorized data

X_train, X_test, y_train, y_test = train_test_split(data[['text', 'text_length', 'percent_punc']], 
                                                    data['label'],
                                                    test_size = 0.2)

In [85]:
# vectorize with TFIDF 

tfidf_vec = TfidfVectorizer(analyzer=clean_text)

In [86]:
# fit training data into a tfidf matrix

tfidf_vec_fit = tfidf_vec.fit(X_train['text'])

In [87]:
# now we need to transform the train and test sets

tfidf_vec_train = tfidf_vec.transform(X_train['text'])

tfidf_vec_test = tfidf_vec.transform(X_test['text'])

In [89]:
# concatenate the matrices with the other columns in the feature data to make new train and test dataframes

X_train_vec = pd.concat([pd.DataFrame(tfidf_vec_train.toarray()), 
                        X_train[['text_length', 'percent_punc']].reset_index(drop=True)], 
                        axis=1)

X_test_vec = pd.concat([pd.DataFrame(tfidf_vec_test.toarray()), 
                       X_test[['text_length', 'percent_punc']].reset_index(drop=True)], 
                       axis=1)

In [90]:
# see first 5 in both datasets

X_train_vec.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7129,7130,7131,7132,7133,7134,7135,7136,text_length,percent_punc
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,41,2.9
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,142,6.8
2,0.287788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33,11.5
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,152,0.8
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65,5.8


In [91]:
X_test_vec.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7129,7130,7131,7132,7133,7134,7135,7136,text_length,percent_punc
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,96,2.6
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24,19.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,165,1.4
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126,2.0


In [92]:
# now that the data is ready, let's move onto the final model selection

# make sure all packages are loaded

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [98]:
# build random forest model

# we will also include extra lines of code to test how long it took to run the code

rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()

rf_model = rf.fit(X_train_vec, y_train)

end = time.time()

rf_fit_time = (end - start)

In [99]:
# predict classes and time code

start = time.time()

pred = rf_model.predict(X_test_vec)

end = time.time()

rf_pred_time = (end - start)

In [100]:
# evaluate results

precision, recall, fscore, support = score(y_test, pred, pos_label='spam', average='binary')

print(f"Precision: {round(precision, 3)}\nRecall: {round(recall, 3)}")

print(f"Accuracy: {round((pred==y_test).sum()/len(pred), 3)}\n")

print(f"Time to fit: {rf_fit_time},\nTime to predict: {rf_pred_time}")

Precision: 1.0
Recall: 0.839
Accuracy: 0.978

Time to fit: 5.696479797363281,
Time to predict: 0.2303476333618164


In [104]:
# build gradient boosting model

gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()

gb_model = gb.fit(X_train_vec, y_train)

end = time.time()

gb_fit_time = (end - start)

In [105]:
# predict classes and time code

start = time.time()

pred = gb_model.predict(X_test_vec)

end = time.time()

gb_pred_time = (end - start)

In [107]:
# evaluate results

precision, recall, fscore, support = score(y_test, pred, pos_label='spam', average='binary')

print(f"Precision: {round(precision, 3)}\nRecall: {round(recall, 3)}")

print(f"Accuracy: {round((pred==y_test).sum()/len(pred), 3)}\n")

print(f"Time to fit: {gb_fit_time},\nTime to predict: {gb_pred_time}")

Precision: 0.918
Recall: 0.826
Accuracy: 0.967

Time to fit: 400.3975188732147,
Time to predict: 0.24857258796691895


# Two final points

There would normally be a much more thorough evaluation of the model in a real-life scenario, including:

-- Evaluation of performance on specific subsets (e.g. those with length 50 or more)

-- Evaluation of specific messages the model is getting wrong

The final model selection would be based on its alignment with the goals of the project (e.g. prioritizing precision or total accuracy)

-- Is a longer predict time gonna make a big bottleneck in your process? Is it feasible to spend a long time training the model?

-- Is there a higher price on precision (false positives) or recall (false negatives)

-- In the spam filter we probably want to prioritize for precision, since we don't want people's real emails getting caught in the filter and the costs for a few spam messages are low

-- In an antivirus software we would probably want to prioritize for recall, since there is a high price for falsely believing a software is not harmful when it actually is

In [111]:
ps.stem('meaning')

'mean'

In [112]:
print(re.split('\W+',"some of the-words are+combined"))

['some', 'of', 'the', 'words', 'are', 'combined']


In [115]:
s = "This is a test for the man to be successful in their lives"

split_s = re.split('\s+', s)

slist = [x for x in split_s if x not in stops]

print(slist)

['This', 'test', 'man', 'successful', 'lives']
