# Model Creation

### Contents:
- [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)
- [Decision Tree](#Decision-Tree)
- [Bagging Classifier](#Bagging-Classifier)
- [Random Forest](#Random-Forest)
- [Extra Trees](#Extra-Trees)
- [AdaBoostClassifier](#AdaBoost-Classifier)

## Import Libraries

In [1]:
# Importing libraries needed for modeling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.utils import resample
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier 



## Import CSV Files

In [2]:
# Reading in second CSV file and saving it to a dataframe
twitter = pd.read_csv("./data/twitter_preprocessed_all.csv")
twitter.head()

Unnamed: 0,date,times_retweeted,times_favorited,bot_rating,words
0,2019-10-28 16:04:00+00:00,0,1,0.005364,AMPMUZIC #CaliforniaFires #californiawildfires...
1,2019-11-12 03:06:00+00:00,2,1,0.014544,"dwatchnews nam Rebirth, angst and the 'new n..."
2,2019-11-03 20:10:28+00:00,0,0,0.036578,WaterSolarWind Trump melts down on Pelosi du...
3,2019-10-26 08:48:42+00:00,2,2,0.097414,BombayHeadlines #CaliforniaWildfire #californi...
4,2019-11-02 21:57:37+00:00,1,1,0.008751,studentveronica California Wildfires Signal ...


In [3]:
# Check the shape of the dataframe rows by columns
twitter.shape

(28106, 5)

In [4]:
# Checking for null values
twitter.isnull().sum()

date               0
times_retweeted    0
times_favorited    0
bot_rating         1
words              0
dtype: int64

In [5]:
# Dropping null values permanently (We can take this out now)
twitter.dropna(inplace=True)

In [93]:
# Checking to see null values dropped from dataframe
twitter.isnull().sum()

date               0
times_retweeted    0
times_favorited    0
bot_rating         0
words              0
dtype: int64

In [6]:
# Changing the datatype for "bot_rating" from object to float64
twitter['bot_rating'] = pd.to_numeric(twitter['bot_rating'], errors='coerce')

In [7]:
# Confirming that datatype of "bot_rating" was correctly changed
twitter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28105 entries, 0 to 28105
Data columns (total 5 columns):
date               28105 non-null object
times_retweeted    28105 non-null int64
times_favorited    28105 non-null int64
bot_rating         28105 non-null float64
words              28105 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 1.3+ MB


In [8]:
# Creating a new column in the dataframe where each "bot_rating" that is greater than .5 == True and those that aren't
# == False
twitter["likely_bot"] = twitter["bot_rating"] > .50

In [9]:
# Confirming the creation of the new column and boolean variables
twitter.head()

Unnamed: 0,date,times_retweeted,times_favorited,bot_rating,words,likely_bot
0,2019-10-28 16:04:00+00:00,0,1,0.005364,AMPMUZIC #CaliforniaFires #californiawildfires...,False
1,2019-11-12 03:06:00+00:00,2,1,0.014544,"dwatchnews nam Rebirth, angst and the 'new n...",False
2,2019-11-03 20:10:28+00:00,0,0,0.036578,WaterSolarWind Trump melts down on Pelosi du...,False
3,2019-10-26 08:48:42+00:00,2,2,0.097414,BombayHeadlines #CaliforniaWildfire #californi...,False
4,2019-11-02 21:57:37+00:00,1,1,0.008751,studentveronica California Wildfires Signal ...,False


In [10]:
# Checking the values for the new column that we created, "likely_bot", and how many of each value there are 
# in the column
twitter["likely_bot"].value_counts()

False    26441
True      1664
Name: likely_bot, dtype: int64

In [11]:
# Checking the mean value of likely bots
twitter["likely_bot"].mean()

0.05920654687777976

In [12]:
# Putting all likely bots in a dataframe for bootstrapping sampling
twitter_bots_likely = twitter.loc[twitter["likely_bot"] == 1]

In [13]:
# Confirming 
twitter.head()

Unnamed: 0,date,times_retweeted,times_favorited,bot_rating,words,likely_bot
0,2019-10-28 16:04:00+00:00,0,1,0.005364,AMPMUZIC #CaliforniaFires #californiawildfires...,False
1,2019-11-12 03:06:00+00:00,2,1,0.014544,"dwatchnews nam Rebirth, angst and the 'new n...",False
2,2019-11-03 20:10:28+00:00,0,0,0.036578,WaterSolarWind Trump melts down on Pelosi du...,False
3,2019-10-26 08:48:42+00:00,2,2,0.097414,BombayHeadlines #CaliforniaWildfire #californi...,False
4,2019-11-02 21:57:37+00:00,1,1,0.008751,studentveronica California Wildfires Signal ...,False


In [14]:
# Changing the True/False values in the "likely_bot" column to integer values
twitter['likely_bot'] = twitter["likely_bot"].astype(int)

In [15]:
twitter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28105 entries, 0 to 28105
Data columns (total 6 columns):
date               28105 non-null object
times_retweeted    28105 non-null int64
times_favorited    28105 non-null int64
bot_rating         28105 non-null float64
words              28105 non-null object
likely_bot         28105 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 1.5+ MB


In [16]:
# Checking the shape of the the 
twitter.shape

(28105, 6)

In [17]:
# Creating a bootstrap sample of the dataframe and setting it equal to a variable, boot
boot = resample(twitter_bots_likely, replace=True, n_samples=25000, random_state=22)

In [18]:
# Creating a new dataframe out of the original dataframe and the bootstrapped sample
twitter_bootstrapped = pd.concat([boot, twitter])

In [19]:
# Checking the values for the "likely_bot" column and how many of each value there are
twitter_bootstrapped["likely_bot"].value_counts()

1    26664
0    26441
Name: likely_bot, dtype: int64

## Model Preperation

In [20]:
# Defining X and y
X = twitter_bootstrapped["words"]
y = twitter_bootstrapped["likely_bot"]

In [21]:
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify=y)

In [22]:
# Accuracy of our baseline model
y_test.value_counts(normalize=True)

1    0.502071
0    0.497929
Name: likely_bot, dtype: float64

## Multinomial Naive Bayes

In [57]:
# Building a pipeline
pipe1 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('nb', MultinomialNB())
                ])
# Setting the parameters of the pipeline 
pipe_params1 = {
    'tfidf__max_features': [100, 1000, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__stop_words' : [None, stop_words.ENGLISH_STOP_WORDS],
}
# Instantiated the grid search
gs1 = GridSearchCV(pipe1, 
                  param_grid=pipe_params1
                 ) 
# Fitting the model
gs1.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [58]:
# Checking the best score
gs1.best_score_

0.8853820639790675

In [59]:
# Setting the best estimator as the model 
gs_model = gs1.best_estimator_

In [60]:
# Checking out the training score
gs_model.score(X_train, y_train)

0.9009490810485086

In [61]:
# Checking out the testing score 
gs_model.score(X_test, y_test)

0.8902613542215863

### Conclusions about this model:

The Naive Bayes is often a good model for NLP, but in this case it is outperformed by other models.

## Decision Tree

In [62]:
# Setting up a pipeline for Decision tree and tfidf Vectorizer
pipe2 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('dt', DecisionTreeClassifier())
                ])
# I removed tfidf feature options so I could try more dt hyperparameters since there has been a lot of
# consistency with hyperparametes that work best
pipe_params2 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'dt__max_depth': [3, 10],
    'dt__min_samples_split': [5, 20],
    'dt__min_samples_leaf': [2, 7]
}
# Instantiated grid search
gs2 = GridSearchCV(pipe2, 
                  param_grid=pipe_params2) 
# Fitting the model
gs2.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [63]:
# Checking the best score
gs2.best_score_

0.7364164081527281

In [64]:
# Setting the best estimator to be the model
gs_model2 = gs2.best_estimator_

In [65]:
# Checking the training score for this model 
gs_model2.score(X_train, y_train)

0.7439991965451441

In [66]:
# Checking the testing score for this model 
gs_model2.score(X_test, y_test)

0.7346539127815018

### Conclusions about this model:

The Decision Tree/TFIDF Vectorizer model performed stronger than its counterpart. It shows signs of overfitting, however, the score between train and test is not wide enough to draw any conclusions.

## Bagging Classifier

In [67]:
# Building the pipeline for a bagging classifier 
pipe3 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('bag', BaggingClassifier())
                ])
# Setting the parameters
pipe_params3 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'bag__max_samples' : [.5, 1.0, 10],
    'bag__n_estimators' : [2, 6, 10]
}
# Instantiated the grid search
gs3 = GridSearchCV(pipe3, 
                  param_grid=pipe_params3) 
# Fitting the model to the data
gs3.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [68]:
# Checking the best score
gs3.best_score_

0.9751932404495628

In [69]:
# Setting the model to the best estimator 
gs_model3 = gs3.best_estimator_

In [70]:
# Checking the training score
gs_model3.score(X_train, y_train)

0.99671085668374

In [71]:
# Checking the testing score 
gs_model3.score(X_test, y_test)

0.977781125254199

### Conclusions about this model:

The Bagging Classifier/TFIDF Vectorizer model was one of the strongest models that was fit and tested. It shows signs of overfitting, however, the score between train and test is not wide enough to draw any conclusions.

## Random Forest

In [72]:
# Setting the pipeline for random forest 
pipe4 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('rf', RandomForestClassifier())
                ])
# Pipeline parameters
pipe_params4 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'rf__n_estimators': [100, 150],
    'rf__max_depth': [None, 5, 6]
}
# Instantiating a grid search
gs4 = GridSearchCV(pipe4, 
                  param_grid=pipe_params4) 
# Fitting my model
gs4.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [73]:
# The best score for this model
gs4.best_score_

0.9877974896133426

In [74]:
# Setting the best estimator as this model
gs_model4 = gs4.best_estimator_

In [75]:
# Checking the training score
gs_model4.score(X_train, y_train)

0.9993974088580898

In [76]:
# Checking the testing score 
gs_model4.score(X_test, y_test)

0.9911877683211568

### Conclusions about this model:

The Random Forest/TFIDF Vectorizer model performed outstandingly, in fact, it perform second best out of every model that was fit and tested. 

## Extra Trees

In [77]:
# Setting the pipeline for tfidf and extra trees
pipe5 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('xt', ExtraTreesClassifier())
                ])
# Setting the pipeline parameters
pipe_params5 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'xt__n_estimators': [100, 150],
    'xt__max_depth': [None, 5, 6]
}
# Instantiating the grid search
gs5 = GridSearchCV(pipe5, 
                  param_grid=pipe_params5) 
# Fitting the model
gs5.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [78]:
# Checking the best score 
gs5.best_score_

0.9920658876811508

In [79]:
# Setting the best estimator as the model 
gs_model5 = gs5.best_estimator_

In [80]:
# Checking the training score 
gs_model5.score(X_train, y_train)

0.9993974088580898

In [81]:
# Checking the testing score
gs_model5.score(X_test, y_test)

0.9955562250508398

### Conclusions about this model:

The Extra Trees/TFIDF Vectorizer model was the best performing model.

## AdaBoost Classifier

In [23]:
# Setting the pipeline 
pipe6 = Pipeline([('tfidf', TfidfVectorizer(max_features=10000, 
                                           ngram_range=(1, 1), 
                                           stop_words=stop_words.ENGLISH_STOP_WORDS)),
                     ('ada', AdaBoostClassifier()),
])
# Setting the pipeline parameters
pipe_params6 = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
}
# Instantiated a grid search
gs6 = GridSearchCV(pipe6, 
                  param_grid=pipe_params6) 
# Fitting the model
gs6.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=10000,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                     

In [24]:
# Finding the best score for the model 
gs6.best_score_

0.7051822316407821

In [25]:
# Setting the best estimator to be the model
gs_model6 = gs6.best_estimator_

In [26]:
# Checking the training score
gs_model6.score(X_train, y_train)

0.7000853670784373

In [27]:
# Checking the testing score
gs_model6.score(X_test, y_test)

0.6989530767492657

### Conclusions about this model:

The AdaBoostClassifier/TFIDF Vectorizer model performed well, but not nearly as well as Extra Trees and Random Forest. 