Thad Hoskins

Mini-project 3 - Naive Bayes

Find some text data of your own choice, it could be labelled tweets, etc. 

Your dataset should have at least 200 instances, and if there are several columns of text, you can choose to merge the text columns into a single text column. Each text instance should have at least 60 words.

For my dataset, I was able to find a labelled collection of tweets used for sentiment analysis.

http://help.sentiment140.com/for-students

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score

from sklearn.naive_bayes import CategoricalNB, GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

import joblib

In [1]:
from nltk.corpus import stopwords
sw = set(stopwords.words("english"))

In [2]:
tweets = pd.read_csv("tweets.csv")
tweets.columns = ["polarity", "id", "date", "query", "user", "text"]

NameError: name 'pd' is not defined

Clean the data

In [None]:
tweets.replace({"polarity": {4: 1}}, inplace=True)
tweets = tweets[(tweets.polarity==0)|(tweets.polarity==1)]

tweets.polarity.unique()

In [None]:
tweets = tweets[["polarity", "text"]]
tweets

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import contractions
import nltk
import re
import string

nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

In [None]:
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in text.split(' ')])

def expand_contractions(text):
    expanded_words = []    
    for word in text.split():
        expanded_words.append(contractions.fix(word))   

    return ' '.join(expanded_words)

def remove_others(text):
    text = re.sub(r'\n', "", text)
    text = re.sub(r'-', " ", text)
    text = text.strip()
    text = re.sub(r' +', " ", text)
    text = re.sub(r'[\(\)\[\]\^\$\+\*\.\?\/!@#%&{}\'\",;:]', "", text)
    
    return text

def clean_text(text):
    stop = set(nltk.corpus.stopwords.words('english'))
    cleaned = text.lower()
    cleaned = expand_contractions(cleaned)
    tokens = word_tokenize(cleaned)
    cleaned = ' '.join([w for w in tokens if not w in stop])
    cleaned = remove_others(cleaned)
    cleaned = lemmatize_text(cleaned)
    return cleaned

In [None]:
len(tweets[tweets.text.str.len()>=60])

In [None]:
test = tweets.head(100).copy()
test["cleaned_text"] = test["text"].replace(regex='(@\w+)|#|&|!',value='')
test["cleaned_text"] = test["cleaned_text"].apply(clean_text)
test

In [None]:
tweets = tweets[tweets.text.str.len()>=60]
# tweets = tweets[tweets.text.str.len()>=60].sample(n=2000, random_state=42).copy()
tweets.polarity.unique()

In [None]:
tweets["cleaned_text"] = tweets["text"].replace(regex='(@\w+)|#|&|!',value='')
tweets["cleaned_text"] = tweets["cleaned_text"].apply(clean_text)
tweets

In [None]:
import random
random_index = random.randint(0, len(tweets)-1)

print("News text prior to cleaning.")
tweets.iloc[random_index].text

In [None]:
print("News text after to cleaning.")
tweets.iloc[random_index].cleaned_text

In [None]:
tweets.drop(["text"], axis=1, inplace=True)

In [None]:
tweets

transform the data to a representation suitable for your algorithm

In [None]:
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words="english", min_df=10, max_features=None)
X = tfidf.fit_transform(tweets.cleaned_text.values)

In [None]:
tfidf.get_params()

split the data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, tweets.polarity.values,
                                                    test_size=0.3,
                                                    random_state=42)

build your model and evaluate the model

In [None]:
clf = MultinomialNB()
clf = clf.fit(X_train, y_train)

In [None]:
clf.get_params()

In [None]:
y_pred = clf.predict(X_train)
print(f"Train accuracy: {accuracy_score(y_train, y_pred)}")

y_pred = clf.predict(X_test)
print(f"Test accuracy: {accuracy_score(y_test, y_pred)}")

In [None]:
cv_score = cross_val_score(clf, X_train, y_train, cv=5)

print(f"Mean Cross Validation Score: {np.mean(cv_score)}")
print(f"Cross Validation Score Standard Deviation: {np.std(cv_score)}")

Tune some parameters of interest

In [None]:
X_train_tune, X_test_tune, y_train_tune, y_test_tune = train_test_split(tweets.cleaned_text.values,
                                                                        tweets.polarity.values,
                                                                        test_size=0.3,
                                                                        random_state=42)

In [None]:
X_train_tune.shape

In [None]:
X_test_tune.shape

In [None]:
y_train_tune.shape

In [None]:
y_test_tune.shape

In [None]:
pipe = Pipeline([("tfidf", TfidfVectorizer(stop_words="english")), ("nb", MultinomialNB())])
pipe.get_params()

In [None]:
param_grid = {"tfidf__min_df":[.01, .1, .15],
              "tfidf__ngram_range": [(1,1), (1,2), (2,2)],
              "tfidf__norm": [None, "l1", "l2"],
              "nb__alpha": [.1, .5, 1],
              "nb__fit_prior": [True, False]}

grid_search = GridSearchCV(pipe,
                           param_grid, cv=5,
                           scoring="accuracy",
                           return_train_score=True)

grid_search.fit(X_train_tune, y_train_tune)

In [None]:
best_est = grid_search.best_estimator_
best_est

In [None]:
grid_search.best_params_

In [None]:
joblib.dump(best_est, "assignment_5_part4_best.pkl")

In [None]:
print(f"Score on training set: {best_est.score(X_train_tune, y_train_tune)}")

In [None]:
print(f"Score on test set: {best_est.score(X_test_tune, y_test_tune)}")

The main problem was finding the dataset! Using the above guidance, I looked for a tweet dataset that was labelled. I found one here (referenced above):<br>
http://help.sentiment140.com/for-students

The dataset included a number of fields. However, I only needed the "polarity" and the text of the tweets. Polarity is my label or target. At the start, the values were:
<ul>
    <li>0 = negative</li>
    <li>2 = neutral</li>
    <li>4 = positive</li>
</ul>

I am targeting wanting to know if a tweet is positive or negative, so I dropped neutral and recoded positive to 1.

To meet the requirement, I limited the dataset to tweets that were more than 60 characters. I decided to do this before transforming the data.

As for my text, I used code I wrote for a project in DS Tools 1 to "clean" the text, i.e., I did not use the cleaning provided in teh lecture. For my cleaning, I performed the following:
<ul>
    <li>Removed @user from every tweet</li>
    <li>Made all letters lower case</li>
    <li>expanded contractions</li>
    <li>Removed stop words</li>
    <li>Remove characters</li>
    <li>Lemmatized the words</li>
</ul>

Following the procedure from #3 from this assignment, I then vectorized the tweet text. Next, the data was split into training and test data (feature and label for each), then fitted the model using MultinomialNB. I then cross validated the results, outputting scores at every steps.

With the ball rolling downhill, I then went back to the non-vectorized dataset.

With a pipeline constructed that will both vectorize and fit the model, I put those pieces in place, then ran the pipeline, both transforming the data and fitting the model.

My chosen hyper-parameters check for minimum usage of terms, ngram range (checking for groupings of words), and row output normalization (norm) in the tranformation, as well as the smoothing factor (alpha) and a probability learning flag (learn_prior) for the Naive Bayes model.

Given the size of the dataset, the cleaning takes a long time, as does the tuning. Times like this make me wish to build a Data Science computer.

In the end, the cross validated gridsearch yielded worse scores. Given the length of run time for a homework assignment I can then only guess to the reason for such a result. My two competing theories are overfitting and limiting hyper-parameters for tuning.

The first is overfitting. The initial  model overfit with an accuracy score of nearly 75%. The tuning and cross validation may prove this out with a much lower accuracy  of 64.8% for the training and 64% for the test. That would indicate the classification power of the model is stronger than a coin toss, but not remarkable so.

The second reason could be tested more thoroughly with a better system than my laptop and more time and that is that my hyper-parameters were restrictive to creating a worse model. My original parameters were:
{'alpha': 1.0, 'class_prior': None, 'fit_prior': True}<br>
{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 10,
 'ngram_range': (1, 2),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}
 
The tuned parameters are:
 {'nb__alpha': 0.1,
 'nb__fit_prior': True,
 'tfidf__min_df': 0.01,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2'}
 
Noteable differences are the alpha (for the model) and min_df (data transformation). 

The tuned min_df is 10% which is quite higher than the 10 overall records of the original model.
 
I had more tuning features but runtime was a limitation.
 
Solutions in the future would be to limit the dataset significantly. The original dataset was 1.6 million, reduced to 900k with some filtering. I then further randomly sampled the data to 500k (further leaning toward cross validation doing its job). Even so, the data cleaning, transformation, fitting, and tuning was very time consuming. I can reduce those now by saving the dataset so that I do not have to do much of that. I can also save the model, change some parameters, and compare that to this one without having to fit again.
 
I would recommend tweaks to the model to increase the accuracy before taking this to production. The method is sound, but needs further tuning.