## UVU Data Science Club - Sentiment Demo

This notebook is a demonstration of the deployment process of a simple sentiment classification model using `scikitlearn`. The model will be trained on 20NewsGroups data set available in scikit to train a NaiveBayes classifier for sentiment prediction. We'll also be using the `streamlit` package to deploy a frontend to our model to allow end users to submit text for classification and testing.

The Notebook is divided in the following sections:

1. Setup & Environment
2. Load & Exploring the Data
3. Feature Extraction (Bag of Words, Tokenization, Frequency Distribution)
4. Training a Classifier
5. Build Pipeline
6. Performance Evaluation & Testing
7. Deployment


### 1. Setup & Environment

#### Environment Setup

You should create a virtual environment to ensure dependency integrity. You can create a virtual environment inside this project directory by using your installed python interpreter using the instructions at the following [link](https://docs.python.org/3/tutorial/venv.html).

Once you have your environment set up and activated. Download the required dependencies in this project by using the `python3 -m pip install -r requirements.txt` command at the root of your project using the `requirements.txt` file. Be sure to activate the virtual environment in your editor before running the notebook.


In [60]:
# import dependencies for project
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

In [72]:
### Load Our Training Data
BASE_TRAIN = "../data/raw/train.tsv"
BASE_TEST = "../data/raw/submission.tsv"
# labeled and submission data are 25000 rows long.
labeled = pd.read_csv(BASE_TRAIN, header=0, delimiter="\t", quoting=3)
prod = pd.read_csv(BASE_TEST, header=0, delimiter="\t", quoting=3)
# local test size of data will be 80% of the labeled training set and local test data will be 20% in size
train = labeled[0:20000]
test = labeled[20000:25000]

In [62]:
# Inspect the structure of the data
# Id - Unique identifier
# Sentiment - Training Target, which means that's how we are going to train our data.
# Review - The data that leads to our target.
train.head(20)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."
5,"""8196_8""",1,"""I dont know why people think this is such a b..."
6,"""7166_2""",0,"""This movie could have been very good, but com..."
7,"""10633_1""",0,"""I watched this video at a friend's house. I'm..."
8,"""319_1""",0,"""A friend of mine bought this film for £1, and..."
9,"""8713_10""",1,"""<br /><br />This movie is full of references...."


In [63]:
# We notice some dirty data so we should clean it.
import re
def clean_text(text: str) -> str:
    """
    This function removes html symbols and the corresponding tags located within the tags in addition to repetitive backslashes.
    """
    html_tags = re.compile('<.*?>')
    clean = text.replace("\\", "")
    clean = clean.replace('\'', "")
    return re.sub(html_tags, '', clean)

In [64]:
# example of clean text
print(f"This is an example of how the text looks before being cleaned:\n{train['review'].iloc[9]}\n\nCompared to after it gets cleaned:\n\n{clean_text(train['review'].iloc[9])}")

This is an example of how the text looks before being cleaned:
"<br /><br />This movie is full of references. Like \"Mad Max II\", \"The wild one\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."

Compared to after it gets cleaned:

"This movie is full of references. Like "Mad Max II", "The wild one" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."


In [74]:
# Lets apply this to our original data and create a new data frame with cleaned data and save it.
train['review'] = train['review'].apply(clean_text)
test['review'] = test['review'].apply(clean_text)
prod['review'] = prod['review'].apply(clean_text)
# save cleaned data to file
train.to_csv("../data/processed/clean_local_train.csv",index=False)
test.to_csv("../data/processed/clean_local_test.csv",index=False)
prod.to_csv("../data/processed/prod.csv",index=False)

In [3]:
print(f"We want to ensure that our local training and test data is normally distributed for valid testing. The average sentiment for our training data is {train['sentiment'].mean()}. The average sentiment for our local testing data is {test['sentiment'].mean()}. This distribution of positive and negative is a confirmation we have a good training data set.")

We want to ensure that our local training and test data is normally distributed for valid testing. The average sentiment for our training data is 0.4986. The average sentiment for our local testing data is 0.5056. This distribution of positive and negative is a confirmation we have a good training data set.


### Feature Extraction: Bag of Words, Tokenization, and Frequency Distribution

#### Bag of Words

The most intuitive way to do so is to use a bags of words representation:

Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

If n_samples == 10000, storing X as a NumPy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers.

Fortunately, most values in X will be zeros since for a given document less than a few thousand distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

[SciKitLearn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

#### Tokenization

Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors:
CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has built a dictionary of feature indices.

CountVectorizer - It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. [GeeksForGeeks](https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/#:~:text=CountVectorizer%20is%20a%20great%20tool,occurs%20in%20the%20entire%20text.&text=The%20value%20of%20each%20cell,in%20that%20particular%20text%20sample.)


In [75]:
# new instance of our CountVectorizer object we imported above
count_vect = CountVectorizer()
# create vectors of our cleaned data for analysis and traing.
X_train_counts = count_vect.fit_transform(train['review'])
# determine the size of our data. The x matches our rows above, the y are the features of vectors extracted. This includes a dictionary of features and n-grams.
X_train_counts.shape

(20000, 73185)

In [78]:
#tfidf transformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(20000, 73185)

## Training a Classifer & Pipeline
Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [83]:
# we supply the training data that we want the classifier to analyze and we provide a target of what the correct answer should be. These are indexed based so they line up 1:1
clf = MultinomialNB().fit(X_train_tfidf, train['sentiment'])
# create a pipeline that's faster than the above steps separated.
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])
# we can just pass the data in
text_clf.fit(train['review'], train['sentiment'])

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [86]:
## Predictions
predicted = text_clf.predict(test['review'])
np.mean(predicted == test['sentiment'])

0.856