# Sentiment Detection: An end to end project

Sentiment analysis is a thoroughly researched and discussed topic, but the field is still making improvements and discovering new techniques everyday. While there are plenty of pre-trained models capable of achieving incredible precision in sentiment detection (especially models with complicated architectures, such as BERTs and RNNs), I set out to create my own implementation trained from data that is cleaned and processed from its raw, textual form. The goal of this project is to address and try to develop a model that possesses a few key aspects that a (binary) sentiment classifier should have, which I have outlined below:

- Negation Handling (classifiying `not bad` as positive, `don't like` as negative)
- Sarcasm (classifiying `This is exactly what I needed right now` as negative)
- many more!

In [38]:
!pip install kaggle
!kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify
  Downloading python_slugify-6.1.2-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m359.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=d1a5cdca068f84371f25b3ccb46f9a8a1b635769cc6adfd18f629a3f2409f78e
  Stored in directory: /Users/garrethlee/Library/Caches/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e
Successfully built kaggle
Installing col

In [2]:
# Load preliminary libraries

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

# Data Loading

The data used for this analysis is the [Sentiment140 Dataset](https://www.kaggle.com/datasets/kazanova/sentiment140), which contains 1.6 million tweets extracted from the twitter API. We'll conduct a few simple preprocessing steps to format the data into a model-friendly shape.

In [3]:
def load_data(path = "./tweets.csv"):
    """Loads tweets csv onto a pandas DataFrame"""

    columns = ["sentiment", "tweet_id", "date", "query", "username", "tweet"]

    data = pd.read_csv('tweets.csv', 
                       encoding='latin', 
                       header = None, 
                       names = columns)

    data = data[['tweet', 'sentiment']]
    
    return data

data = load_data()
data.head()

Unnamed: 0,tweet,sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


# Data Cleaning

In [4]:
data.sentiment.value_counts()

0    800000
4    800000
Name: sentiment, dtype: int64

To make the data more interpretable, we'll change 4 to 1 (positive sentiment) and keep 0 (negative sentiment).

In [5]:
data['sentiment'] = data['sentiment'].replace(4, 1)

In [6]:
data.head()

Unnamed: 0,tweet,sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


For this project, we'll use 200.000 rows of data to speed up training times, with a 80-20 train-test split.

In [7]:
from sklearn.model_selection import StratifiedShuffleSplit

def split_data(data, rows_used, target = 'sentiment', test_size = 0.2):
    """Performs a stratiffied shuffle split (maintaining class balance between train and test splits)"""
    
    X = data.drop([target], axis = 1)
    y = data[target]

    sss = StratifiedShuffleSplit(test_size = test_size)

    for train_indices, test_indices in sss.split(X, y):
        train_X, train_y = X.iloc[train_indices], y.iloc[train_indices]
        test_X, test_y = X.iloc[test_indices], y.iloc[test_indices]
        
    pre_train_X, pre_train_y = train_X.iloc[:rows_used], train_y.iloc[:rows_used]
    test_X, test_y = test_X.iloc[:rows_used], test_y.iloc[:rows_used]
    
    for train_indices, val_indices in sss.split(pre_train_X, pre_train_y):
        train_X, train_y = pre_train_X.iloc[train_indices], pre_train_y.iloc[train_indices]
        val_X, val_y = pre_train_X.iloc[val_indices], pre_train_y.iloc[val_indices]
    
    return train_X, val_X, test_X, train_y, val_y, test_y

In [8]:
train_X, val_X, test_X, train_y, val_y, test_y = split_data(data, rows_used = 300000)

## Basic Text Transformation

Let's take a sample tweet to figure out what features exist in the text

In [9]:
print(*data['tweet'].sample(5).values.squeeze(), sep = "\n")

@andyclemmensen heyyy you ALWAYS look nice  wanna come to my friend's party? its june 8th long weekend in sydney 
@penguinloverwoo me too! So sore. Moving furniture plus DDR equals bodily harm 
@drewbilation Thanx very much. I do try 
Cooky's got a new motor - anyone had their Audi stolen in the Ayrshire area?  
Hero's was amazing tonight OMG! cant wait until next week 


From the sample, we can see some key non-speech features present in tweets:
- Tagged usernames - accounts (denoted with '@' tagged within the tweet)
- Hashtags
- URLs
- Digits that need to be converted to text
- Others (Apostrophes, blank spaces, punctuation, etc.)

We'll have to find solutions to remove/replace these words in the tweet. The `clean` function below uses regex to remove these non-speech features

In [10]:
import re

def clean(t):
    """Replaces non-speech features in tweets with regex"""
    
    URL_PATTERN = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
    HASHTAG_PATTERN = r"#\S+[a-zA-Z]"
    USERNAME_PATTERN = r"@[a-zA-Z]\S*"
    NUMBER_PATTERN = r"\d+"
    APOSTROPHE_PATTERN = r"\w+'\w+"
    NONWORD_PATTERN = r"[^a-zA-Z]+"

    t = re.sub(HASHTAG_PATTERN, "HASHTAG", t)
    t = re.sub(URL_PATTERN, "URL", t)
    t = re.sub(USERNAME_PATTERN, "USER", t)
    t = re.sub(APOSTROPHE_PATTERN, "", t)
    t = re.sub(NUMBER_PATTERN, "NUMBER", t)
    t = re.sub(NONWORD_PATTERN, " ", t)
    
    return t.lower()

# See the clean() function in action
train_X['tweet'] = train_X['tweet'].apply(clean)
val_X['tweet'] = val_X['tweet'].apply(clean)
test_X['tweet'] = test_X['tweet'].apply(clean)

train_X.head()

Unnamed: 0,tweet
360572,user probably either way f cked as still not g...
613863,thinking that the feed posts a lot of shit but...
1295682,user make me one too lol
220455,hmmm i wake up and find that my home file amp ...
691118,god i really really need to study i like kidne...


After stripping non-speech words, we might end up with rows that just have an empty string. We'll remove these

In [11]:
print(f"Removing {sum(train_X['tweet'].str.strip() == '')} rows from train_X")
train_y = train_y[train_X['tweet'].str.strip() != ""]
train_X = train_X[train_X['tweet'].str.strip() != ""]

print(f"Removing {sum(val_X['tweet'].str.strip() == '')} rows from val_X")
val_y = val_y[val_X['tweet'].str.strip() != ""]
val_X = val_X[val_X['tweet'].str.strip() != ""]


print(f"Removing {sum(test_X['tweet'].str.strip() == '')} rows from test_X")
test_y = test_y[test_X['tweet'].str.strip() != ""]
test_X = test_X[test_X['tweet'].str.strip() != ""]


Removing 1 rows from train_X
Removing 0 rows from val_X
Removing 5 rows from test_X


## Advanced Text Transformation

Now, we apply *slightly more* complicated transformations to the text. The goal? To further condense the text without reducing its information

### Lemmatization

When we lemmatize, we abstract the word to its simplest form. Plural nouns get stripped down to its singular form, continous verbs to its base verb form.

In [12]:
import spacy

nlp = spacy.load("en_core_web_md")

def lemmatize_data(texts):

    lemmatized = []

    for doc in nlp.pipe(texts.values, batch_size=100, n_process=-1, disable=["parser", "ner"]):
        lemmatized.append(" ".join([tok.lemma_ for tok in doc]))
        
    print("Done lemmatizing texts!")
    
    return lemmatized

In [13]:
lemmatized_train_X = pd.Series(lemmatize_data(train_X['tweet']))
lemmatized_val_X = pd.Series(lemmatize_data(val_X['tweet']))
lemmatized_test_X = pd.Series(lemmatize_data(test_X['tweet']))

Done lemmatizing texts!
Done lemmatizing texts!
Done lemmatizing texts!


### Removing Stopwords

We can cut down the meaningless words that appear often (stopwords) to help the model focus on the words that matter and abstract them away from the noise

In [14]:
from nltk.corpus import stopwords

def remove_stopwords(sent):
    
    """Removes stopwords from a given sentence"""
    stop_words = stopwords.words('english')
    # Maintain negation in final data
    stop_words.remove('not')
    # remove apostrophes from stopwords
    final_stop_words = list(map(lambda x: re.sub(r'\W+', '', x), stop_words))
        
    return (" ".join([word for word in sent.split() if word not in final_stop_words])).lower()

In [15]:
cleaned_train_X = lemmatized_train_X.apply(remove_stopwords)
cleaned_val_X = lemmatized_val_X.apply(remove_stopwords)
cleaned_test_X = lemmatized_test_X.apply(remove_stopwords)

# Feature Engineering

We're going to try several different approaches and see which features provide the best performance. 

## TF-IDF Vectors

TF stands for term frequency, while IDF stands for inverse document frequency. These terms are pretty much self explanatory

$$\large{w_{x,y}=tf_{x,y}*log(\frac{N}{count{x,y}})}$$

The **tf-idf** score for a given word (x) in a sentence (y) is the number of times the word appears in the sentence multiplied by the logarithm of the number of sentences divided by the number of sentences the word appears in

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# only include unigrams
tf_1 = TfidfVectorizer(ngram_range=(1,1))

new_train_X = tf_1.fit_transform(lemmatized_train_X)
new_val_X = tf_1.transform(lemmatized_val_X)
new_test_X = tf_1.transform(lemmatized_test_X)

tf_2 = TfidfVectorizer(ngram_range=(1,2))

new_train_X_bi = tf_2.fit_transform(cleaned_train_X)
new_val_X_bi = tf_2.transform(cleaned_val_X)
new_test_X_bi = tf_2.transform(cleaned_test_X)

Here, the `ngram_range` sets the ngram tokens that we want to include in the vectorizer. This means that instead of only looking at the term frequency of **unigrams** (single words), the model now learns the TF-IDF values of **bigrams** (two words). 

## Count Vectors

Count vectorization simply takes every word in the corpus, assigns it an index, and assigns a count value corresponding to each index in every sentence

In [None]:
cv = CountVectorizer()
vectorized_train_X = cv.fit_transform(cleaned_train_X)
vectorized_val_X = cv.transform(cleaned_val_X)
vectorized_test_X = cv.transform(cleaned_test_X)

## Word2Vec



In [None]:
# We take the lemmatized, but not cleaned values. This is to make sure that the model can learn associations between the position of the word within the sentence 

train_texts = lemmatized_train_X.values
val_texts = lemmatized_val_X.values
test_texts = lemmatized_test_X.values

In [18]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenize sentences
tokenizer = Tokenizer(num_words = 3000)
tokenizer.fit_on_texts(train_texts)

train_seq = tokenizer.texts_to_sequences(train_texts)
val_seq = tokenizer.texts_to_sequences(val_texts)

# Pad all sequences to have the same length for training
train_X_padded = pad_sequences(train_seq, maxlen=200)
val_X_padded = pad_sequences(val_seq, maxlen=200)

Our goal is to get a numerical representation of each word in the corpus. We will use one of GENSIM Word2Vec's models to get our embedding matrix. Then, we will use this matrix as input to create a tensorflow `Embedding` layer.

In [19]:
from gensim.models.word2vec import Word2Vec


def get_embedding_weights(sentences, vector_size = 200, vocab_size = 5000, return_vocab_size = True):
    """Returns Word2Vec embeddings from given corpus"""
        
    w2v = Word2Vec(sentences = [sent.split() for sent in train_texts], vector_size = vector_size, max_vocab_size=vocab_size)
    w2v.build_vocab([sent.split() for sent in train_texts] ,keep_raw_vocab=True)
    
    embedding_weights = np.zeros((vocab_size, vector_size))
    
    for word, index in w2v.wv.key_to_index.items():
        embedding_weights[index] = w2v.wv[word]
        
    if return_vocab_size:
        size = len(w2v.raw_vocab)
        return embedding_weights, size
    
    return embedding_weights
    
embedding_weights, vocab_size = get_embedding_weights(train_texts) 

Next, we'll create the model architecture alongside the `Embedding` layer

In [20]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(input_dim = 5000,
                            output_dim = 200,
                            weights = [embedding_weights],
                            trainable = False)

In [18]:
from tensorflow.keras.layers import Dense, Embedding, Dropout, Bidirectional, LSTM
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam

def build_model():
    """Create model architecture and compile with metrics"""

    model = Sequential()

    model.add(embedding_layer)
    model.add(Dropout(0.1))
    model.add(Bidirectional(LSTM(units=64)))
    model.add(Dense(50, activation="relu"))
    model.add(Dense(1, activation = "sigmoid"))

    model.compile(optimizer = Adam(learning_rate = 0.001), 
                  loss = 'binary_crossentropy', 
                  metrics = 'accuracy',)
    
    return model
    
    
model = build_model()

In [22]:
# model.fit(train_X_padded[:20000], train_y[:20000], epochs = 20, verbose = 2, validation_data=(val_X_padded[:500], val_y[:500]))

# Baseline Modelling + Feature Selection

We'll test a few common classifiers to find the best baseline model. We'll build upon this base model for future iterations.

Also, we'll use each model to test out the performances of using different features.

In [23]:
from sklearn.linear_model import LogisticRegression

from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.dummy import DummyClassifier
from xgboost import XGBClassifier

from sklearn.metrics import precision_score, f1_score, recall_score, roc_auc_score


def eval_metrics(pred, actual):
    """Calculate various metrics from the model's predictions"""
    precision = precision_score(actual, pred)
    f1 = f1_score(actual, pred)
    recall = recall_score(actual, pred)
    roc_auc = roc_auc_score(actual, pred)
    return precision, f1, recall, roc_auc
    

models = dict(
    # logistic_regression = LogisticRegression(max_iter=500),
    svc = SVC()
    # dummy_classifier = DummyClassifier(strategy='uniform'),
    # adaboost_classifier = AdaBoostClassifier(),
    # xgboost = XGBClassifier(),
    # decision_tree = DecisionTreeClassifier(),
    # MLP = MLPClassifier(),
    # rf = RandomForestClassifier(n_estimators = 150)
)

features = {'Unigram TF-IDF':(new_train_X, new_val_X, new_test_X),
            'Bigram TF-IDF': (new_train_X_bi, new_val_X_bi, new_test_X_bi),
            'Count Vectorizer':(vectorized_train_X, vectorized_val_X, vectorized_test_X)}

baseline_results = []

for model in models:
    metrics = {"model": model}
    for feature_name, (train, val, test) in features.items():
        m = models[model]
        m.fit(train[:100000], train_y[:100000])
        preds = m.predict(val)
        precision, f1, recall, roc_auc = eval_metrics(preds, val_y)
        # Merge dicts together
        metrics = {**metrics, **{(feature_name, "precision"):precision, (feature_name, "f1 score"):f1,(feature_name, "recall"):recall, (feature_name, "roc auc score"):roc_auc}}
    baseline_results.append(metrics)

We'll display the results in a DataFrame

In [24]:
def get_row_col_index(results):
    """Return row indices (model names) and column indices (MultiIndex) for the results dataframe"""
    
    rows, cols = [], []
    
    for mod in results:
        rows.append(list(mod.values())[0])
        cols.append(list(mod.keys())[1:])
        
    col_index = pd.MultiIndex.from_tuples(cols[0])
    row_index = rows
    
    return row_index, col_index

row_index, col_index = get_row_col_index(baseline_results)
baseline_df = pd.DataFrame(baseline_results, index=row_index, columns = col_index)


Since we've balanced out the datasets, we'll use the ROC-AUC score as an indicator of performance.

In [44]:
baseline_df.apply(lambda x: round(x,3))

Unnamed: 0_level_0,Unigram TF-IDF,Unigram TF-IDF,Unigram TF-IDF,Unigram TF-IDF,Bigram TF-IDF,Bigram TF-IDF,Bigram TF-IDF,Bigram TF-IDF,Count Vectorizer,Count Vectorizer,Count Vectorizer,Count Vectorizer
Unnamed: 0_level_1,precision,f1 score,recall,roc auc score,precision,f1 score,recall,roc auc score,precision,f1 score,recall,roc auc score
Logistic Regression,0.775,0.786,0.796,0.782,0.773,0.781,0.789,0.778,0.756,0.77,0.785,0.766
Dummy,0.502,0.501,0.5,0.501,0.501,0.501,0.502,0.5,0.501,0.501,0.501,0.5
Adaboost,0.683,0.728,0.78,0.708,0.712,0.68,0.651,0.693,0.71,0.681,0.654,0.694
XGBoost,0.741,0.765,0.79,0.757,0.739,0.734,0.729,0.735,0.736,0.739,0.742,0.738
Decision Tree,0.697,0.701,0.705,0.699,0.703,0.706,0.71,0.704,0.703,0.698,0.692,0.7
MultiLayer Perceptron,0.766,0.759,0.752,0.761,0.768,0.771,0.774,0.77,0.669,0.716,0.77,0.694
Random Forest,0.783,0.778,0.773,0.779,0.765,0.77,0.774,0.768,0.767,0.758,0.749,0.761
svc,0.781,0.793,0.805,0.789,0.765,0.781,0.797,0.776,0.757,0.779,0.803,0.772


From the table above, we see that the **SVC** classifier got the highest ROC-AUC score, with the **Unigram TF-IDF** word embeddings being the best performing feature set! 

# Pipeline Creation

Now, we're going to create a modelling pipeline that:

- takes in raw text, 
- preprocesses it (strip away stopwords, perform tokenization, and lemmatize the data)
- applies the best performing feature encoding (which in our case was the bigram Tf-Idf vectorizer), then 
- inputs the transformed data into a logistic regression model to produce a prediction on sentiment

Since we want to preprocess text, we can create a custom transformer using sklearn's TransformerMixin class

In [20]:
from sklearn.base import BaseEstimator, TransformerMixin

class SentimentDetectorPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        """Initializes a sklearn transformer that'll preprocess textual data"""
        self.lemmatizer = WordNetLemmatizer()
         

    def clean(self, t):
        
        """Replaces non-speech features in tweets with regex"""

        URL_PATTERN = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
        HASHTAG_PATTERN = r"#\S+[a-zA-Z]"
        USERNAME_PATTERN = r"@[a-zA-Z]\S*"
        NUMBER_PATTERN = r"\d+"
        APOSTROPHE_PATTERN = r"\w+'\w+"
        NONWORD_PATTERN = r"[^a-zA-Z]+"

        t = re.sub(HASHTAG_PATTERN, "HASHTAG", t)
        t = re.sub(URL_PATTERN, "URL", t)
        t = re.sub(USERNAME_PATTERN, "USER", t)
        t = re.sub(APOSTROPHE_PATTERN, "", t)
        t = re.sub(NUMBER_PATTERN, "NUMBER", t)
        t = re.sub(NONWORD_PATTERN, " ", t)

        return t.lower()
        
    def remove_stopwords(self, tokens):
        """Removes stopwords from a given sentence"""        

        stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
        
        # remove apostrophes from stopwords
        final_stop_words = list(map(lambda x: re.sub(r'\W+', '', x), stopwords))

        return (" ".join([word for word in tokens if word not in final_stop_words])).lower()


    def lemmatize_data(self, text):
        return [self.lemmatizer.lemmatize(tok) for tok in text.split()]

    
    def fit(self, X, y = None):
        return self
        
    def transform(self, X, y = None):
        if type(X) == pd.DataFrame:
            X = X['tweet'] 
        if type(X) != pd.Series:
            X = pd.Series(X)
        
        data = X.apply(self.clean)
        data = data.apply(self.lemmatize_data)
        data = data.apply(self.remove_stopwords)
        return data
    
    def fit_transform(self, X, y = None):
        self.fit(X)
        return self.transform(X)
        
        

In [37]:
# Create the pipeline
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from utils.transformer import SentimentDetectorPreprocessor
from sklearn.pipeline import Pipeline

# best_model = gs.best_estimator_

pipe = Pipeline(steps = [('sd_processor', SentimentDetectorPreprocessor()),
                         ('tf-idf_vectorizer', TfidfVectorizer(ngram_range = (1,3), max_features = 50000)),
                         ('SVC', SVC(probability = True, gamma = 'auto'))])

# Train
pipe.fit(train_X, train_y)

In [35]:
# Display metrics from test dataset

from sklearn.metrics import classification_report

print(classification_report(pipe.predict(test_X), test_y))

              precision    recall  f1-score   support

           0       0.77      0.79      0.78    145334
           1       0.80      0.77      0.79    154661

    accuracy                           0.78    299995
   macro avg       0.78      0.78      0.78    299995
weighted avg       0.78      0.78      0.78    299995



We'll pickle the model to get a `pkl` file.

In [36]:
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(pipe, f)

# Final Thoughts

- We ultimately ended up with a **Suport Vector Classifier**, using TF-IDF vectors as a feature for the model. In the balanced testing dataset, the model got an accuracy of .

- After testing the initial model with respect to the goals in the introduction,he model was able to pick up some simple negations (`not good`, `dont like`, `not the best`), but still struggled with others (`not bad`, `dont hate`). This can be attributed to several factors, some of them being possible spelling errors, as well as well as not enough training data.

- For future iterations, we could possibly incorporate hyperparameter tuning into the fold, such as using `GridSearch` to pick the best kernel for the SVC. Also, if we had more access to memory, we could possibly train the model on the entire dataset, so the model can pick up more common phrases that can help in sentiment detection.

*Thanks for sticking around this far, until next time! 👋*