### Capstone Project - Sentiment Anaysis of Social Media (X/Twitter) Feed

Create a model to determine the sentiment/tone (positive, negative, neutral) by performing a sentiment analysis on given data, using Natural Language Processing (NLP) techniques. For this, we will need data with random text passages and assigned tone (supervised learning).

### The Data

The witter Sentiment Analysis Datasets are from Kaggle:
train.csv - the training set
test.csv - the test set

Data fields:
ItemID - numeric id of twit
Sentiment - sentiment (0 - negative, 1 - positive)
SentimentText - text of the twit

Citation: Azhar Yebekenova. (2017). Twitter sentiment analysis. Kaggle. https://kaggle.com/competitions/twitter-sentiment-analysis2


In [5]:
import pandas as pd

In [6]:
df = pd.read_csv('/Users/dsgarcha/Documents/Berkeley/Capstone/twitter-sentiment-analysis/train.csv')
target_df = pd.read_csv('/Users/dsgarcha/Documents/Berkeley/Capstone/twitter-sentiment-analysis/test.csv')

In [7]:
df.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I've been at...
4,5,0,i think mi bf is cheating on me!!! ...


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99989 entries, 0 to 99988
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ItemID         99989 non-null  int64 
 1   Sentiment      99989 non-null  int64 
 2   SentimentText  99989 non-null  object
dtypes: int64(2), object(1)
memory usage: 2.3+ MB


In [9]:
target_df.head()

Unnamed: 0,ItemID,SentimentText
0,1,is so sad for my APL frie...
1,2,I missed the New Moon trail...
2,3,omg its already 7:30 :O
3,4,.. Omgaga. Im sooo im gunna CRy. I've been at...
4,5,i think mi bf is cheating on me!!! ...


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

import time


import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import TweetTokenizer
from nltk.tokenize import TweetTokenizer, word_tokenize
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dsgarcha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/dsgarcha/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/dsgarcha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dsgarcha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# Preprocessing: Stemming

def stemmer(text):
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in word_tokenize(text)]
    return ' '.join(stemmed_words)

start_time = time.time()
stemmed_text = df['SentimentText'].apply(stemmer) # time taken: 150 sec
target_stemmed_text = target_df['SentimentText'].apply(stemmer)
print(time.time()-start_time)
print(stemmed_text[0])


88.67580795288086
is so sad for my apl friend .............


In [12]:
print(target_stemmed_text[0:5])

0            is so sad for my apl friend .............
1                      i miss the new moon trailer ...
2                              omg it alreadi 7:30 : o
3    .. omgaga . im sooo im gunna cri . i 've been ...
4               i think mi bf is cheat on me ! ! ! t_t
Name: SentimentText, dtype: object


In [13]:
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(stemmed_text, y, random_state = 42)

In [14]:
X_train.head()

60132    @ bhyundai 19 rdr and our store did 49 new for...
94181       @ cornflakesss sooo .... are you wear jean ? ?
62234    @ bjmendelson hey , i wan na be follow more . ...
74933    @ anonymousmoi ye inde ! he could soooo get it...
47231                  @ allnight_alway lmaooooo . i win .
Name: SentimentText, dtype: object

In [15]:
X_train[:100]

60132    @ bhyundai 19 rdr and our store did 49 new for...
94181       @ cornflakesss sooo .... are you wear jean ? ?
62234    @ bjmendelson hey , i wan na be follow more . ...
74933    @ anonymousmoi ye inde ! he could soooo get it...
47231                  @ allnight_alway lmaooooo . i win .
                               ...                        
85129              @ ccbake thank but my pant say otherwis
8972     & quot ; fuck you alphabetti spaghetti . i nev...
45925    @ alittlebit keep the blog come as that wa an ...
60764     @ bertbalcaen ye ! fulli agre - but got no choic
17484    @ __xkul0tx__ you 're wonder & amp ; i hope th...
Name: SentimentText, Length: 100, dtype: object

In [16]:
# Pipeline using CountVectorizer normalizer and LogisticRegression classifier
cvect_lgr_pipe = Pipeline([('cvect', CountVectorizer(stop_words = 'english', max_features = 1000)),
                       ('lgr', LogisticRegression())])
cvect_lgr_pipe.fit(X_train, y_train)

# predict
test_acc = cvect_lgr_pipe.score(X_test, y_test) # value is 0.7348 for baseline
print(test_acc)

0.7347787823025842


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [17]:
# Pipeline and Grid Search using CountVectorizer normalizer and LogisticRegression classifier
params = {'cvect__max_features': [500, 2000, 5000, 100000],
         'cvect__stop_words': ['english', None],
         'lgr__max_iter': [1000, 10000],
         'lgr__penalty': ['l2', 'l1', None],
          'lgr__solver': ['lbfgs', 'sag']}

cvect_lgr_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('lgr', LogisticRegression())])
# cvect_lgr_pipe.get_params()

cvect_lgr_grid = GridSearchCV(cvect_lgr_pipe, param_grid=params)

start_time = time.time()

# predict

cvect_lgr_grid.fit(X_train, y_train) #fit-time: 769 sec (took only 77 sec without lgr params)
print(time.time()-start_time)
test_acc = cvect_lgr_grid.score(X_test, y_test) #0.7646 (was 0.7632 without lgr params)
print(test_acc)
print(cvect_lgr_grid.best_params_)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

2215.0111212730408
0.7639011120889672
{'cvect__max_features': 100000, 'cvect__stop_words': None, 'lgr__max_iter': 1000, 'lgr__penalty': 'l2', 'lgr__solver': 'lbfgs'}


In [23]:
# Pipeline and Grid Search using TfidfVectorizer normalizer and LogisticRegression classifier
params_tfidf = {'tfidf__max_features': [500, 2000, 5000, 100000],
         'tfidf__stop_words': ['english', None],
         'lgr__max_iter': [1000, 10000],
         'lgr__penalty': ['l2', 'l1', None],
         'lgr__solver': ['lbfgs', 'sag']}

tfidf_lgr_pipe = Pipeline([('tfidf', TfidfVectorizer()),
                       ('lgr', LogisticRegression())])
tfidf_lgr_grid = GridSearchCV(tfidf_lgr_pipe, param_grid=params_tfidf)

start_time = time.time()
tfidf_lgr_grid.fit(X_train, y_train) #fit-time: 651 sec (was 75 sec without lgr params)
print(time.time()-start_time)
test_acc = tfidf_lgr_grid.score(X_test, y_test) #0.7697 (was 0.7698 without lgr params)
print(test_acc)
print(tfidf_lgr_grid.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

651.2562758922577
0.7697015761260901
{'lgr__max_iter': 1000, 'lgr__penalty': 'l2', 'lgr__solver': 'sag', 'tfidf__max_features': 100000, 'tfidf__stop_words': None}


In [18]:
#type(y_train)
len(y_train)
X_train.values.reshape(-1,1)

array([['@ bhyundai 19 rdr and our store did 49 new for the first time ... you dealer know how signific that is ... well , wa last month'],
       ['@ cornflakesss sooo .... are you wear jean ? ?'],
       ['@ bjmendelson hey , i wan na be follow more . lol !'],
       ...,
       ["@ bwgan s'not fair i 'm go to do overtim after my leav . my 'lcd ' tv fund ! ! !"],
       ['dinara lost again in roland garro . whi the safin have to do it hard ?'],
       ["* yawn * but that 's no excus not to head out"]], dtype=object)

In [19]:
'''
# Tried to use Random Forest and Ada Booster but ran into some issuesregarding data conversion
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

model_1 = AdaBoostClassifier()
model_1 .fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1))
model_1_acc = model_1.score(X_test, y_test)

# LogisticRegression as Base Estimator
params = {'mod__base_estimator__C': [.001, 0.01, 0.1, 1.0, 10.0]}
pipe = Pipeline([('mod', AdaBoostClassifier(base_estimator = LogisticRegression(), random_state = 42))])

grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)
score2 = grid.score(X_test, y_test)

params = {'n_estimators': [100, 200], 'base_estimator__max_depth': [1, 2, 3]}

tree_grid = GridSearchCV(AdaBoostClassifier(base_estimator = DecisionTreeClassifier(), random_state = 42),
                         param_grid=params)

tree_grid.fit(X_train, y_train)
grid_acc = tree_grid.score(X_test, y_test) 

'''

"\n# Tried to use Random Forest and Ada Booster but ran into some issuesregarding data conversion\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble import AdaBoostClassifier\n\nmodel_1 = AdaBoostClassifier()\nmodel_1 .fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1))\nmodel_1_acc = model_1.score(X_test, y_test)\n\n# LogisticRegression as Base Estimator\nparams = {'mod__base_estimator__C': [.001, 0.01, 0.1, 1.0, 10.0]}\npipe = Pipeline([('mod', AdaBoostClassifier(base_estimator = LogisticRegression(), random_state = 42))])\n\ngrid = GridSearchCV(pipe, param_grid=params)\ngrid.fit(X_train, y_train)\nscore2 = grid.score(X_test, y_test)\n\nparams = {'n_estimators': [100, 200], 'base_estimator__max_depth': [1, 2, 3]}\n\ntree_grid = GridSearchCV(AdaBoostClassifier(base_estimator = DecisionTreeClassifier(), random_state = 42),\n                         param_grid=params)\n\ntree_grid.fit(X_train, y_train)\ngrid_acc = tree_grid.score(X_test, y_test) \n\n

In [105]:

# Pipeline and Grid Search using CountVectorizer normalizer and DecisionTreeClassifier classifier
# Take too log to execute

params = {'cvect__max_features': [500, 2000, 5000, 100000],
         'cvect__stop_words': ['english', None]}

dt_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('dtreeCLF', DecisionTreeClassifier())])
dt_grid = GridSearchCV(dt_pipe, param_grid=params)

start_time = time.time()
dt_grid.fit(X_train, y_train) #fit-time: 773 sec
print(time.time()-start_time)
test_acc = dt_grid.score(X_test, y_test) #0.6979
print(test_acc)
print(dt_grid.best_params_)


772.8848021030426
0.6978958316665334
{'cvect__max_features': 100000, 'cvect__stop_words': None}


In [20]:
# help(MultinomialNB)

In [44]:
# Pipeline and Grid Search using CountVectorizer normalizer and MultinomialNB classifier
# MultinomialNB with parameters, such as MultinomialNB(alpha=1.0, fit_prior=False)
params = {'cvect__max_features': [500, 2000, 5000, 100000],
         'cvect__stop_words': ['english', None],
         'bayes__alpha': [1, 0.1, 0.001],
         'bayes__force_alpha': [False, True],
         'bayes__fit_prior': [False, True]}

bayes_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('bayes', MultinomialNB())])
bayes_grid = GridSearchCV(bayes_pipe, param_grid=params)

start_time = time.time()
bayes_grid.fit(X_train, y_train) #fit-time: 272 sec (43 sec without bayes params)
print(time.time()-start_time)
test_acc = bayes_grid.score(X_test, y_test) # 0.7581 (0.7558 without bayes params)
print(test_acc)
print(bayes_grid.best_params_)

272.63193702697754
0.7581006480518442
{'bayes__alpha': 1, 'bayes__fit_prior': True, 'bayes__force_alpha': False, 'cvect__max_features': 100000, 'cvect__stop_words': None}


In [3]:
# print(bayes_grid.cv_results_)

In [21]:
results_df= pd.DataFrame({'model': ['Logistic-CV', 'Logistic-TFID', 'Decision Tree', 'Bayes'], 
             'best_params': ['max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None'],
             'fit-time': ['769 seconds', '651 seconds', '773 seconds', '272 seconds'],
             'best_score': ['0.7646','0.7697', '0.6979', '0.7581']}).set_index('model')
results_df

Unnamed: 0_level_0,best_params,fit-time,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Logistic-CV,"max_features: 100,000, stop_words: None",769 seconds,0.7646
Logistic-TFID,"max_features: 100,000, stop_words: None",651 seconds,0.7697
Decision Tree,"max_features: 100,000, stop_words: None",773 seconds,0.6979
Bayes,"max_features: 100,000, stop_words: None",272 seconds,0.7581


In [None]:
# PART #2: My objective is to get better accuracy score. None of models above scored above 77%, so I am going to try
# Alternative Solution using spaCy library

In [88]:
# help(spacy.load)

In [29]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from collections import Counter
from string import punctuation

# import spacy.cli
# spacy.cli.download("en_core_web_sm")

nlp = spacy.load("en_core_web_sm")


In [30]:
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(df["SentimentText"], y, random_state = 42)

In [31]:
# Format required for spacy's DocBin file
train_data = [
    (sentence, {'cats': {'1': label == 1, '0': label == 0}})
    for sentence, label in zip(X_train, y_train)
]
test_data = [
    (sentence, {'cats': {'1': label == 1, '0': label == 0}})
    for sentence, label in zip(X_test, y_test)
]

In [32]:
train_data[:5]

[('@bhyundai 19 RDRs  and our store did 49 new for the first time... you dealers know how significant that is... well, was last month',
  {'cats': {'1': True, '0': False}}),
 ('@cornflakesss  sooo....are you wearing jeans??',
  {'cats': {'1': False, '0': True}}),
 ('@BJMendelson Hey, I wanna be followed more. LOL! ',
  {'cats': {'1': True, '0': False}}),
 ("@Anonymousmoi yes indeed! He could soooo get it! Yum yum  my friends just don't understand.",
  {'cats': {'1': True, '0': False}}),
 ('@allnight_always LMAOOOOO. I win. ', {'cats': {'1': True, '0': False}})]

In [33]:
# Split into training and validation sets

training_len = 60000
new_train_data = train_data[:training_len]
new_dev_data = train_data[training_len:]
print(len(new_train_data), len(new_dev_data), len(test_data))

60000 14991 24998


In [35]:
# Conversion function to save DocBin binary files to disk

from spacy.tokens import DocBin

def convert(train_data, outfile):
    nlp = spacy.blank("en")
    db = DocBin()
    
    for data in train_data:
        doc = nlp.make_doc(data[0])
        doc.cats = data[1].get("cats")
        db.add(doc)
    db.to_disk(outfile)
    
convert(new_train_data, "train.spacy")
convert(new_dev_data, "dev.spacy")
convert(test_data, "test.spacy")
    

In [36]:
# Generate spacy model configuration file

!python -m spacy init config --pipeline textcat config.cfg --force


2023-12-17 23:07:57.629001: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [39]:
# Train model and save model as textcat_model

start_time = time.time()
!python -m spacy train config.cfg --paths.train ./train.spacy  --paths.dev ./dev.spacy --output textcat_model
print(time.time()-start_time) # 142 seconds

2023-12-17 23:18:59.325397: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;4mℹ Saving to output directory: textcat_model[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-12-17 23:19:01,670] [INFO] Set up nlp object from config
[2023-12-17 23:19:01,675] [INFO] Pipeline: ['textcat']
[2023-12-17 23:19:01,677] [INFO] Created vocabulary
[2023-12-17 23:19:01,677] [INFO] Finished initializing nlp object
[2023-12-17 23:19:14,942] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.25       48.18    

In [40]:
# Evaluate model metrics
start_time = time.time()
!python -m spacy evaluate textcat_model/model-best/ --output metrics.json ./test.spacy
print(time.time()-start_time)

2023-12-17 23:24:26.428663: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;4mℹ Using CPU[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   76.33 
SPEED               386918

[1m

        P       R       F
1   78.05   81.79   79.87
0   75.16   70.54   72.78

[1m

    ROC AUC
1      0.84
0      0.84

[38;5;2m✔ Saved results to metrics.json[0m
11.479695796966553


In [41]:
results_df= pd.DataFrame({'model': ['Logistic-CV', 'Logistic-TFID', 'Decision Tree', 'Bayes', 'spaCy'], 
             'best_params': ['max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None', 'N/A'],
             'fit-time': ['769 seconds', '651 seconds', '773 seconds', '272 seconds', '142 seconds'],
             'best_score': ['0.7646','0.7697', '0.6979', '0.7581', '0.7633']}).set_index('model')
results_df

Unnamed: 0_level_0,best_params,fit-time,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Logistic-CV,"max_features: 100,000, stop_words: None",769 seconds,0.7646
Logistic-TFID,"max_features: 100,000, stop_words: None",651 seconds,0.7697
Decision Tree,"max_features: 100,000, stop_words: None",773 seconds,0.6979
Bayes,"max_features: 100,000, stop_words: None",272 seconds,0.7581
spaCy,,142 seconds,0.7633
