### Capstone Project - Sentiment Anaysis of Social Media (X/Twitter) Feed

Create a model to determine the sentiment/tone (positive, negative, neutral) by performing a sentiment analysis on given data, using Natural Language Processing (NLP) techniques. For this, we will need data with random text passages and assigned tone (supervised learning).

### The Data

The witter Sentiment Analysis Datasets are from Kaggle:
train.csv - the training set
test.csv - the test set

Data fields:
ItemID - numeric id of twit
Sentiment - sentiment (0 - negative, 1 - positive)
SentimentText - text of the twit

Citation: Azhar Yebekenova. (2017). Twitter sentiment analysis. Kaggle. https://kaggle.com/competitions/twitter-sentiment-analysis2


In [46]:
import pandas as pd

In [47]:
df = pd.read_csv('/Users/dsgarcha/Downloads/twitter-sentiment-analysis/train.csv')
target_df = pd.read_csv('/Users/dsgarcha/Downloads/twitter-sentiment-analysis/test.csv')

In [48]:
df.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I've been at...
4,5,0,i think mi bf is cheating on me!!! ...


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99989 entries, 0 to 99988
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ItemID         99989 non-null  int64 
 1   Sentiment      99989 non-null  int64 
 2   SentimentText  99989 non-null  object
dtypes: int64(2), object(1)
memory usage: 2.3+ MB


In [50]:
target_df.head()

Unnamed: 0,ItemID,SentimentText
0,1,is so sad for my APL frie...
1,2,I missed the New Moon trail...
2,3,omg its already 7:30 :O
3,4,.. Omgaga. Im sooo im gunna CRy. I've been at...
4,5,i think mi bf is cheating on me!!! ...


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [51]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

import time

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import TweetTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dsgarcha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/dsgarcha/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/dsgarcha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dsgarcha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [79]:
# Preprocessing: Stemming

def stemmer(text):
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in word_tokenize(text)]
    return ' '.join(stemmed_words)

start_time = time.time()
stemmed_text = df['SentimentText'].apply(stemmer) # time taken: 150 sec
target_stemmed_text = target_df['SentimentText'].apply(stemmer)
print(time.time()-start_time)
print(stemmed_text[0])


148.0249321460724
is so sad for my apl friend .............


In [80]:
print(target_stemmed_text[0:5])

0            is so sad for my apl friend .............
1                      i miss the new moon trailer ...
2                              omg it alreadi 7:30 : o
3    .. omgaga . im sooo im gunna cri . i 've been ...
4               i think mi bf is cheat on me ! ! ! t_t
Name: SentimentText, dtype: object


In [81]:
#X = full_df.drop('sentiment', axis = 1)
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(stemmed_text, y, random_state = 42)

In [84]:

# X_train[0:2] # X_train or X_test is not a df but a list of lists
X_train.head()

60132    @ bhyundai 19 rdr and our store did 49 new for...
94181       @ cornflakesss sooo .... are you wear jean ? ?
62234    @ bjmendelson hey , i wan na be follow more . ...
74933    @ anonymousmoi ye inde ! he could soooo get it...
47231                  @ allnight_alway lmaooooo . i win .
Name: SentimentText, dtype: object

In [70]:
# Feature Extraction processes: Bag-of-words
'''
Prob - can't find unique_words
unique words = set()
for tweet in X_train:
    unique_words.update(tweet)
    
len(unique_words) total features or words in B.O.W.'''

'\nunique words = set()\nfor tweet in X_train:\n    unique_words.update(tweet)\n    \nlen(unique_words) total features or words in B.O.W.'

In [71]:
# create a sparse matrix of features (unique words)
# Also check: SciPy offers a special sparse matrix class
'''
bow_matrix_train = np.zeros((len(X_train), len(unique_words)))
for i, tweet in enumerate(X_train):
    for word in tweet:
        bow_matrix_train[i, unique_wrods == word] += 1
'''

'\nbow_matrix_train = np.zeros((len(X_train), len(unique_words)))\nfor i, tweet in enumerate(X_train):\n    for word in tweet:\n        bow_matrix_train[i, unique_wrods == word] += 1\n'

In [85]:
# pd.DataFrame(X_train, columns = ['SentimentText'])

In [86]:
# CountVectorize with options for stop words and max features to prepare the text data for the estimator

cvect = CountVectorizer(stop_words = 'english', max_features = 300)
X_train_vect = cvect.fit_transform(X_train) # create a sparse matrix
X_test_vect = cvect.transform(X_test)

### ANSWER CHECK
pd.DataFrame(X_train_vect.toarray(), columns = cvect.get_feature_names_out()).head()


Unnamed: 0,10,abl,actual,add,agre,ah,alreadi,alway,amaz,amp,...,wrong,www,xx,ya,yay,ye,yea,yeah,year,yesterday
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [98]:
# Pipeline using CountVectorizer normalizer and LogisticRegression classifier
cvect_lgr_pipe = Pipeline([('cvect', CountVectorizer(stop_words = 'english', max_features = 1000)),
                       ('lgr', LogisticRegression())])
cvect_lgr_pipe.fit(X_train, y_train)

# predict
test_acc = cvect_lgr_pipe.score(X_test, y_test) # value is 0.7348 for baseline
print(test_acc)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7347787823025842


In [97]:
# Pipeline and Grid Search using CountVectorizer normalizer and LogisticRegression classifier
params = {'cvect__max_features': [500, 2000, 5000, 100000],
         'cvect__stop_words': ['english', None]}

cvect_lgr_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('lgr', LogisticRegression())])
cvect_lgr_grid = GridSearchCV(cvect_lgr_pipe, param_grid=params)

start_time = time.time()

# predict

cvect_lgr_grid.fit(X_train, y_train) #fit-time: 77 sec
print(time.time()-start_time)
test_acc = cvect_lgr_grid.score(X_test, y_test) #0.7632
print(test_acc)
print(cvect_lgr_grid.best_params_)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

77.38721799850464
0.7632210576846148
{'cvect__max_features': 100000, 'cvect__stop_words': None}


In [100]:
# Pipeline and Grid Search using TfidfVectorizer normalizer and LogisticRegression classifier
params_tfidf = {'tfidf__max_features': [500, 2000, 5000, 100000],
         'tfidf__stop_words': ['english', None]}

tfidf_lgr_pipe = Pipeline([('tfidf', TfidfVectorizer()),
                       ('lgr', LogisticRegression())])
tfidf_lgr_grid = GridSearchCV(tfidf_lgr_pipe, param_grid=params_tfidf)

start_time = time.time()
tfidf_lgr_grid.fit(X_train, y_train) #fit-time: 75 sec
print(time.time()-start_time)
test_acc = tfidf_lgr_grid.score(X_test, y_test) #0.7698
print(test_acc)
print(tfidf_lgr_grid.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

75.31152987480164
0.7697815825266021
{'tfidf__max_features': 100000, 'tfidf__stop_words': None}


In [92]:
# Pipeline using CountVectorizer normalizer and DecisionTreeClassifier classifier

dt_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('dtreeCLF', DecisionTreeClassifier())])

start_time = time.time()
dt_pipe.fit(X_train, y_train) #fit-time: 44 sec
print(time.time()-start_time)
test_acc = dt_pipe.score(X_test, y_test) #0.6971
print(test_acc)

43.457740783691406
0.6970557644611569


In [105]:

# Pipeline and Grid Search using CountVectorizer normalizer and DecisionTreeClassifier classifier
# Take too log to execute
'''
params = {'cvect__max_features': [500, 2000, 5000, 100000],
         'cvect__stop_words': ['english', None]}

dt_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('dtreeCLF', DecisionTreeClassifier())])
dt_grid = GridSearchCV(dt_pipe, param_grid=params)

start_time = time.time()
dt_grid.fit(X_train, y_train) #fit-time: 773 sec
print(time.time()-start_time)
test_acc = dt_grid.score(X_test, y_test) #0.6979
print(test_acc)
print(dt_grid.best_params_)
'''

772.8848021030426
0.6978958316665334
{'cvect__max_features': 100000, 'cvect__stop_words': None}


In [101]:
# Pipeline and Grid Search using CountVectorizer normalizer and MultinomialNB classifier
# MultinomialNB with parameters, such as MultinomialNB(alpha=1.0, fit_prior=False)
params = {'cvect__max_features': [500, 2000, 5000, 100000],
         'cvect__stop_words': ['english', None]}

bayes_pipe = Pipeline([('cvect', CountVectorizer()),
                       ('bayes', MultinomialNB(alpha=1.0, fit_prior=False))])
bayes_grid = GridSearchCV(bayes_pipe, param_grid=params)

start_time = time.time()
bayes_grid.fit(X_train, y_train) #fit-time: 43 sec
print(time.time()-start_time)
test_acc = bayes_grid.score(X_test, y_test) #0.7558
print(test_acc)
print(bayes_grid.best_params_)

43.16154980659485
0.755820465637251
{'cvect__max_features': 100000, 'cvect__stop_words': None}


In [102]:
print(bayes_grid.cv_results_)

{'mean_fit_time': array([0.82943912, 0.83729682, 0.79076195, 0.84413724, 0.86531053,
       0.96744914, 0.90393586, 0.92326207]), 'std_fit_time': array([0.01617379, 0.01358041, 0.00705121, 0.00846749, 0.09296507,
       0.08561506, 0.0354676 , 0.01202188]), 'mean_score_time': array([0.15675368, 0.16151304, 0.15505333, 0.16541972, 0.1675283 ,
       0.2033165 , 0.18797879, 0.19457631]), 'std_score_time': array([0.00581112, 0.00039447, 0.00112944, 0.00266514, 0.00914764,
       0.03037278, 0.00555386, 0.00257599]), 'param_cvect__max_features': masked_array(data=[500, 500, 2000, 2000, 5000, 5000, 100000, 100000],
             mask=[False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_cvect__stop_words': masked_array(data=['english', None, 'english', None, 'english', None,
                   'english', None],
             mask=[False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=objec

In [106]:
results_df= pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None', 'max_features: 100,000, stop_words: None'],
             'fit-time': ['77 seconds', '773 seconds', '43 seconds'],
             'best_score': ['0.7632', '0.6979', '0.7558']}).set_index('model')
results_df

Unnamed: 0_level_0,best_params,fit-time,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Logistic,"max_features: 100,000, stop_words: None",77 seconds,0.7632
Decision Tree,"max_features: 100,000, stop_words: None",773 seconds,0.6979
Bayes,"max_features: 100,000, stop_words: None",43 seconds,0.7558


In [None]:
Alternative Solution using TweetTokenizer

In [108]:
# Text Preprocessing and Normalization: Stemming
# Issues: df_stemmed and resulting X_train or X_test is not a Series but a list of lists

def preprocess_tweets(text):
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    stemmer = PorterStemmer()
    stop_words = stopwords.words('english')
    
    ret_text = []
    for word in text:
        tokens = tokenizer.tokenize(word)
        ret_text.append([stemmer.stem(word) for word in tokens if word not in stop_words])
        
    return ret_text

start_time = time.time()
df_stemmed = preprocess_tweets(df['SentimentText']) # time taken: 107 sec
target_df_stemmed = preprocess_tweets(target_df['SentimentText'])
print(time.time()-start_time)
print(df_stemmed[0])


107.05752801895142
['sad', 'apl', 'friend', '...']


In [114]:
print(target_df_stemmed[100])

['pavel', 'tonight', '<', 'tigersfan', '>']


In [None]:
# Feature Extraction using CountVectorizer or TF–IDF

In [119]:
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(pd.Series(df_stemmed), y, random_state = 42)

In [122]:
X_train.head()

60132    [19, rdr, store, 49, new, first, time, ..., de...
94181                        [sooo, ..., wear, jean, ?, ?]
62234                   [hey, ,, wanna, follow, ., lol, !]
74933    [ye, inde, !, could, sooo, get, !, yum, yum, f...
47231                                  [lmaooo, ., win, .]
dtype: object

In [123]:
type(X_train)

pandas.core.series.Series

In [124]:
# CountVectorize with options for stop words and max features to prepare the text data for the estimator
'''
AttributeError: 'list' object has no attribute 'lower' - calling cvect.fit_transform
cvect = CountVectorizer(stop_words = 'english', max_features = 1000)
X_train_vect = cvect.fit_transform(X_train) # create a sparse matrix
X_test_vect = cvect.transform(X_test)


pd.DataFrame(X_train_vect.toarray(), columns = cvect.get_feature_names_out()).head()
'''

"\nAttributeError: 'list' object has no attribute 'lower' - calling cvect.fit_transform\ncvect = CountVectorizer(stop_words = 'english', max_features = 1000)\nX_train_vect = cvect.fit_transform(X_train) # create a sparse matrix\nX_test_vect = cvect.transform(X_test)\n\n\npd.DataFrame(X_train_vect.toarray(), columns = cvect.get_feature_names_out()).head()\n"

In [None]:
# create a sparse matrix of features (unique words)
# Also check: SciPy offers a special sparse matrix class