# SENTIMENT ANALYSIS :GOOGLE AND APPLE

## Business understanding

Google and Apple are multinational techonology companies well known for their products such as google sheets (from Google) and iPhone (from Apple).The companies have come up with ways to get their customer feedback such as,in app user feedback and rating, getting the tweets from Twitter(now X),among many others. However,since the companies are multinational it can be really difficult and tiresome to read through the millions of feedback or tweets from multiple apps in order to get the customers' view or sentiment about a product.As a result,the companies want to build a model that can rate the sentiment of a tweet or text based on its content.This will enable the companies to make improvements on their products or services to improve customer satisfaction and even attract more customers.

## Data understanding

The data used in this project was extracted from [data.world](https://data.world/crowdflower/brands-and-product-emotions). It contains tweets related to Google and Apple products which were ranked as negative,positive or neutral. This dataset contains over 9000 tweets.In addition to the tweets and their ratings,the dataset contains a column that shows the product the tweet is directed to.This dataset will be of great help when building a model for our sentiment analysis.

### Data preparation

First,we will prepare the data with nltk and develop a model from the resulting data.This will act as our basic model.From there we will build models with RNN using LSTMs and GRUs and pick the best performing model.The data preparation for RNN models is different and will not incoporate nltk.

#### Importing the relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet',quiet=True)
%matplotlib inline

In [2]:
# Importing the dataset using pandas
raw_df = pd.read_csv('Data/judge-1377884607_tweet_product_company.csv',encoding='ISO-8859-1')
# Taking a look at the dataset
raw_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


The dataset contains columns with really long names.We can start by renaming the columns to have shorter column names

In [3]:
#Renaming the columns
raw_df.columns =['tweet','product','emotion']
raw_df.head()

Unnamed: 0,tweet,product,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


#### Exploring the dataset

In [4]:
#Taking a look at the emotion column
raw_df.emotion.value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: emotion, dtype: int64

We want to develop a model that can tell whether a tweet is positive,negative or neutral.The emotion column consist of four categories.The `I can't tell` category will be dropped since it is of no use to our model. One may consider changing this category into the no emotion category but it might ruin the model in the long run.

In [5]:
# Removing the 'I can't tell' category from the emotion column
df_3cat = raw_df[raw_df['emotion'] != "I can't tell"]
# Checking the remaining categories in the emotion column
df_3cat.emotion.value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
Name: emotion, dtype: int64

In [6]:
# Taking a look at one of the tweets
df_3cat.tweet[0]

'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

Since these are tweets from Twitter (now X) they have hashtags and username tags(@) which have no value in determining the sentiment of a tweet or text.These tags should be removed.

#### Dealing with missing values

In [7]:
# Checking for missing values
df_3cat.isna().sum()

tweet         1
product    5655
emotion       0
dtype: int64

The `tweet` column has only one missing value while the `product` column has more than half of the observations as missing.The `product` column can be dropped since we only need the other two columns to build a model for sentiment analysis.The row containing the missing tweet will be dropped too.

In [8]:
# Dropping the product column
df_2col = df_3cat.drop('product',axis=1)
# Dropping the row containing the missing tweet
df_2col.dropna(inplace=True)
# Checking for missing values
df_2col.isna().sum()
# # Reseting the index of the dataframe
df_2col.reset_index(drop=True,inplace=True)

In [9]:
#Instantiating a RegexpTokenizer that will include words with apostrophes
tokenizer = RegexpTokenizer(r"\b\w+(?:'\w+)?\b")
# Creating a list of stopwords to exlude numbers and the sxsw tag
stopwords_list = stopwords.words('english')+['sxsw']+['0','1','2','3','4','5','6','7','8','9']
# Creating an instance of WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [10]:
#Creating a function that will produce the appropriate tokens
def word_preprocessor(text,tokenizer,stopwords_list,lemmatizer):
#     removing capital letters in the text
    low = text.lower()
#     tokenizing the text
    tokens = tokenizer.tokenize(low)
#     removing stopwords from the tokens
    no_stopwords_list = [word for word in tokens if word not in stopwords_list]
#     performing lemmatization
#     we can remove the first word from the tweets since it is a name tag
    preprocessed_text = [lemmatizer.lemmatize(word) for word in no_stopwords_list[1:]]
    return preprocessed_text

In [11]:
# Checking to see if the function works
word_preprocessor( df_2col.tweet[0],tokenizer,stopwords_list,lemmatizer)

['3g',
 'iphone',
 'hr',
 'tweeting',
 'rise_austin',
 'dead',
 'need',
 'upgrade',
 'plugin',
 'station']

In [12]:
# Mapping the function to the dataset and creating a new column with the preprocessed tweets
df_2col['preprocessed'] = df_2col.tweet.apply\
(lambda x: word_preprocessor(x,tokenizer,stopwords_list,lemmatizer) )

In [13]:
# Taking a look at the new dataframe
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion,"[3g, iphone, hr, tweeting, rise_austin, dead, ..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion,"[know, fludapp, awesome, ipad, iphone, app, li..."
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion,"[wait, ipad, also, sale]"
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion,"[year's, festival, crashy, year's, iphone, app]"
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion,"[great, stuff, fri, marissa, mayer, google, ti..."


Next we will create a rank column to contain integers as follows:
- Positive emotion = 1
- Negative emotion = -1
- No emotion toward brand or product = 0

In [29]:
## Creating a function to encode the categories to integers
def encoder(text):
    if text == 'Positive emotion':
        return 1
    if text == 'Negative emotion':
        return -1 
    else :
        return 0
    

In [30]:
#Creating a new column with the rankings
df_2col['rank'] = df_2col.emotion.apply( lambda x: encoder(x) )
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed,rank,joined_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion,"[3g, iphone, hr, tweeting, rise_austin, dead, ...",-1,3g iphone hr tweeting rise_austin dead need up...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion,"[know, fludapp, awesome, ipad, iphone, app, li...",1,know fludapp awesome ipad iphone app likely ap...
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion,"[wait, ipad, also, sale]",1,wait ipad also sale
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion,"[year's, festival, crashy, year's, iphone, app]",-1,year's festival crashy year's iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion,"[great, stuff, fri, marissa, mayer, google, ti...",1,great stuff fri marissa mayer google tim o'rei...


Next we are going to join the preprocessed column to contain single strings per row to make them compatible to sklearn's CountVectorizer and TfidfVectorizer

In [31]:
df_2col['joined_text'] = df_2col['preprocessed'].str.join(" ")
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed,rank,joined_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion,"[3g, iphone, hr, tweeting, rise_austin, dead, ...",-1,3g iphone hr tweeting rise_austin dead need up...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion,"[know, fludapp, awesome, ipad, iphone, app, li...",1,know fludapp awesome ipad iphone app likely ap...
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion,"[wait, ipad, also, sale]",1,wait ipad also sale
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion,"[year's, festival, crashy, year's, iphone, app]",-1,year's festival crashy year's iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion,"[great, stuff, fri, marissa, mayer, google, ti...",1,great stuff fri marissa mayer google tim o'rei...


### Splitting the dataset

The dataset will be split into train set,validation set and test set.However when developing RNN models we will add a parameter for the validation set.

In [32]:
# Defining the inputs and targets
X= df_2col['joined_text']
y= df_2col['rank']

In [33]:
#Importing the relevant libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
#Splitting the dataset with a test_size of 0.2 and random_state of 42
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)

### Word vectorization using CountVectorizer

In [35]:
# Instantiating CountVectorizer object
count_vectorizer =CountVectorizer()
# fitting the vectorizer on the train set
count_vectorizer.fit(X_train)
# transforming the train and test sets
X_train_vectorized = count_vectorizer.transform(X_train)
X_test_vectorized = count_vectorizer.transform(X_test)

### Modelling

#### Building a baseline model 

 We will build a baseline model using the outputs of the CountVectorizer.Thebaseline model will be a decision tree classifier.

In [36]:
# Importing the relevant libraries
from sklearn.tree import DecisionTreeClassifier
# instantiating the DecisionTreeClassifier with certain parameters
tree_clf = DecisionTreeClassifier(criterion='entropy',min_samples_split=300,random_state=42,max_depth=15)
# fitting the model
tree_clf.fit(X_train_vectorized,y_train)

In [37]:
# making predicitions for the test set
y_test_pred = tree_clf.predict(X_test_vectorized)
# creating a classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred))


              precision    recall  f1-score   support

          -1       0.57      0.10      0.16       126
           0       0.65      0.92      0.76      1094
           1       0.60      0.22      0.33       568

    accuracy                           0.64      1788
   macro avg       0.61      0.41      0.42      1788
weighted avg       0.63      0.64      0.58      1788



Our baseline model has an accuracy of 64%.Next we will try to build models that have a better accuracy

#### Building a second model using tf-idf vectorization

Tf-idf vectorization tends to give higher weights to words that have a low document frequency and a high term frequency.We will perform vectorization using tf-idf to see if it will improve the model.

#### Vectorization using TfidfVectorizer

In [38]:
# Instantiating TfidfVectorizer object
tf_idf_vectorizer = TfidfVectorizer()
# fitting the vectorizer on the train set
tf_idf_vectorizer.fit(X_train)
# transforming the train and test sets
X_train_tfvectorized = tf_idf_vectorizer.transform(X_train)
X_test_tfvectorized = tf_idf_vectorizer.transform(X_test)

In [39]:
# Building a model using the outputs TfidfVectorizer
# Instantiating a decision tree classifier
tf_tree_clf = DecisionTreeClassifier(criterion='entropy',
                                     min_samples_split=300,random_state=42,max_depth=15)
tf_tree_clf.fit(X_train_tfvectorized,y_train)

In [40]:
# making predicitions for the test set
y_test_pred = tree_clf.predict(X_test_tfvectorized)
# creating a classification report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

          -1       0.50      0.01      0.02       126
           0       0.61      1.00      0.76      1094
           1       0.67      0.00      0.01       568

    accuracy                           0.61      1788
   macro avg       0.59      0.34      0.26      1788
weighted avg       0.62      0.61      0.47      1788



With the same parameters this model performs poorer than the basic model.Next we will try to build models using Recurrent Neural Networks.

### Recurrent Neural Networks (with LSTMs)

In order to build RNN models we need to one hot encode the `rank` column using to_categorical which is a utility function found in Keras library.

#### One hot encoding the rank column

In [26]:
# Importing the relevant libraries
from keras.utils import to_categorical

In [41]:
## Creating a function to encode the categories to integers
def encoder(text):
    if text == 'Positive emotion':
        return 2
    if text == 'Negative emotion':
        return 1 # since to_categorical is designed to work with non_negative integers
    else :
        return 0
#Creating a new column with the rankings
df_2col['rank'] = df_2col.emotion.apply( lambda x: encoder(x) )
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed,rank,joined_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion,"[3g, iphone, hr, tweeting, rise_austin, dead, ...",1,3g iphone hr tweeting rise_austin dead need up...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion,"[know, fludapp, awesome, ipad, iphone, app, li...",2,know fludapp awesome ipad iphone app likely ap...
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion,"[wait, ipad, also, sale]",2,wait ipad also sale
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion,"[year's, festival, crashy, year's, iphone, app]",1,year's festival crashy year's iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion,"[great, stuff, fri, marissa, mayer, google, ti...",2,great stuff fri marissa mayer google tim o'rei...


In [50]:
#Splitting the dataset with a test_size of 0.2 and random_state of 42
X_train,X_test,y_train,y_test = train_test_split(df_2col['tweet'],df_2col['rank'],
                                                 test_size=.2,random_state=42)

In [51]:
# One hot encoding the rank column in all the sets
y_train_labels = to_categorical(y_train)
y_test_labels = to_categorical(y_test)

In [52]:
# Importing the relevant libraries for building RNNs
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, Embedding,GRU
from keras.layers import Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence

In [53]:
# Preprocessing the dataset for use by keras
tokenizer = text.Tokenizer(num_words=20000)
tokenizer.fit_on_texts(list(X_train))
list_tokenized_headlines = tokenizer.texts_to_sequences(X_train)
X_t = sequence.pad_sequences(list_tokenized_headlines, maxlen=100)

In [54]:
# Instantiating the sequential model
model = Sequential()
# Adding an embedding layer
embedding_size = 70
model.add(Embedding(20000, embedding_size))
# Adding an LTSM layer
model.add(LSTM(25, return_sequences=True))
model.add(GlobalMaxPool1D())
# Performing dropout regularization
model.add(Dropout(0.5))
# Adding the outputlayer
model.add(Dense(3, activation='softmax'))

In [55]:
# Compiling the model
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

In [56]:
# Importing the relevant packages
from keras.callbacks import EarlyStopping,ModelCheckpoint
# Instantiating callbacks
call_backs = [EarlyStopping(monitor='val_loss', patience=7),
             ModelCheckpoint("best_model.h5", monitor='val_loss', save_best_only=True)]

In [57]:
model.fit(X_t, y_train_labels, epochs=20, batch_size=300, validation_split=0.2,callbacks=call_backs)

Epoch 1/20
Epoch 2/20


  saving_api.save_model(


Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20


<keras.src.callbacks.History at 0x254d576d870>

From the results above we can tel that the saved model (model with the best metrics) has an accuracy of 69% which is an improvement from our previous model.

Next we will use a different model on the tf-idf vectors

#### Random Forest using tf-idf vectors

In [58]:
# Importing the relevant packages
from sklearn.ensemble import RandomForestClassifier
# Defining the inputs and targets

# Instantiating a random forest 
forest_clf = RandomForestClassifier(n_estimators=100,criterion='entropy',
                                     min_samples_split=300,random_state=42,max_depth=15)
# Fitting the model on the dataset
forest_clf.fit(X_train_tfvectorized,y_train)

In [59]:
# making predicitions for the test set
y_test_pred = forest_clf.predict(X_test_tfvectorized)
# creating a classification report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           0       0.61      1.00      0.76      1094
           1       0.50      0.01      0.02       126
           2       1.00      0.01      0.03       568

    accuracy                           0.62      1788
   macro avg       0.70      0.34      0.27      1788
weighted avg       0.73      0.62      0.48      1788



The model has an overall accuracy of 62% which is lower than the RNN model.

### Evaluation

From the four models that have been built,RNN has proved to be the most efficient with accuracy of 69% on the validation_set and an accuracy of 75% on the training set.This shows that the model can generalize into unseen data.This is also the model with the lowest loss.
The fact that this is a multiclass classifier,an accuracy of 69% is considerably high.

#### Evaluating on the test set

In [63]:
# Loading the saved model
# Importing the relevant libraries
from keras.models import load_model
loaded_model = load_model('best_model.h5')

In [64]:
# Preprocessing the test data
list_tokenized_headlines = tokenizer.texts_to_sequences(X_test)
X_test = sequence.pad_sequences(list_tokenized_headlines, maxlen=100)

In [70]:
# Evaluating the model on the test set
loaded_model.evaluate(X_test,y_test_labels)



[0.7703506946563721, 0.6739373803138733]

The model has an accuracy of 67% on the test set. I would recommend using this model to tackle the sentiment analysis problem since it clearly can generalize into unseen data without losing much of the accuracy. In addition,an accuracy of 67% on the test set is still higher than the accuracy of the baseline model.