# SENTIMENT ANALYSIS :GOOGLE AND APPLE

## Business understanding

Google and Apple are multinational techonology companies well known for their products such as google sheets (from Google) and iPhone (from Apple).The companies have come up with ways to get their customer feedback such as,in app user feedback and rating, getting the tweets from Twitter(now X),among many others. However,since the companies are multinational it can be really difficult and tiresome to read through the millions of feedback or tweets from multiple apps in order to get the customers' view or sentiment about a product.As a result,the companies want to build a model that can rate the sentiment of a tweet or text based on its content.This will enable the companies to make improvements on their products or services to improve customer satisfaction and even attract more customers.

## Data understanding

The data used in this project was extracted from [data.world](https://data.world/crowdflower/brands-and-product-emotions). It contains tweets related to Google and Apple products which were ranked as negative,positive or neutral. This dataset contains over 9000 tweets.In addition to the tweets and their ratings,the dataset contains a column that shows the product the tweet is directed to.This dataset will be of great help when building a model for our sentiment analysis.

### Data preparation

First,we will prepare the data with nltk and develop a model from the resulting data.This will act as our basic model.From there we will build models with RNN using LSTMs and GRUs and pick the best performing model.The data preparation for RNN models is different and will not incoporate nltk.

#### Importing the relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet',quiet=True)
%matplotlib inline

In [2]:
# Importing the dataset using pandas
raw_df = pd.read_csv('Data/judge-1377884607_tweet_product_company.csv',encoding='ISO-8859-1')
# Taking a look at the dataset
raw_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


The dataset contains columns with really long names.We can start by renaming the columns to have shorter column names

In [3]:
#Renaming the columns
raw_df.columns =['tweet','product','emotion']
raw_df.head()

Unnamed: 0,tweet,product,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


#### Exploring the dataset

In [4]:
#Taking a look at the emotion column
raw_df.emotion.value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: emotion, dtype: int64

We want to develop a model that can tell whether a tweet is positive,negative or neutral.The emotion column consist of four categories.The `I can't tell` category will be dropped since it is of no use to our model. One may consider changing this category into the no emotion category but it might ruin the model in the long run.

In [5]:
# Removing the 'I can't tell' category from the emotion column
df_3cat = raw_df[raw_df['emotion'] != "I can't tell"]
# Checking the remaining categories in the emotion column
df_3cat.emotion.value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
Name: emotion, dtype: int64

In [6]:
# Taking a look at one of the tweets
df_3cat.tweet[0]

'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

Since these are tweets from Twitter (now X) they have hashtags and username tags(@) which have no value in determining the sentiment of a tweet or text.These tags should be removed.

#### Dealing with missing values

In [7]:
# Checking for missing values
df_3cat.isna().sum()

tweet         1
product    5655
emotion       0
dtype: int64

The `tweet` column has only one missing value while the `product` column has more than half of the observations as missing.The `product` column can be dropped since we only need the other two columns to build a model for sentiment analysis.The row containing the missing tweet will be dropped too.

In [8]:
# Dropping the product column
df_2col = df_3cat.drop('product',axis=1)
# Dropping the row containing the missing tweet
df_2col.dropna(inplace=True)
# Checking for missing values
df_2col.isna().sum()
# Reseting the index of the dataframe
df_2col.reset_index(drop=True,inplace=True)

In [9]:
#Instantiating a RegexpTokenizer that will include words with apostrophes
tokenizer = RegexpTokenizer(r"\b\w+(?:'\w+)?\b")
# Creating a list of stopwords to exlude numbers and the sxsw tag
stopwords_list = stopwords.words('english')+['sxsw']+['0','1','2','3','4','5','6','7','8','9']
# Creating an instance of WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [10]:
#Creating a function that will produce the appropriate tokens
def word_preprocessor(text,tokenizer,stopwords_list,lemmatizer):
#     removing capital letters in the text
    low = text.lower()
#     tokenizing the text
    tokens = tokenizer.tokenize(low)
#     removing stopwords from the tokens
    no_stopwords_list = [word for word in tokens if word not in stopwords_list]
#     performing lemmatization
#     we can remove the first word from the tweets since it is a name tag
    preprocessed_text = [lemmatizer.lemmatize(word) for word in no_stopwords_list[1:]]
    return preprocessed_text

In [11]:
# Checking to see if the function works
word_preprocessor( df_2col.tweet[0],tokenizer,stopwords_list,lemmatizer)

['3g',
 'iphone',
 'hr',
 'tweeting',
 'rise_austin',
 'dead',
 'need',
 'upgrade',
 'plugin',
 'station']

In [12]:
# Mapping the function to the dataset and creating a new column with the preprocessed tweets
df_2col['preprocessed'] = df_2col.tweet.apply\
(lambda x: word_preprocessor(x,tokenizer,stopwords_list,lemmatizer) )

In [13]:
# Taking a look at the new dataframe
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion,"[3g, iphone, hr, tweeting, rise_austin, dead, ..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion,"[know, fludapp, awesome, ipad, iphone, app, li..."
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion,"[wait, ipad, also, sale]"
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion,"[year's, festival, crashy, year's, iphone, app]"
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion,"[great, stuff, fri, marissa, mayer, google, ti..."


Next we will change the emotion column to contain integers:
- Positive emotion = 1
- Negative emotion = -1
- No emotion toward brand or product = 0

In [14]:
## Creating a function to encode the categories to integers
def encoder(text):
    if text == 'Positive emotion':
        return 1
    if text == 'Negative emotion':
        return -1
    else :
        return 0
    

In [15]:
# Mapping the function to the emotion column
df_2col['emotion'] = df_2col.emotion.apply( lambda x: encoder(x) )
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,-1,"[3g, iphone, hr, tweeting, rise_austin, dead, ..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,1,"[know, fludapp, awesome, ipad, iphone, app, li..."
2,@swonderlin Can not wait for #iPad 2 also. The...,1,"[wait, ipad, also, sale]"
3,@sxsw I hope this year's festival isn't as cra...,-1,"[year's, festival, crashy, year's, iphone, app]"
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,1,"[great, stuff, fri, marissa, mayer, google, ti..."


Next we are going to join the preprocessed column to contain single strings per row to make them compatible to sklearn's CountVectorizer and TfidfVectorizer

In [17]:
df_2col['joined_text'] = df_2col['preprocessed'].str.join(" ")
df_2col.head()

Unnamed: 0,tweet,emotion,preprocessed,joined_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,-1,"[3g, iphone, hr, tweeting, rise_austin, dead, ...",3g iphone hr tweeting rise_austin dead need up...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,1,"[know, fludapp, awesome, ipad, iphone, app, li...",know fludapp awesome ipad iphone app likely ap...
2,@swonderlin Can not wait for #iPad 2 also. The...,1,"[wait, ipad, also, sale]",wait ipad also sale
3,@sxsw I hope this year's festival isn't as cra...,-1,"[year's, festival, crashy, year's, iphone, app]",year's festival crashy year's iphone app
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,1,"[great, stuff, fri, marissa, mayer, google, ti...",great stuff fri marissa mayer google tim o'rei...


### Splitting the dataset

Since we will develop neural networks in the end,the dataset will be split into train set,validation set and test set.

In [25]:
# Defining the inputs and targets
X= df_2col[['joined_text']]
y= df_2col['emotion']

In [26]:
#Importing the relevant libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
#Splitting the dataset with a test_size of 0.2 and random_state of 42
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)
# Creating validation sets for use in developing neural network models
X_train,X_val,y_train,y_val = train_test_split(X_train,y_train,test_size=1000,random_state=42)

### Word vectorization using CountVectorizer

In [28]:
# Instantiating CountVectorizer object
count_vectorizer =CountVectorizer()
# fitting the vectorizer on the train set
count_vectorizer.fit(X_train)
# transforming the train and test sets
X_train_vectorized = count_vectorizer.transform(X_train)
X_test_vectorized = count_vectorizer.transform(X_test)

### Building a baseline model 

 We will build a baseline model using the outputs of the CountVectorizer.Thebaseline model will be a decision tree classifier.

In [None]:
# Importing the relevant libraries
from sklearn.tree import DecisionTreeClassifier
# instantiating the DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()
