# Twitter Sentiment and Modeling

## Overview

This project utilizes a dataset from CrowdFlower, analyzing and rating the sentiment of Twitter users regarding Apple and Google products by building an NLP model. Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.

## Business Problem

Apple and Google want to gather information on the consensus of their products. They are looking at Twitter as a medium to gather that information. The task is to build a model that can rate the sentiment of a Tweet based on its content.

## Data Understanding

The dataset used for this project is a csv file ("data.csv"), containing over 9,000 Tweets about Apple and Google products. Human raters rated the sentiment as positive, negative, or neither. The target column is the sentiment column.

### Import Libraries

First thing we did was import the necessary libraries for analysis, visualization, preprocessing data, and building models, as well as ignore warnings.

In [1]:
#import necessary libraries
import pandas as pd

from nltk.corpus import stopwords

### Data Inspection

We proceeded to load the csv dataset, then look at the shape, size, column names and data types, as well as check for missing or duplicate entries.

In [2]:
#load the dataset, ensure the proper encoding is read
df = pd.read_csv('data.csv', encoding='latin1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
#change the name of the tweet, product, and sentiment columns
df = df.rename(columns={'tweet_text': 'tweet', 'emotion_in_tweet_is_directed_at': 'brand_or_product', 'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'})
df.head()

Unnamed: 0,tweet,brand_or_product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
#look at the different values for sentiment column
df['sentiment'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: sentiment, dtype: int64

#### Change Sentiment Values

We decided to combine 'I can't tell' and 'No emotion toward brand or product' into the value 'Neutral', and change
'Positive emotion' and 'Negative emotion' to just 'Positive' and 'Negative'.

In [5]:
#change sentiment values
df['sentiment'] = df['sentiment'].replace({'No emotion toward brand or product': 'Neutral',
                                           'Positive emotion': 'Positive',
                                           'Negative emotion': 'Negative',
                                           "I can't tell": 'Neutral'})
print(df['sentiment'].value_counts())

Neutral     5545
Positive    2978
Negative     570
Name: sentiment, dtype: int64


In [6]:
#look at the different values for brand_or_product column
df['brand_or_product'].value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: brand_or_product, dtype: int64

In [7]:
#check information on each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet             9092 non-null   object
 1   brand_or_product  3291 non-null   object
 2   sentiment         9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


### Duplicates

We checked for and found duplicate records, then proceeded to drop them.

In [8]:
#check for duplicates
df[df.duplicated()]

Unnamed: 0,tweet,brand_or_product,sentiment
468,"Before It Even Begins, Apple Wins #SXSW {link}",Apple,Positive
776,Google to Launch Major New Social Network Call...,,Neutral
2232,Marissa Mayer: Google Will Connect the Digital...,,Neutral
2559,Counting down the days to #sxsw plus strong Ca...,Apple,Positive
3950,Really enjoying the changes in Gowalla 3.0 for...,Android App,Positive
3962,"#SXSW is just starting, #CTIA is around the co...",Android,Positive
4897,"Oh. My. God. The #SXSW app for iPad is pure, u...",iPad or iPhone App,Positive
5338,RT @mention ÷¼ GO BEYOND BORDERS! ÷_ {link} ...,,Neutral
5341,"RT @mention ÷¼ Happy Woman's Day! Make love, ...",,Neutral
5881,RT @mention Google to Launch Major New Social ...,,Neutral


In [9]:
#check the number of duplicates
print(len(df[df.duplicated()]))

22


In [10]:
#drop duplicates
df.drop_duplicates(inplace=True)
df[df.duplicated()]

Unnamed: 0,tweet,brand_or_product,sentiment


### Missing Values

We checked for missing values and were missing 1 value for the tweet column and almost 6,000 values for the brand_or_product column. 

In [11]:
#look at the row with the missing value for the 'tweet' column
df.loc[df['tweet'].isnull()]

Unnamed: 0,tweet,brand_or_product,sentiment
6,,,Neutral


#### Drop Missing Tweet 

Since there is nothing useful provided in the entire row that's the sole missing tweet, we just dropped the row.

In [12]:
#drop missing tweet row
df.dropna(subset=['tweet'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9070 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet             9070 non-null   object
 1   brand_or_product  3282 non-null   object
 2   sentiment         9070 non-null   object
dtypes: object(3)
memory usage: 283.4+ KB


Now we took a look at the missing brand/product rows.

In [13]:
#look at 20 rows of missing brand/product values
df.loc[df['brand_or_product'].isnull()].head(20)

Unnamed: 0,tweet,brand_or_product,sentiment
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,Neutral
16,Holler Gram for iPad on the iTunes App Store -...,,Neutral
32,"Attn: All #SXSW frineds, @mention Register fo...",,Neutral
33,Anyone at #sxsw want to sell their old iPad?,,Neutral
34,Anyone at #SXSW who bought the new iPad want ...,,Neutral
35,At #sxsw. Oooh. RT @mention Google to Launch ...,,Neutral
37,SPIN Play - a new concept in music discovery f...,,Neutral
39,VatorNews - Google And Apple Force Print Media...,,Neutral
41,HootSuite - HootSuite Mobile for #SXSW ~ Updat...,,Neutral
42,Hey #SXSW - How long do you think it takes us ...,,Neutral


We looked for any correlation or pattern between missing brands and the sentiment. 

In [14]:
#check missing brand/product rows that have a sentiment other than Neutral
df.loc[(df['brand_or_product'].isnull()) & (df['sentiment'] != 'Neutral')]

Unnamed: 0,tweet,brand_or_product,sentiment
46,Hand-Held Û÷HoboÛª: Drafthouse launches Û÷H...,,Positive
64,Again? RT @mention Line at the Apple store is ...,,Negative
68,Boooo! RT @mention Flipboard is developing an ...,,Negative
103,Know that &quot;dataviz&quot; translates to &q...,,Negative
112,Spark for #android is up for a #teamandroid aw...,,Positive
...,...,...,...
9011,apparently the line to get an iPad at the #sxs...,,Positive
9043,Hey is anyone doing #sxsw signing up for the g...,,Negative
9049,@mention you can buy my used iPad and I'll pic...,,Positive
9052,@mention You could buy a new iPad 2 tmrw at th...,,Positive


In [15]:
#print the number of missing brand/product rows that have a sentiment other than Neutral, and the number that have Neutral as the sentiment
print("Number of rows with a sentiment other than Neutral: ", len(df.loc[(df['brand_or_product'].isnull()) & (df['sentiment'] != 'Neutral')]))
print("Number of rows with Neutral as the sentiment: ", len(df.loc[(df['brand_or_product'].isnull()) & (df['sentiment'] == 'Neutral')]))

Number of rows with a sentiment other than Neutral:  357
Number of rows with Neutral as the sentiment:  5431


### Data Cleaning

We performed standard actions such as standardizing and tokenizing the data. 

In [16]:
windows_sample = df.iloc[7]["tweet"]
windows_sample

'Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB'