# Gillette Ad Tweets Sentiment Analysis

### In this tutorial, we will go over how to calculate sentiment analysis using Vader.Sentiment package from NLTK in Python.
You can follow along either here or trying it on your own. Instructions on how to set up Anaconda and Python are on my README.md

## First Step: Import Packages and CSV

When setting up your Jupyter Notebook, you first want to designate a cell (normally the first cell) for importing packages. I normally don't need all the packages, but just in case, I try to download as many as possible for convenience if I may need it in the future. To keep it simple, I have mostly downloaded the ones neccessary for this tutorial. See below and follow along with the comments.

### Import Packages

In [1]:
#Import packages
import nltk #Import natural language toolkit to preprocess the text data from the tweets

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA #The main package to calculate sentiment

from nltk.tokenize import word_tokenize #Split up the text data into more manageable, opted out since tweets are already in
#sentence-like format

import pandas as pd #important to organize our data into columns and rows

import numpy as np #not using it here but normally is imported along with pandas

from nltk.sentiment.util import * #not really necessary since we are using a different sentiment package, but good to have

import re #didn't use regular expressions but good to have when working with text data

import string  #also good to have when working with text data, but not using it here

### Import the Actual Data

While the data is not fully cleaned, the sentiment package we are going to use Vader.sentiment is normally used for social media data, and is pretty good at analyzing tweets for e.g. without much cleaning since that it it's primary usage. Here we can download it to our dataframe where will organize the data.

To read more about Sentiment Analyzer, click here :http://www.nltk.org/howto/sentiment.html

To download the data, check out the repository here: https://github.com/chantelmariediaz/GilletteAd-Twitter-Scrape and click gillette_tweets2.csv

Read more about pandas, here: http://pandas.pydata.org/pandas-docs/version/0.15/dsintro.html


#### Again follow along with the comments

In [4]:
#Here we are using pd.read_csv to import our Excel file with the data into our dataframe

#I found this encoding since it was giving me an utf8-error. It's important to read your errors. Sometimes not all files will
#import smoothly due to some mixing data types in your data. Stackoverflow is your friend

#Index_col=None because we do not have an index in our dataset. Again, read the pandas link to know more

df = pd.read_csv("gillette_tweets2.csv",encoding="ISO-8859-1",index_col=None)

## Preparing The Data for the Sentiment Analysis

In this section, we will start to tokenize or split up the text data into something more readable for the Vader.sentiment.
This is something you'll have to do with any text data you are working with. here, we are doing sentences for the tweets.

In [5]:
#Checking out the columns in our dataset from the excel csv file we imported previously
#As you can see there are two columns, the tag or keyword we used to get the data, and the tweet.
#We don't need the tag really, so we're going to be splitting up the sentences in the tweet

df.columns

Index(['Keyword', 'Tweet'], dtype='object')

In [7]:
#Using the column 'Tweet' from our dataframe, here are creating a list to process the tweet as a list of strings
#Vader can only take string, so this is an import conversion, again you'll understand if you read the docs

sentences = list(df['Tweet'])

In [8]:
#You can check the type to make sure you did this part properly
type(sentences)

list

In [9]:
#Print out your list of string to get a better glimpse of your data, no longer in a dataframe for now

sentences

["yes #gillette show men the truth &amp; noble way, please!'",
 "#barbasol has a new ad countering the anti-real man ads promoted by #gillette.'",
 "you know if this guy just would have watched that gillette ad none of this would\x92ve happened.', o sad #gillette'",
 "guys, please calm down.', it not like #gillette asking you to shave your whole body, wear makeup, perfume, high heels;",
 'takes to youtube to rant about the stupidity where #gillette staff members call the recently released t',
 "#gillette must be aware that most of the complaints made by women are fake and frivolous.', '#fakecases are made on men are in'",
 "yes #gillette show men the truth &amp; noble way, please!'",
 "oh my gosh!!!', ken gonna need some pink #gillette razors for those pretty legs!",
 "you don\x92t get any points #gillette for shaming men not when your brand has paicipated actively in objectifying women for ye'",
 "guys, please calm down.', it not like #gillette asking you to shave your whole body, wea

## Calculating Sentiment of Tweets

As you know for our project, our goal is to identify the sentiment or public reaction to the controversial gillette ad and seeing if this was a smart marketing strategy for Proctor and Gamble. This can only be done either manually, which we don't really want to do, or with a sentiment package. There are other ones, but this one is used quite frequently.

#### Creating a New List of Sentiment Scores From the Tweets (Positive, Negative and Neutral)

In [10]:
#Analyze sentiment with vader

sia = SIA()
scores = [] #Create a new list with the scores, positive, negative and neutral. As a well as compound which is like the avg

for sentence in sentences: #create a list iterator
    
    pol_score = sia.polarity_scores(sentence) #put a wrapper in order for vader to calculate sentiment
    
    pol_score['sentences'] = sentence #Create a new list with the tweets called sentences
    
    scores.append(pol_score) #Append or add your new polarity scores to the original list with just the tweets or sentences

#### Relabeling and Organizing Data Again into Dataframe

In [11]:
#For our modeling purposes, we do not want to have the percent scores

#We can to create a binary label that will evaluate if a tweet is either positive or negative

#I later take out neutrals for simplicty after I export from Excel

df2 = pd.DataFrame.from_records(scores) #Create a new dataframe or table with the new sentiment scores

df2['label'] = 0 #Start the new column label @ zero 

df2.loc[df2['compound'] > 0.05, 'label'] = 1 #Create a new lable column to evaluate the compound  as greater than 0.05 as
#positive, I lowered the threshold here

df2.loc[df2['compound'] < -0.05, 'label'] = -1 #In the label column evaluate the compound as less than 0.05 as negative
#Anything in between would be labeled as 0 or neutral

df2.head() #Call head to see the first 5 records in your new dataframe

Unnamed: 0,compound,neg,neu,pos,sentences,label
0,0.807,0.0,0.458,0.542,yes #gillette show men the truth &amp; noble w...,1
1,0.4215,0.0,0.797,0.203,#barbasol has a new ad countering the anti-rea...,1
2,-0.4767,0.147,0.853,0.0,you know if this guy just would have watched t...,-1
3,0.3591,0.089,0.717,0.194,"guys, please calm down.', it not like #gillett...",1
4,-0.6486,0.275,0.725,0.0,takes to youtube to rant about the stupidity w...,-1


## Export back to CSV and Excel

Now that we have calculated our new sentiment scores, we can now export it back to CSV to do some further preprocessing, such as cleaning, and going back and making sure the sentiment analyzer categorized certain tweets as negative that the analyze may not have caught on its own. For example, BOYCOTT Gillette would be considered negative to our standpoint, but maybe not to Vader. 

In [12]:
#Export the dataframe back to Excel
df2 = df2[['sentences', 'compound', 'neg', 'neu', 'pos', 'label']]
df2.to_csv('gilettesent2.csv', encoding='utf-8', index=False)