# Twitter Sentiment Analysis
Author: Brenda De Leon

<img src=“url/filename.gif” alt=“Alt text” title=“Title text” />
### Overview
### Business Understanding

### Data Understanding

#### Data Source
##### Data Limitations:
#### Features
#### Target

In [6]:
# importing relevant libraries
# !pip install wordcloud
import pandas as pd
import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from nltk import FreqDist 
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from wordcloud import WordCloud
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import classification_report

### Data

In [8]:
# importing data
df = pd.read_csv('data/judge_1377884607_tweet_product_company.csv')
# previewing data
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [10]:
# viewing columns, # of columns, dtypes, # of rows, # of non nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8721 entries, 0 to 8720
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          8720 non-null   object
 1   emotion_in_tweet_is_directed_at                     3169 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  8721 non-null   object
dtypes: object(3)
memory usage: 204.5+ KB


In [11]:
# inspecting data value ranges
df.describe()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
count,8720,3169,8721
unique,8693,9,4
top,RT @mention Marissa Mayer: Google Will Connect...,iPad,No emotion toward brand or product
freq,5,910,5156


We will rename column titles to make them more condensed and descriptive. We will preview column values to assure most accurate new titles.

In [19]:
# previewing 'tweet_text' values
df['tweet_text'].value_counts(dropna=False)[:25]

RT @mention Marissa Mayer: Google Will Connect the Digital &amp; Physical Worlds Through Mobile - {link} #sxsw                                      5
RT @mention Marissa Mayer: Google Will Connect the Digital &amp; Physical Worlds Through Mobile - {link} #SXSW                                      4
RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw                                                   4
RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #SXSW                                                   3
#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan          2
RT @mention RT @mention Google to Launch Major New Social Network Called Circles, Possibly Today {link} #sxsw                                       2
Really enjoying the changes in Gowalla 3.0 for Android! Looking forward to seeing what else they &am

In [21]:
# previewing 'emotion_in_tweet_is_directed_at' values
df['emotion_in_tweet_is_directed_at'].value_counts(dropna=False)

NaN                                5552
iPad                                910
Apple                               640
iPad or iPhone App                  451
Google                              412
iPhone                              288
Other Google product or service     282
Android App                          78
Android                              74
Other Apple product or service       34
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [23]:
# previewing 'is_there_an_emotion_directed_at_a_brand_or_product' values
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts(dropna=False)

No emotion toward brand or product    5156
Positive emotion                      2869
Negative emotion                       545
I can't tell                           151
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [30]:
# renaming columns 
df = df.rename(columns = {'is_there_an_emotion_directed_at_a_brand_or_product':'Sentiment',
                          'tweet_text':'Tweet', 
                          'emotion_in_tweet_is_directed_at':'Product_or_Service'}
              )

In [31]:
# checking for nulls
df.isnull().sum()

Tweet                    1
Product or Service    5552
Sentiment                0
dtype: int64

In [40]:
# dropping single null in tweet column
df.dropna(subset=['Tweet'], inplace = True)
# confirming null was dropped
df['Tweet'].isnull().sum()

0

Will need to address remaining null values later

In [32]:
# checking for duplicates
df.duplicated().sum()

22

In [36]:
# print length of 'Tweet' column
print(len(df['Tweet']))
# unique value count for 'Tweet' column
len(df['Tweet'].unique())

8721


8694

After previewing column values and value counts, duplicates are retweets. 

### Data Cleaning
The data needs to be able to fit a scikit-learn model. We will standardize the case of the data, use a tokenizer to convert the full tweets into lists of individual words, and address the remaining nulls. We will then compare the raw word frequency distributions of each sentiment. 

In [None]:
# split tweet_text to create unprocessed_text - we'll make a visualization with this and compare it with preprocessed_text
df['unprocessed_text'] = df['tweet_text'].str.split()

### Build and Evaluate a Baseline Model with TfidfVectorizer and MultinomialNB
Ultimately all data must be in numeric form in order to be able to fit a scikit-learn model. So we'll use a tool from sklearn.feature_extraction.text to convert all data into a vectorized format.

Initially we'll keep all of the default parameters for both the vectorizer and the model, in order to develop a baseline score.

In [None]:
############## Requirements
1. Load the Data
Use pandas and sklearn.datasets to load the train and test data into appropriate data structures. Then get a sense of what is in this dataset by visually inspecting some samples.

2. Perform Data Cleaning and Exploratory Data Analysis with nltk
Standardize the case of the data and use a tokenizer to convert the full posts into lists of individual words. Then compare the raw word frequency distributions of each category.

3. Build and Evaluate a Baseline Model with TfidfVectorizer and MultinomialNB
Ultimately all data must be in numeric form in order to be able to fit a scikit-learn model. So we'll use a tool from sklearn.feature_extraction.text to convert all data into a vectorized format.

Initially we'll keep all of the default parameters for both the vectorizer and the model, in order to develop a baseline score.

4. Iteratively Perform and Evaluate Preprocessing and Feature Engineering Techniques
Here you will investigate three techniques, to determine whether they should be part of our final modeling process:

Removing stopwords
Using custom tokens
Domain-specific feature engineering
Increasing max_features
5. Evaluate a Final Model on the Test Set
Once you have chosen a final modeling process, fit it on the full training data and evaluate it on the test data.