# Apple and Android Sentiment Analysis

This project takes the data.world dataset 

## Objective
Construct various NLP based models demonstrating multiple approaches to classifying the sentiment of 9000 tweets regarding their sentiment towards either Apple products, Android products or both. I will be using the SpaCy word vectorization library and then I will use those vectorizations (along with simple Bags of Words techniques) to classify the sentiment of each tweet. After determining the best binary classification model, I will then attempt to create a tertiary classification model that introduces 'neutral' tweets as well.

### Goal 1 - Identify Products

### Goal 2 - Identify Sentiment

## Approach

To discern the best model for each task I will begin with a basic baseline with a Gaussian Naive Bayes model dependent on context free 'bag of words'. Part of the model process will be to try to accurately identify the products discussed in the individual tweets. I will then iterate in two ways: 1) in pre-processing methods. I will have one dataset that is simply a context-free bag of words, another that will be normalized, context-dependent tokenization using SpaCy's tokenization pipeline and third I will use methods similar to the second dataset, but will also remove all stop words and include SpaCy's entity tag training methods. 2) I will examine different modeling pipelines, including Gaussian Naive Bayes, Logistic Regression, Random Forest, TF-IDF and Sequential Neural Networks.

### Preprocessing
- Tokenizing the text using SpaCy
- Noise reduction - remove tags, identify hastags - one dataset with hashtags, one without
- Removing Stop Words (vs. leaving in)
- Entity Tagging vs. None - SpaCy entity tag training
- Lexicon normalization - lemmatization

### Product Identification
- Regular Expression Extraction
- SpaCy entity labeling - HASHTAGS

## Model 1 - Naive Bayes
- Bag of words Data


## Model 2 - Random Forest
- Data Set 1: TF-IDF
- Data Set 2: SpaCy Noisy
- Data Set 3: SpaCy Clean

## Model 3 - Sequential ReLu/Softmax
- Data Set 1: TF-IDF
- Data Set 2: SpaCy Noisy
- Data Set 3: SpaCy Clean

## Model 4 - Recurrent Neural Network

### Model A - Binary Classification
- Data Set 1: TF-IDF
- Data Set 2: SpaCy Noisy
- Data Set 3: SpaCy Clean

### Model B - Multiclass Classification


In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk import word_tokenize, FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
import spacy
from spacy.lang.en import English
from spacy.tokens import Span, Token, Doc
from spacy.matcher import PhraseMatcher, Matcher 
%matplotlib inline
np.random.seed(16)

In [2]:
df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding='unicode_escape')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### PreProcessing

Find tags and remove
Create matcher for product names

## Part One - Sentiment Analysis
Use NLP methods to extract names into dictionary.
- Drop websites
- Tokenize - TF-IDF
- Tokenize - SpaCy



In [3]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [4]:
handle = re.compile('\.?(@[A-Za-z0-9_]+)')
hashtag = re.compile('(#[A-Za-z0-9_]+)')
website = re.compile('(http://[A-Za-z./0-9?=-]+)')
app_1 = re.compile('([A-Za-z]+) (app)')
# unicode = re.compile('[\\]([A-Za-z0-9])+[^:ascii:]{1,2}[x9d]*')

In [5]:
def process_text(text, stop_words=False, keep_links=False, remove_hashtags = False):
    
    if keep_links == True:
        text = re.sub('\.?(@[A-Za-z0-9_]+)', 'HANDLE', text)
        text = re.sub('{([A-Za-z]+)}', 'WEBSITE', text)
    else:
        text = re.sub('\.?(@[A-Za-z0-9_]+)', '', text)
        text = re.sub('{([A-Za-z]+)}', '', text)
    
    if remove_hashtags == True:
        text = re.sub('(#[A-Za-z0-9_]+)', '', text)
    else:
        text = text.replace('#', '')
        
    text = re.sub('&([A-Za-z])+;', '', text)
    text = text.replace('RT', '')
    text = text.replace('\x89Û÷', '')
    text = text.replace('\x89Ûª', '')
    text = text.replace('\x89ÛÏ', '')
    text = text.replace('\x9d', '')
    text = text.replace('\x89Û', '')
    text = text.replace('\x89ÛÒ', '')
    text = text.replace('[pic]', '')
    doc = nlp(text)
    
    if stop_words == True:
        return [x.lower() for x in list([token.text for token in doc if token.is_punct == False and token.text not in stopwords_list])]
    else:
        return [x.lower() for x in list([token.text for token in doc if token.is_punct == False])]

In [6]:
df.dropna(subset=['tweet_text'],inplace=True)

In [7]:
df['handles'] = df['tweet_text'].map(lambda x: handle.findall(x))

In [8]:
df['hashtags'] = df['tweet_text'].map(lambda x: ', '.join(hashtag.findall(x)))

In [9]:
df_bin = df[df['is_there_an_emotion_directed_at_a_brand_or_product'].isin(['Positive emotion', 'Negative emotion'])]

In [10]:
TEXTS = list(df_bin['tweet_text'].unique())
nlp = spacy.load("en_core_web_sm")
docs = nlp.pipe(TEXTS)

In [11]:
stopwords_list = stopwords.words('english')

In [12]:
df_bin['Simple Processed Text'] = df_bin['tweet_text'].map(lambda x: process_text(x))
df_bin['More Processed Text'] = df_bin['tweet_text'].map(lambda x: process_text(x, stop_words=True))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bin['Simple Processed Text'] = df_bin['tweet_text'].map(lambda x: process_text(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bin['More Processed Text'] = df_bin['tweet_text'].map(lambda x: process_text(x, stop_words=True))


In [13]:
simple_process = [x for x in list(df_bin['Simple Processed Text'])]
more_process = [x for x in list(df_bin['More Processed Text'])]

In [14]:
simple_vocab = set([y for x in simple_process for y in x])
more_vocab = set([y for x in more_process for y in x])

In [15]:
len(simple_vocab)

6130

In [16]:
simple_concat = []

for tweet in simple_process:
    simple_concat += tweet

In [17]:
len(simple_concat)

66714

In [18]:
simple_freqdist = FreqDist(simple_concat)
simple_freqdist.most_common(200)

[('sxsw', 3708),
 ('handle', 2532),
 ('the', 1908),
 (' ', 1767),
 ('to', 1417),
 ('website', 1335),
 ('ipad', 1219),
 ('at', 1149),
 ('apple', 1050),
 ('for', 1029),
 ('a', 942),
 ('google', 882),
 ('is', 823),
 ('i', 809),
 ('of', 771),
 ('in', 762),
 ('iphone', 708),
 ('and', 674),
 ('store', 594),
 ('it', 586),
 ("'s", 570),
 ('2', 559),
 ('on', 543),
 ('up', 510),
 ('app', 457),
 ('my', 408),
 ('an', 403),
 ('new', 403),
 ('you', 394),
 ('with', 354),
 ('austin', 323),
 ('just', 287),
 ('this', 267),
 ('that', 260),
 ('be', 252),
 ('have', 237),
 ("n't", 235),
 ('pop', 231),
 ('android', 228),
 ('ipad2', 224),
 ('out', 221),
 ('not', 216),
 ('by', 203),
 ('from', 190),
 ('are', 186),
 ('get', 182),
 ('launch', 182),
 ('they', 180),
 ('so', 180),
 ('your', 176),
 ('one', 171),
 ('do', 167),
 ('now', 161),
 ('all', 158),
 ('circles', 156),
 ('social', 155),
 ('like', 154),
 ('will', 151),
 ('line', 150),
 ('about', 148),
 ('time', 146),
 ('great', 145),
 ('me', 145),
 ('no', 145),
 

In [26]:
vectorizor = TfidfVectorizer()

In [21]:
len(more_vocab)

7341

In [115]:
vocab_list = []
ents_list = []

for doc in docs:
    vocab_list.extend([token.text for token in doc])
    ents_list.extend([doc.ents])
    
full_vocab = set([word.lower() for word in vocab_list])
full_ents = set(ents_list)

In [125]:

labler = LabelBinarizer()

y = df_bin['is_there_an_emotion_directed_at_a_brand_or_product']
y_bin = labler.fit_transform(y)

In [130]:
df_bin['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts(normalize=True)

Positive emotion    0.839346
Negative emotion    0.160654
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: float64

In [128]:
df_bin['is_there_an_emotion_directed_at_a_brand_or_product']

0       Negative emotion
1       Positive emotion
2       Positive emotion
3       Negative emotion
4       Positive emotion
              ...       
9077    Positive emotion
9079    Positive emotion
9080    Negative emotion
9085    Positive emotion
9088    Positive emotion
Name: is_there_an_emotion_directed_at_a_brand_or_product, Length: 3548, dtype: object

In [157]:
len(simple_process)

9092

In [158]:
len(more_process)

9092

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bin['Simple Processed Text'] = df_bin['tweet_text'].map(lambda x: process_text(x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bin['More Processed Text'] = df_bin['tweet_text'].map(lambda x: process_text(x, stop_words=True))


In [18]:
nlp = spacy.load("en_core_web_sm")

In [19]:
TEXTS = list(df['tweet_text'].unique())

In [20]:
len(TEXTS)

9065

In [43]:
df_bin = df[df['is_there_an_emotion_directed_at_a_brand_or_product'].isin(['Positive emotion', 'Negative emotion'])]

In [44]:
len(df_bin)

3548

In [45]:
df_bin

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,handles,hashtags
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,[@wesley83],"#RISE_Austin, #SXSW"
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,"[@jessedee, @fludapp]",#SXSW
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,[@swonderlin],"#iPad, #SXSW"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,[@sxsw],#sxsw
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,[@sxtxstate],#SXSW
...,...,...,...,...,...
9077,@mention your PR guy just convinced me to swit...,iPhone,Positive emotion,[@mention],"#sxsw, #princess"
9079,&quot;papyrus...sort of like the ipad&quot; - ...,iPad,Positive emotion,[],#SXSW
9080,Diller says Google TV &quot;might be run over ...,Other Google product or service,Negative emotion,[],"#sxsw, #diller"
9085,I've always used Camera+ for my iPhone b/c it ...,iPad or iPhone App,Positive emotion,[],"#SXSW, #SXSWi"


In [23]:
df_bin

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,handles,hashtags
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,[@wesley83],"#RISE_Austin, #SXSW"
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,"[@jessedee, @fludapp]",#SXSW
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,[@swonderlin],"#iPad, #SXSW"
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,[@sxsw],#sxsw
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,[@sxtxstate],#SXSW
...,...,...,...,...,...
9077,@mention your PR guy just convinced me to swit...,iPhone,Positive emotion,[@mention],"#sxsw, #princess"
9079,&quot;papyrus...sort of like the ipad&quot; - ...,iPad,Positive emotion,[],#SXSW
9080,Diller says Google TV &quot;might be run over ...,Other Google product or service,Negative emotion,[],"#sxsw, #diller"
9085,I've always used Camera+ for my iPhone b/c it ...,iPad or iPhone App,Positive emotion,[],"#SXSW, #SXSWi"


In [24]:
TEXTS

['.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.',
 "@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",
 '@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.',
 "@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw",
 "@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",
 '@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd',
 '#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan',
 'Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaV

In [36]:
df['emotion_in_tweet_is_directed_at'].value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: emotion_in_tweet_is_directed_at, dtype: int64

## Populate Missing Products

- Cross fold validation
- 'App' in tweet

In [40]:
missing_df = df_bin[df_bin['emotion_in_tweet_is_directed_at'].isna() == True]

In [41]:
missing_df['tweet_text'].unique()

array(['Hand-Held \x89Û÷Hobo\x89Ûª: Drafthouse launches \x89Û÷Hobo With a Shotgun\x89Ûª iPhone app #SXSW {link}',
       'Again? RT @mention Line at the Apple store is insane.. #sxsw',
       'Boooo! RT @mention Flipboard is developing an iPhone version, not Android, says @mention #sxsw',
       "Know that &quot;dataviz&quot; translates to &quot;satanic&quot; on an iPhone. I'm just sayin'. #sxsw",
       'Spark for #android is up for a #teamandroid award at #SXSW read about it here: {link}',
       'Does your #SmallBiz need reviews to play on Google Places...We got an App for that..{link}  #seo #sxsw',
       '@mention  #SXSW LonelyPlanet Austin guide for #iPhone is free for a limited time {link} #lp #travel',
       'First day at sxsw.  Fun final presentation on Google Doodles.  #GoogleDoodle #sxsw',
       '&quot;You can Google Canadian Tuxedo and lose yourself for hours&quot; #sxsw',
       'Shipments daily - follow @mention #AppleATXdt 4 updates RT @mention Pop-up Apple Store seems

In [42]:
df_bin['emotion_in_tweet_is_directed_at'].value_counts()

iPad                               918
Apple                              638
iPad or iPhone App                 460
Google                             414
iPhone                             287
Other Google product or service    283
Android App                         80
Android                             77
Other Apple product or service      34
Name: emotion_in_tweet_is_directed_at, dtype: int64

## Part Two - Multiclass Sentiment Analysis

## Part Three - Product Identification