## Final Project Submission

Please fill out:

* Student name: Dicchyant Gurung
* Student pace: Self Paced
* Scheduled project review date/time:
* Instructor name: Jeff Herman
* Blog post URL: 

### Project

Analyze Twitter sentiment about Apple and Google products. Build a model that can rate the sentiment of a Tweet based on it's content.

### Import the dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv(r"D:\Data Science\Module_4_Final_Project\dsc-mod-4-project-v2-1-online-ds-sp-000\tweet_product_company.csv", encoding= 'unicode_escape')

### Check the data

In [3]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
df.shape

(9093, 3)

### Separate the tweet_text column for data exploration

In [5]:
tweet_df = pd.DataFrame(df['tweet_text']).copy()

In [6]:
tweet_df.head()

Unnamed: 0,tweet_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...
1,@jessedee Know about @fludapp ? Awesome iPad/i...
2,@swonderlin Can not wait for #iPad 2 also. The...
3,@sxsw I hope this year's festival isn't as cra...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...


### Check and remove missing values

In [7]:
tweet_df.isnull().sum()

tweet_text    1
dtype: int64

In [8]:
tweet_df.dropna(axis=0, inplace=True)

In [9]:
tweet_df.isnull().sum()

tweet_text    0
dtype: int64

### Tokenize the data

Let's observe some of the tweets individually to see what kind of characters or sentence structure we are working with here.

In [10]:
for i in range(0,11):
    print(tweet_df.values[i])

['.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.']
["@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW"]
['@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.']
["@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw"]
["@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)"]
['@teachntech00 New iPad Apps For #SpeechTherapy And Communication Are Showcased At The #SXSW Conference http://ht.ly/49n4M #iear #edchat #asd']
['#SXSW is just starting, #CTIA is around the corner and #googleio is only a hop skip and a jump from there, good time to be an #android fan']
['Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaV

Looking at the first ten tweets, we can see that the common sentence structure consists of one or many twitter handles, sentiment expressed by the user, and some hashtags categorizing the subject.  

* For us to tokenize this data, we need to remove all special character, puncutations and numbers
* We will also need to lowercase all words 

In [11]:
import nltk
from nltk.corpus import stopwords
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
import string
import re

In [12]:
nltk.download('stopwords')

stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)

# Loop in to access individual tweets
for i in tweet_df.index:
    tweet = tweet_df.tweet_text[i]
        
    # tokenize each tweet
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    tweet_tokens_raw = nltk.regexp_tokenize(tweet, pattern)
        
    # lowercase each token
    tweet_tokens = [word.lower() for word in tweet_tokens_raw]
        
    # remove stopwords and punctuations
    tweet_words_stopped = [word for word in tweet_tokens if word not in stopwords_list] 
        
    # replace the main tweet with toknized word in the dataframe    
    tweet_df['tweet_text'][i] = tweet_words_stopped

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dicch\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
tweet_df

Unnamed: 0,tweet_text
0,"[wesley, g, iphone, hrs, tweeting, rise, austi..."
1,"[jessedee, know, fludapp, awesome, ipad, iphon..."
2,"[swonderlin, wait, ipad, also, sale, sxsw]"
3,"[sxsw, hope, year's, festival, crashy, year's,..."
4,"[sxtxstate, great, stuff, fri, sxsw, marissa, ..."
...,...
9088,"[ipad, everywhere, sxsw, link]"
9089,"[wave, buzz, rt, mention, interrupt, regularly..."
9090,"[google's, zeiger, physician, never, reported,..."
9091,"[verizon, iphone, customers, complained, time,..."


### Categorize the tokenized data

Now that our tweets have been tokenized, let's create a list of keywords for Apple and Google products so that we can categorize the tweets respectively.

In [14]:
apple = ['ipad', 'iphone', 'imac', 'iwatch', 'itunes', 'icloud', 'apple', 'mac', 'macbook', 'macpro']
google = ['google']

We can create a dictionary to populate 'Apple' and 'Google' each time the tweet mentions one of its products. 

The keys in the dictionary will allow us to match the product with the tweet and categorize it accordingly.

In [15]:
product = {}
for i in tweet_df.index:
    for word in tweet_df.tweet_text[i]:
        if word in apple:
            product[i] = 'Apple'
            break
        elif word in google:
            product[i] = 'Google'
            break

In [16]:
product

{0: 'Apple',
 1: 'Apple',
 2: 'Apple',
 3: 'Apple',
 4: 'Google',
 5: 'Apple',
 8: 'Apple',
 9: 'Apple',
 13: 'Google',
 14: 'Apple',
 15: 'Apple',
 16: 'Apple',
 17: 'Apple',
 18: 'Apple',
 19: 'Apple',
 20: 'Apple',
 21: 'Apple',
 23: 'Apple',
 25: 'Apple',
 26: 'Apple',
 27: 'Google',
 28: 'Apple',
 30: 'Apple',
 31: 'Apple',
 33: 'Apple',
 34: 'Apple',
 35: 'Google',
 36: 'Apple',
 37: 'Apple',
 38: 'Google',
 39: 'Google',
 40: 'Apple',
 41: 'Apple',
 42: 'Apple',
 43: 'Apple',
 44: 'Apple',
 45: 'Apple',
 46: 'Apple',
 47: 'Apple',
 48: 'Google',
 49: 'Apple',
 50: 'Apple',
 54: 'Google',
 56: 'Google',
 57: 'Apple',
 58: 'Apple',
 59: 'Google',
 60: 'Apple',
 61: 'Google',
 62: 'Apple',
 64: 'Apple',
 65: 'Apple',
 67: 'Apple',
 68: 'Apple',
 69: 'Apple',
 70: 'Google',
 72: 'Google',
 74: 'Google',
 75: 'Google',
 76: 'Apple',
 78: 'Apple',
 80: 'Apple',
 81: 'Apple',
 82: 'Apple',
 83: 'Apple',
 84: 'Google',
 89: 'Apple',
 92: 'Apple',
 93: 'Apple',
 95: 'Apple',
 96: 'Apple'

Let's transform this into a dictionary.

In [17]:
product_df = pd.DataFrame.from_dict(product, orient='index')

In [18]:
product_df.rename(columns={0:'Product'}, inplace=True)

Let's join this into the 'tweet_df' to categorize the tweets.

In [19]:
tweet_final_df = tweet_df.join(product_df, how='inner')

In [20]:
tweet_final_df

Unnamed: 0,tweet_text,Product
0,"[wesley, g, iphone, hrs, tweeting, rise, austi...",Apple
1,"[jessedee, know, fludapp, awesome, ipad, iphon...",Apple
2,"[swonderlin, wait, ipad, also, sale, sxsw]",Apple
3,"[sxsw, hope, year's, festival, crashy, year's,...",Apple
4,"[sxtxstate, great, stuff, fri, sxsw, marissa, ...",Google
...,...,...
9086,"[google, says, want, give, lightning, talk, h,...",Google
9088,"[ipad, everywhere, sxsw, link]",Apple
9089,"[wave, buzz, rt, mention, interrupt, regularly...",Google
9091,"[verizon, iphone, customers, complained, time,...",Apple


We now have the final version of the tweets which are categorized into Apple and Google products. We can use this further to classify into either postive or negative tweets.