# Background

The objective of this project is to classify the overall sentiment of a tweet's context as neutral, negative, or positive using NLP classifiers. To complete, this project, we are given a dataset of 27,481 tweets, where 22,464 of those tweets were captured as having either a neutral, negative, or positive sentiment. Our goal is to use this training data of ~27.5k tweets to predict the sentiment of the 3,534 tweets in our testing data set.

# Objective
1. Training Data - Use selected text to create a set containing words in 'selected_text' column based on sentiment.
2. Testing Data -  Count words, assign probability based on highest percentage in set, and assess precision.

## Import Libraries

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Read Training Dataset

In [2]:
train = pd.read_csv("/Users/bethelikejiofor/Documents/GitHub/ENTITY-Final-Project/Data/train.csv")
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


## Data Wrangling
For this part of the data wrangling, the only data wrangling step is to drop the ID column. We are also going to drop punctuation and split sentences. We will also be subsetting the dataset by sentiment category to create the sets to try on the testing dataset.

### Dropping ID column

In [3]:
train = train[['text','selected_text', 'sentiment']]

### Punctuation Removal

In [4]:
def removepunct(text):
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [5]:
train['selected_text'] = train.selected_text.astype(str).str.lower()
train['selected_text_clean'] = train['selected_text'].apply(removepunct)

In [6]:
train.head()

Unnamed: 0,text,selected_text,sentiment,selected_text_clean
0,"I`d have responded, if I were going","i`d have responded, if i were going",neutral,id have responded if i were going
1,Sooo SAD I will miss you here in San Diego!!!,sooo sad,negative,sooo sad
2,my boss is bullying me...,bullying me,negative,bullying me
3,what interview! leave me alone,leave me alone,negative,leave me alone
4,"Sons of ****, why couldn`t they put them on t...","sons of ****,",negative,sons of


In [7]:
train['selected_text_clean'] = train['selected_text_clean'].str.split()

### Neutral Dataset

In [8]:
neutral = train[train['sentiment']=='neutral']
neutral.head()

Unnamed: 0,text,selected_text,sentiment,selected_text_clean
0,"I`d have responded, if I were going","i`d have responded, if i were going",neutral,"[id, have, responded, if, i, were, going]"
5,http://www.dothebouncy.com/smf - some shameles...,http://www.dothebouncy.com/smf - some shameles...,neutral,"[httpwwwdothebouncycomsmf, some, shameless, pl..."
7,Soooo high,soooo high,neutral,"[soooo, high]"
8,Both of you,both of you,neutral,"[both, of, you]"
10,"as much as i love to be hopeful, i reckon the...","as much as i love to be hopeful, i reckon the ...",neutral,"[as, much, as, i, love, to, be, hopeful, i, re..."


In [9]:
neu_df = pd.DataFrame(neutral['selected_text_clean'].tolist()).add_prefix('word_')

In [10]:
neu_df.head()

Unnamed: 0,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,...,word_23,word_24,word_25,word_26,word_27,word_28,word_29,word_30,word_31,word_32
0,id,have,responded,if,i,were,going,,,,...,,,,,,,,,,
1,httpwwwdothebouncycomsmf,some,shameless,plugging,for,the,best,rangers,forum,on,...,,,,,,,,,,
2,soooo,high,,,,,,,,,...,,,,,,,,,,
3,both,of,you,,,,,,,,...,,,,,,,,,,
4,as,much,as,i,love,to,be,hopeful,i,reckon,...,,,,,,,,,,


In [11]:
neutral_set = set()
for x in range(33):
    neutral_set.update(neu_df['word_'+str(x)].unique())

### Positive Dataset

In [12]:
positive = train[train['sentiment']=='positive']
positive_set = set()

In [13]:
pos_df = pd.DataFrame(positive['selected_text_clean'].tolist()).add_prefix('word_')

In [14]:
pos_df.head()

Unnamed: 0,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,...,word_20,word_21,word_22,word_23,word_24,word_25,word_26,word_27,word_28,word_29
0,fun,,,,,,,,,,...,,,,,,,,,,
1,wow,u,just,became,cooler,,,,,,...,,,,,,,,,,
2,like,,,,,,,,,,...,,,,,,,,,,
3,interesting,,,,,,,,,,...,,,,,,,,,,
4,the,free,fillin,app,on,my,ipod,is,fun,im,...,,,,,,,,,,


In [15]:
positive_set = set()
for x in range(30):
    positive_set.update(pos_df['word_'+str(x)].unique())

### Negative Dataset

In [16]:
negative = train[train['sentiment']=='negative']
negative_set = set()

In [17]:
neg_df = pd.DataFrame(negative['selected_text_clean'].tolist()).add_prefix('word_')

In [18]:
neg_df.head()

Unnamed: 0,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,...,word_19,word_20,word_21,word_22,word_23,word_24,word_25,word_26,word_27,word_28
0,sooo,sad,,,,,,,,,...,,,,,,,,,,
1,bullying,me,,,,,,,,,...,,,,,,,,,,
2,leave,me,alone,,,,,,,,...,,,,,,,,,,
3,sons,of,,,,,,,,,...,,,,,,,,,,
4,dangerously,,,,,,,,,,...,,,,,,,,,,


In [19]:
negative_set = set()
for x in range(29):
    negative_set.update(neg_df['word_'+str(x)].unique())

## Read Testing Dataset

In [20]:
test = pd.read_csv("/Users/bethelikejiofor/Documents/GitHub/ENTITY-Final-Project/Data/test.csv")
test.head()

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


## Data Wrangling
For this part of the data wrangling, the only data wrangling step is to drop the ID column. We are also going to drop punctuation and split sentences.

In [21]:
test = test[['text', 'sentiment']]

In [22]:
test['text'] = test.text.astype(str).str.lower()
test['text_clean'] = test['text'].apply(removepunct)

In [23]:
test.head()

Unnamed: 0,text,sentiment,text_clean
0,last session of the day http://twitpic.com/67ezh,neutral,last session of the day httptwitpiccom67ezh
1,shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting precisely s...
2,"recession hit veronique branquinho, she has to...",negative,recession hit veronique branquinho she has to ...
3,happy bday!,positive,happy bday
4,http://twitpic.com/4w75p - i like it!!,positive,httptwitpiccom4w75p i like it


Getting number of positive, negativem and neutral words for each tweet

In [24]:
test['negative'] = test['text_clean'].apply(lambda x: len([val for val in x.split() if val in negative_set]))

In [25]:
test['positive'] = test['text_clean'].apply(lambda x: len([val for val in x.split() if val in positive_set]))

In [26]:
test['neutral'] = test['text_clean'].apply(lambda x: len([val for val in x.split() if val in neutral_set]))

## Percentage Sentiment Analysis

In [27]:
test.head()

Unnamed: 0,text,sentiment,text_clean,negative,positive,neutral
0,last session of the day http://twitpic.com/67ezh,neutral,last session of the day httptwitpiccom67ezh,5,5,5
1,shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting precisely s...,7,9,9
2,"recession hit veronique branquinho, she has to...",negative,recession hit veronique branquinho she has to ...,11,9,10
3,happy bday!,positive,happy bday,2,2,2
4,http://twitpic.com/4w75p - i like it!!,positive,httptwitpiccom4w75p i like it,3,3,3


## Returning Classification Based on Highest Percentage

In [30]:
test['PredictedR'] = test[['negative','positive','neutral']].idxmax(axis=1)

In [31]:
test.head()

Unnamed: 0,text,sentiment,text_clean,negative,positive,neutral,PredictedR
0,last session of the day http://twitpic.com/67ezh,neutral,last session of the day httptwitpiccom67ezh,5,5,5,negative
1,shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting precisely s...,7,9,9,positive
2,"recession hit veronique branquinho, she has to...",negative,recession hit veronique branquinho she has to ...,11,9,10,negative
3,happy bday!,positive,happy bday,2,2,2,negative
4,http://twitpic.com/4w75p - i like it!!,positive,httptwitpiccom4w75p i like it,3,3,3,negative
