# Twitter Sentiment and Modeling

## Overview

This project utilizes a dataset from CrowdFlower, analyzing and rating the sentiment of Twitter users regarding Apple and Google products by building an NLP model. Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.

## Business Problem

Apple and Google want to gather information on the consensus of their products. They are looking at Twitter as a medium to gather that information. The task is to build a model that can rate the sentiment of a Tweet based on its content.

## Data Understanding

The dataset used for this project is a csv file ("data.csv"), containing over 9,000 Tweets about Apple and Google products. Human raters rated the sentiment as positive, negative, or neither. The target column is the sentiment column.

### Import Libraries

First thing we did was import the necessary libraries for analysis, visualization, preprocessing data, and building models, as well as ignore warnings.

In [1]:
#import necessary libraries
import pandas as pd

from nltk.corpus import stopwords

### Data Inspection

We proceeded to load the csv dataset, then look at the shape, size, column names and data types, as well as check for missing or duplicate entries.

In [2]:
#load the dataset, ensure the proper encoding is read
df = pd.read_csv('data.csv', encoding='latin1')
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [3]:
#change the name of the tweet, product, and sentiment columns
df = df.rename(columns={'tweet_text': 'tweet', 'emotion_in_tweet_is_directed_at': 'brand_or_product', 'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'})
df.head()

Unnamed: 0,tweet,brand_or_product,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
#look at the different values for sentiment column
df['sentiment'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: sentiment, dtype: int64

In [5]:
#look at the different values for brand_or_product column
df['brand_or_product'].value_counts()

iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: brand_or_product, dtype: int64

In [6]:
#check information on each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet             9092 non-null   object
 1   brand_or_product  3291 non-null   object
 2   sentiment         9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


We are missing 1 value for the tweet column and almost 6,000 values for the brand_or_product column. 

In [7]:
#look at the row with the missing value for the 'tweet' column
df.loc[df['tweet'].isnull()]

Unnamed: 0,tweet,brand_or_product,sentiment
6,,,No emotion toward brand or product


#### Drop Missing Tweet 

Since there is nothing useful provided in the entire row that's the sole missing tweet, we just dropped the row.

In [8]:
#drop missing tweet row
df.dropna(subset=['tweet'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet             9092 non-null   object
 1   brand_or_product  3291 non-null   object
 2   sentiment         9092 non-null   object
dtypes: object(3)
memory usage: 284.1+ KB


### Data Cleaning

We performed standard actions such as standardizing and tokenizing the data. 

In [9]:
windows_sample = df.iloc[7]["tweet"]
windows_sample

'Beautifully smart and simple idea RT @madebymany @thenextweb wrote about our #hollergram iPad app for #sxsw! http://bit.ly/ieaVOB'