# Introduction

## Final Project Submission

***
- Student Name: Adam Marianacci
- Student Pace: Flex
- Scheduled project review date/time: TBD
- Instructor Name: Mark Barbour

# Business Understanding

It is my job to help SXSW detect positive sentiment from tweets about their event.

# Data Understanding

This dataset comes from 'CrowdFlower' via data.world. The initial dataframe contained roughly 9,000 tweets and information about the sentiment of the tweet as well as what brand or product the tweet was directed at. Some limitations of the dataset included missing values as well as a class imbalance in the sentiment of the tweets. Over 50% of the tweets showed no emotion, about 33% showed a positive emotion, and only around 6% showed a negative emotion. Due to this imbalance I combined some of the 'no emotion' tweets with the 'negative emotion' tweets to create a 'Not Positive' class to match the 'Positive' class. This dataset was suitable for the project because it allowed me to build a sentiment detection model from the text in the tweets against the target 'sentiment' of what was positive and what was not.


Dataset: [Brands and Product Emotions](https://data.world/crowdflower/brands-and-product-emotions)

# Data Preperation

In [1]:
# Importing the necessary libraries

%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import numpy as np
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
from sklearn.model_selection import train_test_split
from collections import defaultdict
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading the data, and looking at the shape of the df

corpus = pd.read_csv('data/twitter_sentiment.csv', encoding='latin1')
corpus.shape

(9093, 3)

In [3]:
# previewing the dataframe
corpus.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [4]:
# Taking a look at the datatypes and checking for missing values
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [5]:
# Dropping 'the emotion_in_tweet_is_directed_at' column, bc of missing values and not needed for our problem.
corpus.drop('emotion_in_tweet_is_directed_at', axis=1, inplace=True)

In [6]:
# renaming the 'is_there_an_emotion...' column to 'sentiment'
corpus.rename(columns={
    'is_there_an_emotion_directed_at_a_brand_or_product': 'sentiment'}, inplace=True)

In [7]:
# Inspecting the values in 'sentiment'. We have an imbalance in occurences. 
corpus['sentiment'].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: sentiment, dtype: int64

In [8]:
# Dropping 'I can't tell' category because it is not useful and a relatively low amount.
corpus.drop(corpus[corpus['sentiment'] == "I can't tell"].index, inplace=True)

In [9]:
# Creating a mask to identify rows with "No emotion toward brand or product"
no_emotion_mask = corpus['sentiment'] == "No emotion toward brand or product"

# Locating the rows with the mask and redistribute 2,408 occurrences
no_emotion_indices = corpus[no_emotion_mask].sample(n=2408, random_state=42).index
corpus.loc[no_emotion_indices, 'sentiment'] = "Negative emotion"

# Verifying the changes
print(corpus['sentiment'].value_counts())

No emotion toward brand or product    2981
Negative emotion                      2978
Positive emotion                      2978
Name: sentiment, dtype: int64


In [10]:
# Create a mask to identify rows with "No emotion toward brand or product"
no_emotion_mask = corpus['sentiment'] == "No emotion toward brand or product"

# Drop the rows with this mask
corpus.drop(corpus[no_emotion_mask].index, inplace=True)

# Verify the changes
print(corpus['sentiment'].value_counts())

Negative emotion    2978
Positive emotion    2978
Name: sentiment, dtype: int64


In [11]:
# Define the mapping of old values to new values
mapping = {'Positive emotion': 'Positive', 'Negative emotion': 'Not Positive'}

# Replace the categories in the 'sentiment' column
corpus['sentiment'] = corpus['sentiment'].replace(mapping)

# Verify the changes
print(corpus['sentiment'].value_counts())

Positive        2978
Not Positive    2978
Name: sentiment, dtype: int64


In [12]:
# Assigning 'Positive' sentiment to 1 and 'Not Positive' to 0
corpus['sentiment'] = corpus['sentiment'].replace(
    {'Positive': 1, 'Not Positive': 0})

In cells 9-12 we have set this up to be a binary classification problem. We have combined values from "Negative emotion" with values from "No emotion toward brand or product". We did this because we had a class imbalance. We sampled 2,408 occurences from "No emotion toward brand or product" and combined them in the "Negative emotion" category to create a new category called "Not Positive". There were a lot more occurences of "Positive emotion" compared to "Negative emotion". By combining the categories we have now have a balance between 'Positive' and 'Not Positive' occurences. We have assigned sentiment values 'Positive' to 1 and 'Not Positive' to 0.

In [13]:
# Inspecting the DF once again to make sure everything looks correct after all the changes we made.
corpus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5956 entries, 0 to 9092
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   tweet_text  5956 non-null   object
 1   sentiment   5956 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 139.6+ KB


In [14]:
# The number of entries are equal in 'tweet_text' and 'sentiment'
corpus.describe()

Unnamed: 0,sentiment
count,5956.0
mean,0.5
std,0.500042
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [15]:
corpus.head()

Unnamed: 0,tweet_text,sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,1
2,@swonderlin Can not wait for #iPad 2 also. The...,1
3,@sxsw I hope this year's festival isn't as cra...,0
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,1


In [16]:
# Bring in stopwords

sw = stopwords.words('english')

In [17]:
# Looking at one tweet to start my data cleaning process
tweet_index = 0  # Choose a tweet by changing the index

# Accessing the tweet text using iloc
tweet = corpus.iloc[tweet_index]['tweet_text']
print(tweet)

.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.


In [18]:
print(f'We are down to {len(set(tweet))} unique words')

We are down to 36 unique words


In [19]:
# Defining X and y
# T,T,S Here
X = corpus.tweet_text
y = corpus.sentiment

In [None]:
# identifying words, getting rid of punctuation and numbers
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
tokenizer = RegexpTokenizer(pattern)
sample_tweet = tokenizer.tokenize(tweet)

In [None]:
sample_tweet = [token.lower() for token in sample_tweet]
sample_tweet = [token for token in sample_tweet if token not in sw]

In [None]:
print(sample_tweet)

In [None]:
print(f'We are down to {len(set(sample_tweet))} unique words')

In [None]:
def preprocess_tweet(tweet):
    # Define the regular expression pattern
    pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
    
    # Create a tokenizer using the pattern
    tokenizer = RegexpTokenizer(pattern)
    
    # Tokenize the tweet
    tweet_tokens = tokenizer.tokenize(tweet)
    
    # Convert tokens to lowercase
    tweet_tokens = [token.lower() for token in tweet_tokens]
    
    # Get stopwords
    stop_words = set(stopwords.words('english'))
     # Add 'sxsw' to the set of stopwords
    stop_words.add('sxsw')
    
    # Remove stopwords
    tweet_tokens = [token for token in tweet_tokens if token not in stop_words]
    
    return tweet_tokens

# Preprocess the tweet
processed_tweet = preprocess_tweet(tweet)

# Print the processed tweet
print(processed_tweet)

In [None]:
# Access the tweet from DataFrame
tweet_to_process = corpus.loc[tweet_index, 'tweet_text']

# Preprocess the chosen tweet
processed_tweet = preprocess_tweet(tweet_to_process)

# Print the original and processed tweets
print("Original Tweet:")
print(tweet_to_process)
print("\nProcessed Tweet:")
print(processed_tweet)

# Modeling

# Evaluation

# Conclusion

## Limitations

## Recommendations

## Next Steps