## **Importing Libraries**

In [12]:
# General Libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Plotting Libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib.cm as cm
from matplotlib import rcParams
plt.style.use('ggplot')

# Machine Learnings Libraries
from sklearn.metrics import roc_curve, auc, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
# NLP Libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import string
from nltk.tokenize import RegexpTokenizer



## **Loading Dataset**

In [21]:
#Loading dataset
df = pd.read_csv('../data_file/tweet_sentiments.csv', encoding='ISO-8859-1')

#Display The first five columns of the dataset
df.head()


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## **Exploratory Data Analysis**

In [22]:
#Checking the shape of the  dataset
df.shape

(9093, 3)

The shape of our dataset, (9093, 3), which indicates that we have 9,093 rows and 3 columns. Here's a breakdown of what each dimension typically represents in this context:

Rows (9093): This is the number of tweets in our dataset, meaning we have 9,093 individual tweets to analyze.

Columns (3): This likely represents:

**tweet_text:** The text content of the tweet.

**emotion_in_tweet_is_directed_at:** The specific brand or product that the emotion in the tweet is directed towards.

**is_there_an_emotion_directed_at_a_brand_or_product:** The sentiment or emotion expressed in relation to the brand or product (e.g., positive or negative).

In [23]:
df.info

<bound method DataFrame.info of                                              tweet_text  \
0     .@wesley83 I have a 3G iPhone. After 3 hrs twe...   
1     @jessedee Know about @fludapp ? Awesome iPad/i...   
2     @swonderlin Can not wait for #iPad 2 also. The...   
3     @sxsw I hope this year's festival isn't as cra...   
4     @sxtxstate great stuff on Fri #SXSW: Marissa M...   
...                                                 ...   
9088                      Ipad everywhere. #SXSW {link}   
9089  Wave, buzz... RT @mention We interrupt your re...   
9090  Google's Zeiger, a physician never reported po...   
9091  Some Verizon iPhone customers complained their...   
9092  Ï¡Ïàü_ÊÎÒ£Áââ_£â_ÛâRT @...   

     emotion_in_tweet_is_directed_at  \
0                             iPhone   
1                 iPad or iPhone App   
2                               iPad   
3                 iPad or iPhone App   
4                             Google   
...                

In [24]:
# Check for any missing values 
df.isnull().sum()


tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [25]:
# Check which entry is missing in tweet_text
missing_tweet = df[df['tweet_text'].isnull()]
print("Missing tweet text:")
print(missing_tweet)

# Analyze the distribution of emotions directed at brands/products
emotion_counts = df['emotion_in_tweet_is_directed_at'].isnull().sum()
print(f"Number of missing emotions directed at a brand or product: {emotion_counts}")

# Check the distribution of values in the sentiment column
sentiment_counts = df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()
print("Distribution of sentiment toward brand or product:")
print(sentiment_counts)

# Clean data by dropping rows with missing tweet_text if necessary
df_cleaned = df.dropna(subset=['tweet_text'])
print("Cleaned DataFrame:")
print(df_cleaned.info())

Missing tweet text:
  tweet_text emotion_in_tweet_is_directed_at  \
6        NaN                             NaN   

  is_there_an_emotion_directed_at_a_brand_or_product  
6                 No emotion toward brand or product  
Number of missing emotions directed at a brand or product: 5802
Distribution of sentiment toward brand or product:
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64
Cleaned DataFrame:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at              


**Sentiment Toward Brand or Product (is_there_an_emotion_directed_at_a_brand_or_product)**

The distribution is:

5389 entries show no emotion toward a brand or product.

2978 entries show positive emotion.

570 entries show negative emotion.

156 entries have the value "I can't tell.

**Missing Tweet Text**

The cleaned dataset now has 9092 rows, with no missing entries in the tweet_text column. This means that our original dataset contained 9093 rows, but after removing the rows with missing tweet_text, we were left with 9092 rows. This is a small reduction in the number of rows, but it is a necessary step to ensure that we have a complete dataset for analysis.

