## Emotion Classification

Task
1. Perform 'One-Hot Encoding' to the label series
2. Classify emotions as either
a. Joy,
b. Fear,
c. Anger,
d. Sadness,
3. Train and Test the Dataset
4. Build and evaluate the model

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv("D:\Open Classroom\Datasets\Emotion Classification NLP\emotion-labels-val.csv")
df.head()

Unnamed: 0,text,label
0,"@theclobra lol I thought maybe, couldn't decid...",joy
1,Nawaz Sharif is getting more funnier than @kap...,joy
2,Nawaz Sharif is getting more funnier than @kap...,joy
3,@tomderivan73 😁...I'll just people watch and e...,joy
4,I love my family so much #lucky #grateful #sma...,joy


In [3]:
df.shape

(347, 2)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    347 non-null    object
 1   label   347 non-null    object
dtypes: object(2)
memory usage: 5.5+ KB


In [5]:
df.describe()

Unnamed: 0,text,label
count,347,347
unique,342,4
top,"[ @HedgehogDylan ] *she would frown a bit, fol...",fear
freq,2,110


In [6]:
df.drop_duplicates(subset=["text"])

Unnamed: 0,text,label
0,"@theclobra lol I thought maybe, couldn't decid...",joy
1,Nawaz Sharif is getting more funnier than @kap...,joy
2,Nawaz Sharif is getting more funnier than @kap...,joy
3,@tomderivan73 😁...I'll just people watch and e...,joy
4,I love my family so much #lucky #grateful #sma...,joy
...,...,...
341,340:892 All with weary task fordone.\nNow the ...,sadness
342,Common app just randomly logged me out as I wa...,sadness
343,"I'd rather laugh with the rarest genius, in be...",sadness
344,If you #invest in my new #film I will stop ask...,sadness


In [7]:
df.isnull().sum()

text     0
label    0
dtype: int64

In [8]:
df.rename(columns={"text":"message"}, inplace = True)
df.head()

Unnamed: 0,message,label
0,"@theclobra lol I thought maybe, couldn't decid...",joy
1,Nawaz Sharif is getting more funnier than @kap...,joy
2,Nawaz Sharif is getting more funnier than @kap...,joy
3,@tomderivan73 😁...I'll just people watch and e...,joy
4,I love my family so much #lucky #grateful #sma...,joy


#### One-Hot Encoding

In [9]:
le = LabelEncoder()

In [10]:
df["label_num"] = le.fit_transform(df["label"])

In [11]:
df.head()

Unnamed: 0,message,label,label_num
0,"@theclobra lol I thought maybe, couldn't decid...",joy,2
1,Nawaz Sharif is getting more funnier than @kap...,joy,2
2,Nawaz Sharif is getting more funnier than @kap...,joy,2
3,@tomderivan73 😁...I'll just people watch and e...,joy,2
4,I love my family so much #lucky #grateful #sma...,joy,2


In [12]:
df["label_num"].value_counts()

1    110
0     84
2     79
3     74
Name: label_num, dtype: int64

#### Train and Test the Dataset

In [13]:
x = df["message"]
y = df["label_num"]

In [14]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print("x_train :", x_train.shape)
print("x_test :", x_test.shape)
print("y_train :", y_train.shape)
print("y_test :", y_test.shape)

x_train : (277,)
x_test : (70,)
y_train : (277,)
y_test : (70,)


#### Vectorization

In [15]:
vect = CountVectorizer()
vect.fit(x_train)

CountVectorizer()

In [16]:
x_train_dtm = vect.fit_transform(x_train) # Tranforming the x_train to a document-term matrix
print(type(x_train_dtm))                  # Fitting the x_train to learn the vocabulary
x_train_dtm

<class 'scipy.sparse.csr.csr_matrix'>


<277x1699 sparse matrix of type '<class 'numpy.int64'>'
	with 4009 stored elements in Compressed Sparse Row format>

x_train_dtm has 277 observations and 1699 tokens / feature

In [17]:
x_test_dtm = vect.fit_transform(x_test)
print(type(x_test_dtm))
x_test_dtm

<class 'scipy.sparse.csr.csr_matrix'>


<70x640 sparse matrix of type '<class 'numpy.int64'>'
	with 1075 stored elements in Compressed Sparse Row format>

x_train_dtm has 70 observations and 640 tokens / feature

#### Build and Evaluate the Model

##### Decision Tree Model

In [18]:
dt_model = DecisionTreeClassifier(max_depth = 4)
dt_model.fit(x_train_dtm, y_train)

DecisionTreeClassifier(max_depth=4)

In [19]:
dt_pred = dt_model.predict(x_train_dtm)
dt_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1,
       2, 1, 1, 2, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 0, 1])