# Team 19 - COMP 472 (Artificial Intelligence) Mini Project 1: Emotion and Sentiment Classification of Reddit Posts
by
Vithushen Sivasubramaniam (40112363), Vejay Thanamjeyasingam (40112236), and David Xie (40065595)

October 22, 2022

## 1. Dataset Preparation & Analysis (5pts)

### 1.1. □ (0pts) Download the version of the GoEmotion dataset provided on Moodle.
The original GoEmotion dataset, created by [Demszky et al., 2020], is a dataset of 58k humanannotated
Reddit comments labeled with 27+1 emotion categories (eg. admiration, amusement, anger,
caring, . . . ) and neutral. These emotions are themselves organized into 4 sentiments: positive (admiration,
amusement, . . . ), negative (anger, annoyance, . . . ), ambiguous (confusion, curiosity) and
neutral (neutral )). This allows us to use the dataset for both:
* emotion classification (into 28 classes), and
* sentiment classification (into 4 classes).

For more information on the original dataset, you can read this blog and this paper.
The dataset we will use for this assignment is a modified version of the original GoEmotion, where only
posts annotated with a single emotion (and a single sentiment) are kept, and the data is formatted in
json. The json file contains triplets made of the post, its emotion and its sentiment.

### 1.2. □ (0pts) Load the dataset. 
You can use gzip.open and json.load to do that.

In [None]:
import json
import gzip
from matplotlib import pyplot as plt
from collections import Counter

# opens file using GZIP and json.load
with gzip.open('goemotions.json.gz', 'rb') as f:
    file_content = json.load(f)

### 1.3. □ (5pts) Extract the posts and the 2 sets of labels (emotion and sentiment), then plot the distribution of the posts in each category and save the graphic (a histogram or pie chart) in pdf. 

Do this for both the emotion and the sentiment categories. You can use matplotlib.pyplot and savefig to do this. This pre-analysis of the dataset will allow you to determine if the classes are balanced, and which metric is more appropriate to use to evaluate the performance of your classifiers.

In [None]:
# check contents of file
file_content

In [None]:
# Store emotions and sentiments in list
emotionList = []
sentimentList = []

for post in file_content:
    emotionList.append(post[1])
    sentimentList.append(post[2])
    
# Generate graphs using sentiment and emotion Lists
plt.figure(0)
plt.title('Distribution of Sentiments', pad=100, fontdict = {'fontsize' : 20})
plt.pie(Counter(sentimentList).values(), labels=Counter(sentimentList).keys(), radius=2, autopct="%0.1f%%")
plt.savefig("sentimentGraph.pdf", bbox_inches='tight')

plt.figure(1)
plt.title('Distribution of Emotions', pad=450, fontdict = {'fontsize' : 50})
plt.pie(Counter(emotionList).values(), labels=Counter(emotionList).keys(), radius=5, autopct="%0.1f%%")
plt.savefig("emotionGraph.pdf", bbox_inches='tight')

plt.show()

### 2.1. □ (5pts) Process the dataset using feature extraction.text.CountVectorizer to extract tokens/words and their frequencies. 
Display the number of tokens (the size of the vocabulary) in the dataset.

In [None]:
#Display number of tokens and their frequencies
from sklearn.feature_extraction.text import CountVectorizer
vocabularyList = []

cv=CountVectorizer()

for post in file_content:
    vocabularyList.append(post[0])

Z=cv.fit(vocabularyList)
print("Number of tokens: ", len(Z.vocabulary_))

In [None]:
print(Z.vocabulary_)

### 2.2. □ (2pts) Split the dataset into 80% for training and 20% for testing. 
For this, you can use train test split.

In [None]:
#Splitting the dataset (80% training and 20% testing)
from sklearn.model_selection import train_test_split

X=emotionList #might be the Reddit posts
y=sentimentList #may need have 2: one for sentiment, one for emotion
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.20) #may need to split twice