# Sentence Emotion Prediction Using Frequent Itemset Mining

Name: Seyed Ali Mirferdos

Student Number: 99201465

# Importing required modules

In [None]:
!pip install pyspark

In [145]:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import re
from collections import defaultdict
import itertools
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [121]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# Reading the data

In [None]:
!unzip q4.csv.zip

Archive:  q4.csv.zip
  inflating: train.csv               


In [109]:
df = pd.read_csv('train.csv', index_col=0)

In [110]:
df.head()

Unnamed: 0,id,text,emotions
0,27383,i feel awful about it too because it s my job ...,sadness
1,110083,im alone i feel awful,sadness
2,140764,ive probably mentioned this before but i reall...,joy
3,100071,i was feeling a little low few days back,sadness
4,2837,i beleive that i am much more sensitive to oth...,love


# Preprocessing

This part is completely taken from the [link](https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908) provided in the question:

## 1. Lowercase

In [111]:
df['text'] = df['text'].map(str.lower)

## 2. Removing numbers

In [112]:
df['text'] = df['text'].map(lambda x: re.sub(r'\d+', '', x))

## 3. Removing all non-ascii characters

In [113]:
df['text'] = df['text'].map(lambda x: re.sub(r'[^\x00-\x7F]+', '', x))

## 4. Removing punctuation

In [114]:
df['text'] = df['text'].map(lambda x: x.translate(str.maketrans('','', string.punctuation)))

## 5. Removing stop words

In [115]:
stop_words = set(stopwords.words('english'))

In [116]:
df['text'] = df['text'].map(lambda x: 
                            [i for i in word_tokenize(x) if not i in stop_words])

## 6. Removing single letter words

In [117]:
df['text'] = df['text'].map(lambda x: 
                            [w for w in x if len(w) > 1])

## 7. Lemmatizing Words

This is part is taken from the third method discussed [here](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizerwithappropriatepostag):

In [118]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [119]:
lemmatizer = WordNetLemmatizer()

In [122]:
df['text'] = df['text'].map(lambda x: 
                            [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in x])

## 8. Word Counts

In [123]:
df.head()

Unnamed: 0,id,text,emotions
0,27383,"[feel, awful, job, get, position, succeed, hap...",sadness
1,110083,"[im, alone, feel, awful]",sadness
2,140764,"[ive, probably, mention, really, feel, proud, ...",joy
3,100071,"[feel, little, low, day, back]",sadness
4,2837,"[beleive, much, sensitive, people, feeling, te...",love


In [135]:
word_total_count = df.explode('text').dropna().value_counts(subset=['text'])

For getting the emotion count I thought about different methods and asked for advice from some friends and I finally came up with two methods which I've commented one:

In [128]:
word_emotion_count = df.explode('text').dropna().pivot_table(index="text", 
                                                             columns='emotions',
                                                             aggfunc=len,
                                                             fill_value=0)

In [129]:
word_emotion_count.columns = word_emotion_count.columns.droplevel()

In [136]:
word_total_count = word_total_count.to_frame(name='total_count')

In [137]:
words = pd.merge(word_total_count, word_emotion_count, on='text')

In [None]:
# s = df.groupby("emotions")["text"]\
#        .apply(lambda x: Counter(t for texts in x for t in texts))\
#        .dropna()\
#        .astype(int)
# s.unstack(level=0)

# Requests

## 1. Saving the list of words

In [143]:
pd.Series(words.index).to_csv('words.txt', header=False, index=False)

## 2. Memory needed for all frequent itemsets

The total number of words extracted is: 52210

Let's assume each word takes 2 bytes of memory.

For the single-sets we'll have 52210 * 2 = 104.42 kilobytes.

For the pair-sets we'll have 52210 * 52209 = 2.7 gigabytes

For the three-sets we'll have 52210 * 52209 * 52208 * 2/6 = 47.4 terabytes

In the case we want to have all the itemsets it would be $2^{52211}$ bytes.

## 3. How much itemsets can be handled in 16GB memory

As we saw in the previous section we'll have around 2.8 gigabytes for the single and double sets. So in the 16GB memory, 13.2GB will be left.

All the three-sets can be handled in a 47.4 terabytes memory so the fraction which can be handled in a 13.2GB memory is 13.2GB*(number of 3 sets)/ (2 * 47.4TB) which would be 66 * $10^8$ three-set itemsets.

So we can put all the single-sets, pair-sets and 66 * $10^8$ of three-sets in a 16GB memory.

## 4. SON algorithm

The main SON algorithm for MapReduce is taken from [here](https://www.geeksforgeeks.org/the-son-algorithm-and-map-reduce/).

Most of the code of MapReduce is taken from [here](https://github.com/nityapydipati/SON-Algorithm-using-Apache-Spark/blob/master/Nitya_Pydipati_son.py).

In [None]:
sc = SparkContext()
spark = SparkSession(sc)

rdd1 will be a rdd made from the main DataFrame:

In [171]:
rdd1 = spark.createDataFrame(df).rdd

Calculating the thresholds:

In [172]:
s_th = 0.05 * len(df)
partitions = 500

In [173]:
partition_s_th = s_th / partitions

Setting the number of partitions:

In [174]:
rdd2 = rdd1.coalesce(partitions, True)

In [175]:
rdd2.getNumPartitions()

500

rdd2 will only have the text column of each row:

In [176]:
rdd2 = rdd2.map(lambda b: b['text'])

Finding the frequent sets in a list of baskets:

In [177]:
def frequent(frequent_sets,baskets,sup):
    frequent_dict = defaultdict(int)
    for item in frequent_sets:
        for basket in baskets:
            if item.issubset(basket):
                frequent_dict[frozenset(item)] += 1
    items=set()
    for item in frequent_dict:
        if frequent_dict[item] >= sup:
            items.add(item)
    return items

Main Apriori algorithm for single and pair frequent sets:

In [178]:
def apriori(items):
    freq_one = defaultdict(int)
    baskets = []
    frequent_sets = set()
    results = dict()
    for item in items:
        baskets.append(item)
        for i in item:
            freq_one[i]+=1
    for freq in freq_one:
        if(freq_one[freq]>=partition_s_th):
            frequent_sets.add(freq)
    
    results[1]=[frozenset([item]) for item in frequent_sets]
    combine=set()
    for sets in frequent_sets:
        combine.add(sets)
    sets=[set(sorted(item)) for item in 
          itertools.chain(*[itertools.combinations(combine, 2)])]
    frequent_sets=sets
    next_freq=frequent(frequent_sets,baskets,partition_s_th)
    if (next_freq): 
      results[2]=next_freq
    frequent_sets=next_freq
    return results

MapReduced SON Phase 1:

In [179]:
basket=rdd2.mapPartitions(lambda line: [y for y in apriori(line).values()])

map_one=basket.flatMap(lambda x: [(y, 1) for y in x])

reduce_one=map_one.reduceByKey(lambda x, y: x)

item_red=reduce_one.map(lambda x: x[0]).collect()

MapReduced SON Phase 2:

In [180]:
broadcasting_global_count=sc.broadcast(item_red)

map_two=rdd2.flatMap(lambda line: [(count,1) for count in 
                                   broadcasting_global_count.value 
                                   if set(line).issuperset(set(count))])
reduce_two=map_two.reduceByKey(lambda x,y: x+y)

global_count=reduce_two.filter(lambda x: x[1]>=s_th)
output=global_count.collect()

As we can see, we have the frequent itemsets and their total count in the dataset:

In [181]:
output

[(frozenset({'know'}), 18037),
 (frozenset({'im'}), 39047),
 (frozenset({'feel', 'time'}), 16749),
 (frozenset({'make'}), 20315),
 (frozenset({'like'}), 50467),
 (frozenset({'feel', 'im'}), 38484),
 (frozenset({'feel', 'go'}), 18535),
 (frozenset({'feel'}), 291534),
 (frozenset({'feel', 'like'}), 49984),
 (frozenset({'feel', 'make'}), 19937),
 (frozenset({'time'}), 17215),
 (frozenset({'feel', 'get'}), 21349),
 (frozenset({'go'}), 19162),
 (frozenset({'really'}), 17247),
 (frozenset({'feel', 'know'}), 17555),
 (frozenset({'feel', 'really'}), 16927),
 (frozenset({'get'}), 22115)]

Now we want to calculate the count of frequent itemsets in each emotion:

In [183]:
freq_itemsets = [i[0] for i in output]

In [None]:
def get_freq_itemset_emotions(row):
  return [(row['emotions'], item_set) for item_set in freq_items 
            if item_set.issubset(row['text'])]

mapped = rdd1.map(lambda row: get_freq_itemset_emotions(row))\
            .flatMap(lambda x: x)\
            .map(lambda x: (x, 1))\
            .reduceByKey(lambda x,y: x+y)
mapped.collect()

[(('joy', frozenset({'feel'})), 71984),
 (('joy', frozenset({'like'})), 16811),
 (('joy', frozenset({'feel', 'like'})), 14701),
 (('joy', frozenset({'feeling'})), 30466),
 (('anger', frozenset({'time'})), 2423),
 (('joy', frozenset({'know'})), 4966),
 (('joy', frozenset({'im'})), 12928),
 (('joy', frozenset({'feel', 'im'})), 5350),
 (('surprise', frozenset({'feeling'})), 3722),
 (('surprise', frozenset({'feel'})), 7248),
 (('joy', frozenset({'feeling', 'im'})), 7879),
 (('sadness', frozenset({'really'})), 5012),
 (('surprise', frozenset({'like'})), 1588),
 (('surprise', frozenset({'feel', 'like'})), 1357),
 (('fear', frozenset({'time'})), 2014),
 (('surprise', frozenset({'know'})), 650),
 (('surprise', frozenset({'im'})), 1232),
 (('surprise', frozenset({'feel', 'im'})), 563),
 (('sadness', frozenset({'time'})), 5112),
 (('love', frozenset({'really'})), 1491),
 (('love', frozenset({'time'})), 1372),
 (('fear', frozenset({'really'})), 1870),
 (('anger', frozenset({'really'})), 2475),
 (

## 5. Proposing a equation for feeling probability of a sentence

We suppose each sentence is preprocessed into a sequence of words: w1, w2, ..., wn

As locality is discussed in the field of NLP, the combination of words with its neighbours is important thus we approximate the feeling probability of the sentence using the 2-grams of the given sentence. i.e we consider the following 2-grams: (W1, W2), (W2, W3), ..., (Wn-1, Wn).

For each pair of words (a, b) we calculate the number of occurences in the whole dataset in addition to occurencces in each emotion. Then we divide the vector of emotion counts by the total count to get the probability of each emotion for the given pair. This gives us an estimate of how this pair would change the sentence's emotional distribution.

Then for each 2-gram of the sentence we get the emotion distribution vector and sum them up. Finally, we normalize the final vector by dividing it by the sum of all the values.

## 6. Calculating the probability of each sentence in the dataset

In [None]:
df2 = df.copy()[['text', 'emotions']]

Generating the 2-grams for each sentence in the dataset:

In [184]:
def get_2_grams(word_array):
  result = []
  for i in range (len(word_array)-1):
    result.append(tuple((word_array[i], word_array[i+1])))
  return result

In [None]:
df2['2-grams'] = df['text'].apply(get_2_grams)

Counting the number of each emotion for each 2-gram:

In [None]:
df3 = df2.explode('2-grams')\
         .dropna()\
         .pivot_table(index="2-grams", columns='emotions', 
                      aggfunc=len, fill_value=0)

In [None]:
df3.columns = df3.columns.droplevel()

Counting the total number of each 2-gram:

In [None]:
pair_totals = df2.explode('2-grams')\
                 .dropna().value_counts(subset=['2-grams'])

In [None]:
pair_totals = pair_totals.to_frame(name='count')

In [None]:
pair_counts = pd.merge(df3, pair_totals, on='2-grams')

Getting the feeling probability distribution of each 2-gram:

In [None]:
pair_counts['anger'] /= pair_counts['count']
pair_counts['fear'] /= pair_counts['count']
pair_counts['joy'] /= pair_counts['count']
pair_counts['love'] /= pair_counts['count']
pair_counts['sadness'] /= pair_counts['count']
pair_counts['surprise'] /= pair_counts['count']

Finally, let's create the final DataFrame:

In [None]:
final_df = df.copy()
final_df['2-grams']= df['text'].apply(get_2_grams)

As we just need the probability distribution we can drop the count column:

In [None]:
new_pair_counts = pair_counts.drop(['count'], axis=1)

We did several benchmarks with the help of other friends to see which method is the most efficient. Using the vanilla DataFrame crashes the Colab as it uses too much RAM. 

Finally using the dictionary is the most efficient method.

In [None]:
new_pair_counts = new_pair_counts.to_dict(orient='index')

In [None]:
def get_prob_distrib(row):
  emotions = np.zeros(6)
  for p in row:
    e = list(new_pair_counts[p].values())
    emotions += e
  
  prob_s = np.sum(emotions)
  emotions /= prob_s
  return emotions

In [None]:
prs = final_df['2-grams'].apply(get_prob_distrib)

In [None]:
list_of_emotions = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']

In [None]:
dominant_feeling = prs.apply(lambda x: list_of_emotions[np.argmax(x)])

In [None]:
part6_final_df = pd.DataFrame({'id': df['id'], 'feeling-prs': prs, 
                               'dominant-feeling': dominant_feeling})

And saving it to a csv file:

In [None]:
part6_final_df.to_csv('result.csv')

We can zip all the results to be able to download them easier:

In [185]:
!zip q4_results.zip result.csv words.txt

updating: result.csv (deflated 61%)
  adding: words.txt (deflated 54%)
