## Student Information
Name: Edwin Sanjaya

Student ID: 110065710

GitHub ID: edwinsanjaya

Kaggle name: Edwin Sanjaya 陳潤烈

Kaggle private scoreboard snapshot:

[Snapshot](img/pic0.png)


## Assignment 2 & 3

2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm2022-isa5810-lab2-homework) regarding Emotion Recognition on Twitter by this link https://www.kaggle.com/t/2b0d14a829f340bc88d2660dc602d4bd. The scoring will be given according to your place in the Private Leaderboard ranking:
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)
    Submit your last submission __BEFORE the deadline (Nov. 22th 11:59 pm, Tuesday)_. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.


3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained.

### 1. Read and Explore the Data

The first step to work with the assignment is by reading the dataset that will be used in our lab. Pandas data frame will be used to store the data since it provide robust data structure and container to work on data analysis and data science

These are the file that we are working on:
- **data_identification.csv** provides information whether a certain tweet_id is a training or test dataset
- **emotion.csv** provides emotion label of certain tweet_id. only training dataset has the labelled emotion, test dataset has null value and will be used to assess the model performance in Kaggle competition
- **tweets_DM.json** provides the dataset of tweets in twitter in JSON format

Sample of tweets_DM.json data:
```
{
  "_score": 391,
  "_index": "hashtag_tweets",
  "_source": {
    "tweet": {
      "hashtags": [
        "Snapchat"
      ],
      "tweet_id": "0x376b20",
      "text": "People who post \"add me on #Snapchat\" must be dehydrated. Cuz man.... that's <LH>"
    }
  },
  "_crawldate": "2015-05-23 11:42:47",
  "_type": "tweets"
}
```

##### 1.1 Reading the CSV files

**data_identification.csv** and **emotion.csv** file can be directly loaded without further processing since CSV files already in table format

In [1]:
import pandas as pd
import os

kaggle_folder = os.path.join(os.getcwd(), 'kaggle_data')

# Function to generate DataFrame via CSV filename
def df_from_csv(filename):
    f = os.path.join(kaggle_folder, filename)
    return pd.read_csv(f, delimiter='\t|\n|,', engine='python')

# Generate DataFrame
data_identification = df_from_csv('data_identification.csv')
emotion = df_from_csv('emotion.csv')
data_identification

Unnamed: 0,tweet_id,identification
0,0x28cc61,test
1,0x29e452,train
2,0x2b3819,train
3,0x2db41f,test
4,0x2a2acc,train
...,...,...
1867530,0x227e25,train
1867531,0x293813,train
1867532,0x1e1a7e,train
1867533,0x2156a5,train


In [2]:
emotion

Unnamed: 0,tweet_id,emotion
0,0x3140b1,sadness
1,0x368b73,disgust
2,0x296183,anticipation
3,0x2bd6e1,joy
4,0x2ee1dd,anticipation
...,...,...
1455558,0x38dba0,joy
1455559,0x300ea2,joy
1455560,0x360b99,fear
1455561,0x22eecf,joy


##### 1.2 Reading JSON file

Unlike CSV files, the JSON files contains the tweets data in nested JSON format. There are two major problem in the format:
1. The JSON objects in the file separated by the newline, which is not supported by pandas reader
2. The ```_source``` key contains nested object. Using ```pd.read_json``` or ```pd.DataFrame``` directly will cause the whole nested objects stored in one attribute.

To solve the problem, two approaches was used:
1. We iterate the JSON file line by line
2. Pandas ```pd.json_normalize``` were used to convert the nested JSON into flat JSON (one level)

In [3]:
# Issue in read JSON data directly: nested object in _source aggregated as one in _source field
test_json = {
    "_score": 391,
    "_index": "hashtag_tweets",
    "_source": {
        "tweet": {
            "hashtags": [
                "Snapchat"
            ],
            "tweet_id": "0x376b20",
            "text": "People who post \"add me on #Snapchat\" must be dehydrated. Cuz man.... that's <LH>"
        }
    },
    "_crawldate": "2015-05-23 11:42:47",
    "_type": "tweets"
}
test_json_df = pd.DataFrame(test_json)
test_json_df

Unnamed: 0,_score,_index,_source,_crawldate,_type
tweet,391,hashtag_tweets,"{'hashtags': ['Snapchat'], 'tweet_id': '0x376b...",2015-05-23 11:42:47,tweets


In [4]:
import json

# Load JSON file, iterate line by line, store the data in list
f = open(os.path.join(kaggle_folder, 'tweets_DM.json'))
data = []
for line in f:
    data.append(json.loads(line))

# Normalize nested JSON and convert to DataFrame
tweets_dm = pd.json_normalize(data)

# Rename columns in more readable format
rename_map = {
    '_score': 'score',
    '_index': 'index',
    '_crawldate' : 'date',
    '_type': 'type',
    '_source.tweet.hashtags': 'hashtags',
    '_source.tweet.tweet_id': 'tweet_id',
    '_source.tweet.text': 'text'
}
tweets_dm.rename(columns=rename_map, inplace=True)

# Get necessary data: the tweet_id and text
tweets_dm = tweets_dm[['tweet_id', 'text']]
tweets_dm.head

<bound method NDFrame.head of          tweet_id                                               text
0        0x376b20  People who post "add me on #Snapchat" must be ...
1        0x2d5350  @brianklaas As we see, Trump is dangerous to #...
2        0x28b412  Confident of your obedience, I write to you, k...
3        0x1cd5b0                Now ISSA is stalking Tasha 😂😂😂 <LH>
4        0x2de201  "Trust is not the same as faith. A friend is s...
...           ...                                                ...
1867530  0x316b80  When you buy the last 2 tickets remaining for ...
1867531  0x29d0cb  I swear all this hard work gone pay off one da...
1867532  0x2a6a4f  @Parcel2Go no card left when I wasn't in so I ...
1867533  0x24faed  Ah, corporate life, where you can date <LH> us...
1867534  0x34be8c             Blessed to be living #Sundayvibes <LH>

[1867535 rows x 2 columns]>

##### 1.3 Data Exploration

In [5]:
# Checking dimension (good practice :3)
print(f'Data identification: {data_identification.shape}')
print(f'Emotion: {emotion.shape}')
print(f'Tweets DM: {tweets_dm.shape}')

Data identification: (1867535, 2)
Emotion: (1455563, 2)
Tweets DM: (1867535, 2)


In [6]:
# Checking unique emotion value
emotion.groupby('emotion').nunique()

Unnamed: 0_level_0,tweet_id
emotion,Unnamed: 1_level_1
anger,39867
anticipation,248935
disgust,139101
fear,63999
joy,516017
sadness,193437
surprise,48729
trust,205478


### 2. Preprocessing

One of the process to improve our emotion prediction is by doing pre-processing.

The following preprocessing techniques were used in this assignment:
- Lowercase
- Regular Expression: remove symbol, number, punctuation
- Tokenization
- Stopword Removal
- Stemming
- Lemmatization

In [7]:
# Required library for pre-processing
import re
import nltk
import spacy
import preprocessor as p
from nltk.tokenize import word_tokenize
from nltk import TweetTokenizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

# Require: python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package wordnet to C:\Users\Edwin
[nltk_data]     Sanjaya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Edwin
[nltk_data]     Sanjaya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Edwin Sanjaya\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


##### 2.1 Basic Preprocessing: Lowercase, Banned Word, Regex



In [8]:
# Remove mention, url, symbol
def basic_tweet_preprocess(text):
    output = text.lower()
    # tweet-preprocessor -> unused because of redudancy with tweet-tokenizer
    # output = p.clean(output)
    bw_list = ['<lh>', 'rt']
    for bw in bw_list:
        output = output.replace(bw, '')
    output = re.sub(r'[!#$%^&*(),.?":;{}|<>_]', '', output)
    # output = re.sub(r'\b[0-9]+\b', '', output)
    output = re.sub(r'(.)1+', r'1', output)
    output = re.sub(r'[0-9]+', '', output)
    output = re.sub(r"'", '', output)
    return output

##### 2.2 Tokenization

Normally, we can use common build-in tokenizer in nltk which is word_tokenize. After some exploration and research nltk has build-in tokenizer which specialized on working on Twitter's text data called **TweetTokenizer()**. The specified tokenizer will be used for this lab assignment since they have the following benefits:
- Tokenize the word just like normal tokenizer
- Detect handles a.k.a. username info that started with @
- Detect repeated characters
- Detect phone numbers

In [9]:
def tweet_tokenize(text):
    tt = TweetTokenizer(strip_handles=True, reduce_len=True)
    return tt.tokenize(text)

##### 2.3 Stopword Removal

Stopword removal is an important process to reduce the number of redundant word that does not provide any clue in the emotion analysis

By removing the redundant word, our model can focus on the important word and require less processing time due to the reduced amount of word

In [10]:
def remove_stopwords(text):
    sw = set(stopwords.words("english"))
    output = [w for w in text if not w in sw]
    return output

##### 2.4 Stemming

In [11]:
def stemming(text):
    output = text
    stemmer = nltk.PorterStemmer()
    output = [stemmer.stem(w) for w in text]
    return output

##### 2.5 Lemmatizing

nltk and spacy provides lemmatization library. However a pos_tag is necessary to optimize the quality of the lemmatization process. Compared to the stemming, lemmatization also requires more processing time with relatively similar result

In [12]:
# def lemmatizing(text):
#     output = []
#     lemmatizer = nltk.WordNetLemmatizer()
#     text_pos = pos_tag(text)
#     for token_pos in text_pos:
#         output.append(lemmatizer.lemmatize(token_pos[0], pos=tag_map[token_pos[1]]))
#     return output

def lemmatizing(text):
    text = nlp(" ".join(text))
    output = []
    for token in text:
        output.append(token.lemma_)
    return output

In [13]:
def join_token(list):
    output = ' '.join(list)
    return output

##### Example: Pre-processing step by step in first 5 data

In [14]:
# Test data to show step by step process
head_tweets_dm = tweets_dm.head().copy()
head_tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,"People who post ""add me on #Snapchat"" must be ..."
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #..."
2,0x28b412,"Confident of your obedience, I write to you, k..."
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>
4,0x2de201,"""Trust is not the same as faith. A friend is s..."


In [15]:
head_tweets_dm['text'] = head_tweets_dm['text'].apply(lambda x: basic_tweet_preprocess(x))
head_tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,people who post add me on snapchat must be deh...
1,0x2d5350,@brianklaas as we see trump is dangerous to fr...
2,0x28b412,confident of your obedience i write to you kno...
3,0x1cd5b0,now issa is stalking tasha 😂😂😂
4,0x2de201,trust is not the same as faith a friend is som...


In [16]:
# tokenize
head_tweets_dm['text'] = head_tweets_dm['text'].apply(lambda x: tweet_tokenize(x))
head_tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,"[people, who, post, add, me, on, snapchat, mus..."
1,0x2d5350,"[as, we, see, trump, is, dangerous, to, freepr..."
2,0x28b412,"[confident, of, your, obedience, i, write, to,..."
3,0x1cd5b0,"[now, issa, is, stalking, tasha, 😂, 😂, 😂]"
4,0x2de201,"[trust, is, not, the, same, as, faith, a, frie..."


In [17]:
# remove stop word
head_tweets_dm['text'] = head_tweets_dm['text'].apply(lambda x: remove_stopwords(x))
head_tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,"[people, post, add, snapchat, must, dehydrated..."
1,0x2d5350,"[see, trump, dangerous, freepress, around, wor..."
2,0x28b412,"[confident, obedience, write, knowing, even, a..."
3,0x1cd5b0,"[issa, stalking, tasha, 😂, 😂, 😂]"
4,0x2de201,"[trust, faith, friend, someone, trust, putting..."


In [18]:
# stem
head_tweets_dm['text'] = head_tweets_dm['text'].apply(lambda x: stemming(x))
head_tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,"[peopl, post, add, snapchat, must, dehydr, cuz..."
1,0x2d5350,"[see, trump, danger, freepress, around, world,..."
2,0x28b412,"[confid, obedi, write, know, even, ask, philem..."
3,0x1cd5b0,"[issa, stalk, tasha, 😂, 😂, 😂]"
4,0x2de201,"[trust, faith, friend, someon, trust, put, fai..."


In [19]:
# lemma
head_tweets_dm['text'] = head_tweets_dm['text'].apply(lambda x: lemmatizing(x))
head_tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,"[peopl, post, add, snapchat, must, dehydr, cuz..."
1,0x2d5350,"[see, trump, danger, freepress, around, world,..."
2,0x28b412,"[confid, obedi, write, know, even, ask, philem..."
3,0x1cd5b0,"[issa, stalk, tasha, 😂, 😂, 😂]"
4,0x2de201,"[trust, faith, friend, someon, trust, put, fai..."


In [20]:
# Sequential pre-processing
tweets_dm['text'] = tweets_dm['text'].apply(lambda x: x.lower())
tweets_dm['text'] = tweets_dm['text'].apply(lambda x: basic_tweet_preprocess(x))
# tweets_dm['text'] = tweets_dm['text'].apply(lambda x: tweet_tokenize(x))
# tweets_dm['text'] = tweets_dm['text'].apply(lambda x: remove_stopwords(x))
# tweets_dm['text'] = tweets_dm['text'].apply(lambda x: stemming(x))
# tweets_dm['text'] = tweets_dm['text'].apply(lambda x: lemmatizing(x))
tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,people who post add me on snapchat must be deh...
1,0x2d5350,@brianklaas as we see trump is dangerous to fr...
2,0x28b412,confident of your obedience i write to you kno...
3,0x1cd5b0,now issa is stalking tasha 😂😂😂
4,0x2de201,trust is not the same as faith a friend is som...
...,...,...
1867530,0x316b80,when you buy the last tickets remaining for a...
1867531,0x29d0cb,i swear all this hard work gone pay off one da...
1867532,0x2a6a4f,@parcelgo no card left when i wasnt in so i ha...
1867533,0x24faed,ah corporate life where you can date using ju...


In [21]:
# Join token as whole
# tweets_dm['text'] = tweets_dm['text'].apply(lambda x: join_token(x))
tweets_dm

Unnamed: 0,tweet_id,text
0,0x376b20,people who post add me on snapchat must be deh...
1,0x2d5350,@brianklaas as we see trump is dangerous to fr...
2,0x28b412,confident of your obedience i write to you kno...
3,0x1cd5b0,now issa is stalking tasha 😂😂😂
4,0x2de201,trust is not the same as faith a friend is som...
...,...,...
1867530,0x316b80,when you buy the last tickets remaining for a...
1867531,0x29d0cb,i swear all this hard work gone pay off one da...
1867532,0x2a6a4f,@parcelgo no card left when i wasnt in so i ha...
1867533,0x24faed,ah corporate life where you can date using ju...


### 3. Feature Extraction

In [22]:
# Preparing the train dataframe
train_df = pd.merge(tweets_dm, emotion, on='tweet_id', how='inner')
train_df = pd.merge(train_df, data_identification, on='tweet_id', how='inner')
train_df

Unnamed: 0,tweet_id,text,emotion,identification
0,0x376b20,people who post add me on snapchat must be deh...,anticipation,train
1,0x2d5350,@brianklaas as we see trump is dangerous to fr...,sadness,train
2,0x1cd5b0,now issa is stalking tasha 😂😂😂,fear,train
3,0x1d755c,@riskshow @thekevinallison thx for the best ti...,joy,train
4,0x2c91a8,still waiting on those supplies liscus,anticipation,train
...,...,...,...,...
1455558,0x321566,im so happy nowonder the name of this show hap...,joy,train
1455559,0x38959e,in every circumtance id like to be thankful to...,joy,train
1455560,0x2cbca6,theres currently two girls walking around the ...,joy,train
1455561,0x24faed,ah corporate life where you can date using ju...,joy,train


In [23]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

tfidf_vect = TfidfVectorizer()
tfidf_dtm = tfidf_vect.fit(train_df['text'])

In [24]:
#Check terms
tfidf_vect.get_feature_names_out()

array(['aa', 'aaa', 'aaaa', ..., '𝖒𝖊𝖉𝖎𝖈𝖎𝖓𝖊', '𝖔𝖓', '𝖙𝖍𝖊'], dtype=object)

### 4. Machine Learning Modelling

In [25]:
# Preparing the test dataframe
test_df = pd.merge(tweets_dm, data_identification, on='tweet_id', how='inner')
test_df = test_df[test_df['identification']=='test']
test_df = test_df[['tweet_id', 'text']]
test_df = test_df.reset_index()

# Save to pickle
train_df.to_pickle("kaggle_train_df.pkl")
test_df.to_pickle("kaggle_test_df.pkl")
test_df

Unnamed: 0,index,tweet_id,text
0,2,0x28b412,confident of your obedience i write to you kno...
1,4,0x2de201,trust is not the same as faith a friend is som...
2,9,0x218443,when do you have enough when are you satisfie...
3,30,0x2939d5,god woke you up now chase the day godsplan god...
4,33,0x26289a,in these tough times who do you turn to as you...
...,...,...,...
411967,1867525,0x2913b4,for this is the message that ye heard from the...
411968,1867529,0x2a980e,there is a lad here which hath five barley loa...
411969,1867530,0x316b80,when you buy the last tickets remaining for a...
411970,1867531,0x29d0cb,i swear all this hard work gone pay off one da...


In [32]:
# Training the model
x_train = tfidf_dtm.transform(train_df['text'])
y_train = train_df['emotion']
x_test = tfidf_dtm.transform(test_df['text'])

# MNB_model = MultinomialNB()
# MNB_model = MNB_model.fit(x_train, y_train)
# LSVC_model = LinearSVC(verbose=True)
# LSVC_model = LSVC_model.fit(x_train, y_train)
LR_model = LogisticRegression(max_iter=850, n_jobs=-1, verbose=True)
LR_model = LR_model.fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed: 19.9min finished


In [27]:
from sklearn.linear_model import SGDClassifier
from sklearn.kernel_approximation import RBFSampler
# SGDC_model = SGDClassifier(max_iter=4000)
# SGDC_model = SGDC_model.fit(x_train, y_train)
# RBFS_model = RBFSampler()
# RBFS_model = RBFS_model.fit(x_train, y_train)

### 5. Analyze & Generate the Result

Sometimes the result of the training accuracy doesn't fully reflect the test accuracy. For example:

- Using Linear Support Vector Classifier, we got 77% Training accuracy, however our test is 45.5%
- Using Logistic Regression, we got 68% accuracy, however the test accuracy is higher 46%


In [28]:
# Get traaining accuracy of the model
selected_model = LR_model
y_train_predict = selected_model.predict(x_train)
y_test_predict = selected_model.predict(x_test)
print(f'Train Accuracy: {accuracy_score(y_train, y_train_predict)}')

Train Accuracy: 0.6282943438380888


In [29]:
# Generate data frame for the prediction
y_test_predict = pd.DataFrame(y_test_predict, columns = ['emotion'])
y_test_predict

Unnamed: 0,emotion
0,anticipation
1,anticipation
2,joy
3,anticipation
4,trust
...,...
411967,anticipation
411968,anticipation
411969,sadness
411970,anger


In [30]:
# Rename column to meet Kaggle specification
submit_df = test_df.assign(emotion=y_test_predict)
submit_df = submit_df[['tweet_id', 'emotion']]
submit_df = submit_df.rename(columns={'tweet_id': 'id'})
submit_df

Unnamed: 0,id,emotion
0,0x28b412,anticipation
1,0x2de201,anticipation
2,0x218443,joy
3,0x2939d5,anticipation
4,0x26289a,trust
...,...,...
411967,0x2913b4,anticipation
411968,0x2a980e,anticipation
411969,0x316b80,sadness
411970,0x29d0cb,anger


In [31]:
# Create CSV file of test dataset emotion prediction
submit_df.to_csv('submit.csv', index=False)

### 6. Conclusion and Improvement

- JSON object separated with new line: use parameter ```lines=True```
- tweet_DM.json is nested JSON (has hierarchy) separated with multiline -> need to transform into non-hierarchy, using json.load to read per line
    - Solve multiline: use json.load and iterate perline to create normal JSON
    - Solve nested issue: use pd.json_normalize, rename the column for better readability

Based on the process and results in this assignment, there are several improvement points that can be considered in the future:
- Since we are working on a dataset with varying language, a translation library or API can be considered to standardize the text
- Other way to work with the data is to classify the language in each text and create different training model for each language
- We can provide more text pre-processing by using Regular Expression
- Convert emoji into text
- Improve the Lemmatization process by considering more robust pos-tag
- Try deep learning or language model such as Transformers, BERT