<a href="https://colab.research.google.com/github/UmbuMFA/UTSMechineLeraning/blob/main/js08_uts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deteksi Emosi Pengguna Twitter

Deteksi emosi merupakan salah satu permasalahan yang dihadapi pada ***Natural Language Processing*** (NLP). Alasanya diantaranya adalah kurangnya dataset berlabel untuk mengklasifikasikan emosi berdasarkan data twitter. Selain itu, sifat dari data twitter yang dapat memiliki banyak label emosi (***multi-class***). Manusia memiliki berbagai emosi dan sulit untuk mengumpulkan data yang cukup untuk setiap emosi. Oleh karena itu, masalah ketidakseimbangan kelas akan muncul (***class imbalance***). Pada Ujian Tengah Semester (UTS) kali ini, Anda telah disediakan dataset teks twitter yang sudah memiliki label untuk beberapa kelas emosi. Tugas utama Anda adalah membuat model yang mumpuni untuk kebutuhan klasifikasi emosi berdasarkan teks.

### Informasi Data

Dataset yang akan digunakan adalah ****tweet_emotion.csv***. Berikut merupakan informasi tentang dataset yang dapat membantu Anda.

- Total data: 40000 data
- Label emosi: anger, boredom, empty, enthusiasm, fun, happiness, hate, love, neutral, relief, sadness, surprise, worry
- Jumlah data untuk setiap label tidak sama (***class imbalance***)
- Terdapat 3 kolom = 'tweet_id', 'sentiment', 'content'

### Penilaian UTS

UTS akan dinilai berdasaarkan 4 proses yang akan Anda lakukan, yaitu pra pengolahan data, ektraksi fitur, pembuatan model machine learning, dan evaluasi.

#### Pra Pengolahan Data

> **Perhatian**
> 
> Sebelum Anda melakukan sesuatu terhadap data Anda, pastikan data yang Anda miliki sudah "baik", bebas dari data yang hilang, menggunakan tipe data yang sesuai, dan sebagainya.
>

Data tweeter yang ada dapatkan merupakan sebuah data mentah, maka beberapa hal dapat Anda lakukan (namun tidak terbatas pada) yaitu,

1. Case Folding
2. Tokenizing
3. Filtering
4. Stemming

*CATATAN: PADA DATA TWITTER TERDAPAT *MENTION* (@something) YANG ANDA HARUS TANGANI SEBELUM MASUK KE TAHAP EKSTRAKSI FITUR*

#### Ekstrasi Fitur

Anda dapat menggunakan beberapa metode, diantaranya

1. Bag of Words (Count / TF-IDF)
2. N-gram
3. dan sebagainya

#### Pembuatan Model

Anda dibebaskan dalam memilih algoritma klasifikasi. Anda dapat menggunakan algoritma yang telah diajarkan didalam kelas atau yang lain, namun dengan catatan. Berdasarkan asas akuntabilitas pada pengembangan model machine learning, Anda harus dapat menjelaskan bagaimana model Anda dapat menghasilkan nilai tertentu.

#### Evaluasi

Pada proses evaluasi, minimal Anda harus menggunakan metric akurasi. Akan tetapi Anda juga dapat menambahkan metric lain seperti Recall, Precision, F1-Score, detail Confussion Metric, ataupun Area Under Curve (AUC).

#### Nama : Umbu Michael FA
#### Kelas : Kelas TI 4J
#### Nim : 2241727033

### Lembar Pengerjaan
Lembar pengerjaan dimulai dari cell dibawah ini

In [4]:
import numpy as np
import pandas as pd

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/UmbuMFA/UTSMechineLeraning/main/tweet_emotions.csv')

df.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


Preparation Data

In [6]:
# ------ Case Folding --------
# gunakan fungsi Series.str.lower() pada Pandas
df['content'] = df['content'].str.lower()


print('Case Folding Result : \n')
print(df['content'].head(5))
print('\n\n\n')

Case Folding Result : 

0    @tiffanylue i know  i was listenin to bad habi...
1    layin n bed with a headache  ughhhh...waitin o...
2                  funeral ceremony...gloomy friday...
3                 wants to hang out with friends soon!
4    @dannycastillo we want to trade with someone w...
Name: content, dtype: object






In [7]:
import string 
import re #regex library
import nltk
nltk.download('punkt')

# import word_tokenize & FreqDist from NLTK
from nltk.tokenize import word_tokenize 
from nltk.probability import FreqDist

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:


# ------ Tokenizing ---------

def remove_content_special(text):
    # remove tab, new line, ans back slice
    text = text.replace('\\t'," ").replace('\\n'," ").replace('\\u'," ").replace('\\',"")
    # remove non ASCII (emoticon, chinese word, .etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    # remove mention, link, hashtag
    text = ' '.join(re.sub("([@#][A-Za-z0-9]+)|(\w+:\/\/\S+)"," ", text).split())
    # remove incomplete URL
    return text.replace("http://", " ").replace("https://", " ")
                
df['content'] = df['content'].apply(remove_content_special)

#remove number
def remove_number(text):
    return  re.sub(r"\d+", "", text)

df['content'] = df['content'].apply(remove_number)

#remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("","",string.punctuation))

df['content'] = df['content'].apply(remove_punctuation)

#remove whitespace leading & trailing
def remove_whitespace_LT(text):
    return text.strip()

df['content'] = df['content'].apply(remove_whitespace_LT)

#remove multiple whitespace into single whitespace
def remove_whitespace_multiple(text):
    return re.sub('\s+',' ',text)

df['content'] = df['content'].apply(remove_whitespace_multiple)

# remove single char
def remove_singl_char(text):
    return re.sub(r"\b[a-zA-Z]\b", "", text)

df['content'] = df['content'].apply(remove_singl_char)

# NLTK word rokenize 
def word_tokenize_wrapper(text):
    return word_tokenize(text)

df['content_token'] = df['content'].apply(word_tokenize_wrapper)

print('Tokenizing Result : \n') 
print(df['content_token'].head())
print('\n\n\n')

Tokenizing Result : 

0    [know, was, listenin, to, bad, habit, earlier,...
1    [layin, bed, with, headache, ughhhhwaitin, on,...
2                    [funeral, ceremonygloomy, friday]
3          [wants, to, hang, out, with, friends, soon]
4    [we, want, to, trade, with, someone, who, has,...
Name: content_token, dtype: object






In [9]:
def freqDist_wrapper(text):
    return FreqDist(text)

df['content_tokens_fdist'] = df['content_token'].apply(freqDist_wrapper)

print('Frequency Tokens : \n') 
print(df['content_tokens_fdist'].head().apply(lambda x : x.most_common()))

Frequency Tokens : 

0    [(know, 1), (was, 1), (listenin, 1), (to, 1), ...
1    [(layin, 1), (bed, 1), (with, 1), (headache, 1...
2     [(funeral, 1), (ceremonygloomy, 1), (friday, 1)]
3    [(wants, 1), (to, 1), (hang, 1), (out, 1), (wi...
4    [(we, 1), (want, 1), (to, 1), (trade, 1), (wit...
Name: content_tokens_fdist, dtype: object


Filtering

In [10]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# ----------------------- get stopword from NLTK stopword -------------------------------
list_stopwords = stopwords.words('english')

# ----------------------- add stopword from txt file ------------------------------------
# read txt stopword using pandas
txt_stopword = pd.read_csv("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords", names= ["stopwords"], header = None)

# convert stopword string to list & append additional stopword
list_stopwords.extend(txt_stopword["stopwords"][0].split(' '))

# ---------------------------------------------------------------------------------------

# convert list to dictionary
list_stopwords = set(list_stopwords)


#remove stopword pada list token
def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]

df['content_tokens_WSW'] = df['content_token'].apply(stopwords_removal) 


# print(df['content_tokens_WSW'].head())
print(df['content_tokens_WSW'])

0        [know, listenin, bad, habit, earlier, started,...
1               [layin, bed, headache, ughhhhwaitin, call]
2                        [funeral, ceremonygloomy, friday]
3                             [wants, hang, friends, soon]
4            [want, trade, someone, houston, tickets, one]
                               ...                        
39995                                                   []
39996                          [happy, mothers, day, love]
39997    [happy, mothers, day, mommies, woman, man, lon...
39998    [wassup, beautiful, follow, peep, new, hit, si...
39999    [bullet, train, tokyo, gf, visiting, japan, si...
Name: content_tokens_WSW, Length: 40000, dtype: object


In [16]:
# Cek Jumlah Data Per Kelas
print(df['sentiment'].value_counts())
print('\n')

# Cek Kelengkapan Data
print(df.info())
print('\n')

# Cek Statistik Deskriptif
print(df.describe())

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   tweet_id              40000 non-null  int64 
 1   sentiment             40000 non-null  object
 2   content               40000 non-null  object
 3   content_token         40000 non-null  object
 4   content_tokens_fdist  40000 non-null  object
 5   content_tokens_WSW    40000 non-null  object
dtypes: int64(1), object(5)
memory usage: 1.8+ MB
None


           tweet_id
count  4.000000e+04
mean   1.845184e+09
std    1.188579e+08
min    1.693956e+09
25%    1.751431e+09
50%    1.855443e+09
75%    1.962781e

In [29]:
df.head()

Unnamed: 0,tweet_id,sentiment,content,content_token,content_tokens_fdist,content_tokens_WSW
0,1956967341,empty,know was listenin to bad habit earlier and ...,"[know, was, listenin, to, bad, habit, earlier,...","{'know': 1, 'was': 1, 'listenin': 1, 'to': 1, ...","[know, listenin, bad, habit, earlier, started,..."
1,1956967666,sadness,layin bed with headache ughhhhwaitin on your...,"[layin, bed, with, headache, ughhhhwaitin, on,...","{'layin': 1, 'bed': 1, 'with': 1, 'headache': ...","[layin, bed, headache, ughhhhwaitin, call]"
2,1956967696,sadness,funeral ceremonygloomy friday,"[funeral, ceremonygloomy, friday]","{'funeral': 1, 'ceremonygloomy': 1, 'friday': 1}","[funeral, ceremonygloomy, friday]"
3,1956967789,enthusiasm,wants to hang out with friends soon,"[wants, to, hang, out, with, friends, soon]","{'wants': 1, 'to': 1, 'hang': 1, 'out': 1, 'wi...","[wants, hang, friends, soon]"
4,1956968416,neutral,we want to trade with someone who has houston ...,"[we, want, to, trade, with, someone, who, has,...","{'we': 1, 'want': 1, 'to': 1, 'trade': 1, 'wit...","[want, trade, someone, houston, tickets, one]"


In [18]:
X = df['content'].values
y = df['sentiment'].values

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

bow = CountVectorizer(stop_words='english')

X_train = bow.fit_transform(X_train)

X_test = bow.transform(X_test)

In [20]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Inisiasi MultinomialNB
mnb = MultinomialNB()

# Fit model
mnb.fit(X_train, y_train)

# Prediksi dengan data training
y_pred_train = mnb.predict(X_train)

# Evaluasi akurasi data training
acc_train = accuracy_score(y_train, y_pred_train)

# Prediksi dengan data training
y_pred_test = mnb.predict(X_test)

# Evaluasi akurasi data training
acc_test = accuracy_score(y_test, y_pred_test)

# Print hasil evaluasi
print(f'Hasil akurasi data train: {acc_train}')
print(f'Hasil akurasi data test: {acc_test}')

Hasil akurasi data train: 0.56396875
Hasil akurasi data test: 0.322125
