## Deteksi Emosi Pengguna Twitter

Deteksi emosi merupakan salah satu permasalahan yang dihadapi pada ***Natural Language Processing*** (NLP). Alasanya diantaranya adalah kurangnya dataset berlabel untuk mengklasifikasikan emosi berdasarkan data twitter. Selain itu, sifat dari data twitter yang dapat memiliki banyak label emosi (***multi-class***). Manusia memiliki berbagai emosi dan sulit untuk mengumpulkan data yang cukup untuk setiap emosi. Oleh karena itu, masalah ketidakseimbangan kelas akan muncul (***class imbalance***). Pada Ujian Tengah Semester (UTS) kali ini, Anda telah disediakan dataset teks twitter yang sudah memiliki label untuk beberapa kelas emosi. Tugas utama Anda adalah membuat model yang mumpuni untuk kebutuhan klasifikasi emosi berdasarkan teks.

### Informasi Data

Dataset yang akan digunakan adalah ****tweet_emotion.csv***. Berikut merupakan informasi tentang dataset yang dapat membantu Anda.

- Total data: 40000 data
- Label emosi: anger, boredom, empty, enthusiasm, fun, happiness, hate, love, neutral, relief, sadness, surprise, worry
- Jumlah data untuk setiap label tidak sama (***class imbalance***)
- Terdapat 3 kolom = 'tweet_id', 'sentiment', 'content'

### Penilaian UTS

UTS akan dinilai berdasaarkan 4 proses yang akan Anda lakukan, yaitu pra pengolahan data, ektraksi fitur, pembuatan model machine learning, dan evaluasi.

#### Pra Pengolahan Data

> **Perhatian**
> 
> Sebelum Anda melakukan sesuatu terhadap data Anda, pastikan data yang Anda miliki sudah "baik", bebas dari data yang hilang, menggunakan tipe data yang sesuai, dan sebagainya.
>

Data tweeter yang ada dapatkan merupakan sebuah data mentah, maka beberapa hal dapat Anda lakukan (namun tidak terbatas pada) yaitu,

1. Case Folding
2. Tokenizing
3. Filtering
4. Stemming

*CATATAN: PADA DATA TWITTER TERDAPAT *MENTION* (@something) YANG ANDA HARUS TANGANI SEBELUM MASUK KE TAHAP EKSTRAKSI FITUR*

#### Ekstrasi Fitur

Anda dapat menggunakan beberapa metode, diantaranya

1. Bag of Words (Count / TF-IDF)
2. N-gram
3. dan sebagainya

#### Pembuatan Model

Anda dibebaskan dalam memilih algoritma klasifikasi. Anda dapat menggunakan algoritma yang telah diajarkan didalam kelas atau yang lain, namun dengan catatan. Berdasarkan asas akuntabilitas pada pengembangan model machine learning, Anda harus dapat menjelaskan bagaimana model Anda dapat menghasilkan nilai tertentu.

#### Evaluasi

Pada proses evaluasi, minimal Anda harus menggunakan metric akurasi. Akan tetapi Anda juga dapat menambahkan metric lain seperti Recall, Precision, F1-Score, detail Confussion Metric, ataupun Area Under Curve (AUC).

### Lembar Pengerjaan
Lembar pengerjaan dimulai dari cell dibawah ini

In [4]:
import numpy as np
import pandas as pd

In [5]:
df = pd.read_csv('data/tweet_emotions.csv')

df.head()

Unnamed: 0,tweet_id,sentiment,content
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,wants to hang out with friends SOON!
4,1956968416,neutral,@dannycastillo We want to trade with someone w...


In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import nltk 
import string
import re
%matplotlib inline
pd.set_option('display.max_colwidth', 100)

In [7]:
tweet_df = pd.read_csv('data/tweet_emotions.csv')

print('Dataset size:',tweet_df.shape)
print('Columns are:',tweet_df.columns)

Dataset size: (40000, 3)
Columns are: Index(['tweet_id', 'sentiment', 'content'], dtype='object')


In [8]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   40000 non-null  int64 
 1   sentiment  40000 non-null  object
 2   content    40000 non-null  object
dtypes: int64(1), object(2)
memory usage: 937.6+ KB


In [9]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

In [11]:
df['Tweet_punct'] = df['content'].apply(lambda x: remove_punct(x))
df.head(10)

Unnamed: 0,tweet_id,sentiment,content,Tweet_punct
0,1956967341,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin on your call...,Layin n bed with a headache ughhhhwaitin on your call
2,1956967696,sadness,Funeral ceremony...gloomy friday...,Funeral ceremonygloomy friday
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends SOON
4,1956968416,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will.",dannycastillo We want to trade with someone who has Houston tickets but no one will
5,1956968477,worry,Re-pinging @ghostridah14: why didn't you go to prom? BC my bf didn't like my friends,Repinging ghostridah why didnt you go to prom BC my bf didnt like my friends
6,1956968487,sadness,"I should be sleep, but im not! thinking about an old friend who I want. but he's married now. da...",I should be sleep but im not thinking about an old friend who I want but hes married now damn am...
7,1956968636,worry,Hmmm. http://www.djhero.com/ is down,Hmmm httpwwwdjherocom is down
8,1956969035,sadness,@charviray Charlene my love. I miss you,charviray Charlene my love I miss you
9,1956969172,sadness,@kelcouch I'm sorry at least it's Friday?,kelcouch Im sorry at least its Friday


In [12]:
def tokenization(text):
    text = re.split('\W+', text)
    return text

df['Tweet_tokenized'] = df['Tweet_punct'].apply(lambda x: tokenization(x.lower()))
df.head()

Unnamed: 0,tweet_id,sentiment,content,Tweet_punct,Tweet_tokenized
0,1956967341,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part,"[tiffanylue, i, know, i, was, listenin, to, bad, habit, earlier, and, i, started, freakin, at, h..."
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin on your call...,Layin n bed with a headache ughhhhwaitin on your call,"[layin, n, bed, with, a, headache, ughhhhwaitin, on, your, call]"
2,1956967696,sadness,Funeral ceremony...gloomy friday...,Funeral ceremonygloomy friday,"[funeral, ceremonygloomy, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends SOON,"[wants, to, hang, out, with, friends, soon]"
4,1956968416,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will.",dannycastillo We want to trade with someone who has Houston tickets but no one will,"[dannycastillo, we, want, to, trade, with, someone, who, has, houston, tickets, but, no, one, will]"


In [13]:
stopword = nltk.corpus.stopwords.words('english')
#stopword.extend(['yr', 'year', 'woman', 'man', 'girl','boy','one', 'two', 'sixteen', 'yearold', 'fu', 'weeks', 'week',
#               'treatment', 'associated', 'patients', 'may','day', 'case','old'])

In [14]:
def remove_stopwords(text):
    text = [word for word in text if word not in stopword]
    return text

df['Tweet_nonstop'] = df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x))
df.head(10)

Unnamed: 0,tweet_id,sentiment,content,Tweet_punct,Tweet_tokenized,Tweet_nonstop
0,1956967341,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part,"[tiffanylue, i, know, i, was, listenin, to, bad, habit, earlier, and, i, started, freakin, at, h...","[tiffanylue, know, listenin, bad, habit, earlier, started, freakin, part, ]"
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin on your call...,Layin n bed with a headache ughhhhwaitin on your call,"[layin, n, bed, with, a, headache, ughhhhwaitin, on, your, call]","[layin, n, bed, headache, ughhhhwaitin, call]"
2,1956967696,sadness,Funeral ceremony...gloomy friday...,Funeral ceremonygloomy friday,"[funeral, ceremonygloomy, friday]","[funeral, ceremonygloomy, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends SOON,"[wants, to, hang, out, with, friends, soon]","[wants, hang, friends, soon]"
4,1956968416,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will.",dannycastillo We want to trade with someone who has Houston tickets but no one will,"[dannycastillo, we, want, to, trade, with, someone, who, has, houston, tickets, but, no, one, will]","[dannycastillo, want, trade, someone, houston, tickets, one]"
5,1956968477,worry,Re-pinging @ghostridah14: why didn't you go to prom? BC my bf didn't like my friends,Repinging ghostridah why didnt you go to prom BC my bf didnt like my friends,"[repinging, ghostridah, why, didnt, you, go, to, prom, bc, my, bf, didnt, like, my, friends]","[repinging, ghostridah, didnt, go, prom, bc, bf, didnt, like, friends]"
6,1956968487,sadness,"I should be sleep, but im not! thinking about an old friend who I want. but he's married now. da...",I should be sleep but im not thinking about an old friend who I want but hes married now damn am...,"[i, should, be, sleep, but, im, not, thinking, about, an, old, friend, who, i, want, but, hes, m...","[sleep, im, thinking, old, friend, want, hes, married, damn, amp, wants, scandalous]"
7,1956968636,worry,Hmmm. http://www.djhero.com/ is down,Hmmm httpwwwdjherocom is down,"[hmmm, httpwwwdjherocom, is, down]","[hmmm, httpwwwdjherocom]"
8,1956969035,sadness,@charviray Charlene my love. I miss you,charviray Charlene my love I miss you,"[charviray, charlene, my, love, i, miss, you]","[charviray, charlene, love, miss]"
9,1956969172,sadness,@kelcouch I'm sorry at least it's Friday?,kelcouch Im sorry at least its Friday,"[kelcouch, im, sorry, at, least, its, friday]","[kelcouch, im, sorry, least, friday]"


In [15]:
ps = nltk.PorterStemmer()

def stemming(text):
    text = [ps.stem(word) for word in text]
    return text

df['Tweet_stemmed'] = df['Tweet_nonstop'].apply(lambda x: stemming(x))
df.head()

Unnamed: 0,tweet_id,sentiment,content,Tweet_punct,Tweet_tokenized,Tweet_nonstop,Tweet_stemmed
0,1956967341,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part,"[tiffanylue, i, know, i, was, listenin, to, bad, habit, earlier, and, i, started, freakin, at, h...","[tiffanylue, know, listenin, bad, habit, earlier, started, freakin, part, ]","[tiffanylu, know, listenin, bad, habit, earlier, start, freakin, part, ]"
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin on your call...,Layin n bed with a headache ughhhhwaitin on your call,"[layin, n, bed, with, a, headache, ughhhhwaitin, on, your, call]","[layin, n, bed, headache, ughhhhwaitin, call]","[layin, n, bed, headach, ughhhhwaitin, call]"
2,1956967696,sadness,Funeral ceremony...gloomy friday...,Funeral ceremonygloomy friday,"[funeral, ceremonygloomy, friday]","[funeral, ceremonygloomy, friday]","[funer, ceremonygloomi, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends SOON,"[wants, to, hang, out, with, friends, soon]","[wants, hang, friends, soon]","[want, hang, friend, soon]"
4,1956968416,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will.",dannycastillo We want to trade with someone who has Houston tickets but no one will,"[dannycastillo, we, want, to, trade, with, someone, who, has, houston, tickets, but, no, one, will]","[dannycastillo, want, trade, someone, houston, tickets, one]","[dannycastillo, want, trade, someon, houston, ticket, one]"


In [16]:
import nltk
#nltk.download('wordnet')
#nltk.download('omw-1.4')

wn = nltk.WordNetLemmatizer()

def lemmatizer(text):
    text = [wn.lemmatize(word) for word in text]
    return text

df['Tweet_lemmatized'] = df['Tweet_nonstop'].apply(lambda x: lemmatizer(x))
df.head()

Unnamed: 0,tweet_id,sentiment,content,Tweet_punct,Tweet_tokenized,Tweet_nonstop,Tweet_stemmed,Tweet_lemmatized
0,1956967341,empty,@tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part =[,tiffanylue i know i was listenin to bad habit earlier and i started freakin at his part,"[tiffanylue, i, know, i, was, listenin, to, bad, habit, earlier, and, i, started, freakin, at, h...","[tiffanylue, know, listenin, bad, habit, earlier, started, freakin, part, ]","[tiffanylu, know, listenin, bad, habit, earlier, start, freakin, part, ]","[tiffanylue, know, listenin, bad, habit, earlier, started, freakin, part, ]"
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin on your call...,Layin n bed with a headache ughhhhwaitin on your call,"[layin, n, bed, with, a, headache, ughhhhwaitin, on, your, call]","[layin, n, bed, headache, ughhhhwaitin, call]","[layin, n, bed, headach, ughhhhwaitin, call]","[layin, n, bed, headache, ughhhhwaitin, call]"
2,1956967696,sadness,Funeral ceremony...gloomy friday...,Funeral ceremonygloomy friday,"[funeral, ceremonygloomy, friday]","[funeral, ceremonygloomy, friday]","[funer, ceremonygloomi, friday]","[funeral, ceremonygloomy, friday]"
3,1956967789,enthusiasm,wants to hang out with friends SOON!,wants to hang out with friends SOON,"[wants, to, hang, out, with, friends, soon]","[wants, hang, friends, soon]","[want, hang, friend, soon]","[want, hang, friend, soon]"
4,1956968416,neutral,"@dannycastillo We want to trade with someone who has Houston tickets, but no one will.",dannycastillo We want to trade with someone who has Houston tickets but no one will,"[dannycastillo, we, want, to, trade, with, someone, who, has, houston, tickets, but, no, one, will]","[dannycastillo, want, trade, someone, houston, tickets, one]","[dannycastillo, want, trade, someon, houston, ticket, one]","[dannycastillo, want, trade, someone, houston, ticket, one]"


In [17]:
def clean_text(text):
    text_lc = "".join([word.lower() for word in text if word not in string.punctuation]) # remove puntuation
    text_rc = re.sub('[0-9]+', '', text_lc)
    tokens = re.split('\W+', text_rc)    # tokenization
    text = [ps.stem(word) for word in tokens if word not in stopword]  # remove stopwords and stemming
    return text

In [18]:
#data = clean_text(df['content'])

In [19]:
#from sklearn.feature_extraction.text import CountVectorizer

#vectorizer = CountVectorizer()
#vectorizer.fit(data)
#vectorizer.vocabulary_

In [20]:
countVectorizer = CountVectorizer(analyzer=clean_text) 
countVector = countVectorizer.fit_transform(df['content'])
print('{} Number of tweets has {} words'.format(countVector.shape[0], countVector.shape[1]))
#print(countVectorizer.get_feature_names())

40000 Number of tweets has 45306 words


In [21]:
count_vect_df = pd.DataFrame(countVector.toarray(), columns=countVectorizer.get_feature_names())
count_vect_df.head()



Unnamed: 0,Unnamed: 1,aa,aaa,aaaa,aaaaa,aaaaaaaa,aaaaaaaaaaa,aaaaaaaaaahhhhhhhh,aaaaaaaaaamaz,aaaaaaaafternoon,...,½we,½whi,½who,½whyyi,½y,½you,½z,½ï,â,ï
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# Pisahkan label dan fitur
X = count_vect_df.iloc[:,1:]
y = count_vect_df.iloc[:,0]
X

Unnamed: 0,aa,aaa,aaaa,aaaaa,aaaaaaaa,aaaaaaaaaaa,aaaaaaaaaahhhhhhhh,aaaaaaaaaamaz,aaaaaaaafternoon,aaaaaaaahhhhhhhh,...,½we,½whi,½who,½whyyi,½y,½you,½z,½ï,â,ï
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# Encode Fitur
from sklearn.preprocessing import LabelEncoder

def feature_encode(df):
    encode = LabelEncoder()
    for col in df:
        df[col] = encode.fit_transform(df[col])
    
    return df

X = feature_encode(X)
display(X)

# Encode Label
encode = LabelEncoder()
y = encode.fit_transform(y)
print(y)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = encode.fit_transform(df[col])


Unnamed: 0,aa,aaa,aaaa,aaaaa,aaaaaaaa,aaaaaaaaaaa,aaaaaaaaaahhhhhhhh,aaaaaaaaaamaz,aaaaaaaafternoon,aaaaaaaahhhhhhhh,...,½we,½whi,½who,½whyyi,½y,½you,½z,½ï,â,ï
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


[1 0 0 ... 0 0 0]


In [24]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [25]:
# Split train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)

# Initiate Model
dt_entropy = DecisionTreeClassifier(criterion='entropy')
dt_gini = DecisionTreeClassifier(criterion='gini')

# Fit model
# Entropy
dt_entropy.fit(X_train, y_train)
y_pred_entropy_train = dt_entropy.predict(X_train)
y_pred_entropy = dt_entropy.predict(X_test)

# Gini
dt_gini.fit(X_train, y_train)
y_pred_gini_train = dt_gini.predict(X_train)
y_pred_gini = dt_gini.predict(X_test)

# Evaluasi
# Entropy
acc_entropy_train = accuracy_score(y_train, y_pred_entropy_train)
acc_entropy = accuracy_score(y_test, y_pred_entropy)

# Gini
acc_gini_train = accuracy_score(y_train, y_pred_gini_train)
acc_gini = accuracy_score(y_test, y_pred_gini)

print(f'Akurasi Entropy Train: {acc_entropy_train}')
print(f'Akurasi Entropy: {acc_entropy}')
print('\n')
print(f'Akurasi Gini Train: {acc_gini_train}')
print(f'Akurasi Gini: {acc_gini}')

MemoryError: Unable to allocate 9.45 GiB for an array with shape (45305, 28000) and data type int64