Pada pertemuan sebelumnya kita sudah mempelajari text classification dengan menggunakan LSTM dan text vectorization skipgram dengan menggunakan data emotion classification.


Pada forum kali ini anda diberikan data dengan 20 label, lakukan klasifikasi menggunakan LSTM dan text vectorization skipgram.
Anda bebas melakukan tuning hyperparameter termasuk arsitektur LSTM.

In [1]:
import pandas as pd
import numpy as np

In [2]:
import re
import nltk
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# import dataset
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Text Mining/GSLC 2/train_data.csv')

In [5]:
df.head()

Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,0
1,Buy Las Vegas Sands as travel to Singapore bui...,0
2,"Piper Sandler downgrades DocuSign to sell, cit...",0
3,"Analysts react to Tesla's latest earnings, bre...",0
4,Netflix and its peers are set for a ‘return to...,0


In [6]:
print(df.iloc[0][0])

Here are Thursday's biggest analyst calls: Apple, Amazon, Tesla, Palantir, DocuSign, Exxon &amp; more  https://t.co/QPN8Gwl7Uh


In [7]:
df.shape

(16990, 2)

In [8]:
# check apakah semua kolom terisi
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16990 entries, 0 to 16989
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    16990 non-null  object
 1   label   16990 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 265.6+ KB


In [9]:
# check jumlah data tiap label
df['label'].value_counts()

label
2     3545
18    2118
14    1822
9     1557
5      987
16     985
1      837
19     823
7      624
6      524
15     501
17     495
12     487
13     471
4      359
3      321
0      255
8      166
10      69
11      44
Name: count, dtype: int64

In [6]:
# check apakah ada text yang duplicate
df = df.drop_duplicates(subset=['text'])

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16990 entries, 0 to 16989
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    16990 non-null  object
 1   label   16990 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 265.6+ KB


In [7]:
def cleansing(df):
  df_clean = df.copy()
  df_clean = df_clean.str.lower()  #lowercase
  df_clean = [re.sub(r"\d+","", i) for i in df_clean]  #numbers
  df_clean = [re.sub(r'[^\w]', ' ', i) for i in df_clean]  #punctuation
  df_clean = [re.sub(r'\s+',' ', i) for i in df_clean]  #whitespace
  df_clean = [re.sub(r'https\s+','',i) for i in df_clean]  #URL
  return df_clean


In [8]:
# clean text
df['clean_text']=cleansing(df['text'])

In [14]:
print(df.iloc[0][2])

here are thursday s biggest analyst calls apple amazon tesla palantir docusign exxon amp more t co qpngwluh


In [15]:
df.head()

Unnamed: 0,text,label,clean_text
0,Here are Thursday's biggest analyst calls: App...,0,here are thursday s biggest analyst calls appl...
1,Buy Las Vegas Sands as travel to Singapore bui...,0,buy las vegas sands as travel to singapore bui...
2,"Piper Sandler downgrades DocuSign to sell, cit...",0,piper sandler downgrades docusign to sell citi...
3,"Analysts react to Tesla's latest earnings, bre...",0,analysts react to tesla s latest earnings brea...
4,Netflix and its peers are set for a ‘return to...,0,netflix and its peers are set for a return to ...


In [9]:
#check maximum length of word in sentence
max_sen = df['clean_text'].str.split().str.len().max()

In [10]:
max_sen

57

**Split the Data**

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, log_loss
from sklearn.metrics import classification_report, confusion_matrix

In [12]:
x_train, x_test, y_train, y_test = train_test_split(df['clean_text'], df['label'], test_size = 0.2, random_state = 42,stratify=df['label'])

In [20]:
len(x_train)

13592

In [21]:
len(x_test)

3398

In [13]:
 #check maximum length of word in train data
 x_train.str.split().str.len().max()

57

**Remove Stopword and Tokenization**

In [14]:
#Define Stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
# Get a list of stop words in the English language
stop_words = set(stopwords.words('english'))

# Display the top 20 stop words
list(stop_words)[:20]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['all',
 'ours',
 'myself',
 'very',
 'mightn',
 'off',
 'up',
 'too',
 'our',
 'being',
 'themselves',
 "mightn't",
 'in',
 "doesn't",
 'until',
 'are',
 'be',
 'through',
 'needn',
 'were']

In [15]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [16]:
#tokenization
word_token=[word_tokenize(i) for i in x_train]

In [26]:
word_token

[['it',
  's',
  'ecb',
  'rate',
  'decision',
  'day',
  'here',
  's',
  'what',
  'to',
  'expect',
  'via',
  'weberalexander',
  'amp',
  'carolynnlook',
  't',
  'co',
  'isqgdue'],
 ['twitter',
  'users',
  'were',
  'quick',
  'to',
  'spot',
  'liz',
  'truss',
  'seemingly',
  'recreating',
  'an',
  'outfit',
  'of',
  'margaret',
  'thatcher',
  's',
  'for',
  'her',
  'appearance',
  'at',
  'channel',
  's',
  'tory',
  'leadership',
  'debate',
  't',
  'co',
  'vsiioegrz'],
 ['jetblue',
  'announces',
  'webcast',
  'of',
  'second',
  'quarter',
  'earnings',
  'conference',
  'call',
  't',
  'co',
  'kzrfsrwcpk',
  't',
  'co',
  'xjbczmry'],
 ['calm',
  'cal',
  'maine',
  'foods',
  'stock',
  'ticks',
  'higher',
  'on',
  'record',
  'net',
  'income',
  'pricing',
  'power',
  't',
  'co',
  'nczmzphcx'],
 ['tower',
  'semiconductor',
  'and',
  'cadence',
  'expand',
  'collaboration',
  'to',
  'accelerate',
  'automotive',
  'ic',
  'development',
  't',
  

In [27]:
len(word_token)

13592

In [17]:
# Remove stopwords from each sublist in word_token
filtered_tokens_train = [[word for word in sublist if word not in stop_words] for sublist in word_token]

# Display the first sublist of filtered tokens after removing stopwords
print(filtered_tokens_train[0])

['ecb', 'rate', 'decision', 'day', 'expect', 'via', 'weberalexander', 'amp', 'carolynnlook', 'co', 'isqgdue']


In [18]:
print(filtered_tokens_train[1])

['twitter', 'users', 'quick', 'spot', 'liz', 'truss', 'seemingly', 'recreating', 'outfit', 'margaret', 'thatcher', 'appearance', 'channel', 'tory', 'leadership', 'debate', 'co', 'vsiioegrz']


**Lemmatization**

In [19]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [20]:
lemmatizer = WordNetLemmatizer()

In [21]:
lemmatized_tokens_train = [[lemmatizer.lemmatize(word) for word in sublist] for sublist in filtered_tokens_train]

In [38]:
lemmatized_tokens_train

[['ecb',
  'rate',
  'decision',
  'day',
  'expect',
  'via',
  'weberalexander',
  'amp',
  'carolynnlook',
  'co',
  'isqgdue'],
 ['twitter',
  'user',
  'quick',
  'spot',
  'liz',
  'truss',
  'seemingly',
  'recreating',
  'outfit',
  'margaret',
  'thatcher',
  'appearance',
  'channel',
  'tory',
  'leadership',
  'debate',
  'co',
  'vsiioegrz'],
 ['jetblue',
  'announces',
  'webcast',
  'second',
  'quarter',
  'earnings',
  'conference',
  'call',
  'co',
  'kzrfsrwcpk',
  'co',
  'xjbczmry'],
 ['calm',
  'cal',
  'maine',
  'food',
  'stock',
  'tick',
  'higher',
  'record',
  'net',
  'income',
  'pricing',
  'power',
  'co',
  'nczmzphcx'],
 ['tower',
  'semiconductor',
  'cadence',
  'expand',
  'collaboration',
  'accelerate',
  'automotive',
  'ic',
  'development',
  'co',
  'vssriai',
  'co',
  'kkjrddoxa'],
 ['u',
  'stock',
  'fell',
  'earnings',
  'season',
  'set',
  'kick',
  'week',
  'investor',
  'worried',
  'impact',
  'inflation',
  'corporate',
  'prof

**Vectorization Word2Vec - Skipgram**

In [22]:
import gensim
from gensim.models import Word2Vec
model_skipgram = gensim.models.Word2Vec(lemmatized_tokens_train, min_count = 2,vector_size = 100, sg=1)

In [23]:
vocabulary_skipgram = model_skipgram.wv.index_to_key
word_vec_dict={}

for word in vocabulary_skipgram:
    word_vec_dict[word]=model_skipgram.wv.get_vector(word)
print("The no of key-value pairs : ",len(word_vec_dict)) # should come equal to vocab size


The no of key-value pairs :  9865


In [24]:
word_vec_dict

{'co': array([ 0.04944844, -0.19914235,  0.3364713 ,  0.12279912, -0.41393432,
        -0.52009106,  0.5553665 ,  0.4353354 , -0.3946706 , -0.03693954,
        -0.15181535, -0.32468238, -0.16752617, -0.07201686, -0.06053897,
        -0.0801933 ,  0.14561123,  0.055388  , -0.3384287 , -0.5023502 ,
         0.06110964,  0.3375496 ,  0.2389425 ,  0.12013111,  0.05166225,
         0.01275238, -0.10810937, -0.31905845, -0.3874342 ,  0.12249593,
        -0.03629033, -0.38851312,  0.26802772, -0.31107968,  0.05597866,
         0.4101505 ,  0.4262082 ,  0.07180206, -0.27372456, -0.31661138,
        -0.09810401, -0.21975182, -0.55745196,  0.09015995,  0.327454  ,
        -0.16610835, -0.36411408,  0.34496367,  0.1836047 ,  0.3386533 ,
        -0.02744745, -0.21516791,  0.41073635, -0.03007238,  0.13282804,
         0.08214888, -0.06219637, -0.26565292, -0.03453907,  0.13757613,
         0.13707286, -0.17200856,  0.1315011 , -0.07553819, -0.35424396,
         0.58695334,  0.2811549 ,  0.63059765

In [25]:
from keras.preprocessing.text import one_hot,Tokenizer
tok = Tokenizer()
tok.fit_on_texts(lemmatized_tokens_train)
encd_rev = tok.texts_to_sequences(lemmatized_tokens_train)

In [26]:
max_sen_len= max_sen # max lenght of word in a sentence
vocab_size = len(tok.word_index) * 2  #ideally it should be len(tok.word_index) + 1  or total no of words in data, but to handle number of data which not appear in train, for example in test, make the size higher
embed_dim=100 # embedding dimension as choosen in word2vec constructor, same with vector size

In [27]:
# now creating the embedding matrix
embed_matrix=np.zeros(shape=(vocab_size,embed_dim))

for word,i in tok.word_index.items():
    embed_vector=word_vec_dict.get(word) #mapping the vector to word in our skipgram dictionary
    if embed_vector is not None:  # word is in the vocabulary learned by the w2v model
        embed_matrix[i]=embed_vector
  # if word is not found then embed_vector corressponding to that vector will stay zero.

**Preparing the Data for Embedding Layer**

In [62]:
encd_rev

[[173, 15, 505, 49, 546, 57, 4585, 28, 4586, 1, 9866],
 [157,
  817,
  1892,
  860,
  758,
  659,
  6905,
  9867,
  5500,
  4587,
  3514,
  5501,
  1182,
  739,
  238,
  1429,
  1,
  9868],
 [3515, 21, 740, 17, 18, 6, 128, 27, 1, 9869, 1, 9870],
 [1893, 3516, 2628, 227, 3, 2133, 83, 121, 279, 326, 797, 146, 1, 9871],
 [1369, 628, 4588, 482, 1225, 1279, 1370, 4589, 571, 1, 9872, 1, 9873],
 [2, 3, 412, 6, 572, 110, 2033, 35, 31, 1894, 554, 10, 384, 137, 1, 9874],
 [144, 89, 8, 155, 60, 193, 14, 10, 1893, 1, 9875],
 [1714, 5502, 5503, 483, 2629, 9876, 1634, 313, 3517, 1, 9877],
 [5504, 58, 966, 6906, 358, 96, 2034, 9878, 1, 9879, 1, 9880],
 [345, 191, 547, 99, 1895, 723, 45, 6907, 1, 9881],
 [78,
  591,
  2035,
  2134,
  70,
  2858,
  55,
  1280,
  1896,
  60,
  833,
  83,
  14,
  122,
  362,
  232,
  8,
  724,
  104,
  99,
  555,
  85,
  1,
  9882,
  1,
  9883],
 [6908,
  2,
  9884,
  3972,
  123,
  3,
  5,
  518,
  95,
  184,
  3135,
  123,
  3,
  5,
  95,
  184,
  1635,
  123,
  3,
  5

In [61]:
vocab_size

65658

Lakukan padding agar panjang sequences sama

In [28]:
from keras.preprocessing.sequence import pad_sequences

pad_rev= pad_sequences(encd_rev, maxlen=max_sen_len, padding='post')
pad_rev.shape

(13592, 57)

In [29]:
pad_rev

array([[  173,    15,   505, ...,     0,     0,     0],
       [  157,   817,  1892, ...,     0,     0,     0],
       [ 3515,    21,   740, ...,     0,     0,     0],
       ...,
       [  633,  1268,  2212, ...,     0,     0,     0],
       [ 2422,  3963,   437, ...,     0,     0,     0],
       [  173, 32829,   164, ...,     0,     0,     0]], dtype=int32)

**Classification**

In [30]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.initializers import Constant
from tensorflow.keras.layers import Dense, Embedding, Activation, Flatten

In [31]:
model = keras.Sequential()
model._name = "lstm"
model.add(layers.Embedding(input_dim=vocab_size,output_dim=embed_dim,input_length=max_sen_len,embeddings_initializer=Constant(embed_matrix)))
model.add(layers.LSTM(max_sen_len))
model.add(layers.BatchNormalization())
model.add(layers.Dense(20)) #sesuai jumlah class
print(model.summary())

Model: "lstm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 57, 100)           6565800   
                                                                 
 lstm (LSTM)                 (None, 57)                36024     
                                                                 
 batch_normalization (Batch  (None, 57)                228       
 Normalization)                                                  
                                                                 
 dense (Dense)               (None, 20)                1160      
                                                                 
Total params: 6603212 (25.19 MB)
Trainable params: 6603098 (25.19 MB)
Non-trainable params: 114 (456.00 Byte)
_________________________________________________________________
None


Optimizer SGD

In [32]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='sgd',
    metrics=["accuracy"]
)

In [78]:
model.fit(pad_rev, y_train, batch_size=64, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7ea19afb4eb0>

In [33]:
model2 = keras.Sequential()
model2._name = "lstm"
model2.add(layers.Embedding(input_dim=vocab_size,output_dim=embed_dim,input_length=max_sen_len,embeddings_initializer=Constant(embed_matrix)))
model2.add(layers.LSTM(max_sen_len))
model2.add(layers.BatchNormalization())
model2.add(layers.Dense(20)) #sesuai jumlah class
print(model2.summary())

Model: "lstm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 57, 100)           6565800   
                                                                 
 lstm_1 (LSTM)               (None, 57)                36024     
                                                                 
 batch_normalization_1 (Bat  (None, 57)                228       
 chNormalization)                                                
                                                                 
 dense_1 (Dense)             (None, 20)                1160      
                                                                 
Total params: 6603212 (25.19 MB)
Trainable params: 6603098 (25.19 MB)
Non-trainable params: 114 (456.00 Byte)
_________________________________________________________________
None


Optimizer Adam

In [34]:
model2.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=["accuracy"]
)

In [35]:
model2.fit(pad_rev, y_train, batch_size=64, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7eedb27d37c0>

**Testing**

**Preparing  Test Data for Embedding Layer**

In [36]:
# Tokenize the words in the testing text data
word_token_test =[word_tokenize(i) for i in x_test]

In [37]:
word_token_test[0]

['the',
 'nonprofit',
 'organization',
 'pursuit',
 'has',
 'sold',
 'a',
 'million',
 'bond',
 'to',
 'finance',
 'its',
 'job',
 'training',
 'program',
 'now',
 'it',
 'needs',
 'participants',
 'to',
 'pay',
 'down',
 'the',
 'debt',
 'with',
 'their',
 'future',
 'earnings',
 't',
 'co',
 'bakphufj']

In [38]:
# Remove stopwords from each sublist in word_token_test
filtered_tokens_test = [[word for word in sublist if word not in stop_words] for sublist in word_token_test]
# Display the first sublist of filtered tokens after removing stopwords
print(filtered_tokens_test[0])

['nonprofit', 'organization', 'pursuit', 'sold', 'million', 'bond', 'finance', 'job', 'training', 'program', 'needs', 'participants', 'pay', 'debt', 'future', 'earnings', 'co', 'bakphufj']


In [39]:
lemmatized_tokens_test = [[lemmatizer.lemmatize(word) for word in sublist] for sublist in filtered_tokens_test]

In [41]:
# tok.fit_on_texts(lemmatized_tokens_test) #tidak perlu di fit lagi
encd_rev_test = tok.texts_to_sequences(lemmatized_tokens_test)

In [42]:
from keras.preprocessing.sequence import pad_sequences

pad_rev_test= pad_sequences(encd_rev_test, maxlen=max_sen_len, padding='post')
pad_rev_test.shape

(3398, 57)

In [43]:
pad_rev_test

array([[ 3672,  1886,  4236, ...,     0,     0,     0],
       [  248,  1291,   405, ...,     0,     0,     0],
       [  804,    16,   612, ...,     0,     0,     0],
       ...,
       [32436,  1796,   624, ...,     0,     0,     0],
       [  228,    21,     5, ...,     0,     0,     0],
       [  176,   364,   289, ...,     0,     0,     0]], dtype=int32)

In [44]:
test_predict=model2.predict(pad_rev_test)
classe_test=np.argmax(test_predict,axis=1)




In [45]:
classe_test

array([ 3,  3, 18, ..., 18, 18,  7])

In [48]:
print('\nClassification Report\n')
print(classification_report(y_test, classe_test))


Classification Report

              precision    recall  f1-score   support

           0       0.17      0.49      0.25        51
           1       0.81      0.80      0.80       167
           2       0.78      0.60      0.68       709
           3       0.48      0.50      0.49        64
           4       0.81      0.92      0.86        72
           5       0.92      0.92      0.92       198
           6       0.81      0.85      0.83       105
           7       0.74      0.62      0.68       125
           8       0.49      0.55      0.51        33
           9       0.42      0.70      0.53       311
          10       0.27      0.29      0.28        14
          11       0.06      0.56      0.11         9
          12       0.98      0.63      0.77        97
          13       0.67      0.19      0.30        94
          14       0.74      0.80      0.77       364
          15       0.86      0.48      0.62       100
          16       0.88      0.82      0.85       197
   

Arsitektur untuk model yang dipakai yaitu:
- 1 layer embedding
- 1 layer LSTM
- 1 layer batch normalization
- 1 layer Dende


Accuracy pada data train menggunakan optimizer sgd didapat sebesar 0.59.

Accuracy pada data train menggunakan optimizer Adam didapat sebesar 0.98.

Untuk model test saya menggunakan optimizer yang memiliki accuracy tertinggi, yaitu Adam. Dimana angka tersebut (98%) sangat baik dalam memprediksi kata pada data train

Dan pada test data didapatkan accuracy keseluruhan sebesar 0.67

Performa model cukup baik untuk beberapa class, misalnya pada kelas 5, 6, 16 yang memiliki precision & recall yang tinggi.

Namun ada beberapa yang buruk, seperti pada kelas 0, 3, 9, 11, 13




Untuk menambah performa model dapat dilakukan beberapa hal:

Memperbanyak jumlah epoch, menambah design arsitektur, melakukan Regularization