<a href="https://colab.research.google.com/github/blankon123/random-analytics/blob/main/Multi_class_Classification_Dicoding_Pt_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiclass Text Classification with LSTM
Dalam rangka mengerjakan projek mandiri Intermediate Machine Learning - Indosat Ooredoo Scholarship di dicoding. Data yang digunakan yakni data teks berita BBC-News yang sudah dikategorisasi yang berasal dari [Public Dataset](http://mlg.ucd.ie/datasets/bbc.html). Data terdiri dari 2225 teks dengan 5 kelas

## Loading Data
Memuat data dengan download lalu uncompress file, setelah itu dilakukan pembacaan berdasarkan file per folder

In [1]:
#Memastikan menggunakan GPU dalam komputasinya
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [2]:
#Download data teks yang dibutuhkan
from IPython.display import clear_output

!wget 'http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip'
!unzip "bbc-fulltext.zip" -d "/content/drive/"

clear_output()

print('Terdownload di /content/drive/bbc [1 Folder 1 Kelas]')

Terdownload di /content/drive/bbc [1 Folder 1 Kelas]


## Preprocessing Data

In [3]:
from glob import glob
import re

raw_data = []
teks = ''

kelas = [x.split('/')[4] for x in glob("/content/drive/bbc/*/")]

#Loop Read ke semua file txt berdasarkan kelas
for k in kelas:
  daftar_file = glob("/content/drive/bbc/"+k+"/*")
  for teks_file in daftar_file:
    with open(teks_file, 'r',encoding='latin-1') as file:
      teks = file.read().replace('\n', ' ').replace('  ', ' ').lower()
    raw_data.append({
        'teks' : re.sub(r'[^\w\s]', '', teks),
        'kelas' : k
    })

In [4]:
#Download Stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
#Penghapusan stopwords, karena sepertinya pemodelan topik hanya membutuhkan keyword-keyword penting
from nltk.corpus import stopwords

def clean_stopwords(teks):
  filtered_ = [word for word in teks.split() if word not in stopwords.words('english')]
  return ' '.join(filtered_)

for i in raw_data:
  i['teks'] = clean_stopwords(i['teks'])

raw_data[1]

{'kelas': 'entertainment',
 'teks': 'housewives lift channel 4 ratings debut us television hit desperate housewives helped lift channel 4s january audience share 12 compared last year successes celebrity big brother simpsons enabled broadcaster surpass bbc two first month since last july bbc twos share audience fell 112 96 last month comparison january 2004 celebrity big brother attracted fewer viewers 2002 series comedy drama desperate housewives managed pull five million viewers one point run date attracting quarter television audience two main television channels bbc1 itv1 seen monthly audience share decline year year comparison january fives proportion remained slender 63 digital multichannel tv continuing strongest area growth bbc reporting freeview box ownership five million including one million sales last portion 2004 share audience soared 20 january 2005 compared last year currently stands average 286'}

In [6]:
#Konversi ke Pandas Dataframe
import pandas as pd

df = pd.DataFrame(raw_data)
df.head(8)

Unnamed: 0,teks,kelas
0,godzilla gets hollywood fame star movie monste...,entertainment
1,housewives lift channel 4 ratings debut us tel...,entertainment
2,boogeyman takes box office lead lowbudget horr...,entertainment
3,spike lee backs student directors filmmaker sp...,entertainment
4,dvd review robot one man recognises robots thr...,entertainment
5,da vinci code lousy history plot international...,entertainment
6,uganda bans vagina monologues ugandas authorit...,entertainment
7,prince crowned top music earner prince earned ...,entertainment


In [7]:
# Memastikan jumlah kelas
df.kelas.unique()

array(['entertainment', 'politics', 'sport', 'business', 'tech'],
      dtype=object)

In [8]:
# One hot encoding
kategori = pd.get_dummies(df.kelas)
df = pd.concat([df, kategori], axis=1) 
df = df.drop(columns='kelas')
df.head()

Unnamed: 0,teks,business,entertainment,politics,sport,tech
0,godzilla gets hollywood fame star movie monste...,0,1,0,0,0
1,housewives lift channel 4 ratings debut us tel...,0,1,0,0,0
2,boogeyman takes box office lead lowbudget horr...,0,1,0,0,0
3,spike lee backs student directors filmmaker sp...,0,1,0,0,0
4,dvd review robot one man recognises robots thr...,0,1,0,0,0


## Train-Test Splitting

In [9]:
#Pemisahan label dan atribut
berita = df.teks.values 
label = df[kategori.columns.values].values 

In [10]:
#Pemecahan train dan test-set
from sklearn.model_selection import train_test_split 

berita_latih, berita_test, label_latih, label_test = train_test_split(berita, label, test_size=0.2,shuffle=True)

## Text to Tensors

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 

tokenizer = Tokenizer(num_words=5000, oov_token='-') 
tokenizer.fit_on_texts(berita_latih)  
tokenizer.fit_on_texts(berita_test) 

sekuens_latih = tokenizer.texts_to_sequences(berita_latih) 
sekuens_test = tokenizer.texts_to_sequences(berita_test) 

padded_latih = pad_sequences(sekuens_latih)
padded_test = pad_sequences(sekuens_test)

## Neural Net Design

In [12]:
import tensorflow as tf 

model = tf.keras.Sequential([ 
    tf.keras.layers.Embedding(input_dim=5000, output_dim=16), 
    tf.keras.layers.LSTM(64), 
    tf.keras.layers.Dense(128, activation='relu'), 
    tf.keras.layers.Dense(64, activation='relu'), 
    tf.keras.layers.Dense(5, activation='softmax') 
]) 

model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy']) 

## Training & Testing

In [13]:
num_epochs = 30 

history = model.fit(padded_latih, label_latih,
                    epochs=num_epochs,
                    validation_data=(padded_test, label_test),
                    verbose=2
                    )

Epoch 1/30
56/56 - 12s - loss: 1.6026 - accuracy: 0.2646 - val_loss: 1.5796 - val_accuracy: 0.4135
Epoch 2/30
56/56 - 4s - loss: 1.3220 - accuracy: 0.4236 - val_loss: 1.0538 - val_accuracy: 0.5258
Epoch 3/30
56/56 - 4s - loss: 0.9455 - accuracy: 0.5629 - val_loss: 0.9351 - val_accuracy: 0.5685
Epoch 4/30
56/56 - 4s - loss: 0.6870 - accuracy: 0.6865 - val_loss: 0.8757 - val_accuracy: 0.6337
Epoch 5/30
56/56 - 4s - loss: 0.4371 - accuracy: 0.8056 - val_loss: 0.8265 - val_accuracy: 0.6966
Epoch 6/30
56/56 - 4s - loss: 0.3237 - accuracy: 0.8899 - val_loss: 0.9816 - val_accuracy: 0.6584
Epoch 7/30
56/56 - 4s - loss: 0.1980 - accuracy: 0.9438 - val_loss: 0.9865 - val_accuracy: 0.7146
Epoch 8/30
56/56 - 4s - loss: 0.0817 - accuracy: 0.9775 - val_loss: 0.9545 - val_accuracy: 0.7640
Epoch 9/30
56/56 - 4s - loss: 0.0470 - accuracy: 0.9888 - val_loss: 1.0336 - val_accuracy: 0.7685
Epoch 10/30
56/56 - 4s - loss: 0.0312 - accuracy: 0.9899 - val_loss: 1.1154 - val_accuracy: 0.7820
Epoch 11/30
56/56 

Syarat Kelulusan ✅
- [x] Dataset yang akan dipakai bebas, namun minimal memiliki 1000 sampel.
- [x] Harus menggunakan LSTM dalam arsitektur model.
- [x] Harus menggunakan model sequential.
- [x] Validation set sebesar 20% dari total dataset.
- [x] Harus menggunakan Embedding.
- [x] Harus menggunakan fungsi tokenizer.
- [x] Akurasi dari model minimal 75%.

Syarat ⭐⭐⭐⭐⭐
- [x] Lebih dari 3 Kelas
- [x] Minimal 2000 Sampel Data
- [x] Akurasi > 90%