<a href="https://colab.research.google.com/github/angelamarpaung99/DiaryApps/blob/master/Deep_Learning_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning for NLP

Silakan download file di bawah ini terlebih dahulu. Ukuran sekitar 800MB.

In [0]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2020-03-09 06:21:59--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-03-09 06:22:00--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-03-09 06:22:00--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6

## Training Word2Vec

Import library yang dibutuhkan.

In [0]:
import re, string 
import pandas as pd 
from time import time  
from collections import defaultdict
import spacy
from sklearn.manifold import TSNE
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from matplotlib import pyplot as plt
%matplotlib inline

Download stopwords yang disediakan oleh NLTK. 

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Saat ini tersedia stopwords untuk 23 bahasa (mungkin bertambah).

In [0]:
print(stopwords.fileids())

In [0]:
STOPWORDS = set(stopwords.words('english'))

Download dataset bbc news.

In [0]:
url = "https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv"
df = pd.read_csv(url)

Cek jumlah artikel yang merupakan jumlah baris pada dataframe (`df`).

In [0]:
len(df)

In [0]:
df.head()

Bersihkan data, dan hilangkan stopwords.

In [0]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    # Remove a sentence if it is only one word long
    if len(text) > 2:
        return ' '.join(word for word in text.split() if word not in STOPWORDS)

df_clean = df.copy()
df_clean['text'] = pd.DataFrame(df.text.apply(lambda x: clean_text(x)))

In [0]:
df_clean.head()

Terapkan lematisasi menggunakan library spacy. Cek https://spacy.io/

In [0]:
nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def lemmatizer(text):        
    sent = []
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
    return " ".join(sent)

df_clean["text_lemmatize"] =  df_clean.apply(lambda x: lemmatizer(x['text']), axis=1)

In [0]:
df_clean.head()

In [0]:
df_clean["text"][0]

In [0]:
df_clean["text_lemmatize"][0]

Persiapkan list `sentences`, yang berisi list kata pada tiap kalimat.  
Dari `['saya dan dia', 'kamu dan mereka']` menjadi `[['saya', 'dan', 'dia'], ['kamu', 'dan', 'mereka']`.

In [0]:
sentences = [row.split() for row in df_clean['text_lemmatize']]
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

Top-10 kata dengan jumlah kemunculan tertinggi.

In [0]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

Buat objek dari constructor Word2Vec.

In [0]:
# min_count: minimum number of occurrences of a word in the corpus to be included in the model.
# window: the maximum distance between the current and predicted word within a sentence.
# size: the dimensionality of the feature vectors
# workers: number of workers, 
w2v_model = Word2Vec(min_count=200,
                     window=5,
                     size=100,
                     workers=4,
                     sg=1)

Bangun vocabulary.

In [0]:
# this line of code to prepare the model vocabulary
w2v_model.build_vocab(sentences)

Train word2vec model.

In [0]:
# train word vectors
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=w2v_model.iter)

In [0]:
# Explore the model
w2v_model.wv.most_similar(positive=['economy'])

In [0]:
w2v_model.wv.most_similar(positive=['tv'])

In [0]:
w2v_model.wv.similarity('device', 'gadget')

In [0]:
w2v_model['gadget']

In [0]:
w2v_model['gadget'].shape

## Text Classification with LSTM

Kali ini kita akan coba menggunakan LSTM untuk melakukan klasifikasi teks.

In [0]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


`vocab_size`: hanya menggunakan top-5000 vocab dari total 24582. Hal ini untuk mempercepat waktu training.  
`embedding_dim`: dimensi vektor representasi setiap token.  
`max_length`: maksimal jumlah token dalam satu artikel/sample data.  
`trunc_type = post`: jika melebihi `max_length`, pemotongan dilakukan di bagian akhir.  
`oov_tok`: pengganti token yang tidak ada dalam vocab.

*Sebaiknya menyebutkan token atau kata?*

In [0]:
vocab_size = 5000
embedding_dim = 64
max_length = 200
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = .8

Split data menjadi 80% training dan 20% validasi.

In [0]:
train_size = int(len(df_clean) * training_portion)
train_set = df_clean[0: train_size]
validation_set = df_clean[train_size:]

In [0]:
len(train_set), len(validation_set)

Tokenisasi menggunakan library `keras`. Fit `tokenizer` **hanya pada training set**.

In [0]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_set["text"])
word_index = tokenizer.word_index

In [0]:
list(word_index.items())[0:10]

Encode sekuen token ke dalam sekuen id.

In [0]:
train_sequences = tokenizer.texts_to_sequences(train_set["text"])

In [0]:
print(train_sequences[10])

Potong sekuens ids (jika melebihi `max_length`) atau tambahkan padding (jika kurang dari `max_length`).

In [0]:
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [0]:
print('before pad_sequences: ',len(train_sequences[0]))
print('after pad_sequences: ',len(train_padded[0]))

print('before pad_sequences: ',len(train_sequences[1]))
print('after pad_sequences: ',len(train_padded[1]))

print('before pad_sequences: ',len(train_sequences[10]))
print('after pad_sequences: ',len(train_padded[10]))

Perbandingan data sebelum dan setelah padding.

In [0]:
print(np.asarray(train_sequences[10]))

Token padding adalah `0`.

In [0]:
print(train_padded[10])

Lakukan hal yang sama untuk data validasi.

In [0]:
validation_sequences = tokenizer.texts_to_sequences(validation_set['text'])
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(len(validation_sequences))
print(validation_padded.shape)

445
(445, 200)


Ubah label ke dalam numerik.

In [0]:
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(df_clean['category'])

training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_set['category']))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_set['category']))

In [0]:
print(training_label_seq[0])
print(training_label_seq[1])
print(training_label_seq[2])
print(training_label_seq.shape)

print(validation_label_seq[0])
print(validation_label_seq[1])
print(validation_label_seq[2])
print(validation_label_seq.shape)

Sebelum kita mulai klasifikasi, mari kita lihat perubahan yang terjadi dari pada training data dalam format teks. Kata yang tidak ada dalam vocabulary (OOV) sudah digantikan dengan token khusus, dan ditambahkan `pad_token` yang didecode sebagai `?`.

In [0]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_article(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_padded[10]))
print('---')
print(train_set['text'][10])

### Model

Pertama kita perlu `Embedding` layer untuk mengubah token id menjadi vektor. Saat ini, weights dari embedding layer akan ditrain bersamaan dengan proses training semua layer.  
Lalu sebuah `LSTM` layer dengan **jumlah `units`** dalam contoh ini yaitu 64.  
Terakhir kita tambahkan 2 buah Dense layer.

In [0]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.LSTM(embedding_dim),
    tf.keras.layers.Dense(embedding_dim, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 64)          320000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               66000     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 6)                 606       
Total params: 396,706
Trainable params: 396,706
Non-trainable params: 0
_________________________________________________________________


In [0]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Train model yang telah dibuat.

In [0]:
num_epochs = 10
history = model.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)

Plot accuracy dan error dari traning dan validation. Terlihat dalam 10 epochs, akurasi validasi di sekitar 50%.

In [0]:

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "acc")
plot_graphs(history, "loss")

## BiLSTM

Sekarang kita akan coba menggunakan Bidirectional LSTM. `tf.keras` sudah menyiapkan layer khusus yang langsung bisa digunakan, yaitu `tf.keras.layers.Bidirectional`. Terlihat bahwa jumlah parameter pada layer bidirectional tepat 2 kali jumlah parameter di LSTM biasa.

In [0]:
model_bi = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
    tf.keras.layers.Dense(embedding_dim, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])
model_bi.summary()

In [0]:
model_bi.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model_bi.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)

In [0]:
plot_graphs(history, "acc")
plot_graphs(history, "loss")

## CNN

Salah satu alasan menggunakan Bidirectional adalah perlunya kemampuan melihat input di depan (tidak hanya input sebelumnya). Maka Bidirectional bukan satu-satunya alternatif. CNN juga bisa menjadi alternatif karena CNN akan melihat beberapa input sebelum dan sesudah berdasarkan size filter yang diberikan.  
  
Di bawah ini, kita coba menggunakan jumlah filter sebanyak 128 dengan ukuran kernel 5.

In [0]:
model_cnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                             input_length=max_length),
    # specify the number of convolutions that you want to learn, their size, and their activation function.
    # words will be grouped into the size of the filter in this case 5
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])
model_cnn.summary()

In [0]:
model_cnn.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model_cnn.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)

Dari plot, kita bisa lihat bahwa hasilnya hampir sama dengan model BiLSTM bahkan terlihat sedikit lebih baik.

In [0]:
plot_graphs(history, "acc")
plot_graphs(history, "loss")

## Glove+CNN+LSTM

Kali ini kita coba untuk menggabungkan beberapa komponen, yaitu menggunakan `pretrained embedding matrix`, bisa menggunakan word2vec, namun di sini kita coba menggunakan Glove. Info tentang Glove: https://nlp.stanford.edu/projects/glove/. Ada beberapa pilihan Glove, kita coba menggunakan 100 dimensi.

Lalu kita tambahkan CNN dan LSTM setelahnya.

In [0]:
word_index = tokenizer.word_index
vocab_size=len(word_index)
embedding_dim = 100

Bentuk `embeddings_matrix` yang hanya berisi vektor kata yang ada dalam kamus dataset yang kita gunakan saat ini (BBC train).

In [0]:
embeddings_index = {};
with open('glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs;



embeddings_matrix = np.zeros((vocab_size+1, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

Karena kita sudah menggunakan pretrained embeddings matrix, maka kita bisa set agar tidak mengupdate/learn weights nya. Set `trainable=False`.  
Dari summary model terlihat bahwa meskipun jumlah total parameter jauh lebih banyak dari model-model sebelumnya, namun yang ditrain jauh lebih sedikit.

In [0]:
model_combi = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(6, activation='softmax')
])
model_combi.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model_combi.summary()

In [0]:
model_combi.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model_combi.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)

Salah satu perbedaan yang terlihat adalah, hasil akurasi validasi pada epoch 1 sudah >80%, dibanding model sebelumnya yang berkisar 50%. Hal ini merupakan efek penggunaan pretrained word embeddings.

In [0]:
plot_graphs(history, "acc")
plot_graphs(history, "loss")

## Transformers

Terakhir, mari kita coba menggunakan pretrained contextual word embeddings. Transformers menyediakan banyak sekali pilihan.

Install transformers dari huggingface https://huggingface.co/.

In [0]:
!pip install transformers

Import beberapa library yang dibutuhkan.

In [0]:
import torch
import transformers as tfm # pytorch transformers
from sklearn.linear_model import LogisticRegression

Ada banyak model transformer yang bisa digunakan, saat ini kita coba menggunakan distilBERT. Model lainnya dapat dilihat di https://huggingface.co/transformers/pretrained_models.html

In [0]:
model_class, tokenizer_class, pretrained_weights = (tfm.DistilBertModel, tfm.DistilBertTokenizer, 'distilbert-base-uncased')

Load tokenizer dan juga model.

In [0]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Tokenize train_set dan juga validation_set. 

In [0]:
train_sequences = train_set['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [0]:
validation_sequences = validation_set['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [0]:
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [0]:
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

Ubah `train_padded` dan `validation_padded` ke dalam bentuk tensor. Lalu jalankan model (forward pass). DistilBERT memiliki hidden unit sebanyak 768.

In [0]:
train_ids = torch.tensor(np.array(train_padded)).to(torch.int64)
with torch.no_grad():
    train_last_hidden_states = model(train_ids)

In [0]:
validation_ids = torch.tensor(np.array(validation_padded)).to(torch.int64)
with torch.no_grad():
    validation_last_hidden_states = model(validation_ids)

Shape dari `last_hidden_states` yaitu `[jumlah data, panjang sekuens, jumlah hidden units]`.

In [0]:
train_last_hidden_states[0].shape

Untuk teks classification, kita bisa menggunakan hanya output posisi pertama dari 200 output yang dikeluarkan model. Karena terdapat self-attention di dalam model tersebut, maka setiap posisi output akan memiliki aliran informasi dari semua posisi input, tidak hanya dari posisi input yang bersesuaian.

In [0]:
train_features = train_last_hidden_states[0][:,0,:].numpy()

In [0]:
validation_features = validation_last_hidden_states[0][:,0,:].numpy()

In [0]:
train_features[0]

In [0]:
train_label = training_label_seq.squeeze()
validation_label = validation_label_seq.squeeze()

Coba gunakan logistic regression sederhana untuk melakukan klasifikasi.

In [0]:
lr_clf = LogisticRegression(max_iter=500)
lr_clf.fit(train_features, train_label)

Ternyata akurasi train dan validationnya cukup tinggi.

In [0]:
lr_clf.score(train_features, train_label)

In [0]:
lr_clf.score(validation_features, validation_label)

## PoS Tagging with LSTM

Originally from https://nlpforhackers.io/lstm-pos-tagger-keras/

In [0]:
import nltk
import numpy as np

nltk.download('treebank')
 
tagged_sentences = nltk.corpus.treebank.tagged_sents()
 
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
print("Tagged words:", len(nltk.corpus.treebank.tagged_words()))

In [0]:
sentences, sentence_tags =[], [] 
for tagged_sentence in tagged_sentences:
    sentence, tags = zip(*tagged_sentence)
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))
 
# Let's see how a sequence looks
 
print(sentences[5])
print(sentence_tags[5])

In [0]:
from sklearn.model_selection import train_test_split

(train_sentences, 
 test_sentences, 
 train_tags, 
 test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)

In [0]:
words, tags = set([]), set([])
 
for s in train_sentences:
    for w in s:
        words.add(w.lower())
 
for ts in train_tags:
    for t in ts:
        tags.add(t)
 
word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0  # The special value used for padding
word2index['-OOV-'] = 1  # The special value used for OOVs
 
tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0  # The special value used to padding

In [0]:
train_sentences_X, test_sentences_X, train_tags_y, test_tags_y = [], [], [], []
 
for s in train_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
 
    train_sentences_X.append(s_int)
 
for s in test_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
 
    test_sentences_X.append(s_int)
 
for s in train_tags:
    train_tags_y.append([tag2index[t] for t in s])
 
for s in test_tags:
    test_tags_y.append([tag2index[t] for t in s])
 
print(train_sentences_X[0])
print(test_sentences_X[0])
print(train_tags_y[0])
print(test_tags_y[0])

In [0]:
MAX_LENGTH = len(max(train_sentences_X, key=len))
print(MAX_LENGTH)  # 271

In [0]:
from keras.preprocessing.sequence import pad_sequences
 
train_sentences_X = pad_sequences(train_sentences_X, maxlen=MAX_LENGTH, padding='post')
test_sentences_X = pad_sequences(test_sentences_X, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')

In [0]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
from keras.optimizers import Adam

import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(len(word2index), 128),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True)),
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(len(tag2index))),
    tf.keras.layers.Activation('softmax')
])
model.summary() 
 

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [0]:
def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)

In [0]:
model.fit(train_sentences_X, to_categorical(train_tags_y, len(tag2index)), batch_size=128, epochs=10, validation_split=0.2)

In [0]:
scores = model.evaluate(test_sentences_X, to_categorical(test_tags_y, len(tag2index)))
print(f"{model.metrics_names[1]}: {scores[1] * 100}")   