Topic Modeling Amarigna

_ Simple topic classifying LSTM model to test if it is possible to identify topics in Amharic text _

In [64]:
from sklearn.datasets import fetch_20newsgroups
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras, numpy as np
from keras.layers import Embedding, Dense, LSTM, GRU
from keras.models import Sequential
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

_A small sample dataset to train and test the model_

In [86]:
data_loc = "./data/big_sample.csv"
data = pd.read_csv(data_loc, sep=';', names=['article_id', 'url_fragment', 'first_published', 'body', 'topic'])
# data.columns = ['article_id', 'url_fragment', 'first_published', 'body', 'topic']

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.


In [84]:
data.body

5910888242026c28fac35d30     my very short pregnancy was extremely traumat...
5910888242026c28fac35d30     my very short pregnancy was extremely traumat...
5910888242026c28fac35d30     my very short pregnancy was extremely traumat...
5910888242026c28fac35d30     my very short pregnancy was extremely traumat...
592304c2fb5b53475047e666                                                 easy
592304c2fb5b53475047e666                                               dinner
5925cf4ab377dd5e042bec48                                              classic
5925cf4ab377dd5e042bec48                                                chili
5952d2e5068490541f4ad204                                                snack
590a0fb9f68fc128e293301f                                              noodles
590a0fb9f68fc128e293301f                                                 easy
590a0fb9f68fc128e293301f                                               dinner
5a14729a3858494222ebb024     they brought my daughter a stuffed 

In [66]:
nb_words = 100000
max_seq_len = 2000
data.columns

Index(['article_id', 'url_fragment', 'first_published', 'body', 'topic'], dtype='object')

In [67]:
train_size = int(np.floor(data.shape[0] * .8))

train_x = data["body"][0:train_size]
train_y = data["topic"][0:train_size]

test_x = data["body"][train_size:]
test_y = data["topic"][train_size:]

In [68]:
train_x.shape, train_y.shape, test_x.shape, test_y.shape

((800,), (800,), (200,), (200,))

In [69]:
X = data["body"]
y = data["topic"]

In [70]:
topics = list(y.unique())
y_encoded = [topics.index(topic) for topic in y] 

n_classes = len(topics)
n_classes

183

In [80]:
data.topic[0:100]

5910888242026c28fac35d30                                        mental health
5910888242026c28fac35d30                                                 body
5910888242026c28fac35d30                                                 mind
5910888242026c28fac35d30                                           psychology
592304c2fb5b53475047e666                                                  NaN
592304c2fb5b53475047e666                                                  NaN
5925cf4ab377dd5e042bec48                                                  NaN
5925cf4ab377dd5e042bec48                                                  NaN
5952d2e5068490541f4ad204                                                  NaN
590a0fb9f68fc128e293301f                                                  NaN
590a0fb9f68fc128e293301f                                                  NaN
590a0fb9f68fc128e293301f                                                  NaN
5a14729a3858494222ebb024     it seemed like they were in some ki

Preparing the data for the model
* Tokenizing the text - Identifying unique words, creating a dictionary and counting their frequency in the list of documents (texts) in the training data.
* One-hot encoding the labels (topics)
* Splitting the data into train and test(validation) sets

In [71]:
tokenizer = Tokenizer(num_words=nb_words)
tokenizer.fit_on_texts(X)
sequences = Tokenizer.texts_to_sequences(tokenizer, X)
word_index = tokenizer.word_index

ydata = keras.utils.to_categorical(y_encoded)
input_data = pad_sequences(sequences, maxlen=max_seq_len)

Xtrain, Xvalid, ytrain, yvalid = train_test_split(input_data, ydata, test_size=0.2)

_Model definition and training_

In [74]:
embedding_vector_length = 64
model = Sequential()
model.add(Embedding(len(word_index)+1, embedding_vector_length, input_length=max_seq_len, embeddings_initializer='glorot_normal', 
                    embeddings_regularizer=keras.regularizers.l2(0.01)))
model.add(LSTM(80))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 2000, 64)          653888    
_________________________________________________________________
lstm_6 (LSTM)                (None, 80)                46400     
_________________________________________________________________
dense_6 (Dense)              (None, 183)               14823     
Total params: 715,111
Trainable params: 715,111
Non-trainable params: 0
_________________________________________________________________
None


In [76]:
model.fit(Xtrain, ytrain, validation_data=(Xvalid, yvalid), nb_epoch=10, batch_size=16)



Train on 800 samples, validate on 200 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1242c2e48>

In [None]:
preds = model.predict()