# HOMEWORK 5: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming) 

In this homework, you are asked to do the following tasks:
1. Data Cleaning
2. Preprocessing data for keras
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 


Note: we have removed phone numbers from the dataset for privacy purposes. 

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
%cd /content/drive/MyDrive/Colab\ Notebooks/NLP/hw6

/content/drive/MyDrive/Colab Notebooks/NLP/hw6


In [2]:
# !wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

## Import Libs

In [115]:
%matplotlib inline
import pandas
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from IPython.display import display

import matplotlib.pyplot as plt

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [116]:
data_df = pandas.read_csv('clean-phone-data-for-students.csv')

Let's preview the data.

In [117]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1: 
You will have to remove unwanted label duplications as well as duplications in text inputs. 
Also, you will have to trim out unwanted whitespaces from the text inputs. 
This shouldn't be too hard, as you have already seen it in the demo.



In [118]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [119]:
# TODO1: Data cleaning
data_df.Action = data_df.Action.str.lower().copy()
data_df.Object = data_df.Object.str.lower().copy()

display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,8,26
top,บริการอื่นๆ,enquire,service
freq,97,10484,2528


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

In [120]:
data_df = data_df.drop_duplicates("Sentence Utterance", keep='first')
display(data_df.describe())

Unnamed: 0,Sentence Utterance,Action,Object
count,13389,13389,13389
unique,13389,8,26
top,เด๋วพี่ขอปรึกษาหน่อยนะ น้องสามารถเช็คโปรโมชั่น...,enquire,service
freq,1,8658,2111


In [121]:
data_df = data_df.rename(columns={"Sentence Utterance": "input"})

In [122]:
## strip space before input
data_df.input = data_df.input.str.strip()

In [124]:
data_df.to_csv('checkpoint.csv', index=False)

## #TODO 2: Preprocessing data for Keras
You will be using Tensorflow 2 keras in this assignment. Please show us how you prepare your data for keras.
Don't forget to split data into train and test sets (+ validation set if you want)

In [125]:
import tensorflow as tf

In [126]:
!pip install pythainlp



In [127]:
from pythainlp import word_tokenize, Tokenizer
from pythainlp.util.trie import dict_trie
from pythainlp.corpus.common import thai_words

In [128]:
# TODO2: Preprocessing data for Keras
data_df = pandas.read_csv('checkpoint.csv')
display(data_df)
data = data_df.to_numpy()

Unnamed: 0,input,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counter...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้ ค่ะ,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อง...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโท...,report,phone_issues
...,...,...,...
13384,ต้องการทราบวันตัดรอบบิลค่ะ,enquire,bill
13385,เชื่อมต่ออินเตอร์เน็ตไม่ได้ค่ะ,enquire,internet
13386,ยอดเงินเหลือเท่าไหร่ค่ะ,enquire,balance
13387,ยอดเงินในระบบ,enquire,balance


In [129]:
action_labels = data_df.Action.unique()
object_labels = data_df.Object.unique()

action2idx = dict(zip(action_labels, range(len(action_labels))))
idx2action = dict(zip(range(len(action_labels)), action_labels))

object2idx = dict(zip(object_labels, range(len(object_labels))))
idx2object = dict(zip(range(len(object_labels)), object_labels))

# display(action2index)
# display(index2action)

# display(object2index)
# display(index2object)

data[:,1] = np.vectorize(action2idx.get)(data[:,1])
data[:,2] = np.vectorize(object2idx.get)(data[:,2])

In [130]:
words = ["<PHONE_NUMBER_REMOVED>"]
custom_words_list = set(thai_words())
custom_words_list.update(words)
trie = dict_trie(dict_source=custom_words_list)

custom_tokenizer = Tokenizer(custom_dict=trie, engine='newmm')
sentence_data = [custom_tokenizer.word_tokenize(data[i,0]) for i in range(data.shape[0])]
print(sentence_data[:5])

[['<PHONE_NUMBER_REMOVED>', ' ', 'ผม', 'ไป', 'จ่าย', 'เงิน', 'ที่', ' ', 'Counter', ' ', 'Services', ' ', 'เค้า', 'เช็ต', ' ', '3276.25', ' ', 'บาท', ' ', 'เมื่อวาน', 'ที่', 'ผม', 'เช็ค', 'ที่', 'ศูนย์', 'บอก', 'มี', 'ยอด', ' ', '3057.79', ' ', 'บาท'], ['internet', ' ', 'ยัง', 'ความเร็ว', 'อยุ่', 'เท่า', 'ไห', 'ร', ' ', 'ครับ'], ['ตะกี้', 'ไป', 'ชำระ', 'ค่าบริการ', 'ไป', 'แล้ว', ' ', 'แต่', 'ยัง', 'ใช้งาน', 'ไม่', 'ได้', ' ', 'ค่ะ'], ['พี่', 'ค่ะ', 'ยัง', 'ใช้', ' ', 'internet', ' ', 'ไม่', 'ได้', 'เลย', 'ค่ะ', ' ', 'เป็น', 'เครื่อง', ' ', 'โก', 'ลไล'], ['ฮา', 'โหล', ' ', 'คะ', ' ', 'พอดี', 'ว่า', 'เมื่อวาน', 'เปิด', 'ซิม', 'ทรูมูฟ', ' ', 'แต่', 'มัน', 'โทร', 'ออก', 'ไม่', 'ได้', 'คะ', ' ', 'แต่', 'เล่น', 'เนต', 'ได้', 'คะ']]


In [131]:
print('max len train', max([len(x) for x in sentence_data]))

max len train 124


In [132]:
train_data, test_data, y_train, y_test = sklearn.model_selection.train_test_split(sentence_data.copy(), data[:,1:3].copy(), test_size=0.2)



# ya_train, ya_test = Y_train[:,0].copy(), Y_test[:,0].copy()
# yo_train, yo_test = Y_train[:,1].copy(), Y_test[:,1].copy()

# print(ya_train.shape, yo_train.shape)
# print(np.max(ya_train))
# print(np.max(yo_train))

In [133]:
word2idx = {}
idx2word = {}
for sent in train_data:
  for word in sent:
    if word not in word2idx:
      word2idx[word] = len(word2idx)+1
      idx2word[word2idx[word]] = word
word2idx['UNK'] = len(word2idx)

In [134]:
def sent2features(sent, emb_dict=word2idx):
  return np.asarray([word2idx.get(w, word2idx['UNK']) for w in sent])

sent2features(["ฉัน","หิว","โทร","กดฟห"])

array([1963, 3754,    1, 3754])

In [168]:

x_train = np.asarray([sent2features(sent) for sent in train_data])
x_test = [sent2features(sent) for sent in test_data]
print(x_train[100])
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=128, dtype='int32', padding='post', truncating='pre', value=0.)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=128, dtype='int32', padding='post', truncating='pre', value=0.)
print(x_train[100])

[ 43   9  12  39  45   9 340   9 309   9 114   9 402   9  12 103 289  70
  57  43]
[ 43.   9.  12.  39.  45.   9. 340.   9. 309.   9. 114.   9. 402.   9.
  12. 103. 289.  70.  57.  43.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.]


  return array(a, dtype, copy=False, order=order)


In [170]:
from sklearn.preprocessing import OneHotEncoder

naction = len(action2idx)
nobject = len(object2idx)

onehot_encoder = OneHotEncoder(sparse=False)
ya_train = onehot_encoder.fit_transform(y_train[:,0].reshape(-1,1)).astype(np.float32)
ya_test = onehot_encoder.fit_transform(y_test[:,0].reshape(-1,1)).astype(np.float32)
yo_train = onehot_encoder.fit_transform(y_train[:,1].reshape(-1,1)).astype(np.float32)
yo_test = onehot_encoder.fit_transform(y_test[:,1].reshape(-1,1)).astype(np.float32)

print('action test', ya_test.shape)
print('action train', ya_train.shape)
print('object test', yo_test.shape)
print('object train', yo_train.shape)

print('nclass action', naction)
print('nclass object', nobject)

action test (2678, 8)
action train (10711, 8)
object test (2678, 26)
object train (10711, 26)
nclass action 8
nclass object 26


## #TODO 3: Build and evaluate a model for "action" classification


In [162]:
from tensorflow.keras.backend import clear_session
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Reshape, Activation, Input, Dense,GRU,Reshape,TimeDistributed,Bidirectional,Dropout,Masking, SimpleRNN
from tensorflow.keras.optimizers import Adam

In [208]:
#TODO 3: Build and evaluate a model for "action" classification
clear_session()

def getActionModel():
  inputs = Input(shape=(128,), dtype='int32')
  x = Embedding(len(word2idx), 32, input_length=128, mask_zero=True)(inputs)
  x = Bidirectional(GRU(32, return_sequences=True))(x)
  x = Dropout(0.2)(x)
  # x = TimeDistributed(Dense(naction,activation='softmax'))(x)
  x = SimpleRNN(128)(x)
  x = Dense(naction, activation='softmax')(x)
  model = Model(inputs, x)

  

  model.compile(optimizer=Adam(lr=0.001),  loss='categorical_crossentropy', metrics=['categorical_accuracy'])
  return model

model = getActionModel()
model.summary()
model.fit(x_train,y=ya_train, batch_size=64, epochs=7, verbose=1, validation_split=0.05)


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 128)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 128, 32)           120160    
_________________________________________________________________
bidirectional (Bidirectional (None, 128, 64)           12672     
_________________________________________________________________
dropout (Dropout)            (None, 128, 64)           0         
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 128)               24704     
_________________________________________________________________
dense (Dense)                (None, 8)                 1032      
Total params: 158,568
Trainable params: 158,568
Non-trainable params: 0
_______________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7f933cf53ed0>

In [209]:
model.save('action_bigru1')



In [210]:
def evaluate_action(y_true, y_pred):
    y_true = y_true.argmax(axis=1)
    y_pred = y_pred.argmax(axis=1)
    print(sklearn.metrics.classification_report(y_true, y_pred, target_names=[idx2action[i] for i in range(8)]))
    print('accuracy', np.sum(y_true==y_pred)/y_true.size)

In [211]:
ya_pred = model.predict(x_test)
print(evaluate_action(ya_test, ya_pred))

              precision    recall  f1-score   support

     enquire       0.91      0.87      0.89      1748
      report       0.69      0.80      0.74       282
      cancel       0.88      0.92      0.90       213
         buy       0.58      0.78      0.67       157
    activate       0.71      0.70      0.70       106
     request       0.55      0.45      0.50        58
     garbage       0.00      0.00      0.00        10
      change       0.85      0.78      0.81       104

    accuracy                           0.84      2678
   macro avg       0.65      0.66      0.65      2678
weighted avg       0.84      0.84      0.84      2678

accuracy 0.8368185212845407
None


  _warn_prf(average, modifier, msg_start, len(result))


## #TODO 4: Build and evaluate a model for "object" classification



In [227]:
#TODO 4: Build and evaluate a model for "object" classification
clear_session()
def evaluate_object(y_true, y_pred):
    y_true = y_true.argmax(axis=1)
    y_pred = y_pred.argmax(axis=1)
    print(sklearn.metrics.classification_report(y_true, y_pred, target_names=[idx2object[i] for i in range(26)]))
    print('accuracy', np.sum(y_true==y_pred)/y_true.size)

def getObjectModel():
  inputs = Input(shape=(128,), dtype='int32')
  x = Embedding(len(word2idx), 32, input_length=128, mask_zero=True)(inputs)
  x = Bidirectional(GRU(32, return_sequences=True))(x)
  x = Dropout(0.2)(x)
  # x = TimeDistributed(Dense(naction,activation='softmax'))(x)
  x = SimpleRNN(128)(x)
  x = Dense(nobject, activation='softmax')(x)
  model = Model(inputs, x)

  

  model.compile(optimizer=Adam(lr=0.001),  loss='categorical_crossentropy', metrics=['categorical_accuracy'])
  return model

model = getObjectModel()
model.summary()
model.fit(x_train,y=yo_train, batch_size=64, epochs=10, verbose=1, validation_split=0.05)

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 128)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 128, 32)           120160    
_________________________________________________________________
bidirectional (Bidirectional (None, 128, 64)           12672     
_________________________________________________________________
dropout (Dropout)            (None, 128, 64)           0         
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 128)               24704     
_________________________________________________________________
dense (Dense)                (None, 26)                3354      
Total params: 160,890
Trainable params: 160,890
Non-trainable params: 0
_______________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7f93372d6b50>

In [228]:
model.save('object_bigru1')



In [230]:
yo_pred = model.predict(x_test)
print(evaluate_object(yo_test, yo_pred))

                 precision    recall  f1-score   support

        payment       0.56      0.59      0.57       131
        package       0.66      0.74      0.70       371
        suspend       0.70      0.81      0.75       140
       internet       0.70      0.76      0.73       349
   phone_issues       0.52      0.59      0.55       121
        service       0.76      0.73      0.75       404
    nontruemove       0.14      0.10      0.12        58
        balance       0.81      0.84      0.82       296
         detail       0.50      0.22      0.30        73
           bill       0.61      0.61      0.61        97
         credit       0.87      0.82      0.84        33
      promotion       0.69      0.55      0.61       229
 mobile_setting       0.53      0.44      0.48        57
       iservice       0.00      0.00      0.00        10
        roaming       0.64      0.58      0.61        48
      truemoney       0.81      0.82      0.82        51
    information       0.34    

  _warn_prf(average, modifier, msg_start, len(result))


## #TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 

This can be a bit tricky, if you are not familiar with the Keras functional API. PLEASE READ these webpages(https://www.tensorflow.org/guide/keras/functional, https://keras.io/getting-started/functional-api-guide/) before you start this task.   

Your model will have 2 separate output layers one for action classification task and another for object classification task. 

This is a rough sketch of what your model might look like:
![image](https://raw.githubusercontent.com/ekapolc/nlp_course/master/HW5/multitask_sketch.png)

In [240]:
#TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go
def getMultiModel():
  inputs = Input(shape=(128,), dtype='int32')
  x = Embedding(len(word2idx), 32, input_length=128, mask_zero=True)(inputs)
  x = Bidirectional(GRU(32, return_sequences=True))(x)
  x = Dropout(0.2)(x)
  # x = TimeDistributed(Dense(naction,activation='softmax'))(x)
  x = SimpleRNN(128)(x)
  xa = Dense(naction, activation='softmax', name='action')(x)
  xo = Dense(nobject, activation='softmax', name='object')(x)
  model = Model(inputs, outputs=[xa,xo])
  model.compile(optimizer=Adam(lr=0.001),  loss={'action':'categorical_crossentropy', 'object':'categorical_crossentropy'}, loss_weights=[1.0, 1.0])
  return model

model = getMultiModel()
model.summary()
model.fit(x_train,{'action':ya_train, 'object':yo_train}, batch_size=64, epochs=17, verbose=1, validation_split=0.05)

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 128)]        0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 128, 32)      120160      input_3[0][0]                    
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) (None, 128, 64)      12672       embedding_2[0][0]                
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 128, 64)      0           bidirectional_2[0][0]            
____________________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7f9336d6bb50>

In [241]:
model.save('multi_bigru1')



In [242]:
ya_pred, yo_pred = model.predict(x_test)

In [243]:
evaluate_action(ya_pred, ya_test)
evaluate_object(yo_pred, yo_test)

              precision    recall  f1-score   support

     enquire       0.91      0.88      0.89      1801
      report       0.70      0.70      0.70       282
      cancel       0.91      0.86      0.88       225
         buy       0.68      0.75      0.71       141
    activate       0.59      0.64      0.61        99
     request       0.36      0.66      0.47        32
     garbage       0.00      0.00      0.00         4
      change       0.75      0.83      0.79        94

    accuracy                           0.84      2678
   macro avg       0.61      0.66      0.63      2678
weighted avg       0.85      0.84      0.84      2678

accuracy 0.8379387602688574
                 precision    recall  f1-score   support

        payment       0.58      0.61      0.60       124
        package       0.71      0.64      0.68       411
        suspend       0.75      0.75      0.75       140
       internet       0.80      0.64      0.71       436
   phone_issues       0.50      0.5

  _warn_prf(average, modifier, msg_start, len(result))


In [245]:
myx = [sent2features(["สอบ","ถาม","ยอด","ค่า","บริการ"])]
myya, myyo = model.predict(myx)
actionidx =myya.argmax(axis=1)
objectidx =myyo.argmax(axis=1)
print(idx2action[actionidx[0]])
print(idx2object[objectidx[0]])

enquire
service
