<a href="https://colab.research.google.com/github/chantmk/NLP_2021/blob/main/HW6/hw6_text_classification_finished.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HOMEWORK 5: TEXT CLASSIFICATION
In this homework, you will create models to classify texts from TRUE call-center. There are two classification tasks:
1. Action Classification: Identify which action the customer would like to take (e.g. enquire, report, cancle)
2. Object Classification: Identify which object the customer is referring to (e.g. payment, truemoney, internet, roaming) 

In this homework, you are asked to do the following tasks:
1. Data Cleaning
2. Preprocessing data for keras
3. Build and evaluate a model for "action" classification
4. Build and evaluate a model for "object" classification
5. Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 


Note: we have removed phone numbers from the dataset for privacy purposes. 

In [1]:
!wget --no-check-certificate https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv

--2021-02-27 07:34:36--  https://www.dropbox.com/s/37u83g55p19kvrl/clean-phone-data-for-students.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv [following]
--2021-02-27 07:34:36--  https://www.dropbox.com/s/raw/37u83g55p19kvrl/clean-phone-data-for-students.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc0eba5ff79bce0641717c45d4d6.dl.dropboxusercontent.com/cd/0/inline/BJs0uZ4isV5RqP1glbEwLwSvZUqlDKT1YXtB1BA1A1F7t6Le6Y4iGXmVZI-I_EFdHEucc-IFKCskruExy27YXx2StF33S5nKKr98ltHz9JbiXw/file# [following]
--2021-02-27 07:34:37--  https://uc0eba5ff79bce0641717c45d4d6.dl.dropboxusercontent.com/cd/0/inline/BJs0uZ4isV5RqP1glbEwLwSvZUqlDKT1YXtB1BA1A1F7t6Le6Y4iGXmVZI-I_EFdHEucc

## Import Libs

In [2]:
%matplotlib inline
import pandas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from IPython.display import display

import matplotlib.pyplot as plt

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, GRU, Embedding
from tensorflow.keras.optimizers import Adam

import random
random.seed(11)

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [3]:
data_df = pandas.read_csv('clean-phone-data-for-students.csv')

Let's preview the data.

In [4]:
# Show the top 5 rows
display(data_df.head())
# Summarize the data
data_df.describe()

Unnamed: 0,Sentence Utterance,Action,Object
0,<PHONE_NUMBER_REMOVED> ผมไปจ่ายเงินที่ Counte...,enquire,payment
1,internet ยังความเร็วอยุ่เท่าไหร ครับ,enquire,package
2,ตะกี้ไปชำระค่าบริการไปแล้ว แต่ยังใช้งานไม่ได้...,report,suspend
3,พี่ค่ะยังใช้ internet ไม่ได้เลยค่ะ เป็นเครื่อ...,enquire,internet
4,ฮาโหล คะ พอดีว่าเมื่อวานเปิดซิมทรูมูฟ แต่มันโ...,report,phone_issues


Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


## Data cleaning

We call the DataFrame.describe() again.
Notice that there are 33 unique labels/classes for object and 10 unique labels for action that the model will try to predict.
But there are unwanted duplications e.g. Idd,idd,lotalty_card,Lotalty_card

Also note that, there are 13389 unqiue sentence utterances from 16175 utterances. You have to clean that too!

## #TODO 1: 
You will have to remove unwanted label duplications as well as duplications in text inputs. 
Also, you will have to trim out unwanted whitespaces from the text inputs. 
This shouldn't be too hard, as you have already seen it in the demo.



In [5]:
display(data_df.describe())
display(data_df.Object.unique())
display(data_df.Action.unique())

Unnamed: 0,Sentence Utterance,Action,Object
count,16175,16175,16175
unique,13389,10,33
top,บริการอื่นๆ,enquire,service
freq,97,10377,2525


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'IDD', 'ringtone', 'Idd',
       'rate', 'loyalty_card', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

In [6]:
def lowerString(dataframe, column):
    newColumn = column + "_clean"
    dataframe[newColumn] = dataframe[column].str.lower().copy()
    return dataframe

In [7]:
# TODO1: Data cleaning
clean_df = data_df.copy()
clean_df = clean_df.applymap(lambda x: x.strip())
clean_df = lowerString(clean_df, "Action")
clean_df = lowerString(clean_df, "Object")
clean_df = clean_df.drop_duplicates("Sentence Utterance", keep="first")

In [8]:
display(clean_df.describe())
display(clean_df.Object.unique())
display(clean_df.Object_clean.unique())
display(clean_df.Action.unique())
display(clean_df.Action_clean.unique())

Unnamed: 0,Sentence Utterance,Action,Object,Action_clean,Object_clean
count,13367,13367,13367,13367,13367
unique,13367,10,32,8,26
top,ขอเปิดใช้งานบริการโรมมิ่ง 3 เดือน พอดีจะไปทำงา...,enquire,service,enquire,service
freq,1,8541,2105,8644,2108


array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nonTrueMove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd',
       'TrueMoney', 'garbage', 'Payment', 'ringtone', 'rate',
       'loyalty_card', 'Idd', 'contact', 'officer', 'Balance', 'Service',
       'Loyalty_card'], dtype=object)

array(['payment', 'package', 'suspend', 'internet', 'phone_issues',
       'service', 'nontruemove', 'balance', 'detail', 'bill', 'credit',
       'promotion', 'mobile_setting', 'iservice', 'roaming', 'truemoney',
       'information', 'lost_stolen', 'balance_minutes', 'idd', 'garbage',
       'ringtone', 'rate', 'loyalty_card', 'contact', 'officer'],
      dtype=object)

array(['enquire', 'report', 'cancel', 'Enquire', 'buy', 'activate',
       'request', 'Report', 'garbage', 'change'], dtype=object)

array(['enquire', 'report', 'cancel', 'buy', 'activate', 'request',
       'garbage', 'change'], dtype=object)

## #TODO 2: Preprocessing data for Keras
You will be using Tensorflow 2 keras in this assignment. Please show us how you prepare your data for keras.
Don't forget to split data into train and test sets (+ validation set if you want)

In [9]:
def label2num(dataframe, column):
    uniqueLabel = dataframe[column].unique()
    label2numMap = dict(zip(uniqueLabel, range(len(uniqueLabel))))
    num2labelMap = dict(zip(range(len(uniqueLabel)), uniqueLabel))
    dataframe[column+"_id"] = dataframe[column].map(label2numMap)
    return dataframe, label2numMap, num2labelMap

In [10]:
def getAllChar(dataframe, column):
    allString = "".join(dataframe[column])
    allChar = sorted(np.unique(np.array(list(allString))))
    charMap = dict(zip(allChar, range(len(allChar))))
    return allChar, charMap

In [11]:
def countChar(row, column, charMap):
    result = np.zeros(len(charMap))
    np_str = np.array(list(row[column]))
    str_char, str_char_count = np.unique(np_str, return_counts=True)
    for char, count in zip(str_char, str_char_count):
        result[charMap[char]] = count
    return result

In [12]:
def dataframeToNumpy(dataframe, input, label):
    test_size = 0.3
    random_state = 11
    data_input = np.array([[e for e in sl] for sl in dataframe[input]])
    data_train, data_test = train_test_split(data_input, test_size=test_size, random_state=random_state)
    label = to_categorical(dataframe[label])
    label_train, label_test = train_test_split(label, test_size=test_size, random_state=random_state)
    return data_train, data_test, label_train, label_test

In [13]:
# TODO2: Preprocessing data for Keras
map_df = clean_df.copy()
map_df, l2n_action, n2l_action = label2num(map_df, "Action_clean")
map_df, l2n_object, n2l_object = label2num(map_df, "Object_clean")
allChar, charMap = getAllChar(map_df, "Sentence Utterance")
map_df["Char_count"] = map_df.apply(lambda row: countChar(row,"Sentence Utterance", charMap), axis=1)
action_train, action_test, action_train_label, action_test_label  = dataframeToNumpy(map_df, "Char_count", "Action_clean_id")
object_train, object_test, object_train_label, object_test_label = dataframeToNumpy(map_df, "Char_count", "Object_clean_id")

## #TODO 3: Build and evaluate a model for "action" classification


In [14]:
def evaluate(y_pred, y_true):
    sum = 0
    size = len(y_pred)
    for i in range(size):
        if y_pred[i].argmax() == y_true[i].argmax():
            sum += 1
    print("Accuracy: "+str(sum/size))
    return sum/size

In [15]:
def model(outSize):
    input = Input(shape=(152,))
    ff = Dense(128, activation="relu")(input)
    ff = Dense(128, activation="relu")(ff)
    ff = Dense(128, activation="relu")(ff)

    out = Dense(outSize, activation="softmax")(ff)

    model = Model(inputs=input, outputs=out)
    model.compile(optimizer=Adam(), loss="categorical_crossentropy", metrics=["acc"])
    return model

In [16]:
m1 = model(8)
m1.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 152)]             0         
_________________________________________________________________
dense (Dense)                (None, 128)               19584     
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_2 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_3 (Dense)              (None, 8)                 1032      
Total params: 53,640
Trainable params: 53,640
Non-trainable params: 0
_________________________________________________________________


In [17]:
m1.fit(action_train, action_train_label, epochs=10, batch_size=128, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f55f3cb4a50>

In [18]:
y_pred1 = m1.predict(action_test)
evaluate(y_pred1, action_test_label)

Accuracy: 0.7895786586886063


0.7895786586886063

In [27]:
m1.evaluate(action_test, action_test_label)



[0.663093626499176, 0.7895786762237549]

## #TODO 4: Build and evaluate a model for "object" classification



In [19]:
m2 = model(26)
m2.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 152)]             0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               19584     
_________________________________________________________________
dense_5 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_6 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_7 (Dense)              (None, 26)                3354      
Total params: 55,962
Trainable params: 55,962
Non-trainable params: 0
_________________________________________________________________


In [20]:
m2.fit(object_train, object_train_label, epochs=10, batch_size=128, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f55f3c84350>

In [21]:
y_pred2 = m2.predict(object_test)
evaluate(y_pred2, object_test_label)

Accuracy: 0.5893792071802543


0.5893792071802543

In [28]:
m2.evaluate(object_test, object_test_label)



[1.4165138006210327, 0.5893791913986206]

## #TODO 5: Build and evaluate a multi-task model that does both "action" and "object" classifications in one-go 

This can be a bit tricky, if you are not familiar with the Keras functional API. PLEASE READ these webpages(https://www.tensorflow.org/guide/keras/functional, https://keras.io/getting-started/functional-api-guide/) before you start this task.   

Your model will have 2 separate output layers one for action classification task and another for object classification task. 

This is a rough sketch of what your model might look like:
![image](https://raw.githubusercontent.com/ekapolc/nlp_course/master/HW5/multitask_sketch.png)

In [22]:
def getMultitaskModel():
    input = Input(shape=(152,))
    ff = Dense(128, activation="relu")(input)
    ff = Dense(128, activation="relu")(ff)
    ff = Dense(128, activation="relu")(ff)
    
    out1 = Dense(8, activation="softmax")(ff)
    out2 = Dense(26, activation="softmax")(ff)

    model = Model(inputs=input, outputs=[out1, out2])
    model.compile(optimizer=Adam(), loss="categorical_crossentropy", metrics=["accuracy"])
    return model

In [23]:
mcomb = getMultitaskModel()
mcomb.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 152)]        0                                            
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 128)          19584       input_3[0][0]                    
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, 128)          16512       dense_8[0][0]                    
__________________________________________________________________________________________________
dense_10 (Dense)                (None, 128)          16512       dense_9[0][0]                    
____________________________________________________________________________________________

In [24]:
mcomb.fit(x=action_train, y=[action_train_label, object_train_label], epochs=10, batch_size=128, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f55efc9c050>

In [25]:
y_pred_comb = mcomb.predict(action_test)
evaluate(y_pred_comb[0], action_test_label)

Accuracy: 0.7968087758663674


0.7968087758663674

In [26]:
y_pred_comb = mcomb.predict(object_test)
evaluate(y_pred_comb[1], object_test_label)

Accuracy: 0.5846422338568935


0.5846422338568935

In [32]:
m1.evaluate(action_test, action_test_label)
m2.evaluate(object_test, object_test_label)
mcomb.evaluate(action_test, [action_test_label, object_test_label])



[2.0578253269195557,
 0.6421725153923035,
 1.4156533479690552,
 0.7968087792396545,
 0.584642231464386]