# Homework of Ch5. Text Classification
----
This is the homework of TU-ETP-AD1062 Machine Learning Fundamentals.

For more information, please refer to:
https://sites.google.com/view/tu-ad1062-mlfundamentals/

> You do NOT have to build up from nothing, please try your best for the following parts:
> - **Your task: HW5.2.2.**
> - **Your task: HW5.2.3.**
> - **Your task: HW5.3.1.**
> - **Your task: HW5.3.2.**
> - **Your task: HW5.3.3.**
> - **Your task: HW5.4.**
> 
> Please refer to the `demo_05.ipynb` If you have no idea about the API usage.

## HW5.1. Import Packages
----
- Models construction:
    - `keras`:
        - `preprocessing.*`: text cleanup, text pre-processing, and text sequences tokenization before import into models
        - `models.*`, `embeddings.*`, `layers.*`, and `optimizers.*`: For loading related components layers to constructing recurrent neural network (including both LSTM, GRU or the simplest version RNN)
        - `utils.to_categorical`: For converting numerical labels into categorical labels
- Performance evaluation:
    - `sklearn.metrics.zero_one_loss`: Used for accuracy evaluation
    - `sklearn.model_selection.train_test_split`: Divide your data into training and validation set for once, then feed into classifier by yourself, observing the score and confusion matrix
    - `mlfund.plot.PlotMetric`: plot confusion matrix (provided by this repository)

In [None]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
from keras.layers import Dense, Dropout, Activation
from keras.utils import to_categorical

from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import text_to_word_sequence, Tokenizer

import codecs
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import zero_one_loss

import numpy as np
import pandas as pd
import os

from matplotlib import pyplot as plt
from mlfund.plot import PlotMetric
%matplotlib inline

## HW5.2. Data pre-processing


### HW5.2.1. Read Dataset from CSV files
----
The code snippet is used to read training and testing set from CSV files, then conduct text-preprocessing.

In [None]:
# Training set
df_train = pd.read_csv(os.path.join('data', 'hw5', 'hw5.train.csv'))

id_train = df_train['id']
X_train = df_train['review']
y_train = list(df_train['label'])


# Testing set
df_test = pd.read_csv(os.path.join('data', 'hw5', 'hw5.test.csv'))

id_test = df_test['id']
X_test = df_test['review']

display(df_train)

### HW5.2.2. Tokenize Sentences to Word Lists
---
> **Your task: HW5.2.2.**  
> Please follow the demo of **Demo 5.4.1.** in `demo_05.ipynb` to conduct sentence tokenization.  
> For example, for the following sentence:
> - `I'm great admirer Lon Chaney, screen writing movie work me.`
> 
> The expected result is:  
> - `['i'm', 'great', 'admirer', 'lon', 'chaney', 'screen', 'writing', 'movie', 'work', 'me']`
>
> You're expected to conduct sentence tokenization for the sentences stored in the following 2 lists:  
> - `df_train['review']`
> - `df_test['review']`
> ---
> For your conveniences of using the code snippets afterwards:
> - Use `X_train_str` and `X_test_str` for storing the tokenized results.

In [None]:
# X_train_str = (...Uncomment this line and finish your task! ... )
# X_test_str = (...Uncomment this line and finish your task! ... )


# The following snippet helps you to show the first 10 words for each tokenized results of the first 20 sentences in training set
for i in range(0, 20):
    print('len = %d, [\'%s\' ...]' % (len(X_train_str[i]), '\', \''.join(X_train_str[i][0:10])))

### HW5.2.3. One-hot Encoding for the Top-K words
----
> **Your task: HW5.2.3.**  
> Please refer to the **Demo 5.4.2** in `demo_05.ipynb`, finish the following tasks:
> 1. Calculate the most frequently used words by training a `keras.preprocessing.text.Tokenizer` instance.  
**You are expected to train the tokenizer with `X_train_str` Only, since you're not expected to have any knowledge about your testing data**
> 2. Convert each sentences to one-hot encoding by the trained `keras.preprocessing.text.Tokenizer` instance in the above step, and
> 3. Expand token list to the fixed length by `keras.preprocessing.sequence.pad_sequences`
>
>
> You're also expected to adjust the `MAX_NUM_DICT_WORDS`, `MAX_SEQUENCE_LENGTH`, and `WORD_EMBED_DIMENSION`. The additional information for the 3 parameters are listed below:
> 1. Only the top-`MAX_NUM_DICT_WORDS` will be reserved and convert to word vectors, whereas the other will be neglected
> 2. Only the first-`MAX_SEQUENCE_LENGTH` words will be reserved for each review, whereas the words afterwords will be truncated.
> 3. The word embedding dimension will be controlled by `WORD_EMBED_DIMENSION`:
>    - If you use the `glove.6B.50d.txt` for word vectors converting, DO NOT modify this value
>    - If you use the self-trained word embedding (i.e., `trainable=True` in Keras embedding layer), try to adjust for this value
> ----
> For your conveniences of using the code snippets afterwards:
> - Use `tokenizer` as the `keras.preprocessing.text.Tokenizer` instance,
> - Use `X_train` as the name of fixed-length one-hot encoded word index list of training set, and
> - Use `X_test` as the name of fixed-length one-hot encoded word index list of the testing set.

In [None]:
MAX_NUM_DICT_WORDS     = 5000
MAX_SEQUENCE_LENGTH    = 200
WORD_EMBED_DIMENSION   = 50

# tokenizer = (...Uncomment this line and finish your task! ... )

# X_train = (...Uncomment this line and finish your task! ... )
# X_test  = (...Uncomment this line and finish your task! ... )

**(Optional)** The following snippet helps you to examine the result of Top-K words tokenizer:

In [None]:
for key in sorted(tokenizer.index_word)[:50]:
    print('%d\t => %s' % (key, tokenizer.index_word[key]))

**(Optional)** The following snippet helps you to examine the one-hot encoded result:

In [None]:
for x_train in X_train:
    assert isinstance(x_train, np.ndarray) and len(x_train) == MAX_SEQUENCE_LENGTH, "X_train should be an instance of numpy.ndarray, with shape[1] == MAX_SEQUENCE_LENGTH"

for i in range(0, 20):
    print('%s' % (X_train[i]))

### HW5.2.4. Convert each Words by External Word2Vec Data
---
The code snippet here shows how to convert the words by external **GloVe** representations.

- You're required to manually download the `glove.6B.50d.zip` from the Kaggle InClass competition > Data page (60.5MB), or:  
https://powerbox-file.trendmicro-cloud.com/SFDC/external_shared/b98384991fe44745da75780232ec0961.php
- After downloaded, decompress it and place the `glove.6B.50d.txt` under `data/hw5`
- The word dictionary mapped from One-hot encoding index to word vectors would then be built based on the top-K tokenizer constructed above

In [None]:
# Convert tokenized words to vector
word2vec_path = os.path.join('data', 'demo5', 'glove.6B.50d.txt')

if not os.path.exists(word2vec_path):
    raise FileNotFoundError('Please follow the instructions mentioned above to get glove.6B.50d.txt')

embed_mat = np.zeros( (MAX_NUM_DICT_WORDS + 1, WORD_EMBED_DIMENSION ) )
with codecs.open(word2vec_path, 'r', 'utf-8') as f:
    for line in f.readlines():
        tokens = line.rstrip(' \r\n').split(' ')
        
        word_key    = tokens[0]
        word_vector = [float(i) for i in tokens[1:]]
        
        if word_key in tokenizer.word_index and tokenizer.word_index[word_key] < MAX_NUM_DICT_WORDS + 1:
            embed_mat[ tokenizer.word_index[word_key], : ] = word_vector

**(Optional)** The following snippet helps you to examine the word2vec result:

In [None]:
dict_word2vec_result = []
for i in range(2,22):
    dict_word2vec_result.append({
        'word': tokenizer.index_word[i],
        'vector': '%s' % embed_mat[i]
    })
    
display(pd.DataFrame(dict_word2vec_result))

## HW5.3. Construct and Train Text Classification Model

### HW 5.3.1. Adjust the Model
----
The code snippet shown below constructs a LSTM with following structure:
1. Embedding Layer:
    * `input_dim`: Maximum words appeared in the dictionary, which should be `MAX_NUM_DICT_WORDS + 1`
    * `output_dim`: Word embedding dimension, which should be `WORD_EMBED_DIMENSION` set above
    * `input_length`: Max sequence length, which should be `MAX_SEQUENCE_LENGTH` set above
2. LSTM Layer:
    * Unit size 64
3. Fully-connected layer
    * Unit size 256
4. Drop-out layer
5. Fully-connected layer with sigmoid activation

> **Your task: HW5.3.1.**  
> Build and adjust your own models by Keras framework, try to maximize the performance by adjust the model structures.
> Some documents listed below might be useful:
> - Embedding Layer: https://keras.io/layers/embeddings/#embedding
> - LSTM Layer: https://keras.io/layers/recurrent/#lstm
> - GRU Layer: https://keras.io/layers/recurrent/#gru
> - Dense Layer (Fully-connected layer): https://keras.io/layers/core/#dense
>
> **Notice:** You can use either:
> - RNN/LSTM/GRU model mentioned in Chapter 5. (Same as the code snippet), or
> - CNN model mentioned in Chapter 4. (Please refer to **Demo 4.3.1.** in `demo04.ipynb`)

In [None]:
def create_rnn():
    model = Sequential()
    
    model.add(Embedding(MAX_NUM_DICT_WORDS + 1, WORD_EMBED_DIMENSION, weights=[embed_mat], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
    model.add(LSTM(10))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(2, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
    
    return model

### HW5.3.2. Train the Model
----
The code snippet helps you to split the known, training data `X_train`, `y_train` into `X1`, `X2`, `y1`, `y2` for validation.

> **Your task: HW5.3.2.**  
> Adjust your `fit` process for training, including `batch_size` and `epochs` to meet your hardware conditions.
> 
> For more details, see: https://keras.io/models/model/#fit

In [None]:
model = create_rnn()
model.summary()

X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

batch_size = 64
epochs = 10

y1_categorical = to_categorical(y1)
model.fit(X1, y1_categorical, epochs=epochs, batch_size=batch_size)

### HW5.3.3. Performance Evaluation
----
> **Your task: HW5.3.3.**  
> Check the zero-one-loss and confusion matrix to adjust the performance.
>
> **Notice:**
> - In general, one should conduct cross-validation mentioned in Chapter 1.
> - To save the time, you are allowed to conduct single train and test split in order to metigate the time consumption of performance evaluation.

In [None]:
y2_categorical_predict = model.predict(X2)
y2_predict = np.argmax(y2_categorical_predict, axis=1)

# Error rate
err_01loss = zero_one_loss(y2, y2_predict)
print('Error rate = %2.3f' % err_01loss)

# Confusion matrix of prediction
plot_conf_mat = PlotMetric(figsize=(4, 4))
plot_conf_mat.set_labels(['Negative', 'Positive'])
plot_conf_mat.confusion_matrix(y2, y2_predict, True)

## HW 5.4. Train with the Full Training Data and Submit to Kaggle
----
Now you've already have fine-tuned your `create_rnn`. Now train your model by leveraging full `X_train` and `y_train` dataset, predict the label of `X_test`, then submit to Kaggle.

> **Your task: HW5.4.**
> 1. Training with full data set `X_train` with the model created by `create_rnn`,
> 2. Predict the **unknown** testing data `X_test` by the trained model, then
> 3. Submit your result to Kaggle
>
> **Notice: You got 5 chances to submit your result every day.**

In [None]:
# Create model and train
y_train_categorical = to_categorical(y_train)

model = create_rnn()
model.fit(X_train, y_train_categorical, batch_size=batch_size, epochs=epochs)

# Predict the testing data
y_test_categorical_predict = model.predict(X_test)
y_test_predict = np.argmax(y_test_categorical_predict, axis=1)

## Before you submit
----
Please join the homework 5 competition by **using the Email ended with \@trendmicro.com as your Kaggle InClass team name**.

Type your Email in the variable `my_trendmicro_email_which_is_also_my_team_name` to make sure you've already read this paragraph, then the following code snippet will help you to generate the csv file for submission.

In [None]:
my_trendmicro_email_which_is_also_my_team_name = ''

import re
assert re.match(r"[^@]+@trendmicro.com", my_trendmicro_email_which_is_also_my_team_name), "Please read the instruction above paragraph carefully"

target_path = 'data/hw05.result.csv'
df_test_label = pd.DataFrame({'id': id_test, 'label': y_test_predict})
df_test_label.to_csv(target_path, index=False)

print('Congratulation! Please submit your result \'%s\' to https://www.kaggle.com/t/0d72d5a876864bc988259933bc35f3f2' % target_path)