# This is a More Readable Demo and Detailed Process of Classification


## 1. Load Dataset

In [None]:
import pandas as pd
# A two column CSV with header "label & text"
DATA_PATH = "INSERT YOUR DATA PATH HERE"

df = pd.read_csv(DATA_PATH)
df

## 2. Preprocessing Text

There are three preprocessing step that commonly used:
1. Tokenization
2. Remove Punctuation
3. Casefolding

There are also other preprocessing step such as:
1. Lemmatization
2. Stemming
3. Stopword removal 

However, lemmatization, stemming, and stopword removal, removes bits of information of the text that you may want to keep for most of the task. So, these steps is mostly ignored when using deep learning model. 

__Note__: if you're using a pretrained network, make sure the preprocessing step is mathed with the preprocessing during pretraining

### 2.1. Tokenization

Tokenization is a process that splits an entire text into list of words (token). On some longer text, such as news article, you may want to split the text into paragraph first, then into sentences, and lastly split into token. For tokenization process, one of the most simple tokenization method is to split the text on the whitespace. On this demo the tokenization process will be using __tokenization__ module that on __nltk__. Specifically the __casual_tokenize__, a tokenization function to handle social media text.

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize, casual_tokenize
# casual_tokenize is a special tokenization function for social media text
# word_tokenize uses PunktTokenizer for formal text
df["tokenized_text"] = df["text"].apply(lambda t: casual_tokenize(t.lower()))
df["tokenized_text"]

### 2.2. Remove Punctuation (Recomended but Optional)

With exception of some specific task, generally, punctuations does not provide a lot information of its text compared to words of the text. So, most of the time, the punctuation is removed from the text. The __punctuation__ function on __string__ package is a string of punctuations such as ```!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~```. You could append this string with other punctuation character.

In [None]:
import string


def remove_punctuation(tokenized_text):
    punctuations = string.punctuation
    cleaned_text = []
    for token in tokenized_text:
        #         Remove punctuation
        cleaned_token = token.translate(str.maketrans("", "", punctuations))
#         Filter out empty string and whitespace only token
        if cleaned_token.strip():
            cleaned_text.append(cleaned_token)
    return cleaned_text


df["clean_corpus"] = [remove_punctuation(
    data) for data in df["tokenized_text"]]

### 2.3 Casefolding (Recommended but Optional)

Same deal with punctuations, the difference between uppercase and lowercase is not really important for _most_ of text processing. Each different case of variation has to be process independently of each other and can take up extra slot vocabulary slot. Therefore, again with some exception, you may want to perform casefolding by lowering all of the text. (You can perform this __before__ tokenization)

In [None]:
df["clean_corpus"] = [[token.lower() for token in data]
                      for data in df["clean_corpus"]]

### 2.4 Using Keras preprocessing module

Keras package provides a  __text_to_word_sequence__ function that by default performs, tokenization with whitespace, remove punctuation, and casefolding.

In [None]:
from keras.preprocessing.text import text_to_word_sequence

df["alt_clean_corpus"] = [text_to_word_sequence(text) for text in df["text"]]
df["alt_clean_corpus"]

### 2.5 Remove stopwords (Completely Optional)

Stopwords is a group of words that does not add much meaning to the entire text, such as, this, that, is, at, on, etc. This step is mostly ignored because it may remove important information of the text that the Deep Learning model will try to learn. However, in some cases you may want to perform stopword removal such as low-resource settings (low data count) when you have to put lazer focus on the important bit of the text given the limited resources.

The nltk provides stopwords list of some languages. You could also take 0.x% of most common word of some big corpus (wiki) as stopwords. Alternatively, you use your own stopword list.

In [None]:
from nltk.corpus import stopwords
# Open stopword list
stopword_list = stopwords.words("indonesian")

df["clean_corpus"] = [list(filter(lambda x: x not in stopword_list, data))
                      for data in df["clean_corpus"]]
df["clean_corpus"]

### 2.6. Data Split

After all of data is cleaned the last step is to split the dataset. The data can be split either in to three split (train, validation(dev), and test) or two split (train, test). if you think the dataset is too low you can perform K-Folding split. This demo will split data in to three split, 20% for testing, 8% (10% of 80%) for validation, and the rest for training. 

In [None]:
from sklearn.model_selection import train_test_split

X = df["clean_corpus"].values.tolist()
y = df["label"].values.tolist()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, test_size=0.2, random_state=4371
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size=0.9, test_size=0.1, random_state=4371
)

## 3. Training model

All of the classifier implemented in this repo, has some hidden preprocessing steps that usually you have to do when using keras, input + label vectorization, and embedding initialization. Before getting into the classifier, this demo will explain both processes.

### 3.-2 Label Vectorization

Usually the label vectorizatino is done by converting the numeric label into one hot vector, an easy way to do this is to use __to_categorical__ function on keras package. but first, built two dictionary of the index of the label and its inverse index.

In [None]:
from keras.utils import to_categorical


def vectorized_label(label_data, label_index):
    return to_categorical([label_index[label] for label in label_data])


# Get all of unique label
label = sorted([*{*y_train}])
# Make label index (to convert label to numeric)
label_index = {ch: idx for idx, ch in enumerate(label)}
# Make inverse label index (to convert back prediction result to string label)
inverse_label_index = {idx: ch for ch, idx in label_index.items()}

vectorized_y_train = vectorized_label(y_train, label_index)
vectorized_y_val = vectorized_label(y_val, label_index)
vectorized_y_test = vectorized_label(y_test, label_index)

### 3.-1 Input Vectorization

Input vectorization is basically a process to convert the input from text sequence into numerical sequences. This process is done as follows:
1. Determine how many words you want to represent from you training dataset (x).
2. Build a frequency table of each token.
3. Takes x most common words from the frequency table.
4. Make an inverse index of the chosen vocabulary. Start from 1 because 0 is used for padding/unkown words.
5. Convert the input using the inverse index, set 0 for words not on the index.

In [None]:
from collections import defaultdict  # For frequency table
import numpy as np


def make_inverse_index(data_train, vocab_size):
    #     Make frequency table
    frequency_table = defaultdict(int)
    for data in data_train:
        for token in data:
            frequency_table[token] += 1

#     Sort by frequency
    sorted_vocab = sorted(
        frequency_table.items(), key=lambda x: x[1], reverse=True
    )
#     List of chosen vocabulary
    vocab = ["UNK"]
    for idx, word in enumerate(sorted_vocab):
        if len(vocab) < vocab_size:
            vocab.append(word[0])
#     Inverse index
    return {word: idx for idx, word in enumerate(vocab)}


def input_vectorization(corpus, inverse_index, max_length):
    #     Initialize zero matrix with the size of [#data x max_length]
    #     this way we already performed a padding or truncate any data with length over max_length
    vectorized_input = np.zeros(
        shape=(len(corpus), max_length), dtype=np.int32)
    for idx, data in enumerate(corpus):
        for jdx, token in enumerate(data):
            if jdx < max_length:
                vectorized_input[idx][jdx] = inverse_index.get(token, 0)
    return vectorized_input


# Determine how many vocabulary and maximum length of the sequences
VOCAB_SIZE = 5000
MAXIMUM_LENGTH = 50
inverse_index = make_inverse_index(X_train, VOCAB_SIZE)

vectorized_x_train = input_vectorization(
    X_train, inverse_index, MAXIMUM_LENGTH)
vectorized_x_val = input_vectorization(X_val, inverse_index, MAXIMUM_LENGTH)
vectorized_x_test = input_vectorization(X_test, inverse_index, MAXIMUM_LENGTH)

### 3.0 Embedding

Embedding is vector representation of a word. There are various ways to obtain this representation. Some notes on embedding:

- The simplest one would be letting the Embedding layer generate a random vector for each unique word (forming a matrix the size of ( _VOCAB_SIZE_ x _EMBEDDING_DIMENSION_SIZE_ )) and let the training process adjust the representation of the word. However, this would need a lot of data and training epoch to give a proper representation of the word. 
- Another simple way to represent a word is by using one hot embedding (similar to label vectorization). This embedding is generated with identity matrix with the size of ( _VOCAB_SIZE_ x  _VOCAB_SIZE_ ) and set the initial weight of the Emebdding layer with this matrix.
- Most of the input representation of text is using a pretrained word embedding that is trained by various word embedding method such as __Word2Vec__, __FasText__, or __GLoVe__ on relatively big text corpus. The embedding is a matrix with the size of ( _VOCAB_SIZE_ x _WORD_EMBEDDING_SIZE_ ), with each row denotes a word vector from pretrained word embedding. A random vector will be generated if a word in vocabulary does not exist in the pretrained word emebdding. Same as one hot embedding, the resulted matrix would be used to initialize the weight of Embedding layer.
- The newest way to set initial weight on the embedding layer is by sending the metrix inside a __Constant__ class on __keras.initializer__ to set initial weight of Embedding layer (__embedding_initializer__ parameter). Some article would tell you to use the __weight__ parameter on embedding layer, this works on older version of keras

In [None]:
from keras.initializers import Constant

onehot_embedding = np.eye(VOCAB_SIZE)
onehot_embedding_weight = Constant(onehot_embedding)
# Send this to the embedding layer

### 3.1 Using the classifier

There are some classifier that has been implemented in this repo, such as, CNN, RNN, RNN+Attention, Transformer, and HAN. Each classifier has their own parameter and a set of generic parameter that applies to all classifier. the parameter is:
```
input_size      : int, maximum number of token input
optimizer       : string, learning optimizer, keras model compile "optimizer" parameter
loss            : string, loss function, keras model compile "loss" parameter
embedding_matrix: numpy array, custom embedding matrix of the provided vocab
vocab size      : int, maximum size of vocabulary of the model
                  (most frequent word of the training data will be used)
                  set to 0 to use every word in training data
vocab           : dictionary, a vocab inverse index.
embedding_file  : string, path to pretrined word embedding file
embedding_type  : string, type of embedding, available options
                  w2v for zword2vec, matrix will be taken from embedding file
                  ft for FasText, matrix will be taken from embedding file
                  onehot, initialize one hot encoding of vocabulary
                  custom, use embedding matrix
                  or any valid keras.initializer string
train_embedding : boolean, determine whether the Embedding layer should be trainable
                  which apparently not recommended when using pretrained weight
                  refer -> https://keras.io/examples/nlp/pretrained_word_embeddings/
```

In [None]:
from sklearn.metrics import classification_report


def run_classifier(classifier_class, data):
    X_train, y_train, X_test, y_test, X_val, y_val = data
    classifier_class.train(
        X_train, y_train, 10, 32, (
            X_val, y_val
        )
    )
    prediction = classifier_class.test(X_test)
    print(classification_report(y_test, prediction))


PRETRAINED_PATH = "INSERT THE PRETRAINED WORD EMBEDDING PATH HERE"
base_classifier_parameter = {
    "input_size": 50,
    "optimizer": "adam",
    "loss": "categorical_crossentropy",
    "embedding_matrix": None,
    "vocab_size": 5000,
    "vocab": None,
    "embedding_file": PRETRAINED_PATH,
    "embedding_type": "ft",
    "train_embedding": False
}
grouped_data = X_train, y_train, X_test, y_test, X_val, y_val

#### 3.1 CNN
CNN speific parameter
```
conv_layers: list of tupple, list of parameter for convolution layers,
             each tupple for one convolution layer that consist of : [
                (int) Number of filter,
                (int) filter size,
                (int) maxpooling (-1 to skip),
                (string) activation function
             ]
fcn_layers : list of tupple, list of parameter for Dense layers,
                each tupple for one FC layer,
                final layer (softmax) will be automatically added in,
                each tupple consist of: [
                    (int) Number of unit,
                    (float) dropout (-1 to skip),
                    (string) activation function
                ]
conv_type  : string, Set how the convolution will be performed, available options: parallel/sequence
             parallel: each cnn layer from conv_layers will run against
                 embedding matrix directly, the result will be concatenated before FCN layer
                 Refer to Yoon Kim, 2014
             sequence: cnn layer from conv_layers will stacked sequentially,
                commonly used for character level CNN, on word level CNN 'parallel' is recommended
```

In [None]:
from model.CNNText.cnn_classifier import CNNClassifier

cnn_parameter = {key: value for key, value in base_classifier_parameter.items()}
cnn_parameter["conv_layers"] = [
    (256, 3, 1, "relu"),
    (256, 4, 1, "relu"),
    (256, 5, 1, "relu")
]
cnn_parameter["fcn_layers"] = [(512, 0.2, "relu")]
cnn_parameter["conv_type"] = "parallel"


cnn_classifier = CNNClassifier(**cnn_parameter)
run_classifier(cnn_classifier, grouped_data)

#### 3.2 RNN & RNN + Attention

Parameter
```
rnn_size  : int, RNN hidden unit
dropout   : float, [0.0, 1.0] dropout just before softmax layer
rnn_type  : string, RNN memory type, "gru"/"lstm"
attention : string, attention scoring type choice available:
            dot/scale/general/location/add/self,
            set None to not use attention mechanism
```

In [None]:
from model.RNNText.rnn_classifier import RNNClassifier
rnn_parameter = {key: value for key, value in base_classifier_parameter.items()} 
rnn_parameter["rnn_size"] = 100
rnn_parameter["dropout"] = 0.2
rnn_parameter["rnn_type"] = "lstm"
rnn_parameter["attention"] = "self"

rnn_classifier = RNNClassifier(**rnn_parameter)
run_classifier(rnn_classifier, grouped_data)

#### 3.3 Transformer

Parameter
```
n_blocks          : int, number of transformer stack
dim_ff            : int, hidden unit on fcn layer in transformer
dropout           : float, dropout value
n_heads           : int, number of attention heads
attention_dim     : int, number of attention dimension
                    value will be overidden if using custom/pretrained embedding matrix
pos_embedding_init: bool, Initialize posiitonal embedding with
                    sincos function, or else will be initialize with glorot_uniform
fcn_layers        : list of tupple, configuration of each
                    fcn layer after transformer, each tupple consist of:
                        [int] number of units,
                        [float] dropout after fcn layer,
                        [string] activation function
sequence_embedding: string, a method how to get representation of entire sequence,
                    the representation will be used for the input of FCN layer, available option:
                    cls, prepend [CLS] token in the sequence, then take
                        attention output of [CLS] token as sequence representation (BERT style)
                    global_avg, use GlobalAveragePool1D
```

In [None]:
from model.TransformerText.transformer_classifier import TransformerClassifier

transformer_parameter = {key: value for key, value in base_classifier_parameter.items()}
transformer_parameter["n_blocks"] = 2
transformer_parameter["dim_ff"] = 256
transformer_parameter["dropout"] = 0.3
transformer_parameter["n_heads"] = 8
transformer_parameter["attention_dim"] = 256
transformer_parameter["pos_embedding_init"] = True
transformer_parameter["fcn_layers"] = [(128, 0.1, "relu")]
transformer_parameter["sequence_embedding"] = "global_avg"

transformer_classifier = TransformerClassifier(**transformer_parameter)
run_classifier(transformer_classifier, grouped_data)

#### 3.4 Hierarchical Attention Network (HAN)

The original paper use this for document classification tasks, however we still could use them for sentence classification by attending to the character and token to determine the class of a sentence. Unlike other network so far, HAN accepts a 3D input array. For documenet classification, we could see this as an entire text split by sentence, and each sentences split by token. In term of sentence classification, we could see it as text split by its token, and each token split by the characters.

Parameter -> Similar to RNN
```
input_shape : 2 length tuple (int, int), maximum input shape,
              the first element refer to maximum length of a data or maximum number of sequence in a data
              the second element refer to maximum length of a sequence or maximum number of sub-sequence in
              a sequence
              NOTE this will override whatever value that passed by input_size parameter.
rnn_size    : int, number of rnn hidden units
dropout     : float, dropout rate (before softmax)
rnn_type    : string, the type of rnn cell, available option:
              gru or lstm
 ```

In [None]:
from model.RNNText.han_classifier import HANClassifier
han_parameter = {key: value for key, value in base_classifier_parameter.items()}
han_parameter["rnn_size"] = 100
han_parameter["dropout"] = 0.2
han_parameter["rnn_type"] = "lstm"
han_parameter["input_shape"] = (25, 10)

grouped_data = list(grouped_data)
grouped_data[0] = [[[*token] for token in doc] for doc in grouped_data[0]]
grouped_data[2] = [[[*token] for token in doc] for doc in grouped_data[2]]
grouped_data[4] = [[[*token] for token in doc] for doc in grouped_data[4]]

han_classifier = HANClassifier(**han_parameter)
run_classifier(han_classifier, grouped_data)

In [None]:
#### Next Classifier RCNN