<a href="https://colab.research.google.com/github/danjohnvelasco/Filipino-ULMFiT/blob/master/Filipino_ULMFiT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filipino ULMFiT
This notebook shows the ULMFiT approach to Filipino text classification task on [Hate Speech Dataset](https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks#datasets) using a pre-trained Filipino language model. To learn more about the ULMFiT approach check the [Docs](https://docs.fast.ai/tutorial.text) and [Paper](https://arxiv.org/abs/1801.06146).

Originally posted in this [repository](https://github.com/danjohnvelasco/Filipino-ULMFiT).

In [None]:
# Install fastai v2
# uncomment if your environment doesn't use fastai >= v2.
# run pip freeze to check if fastai is installed

# !pip install -U fastai

In [1]:
# if you're on Colab, make sure you're using a GPU instance.
# Make sure that your GPU supports mixed-precisoin traning (e.g. Tesla T4, P4, P100, V100)
# !nvidia-smi

# Before you start...

1.  Import dependencies
2.  Define convenient functions for later use
3.  Load data

In [None]:
from fastai.text.all import *
from sklearn.metrics import accuracy_score

In [None]:
# Run this function before creating a learner if you want  
# your work to be reproducible

# Convenience function for setting the random seed manually
def set_random_seed(seed):
    # python RNG
    import random
    random.seed(seed)

    # pytorch RNGs
    import torch
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

    # numpy RNG
    import numpy as np
    np.random.seed(seed)

In [None]:
# Convenience function for testing model on test set
def get_test_acc(learn):
    # Create dataloader for test set
    test_dl = learn.dls.test_dl(test_df, with_label=True)
    # Make predictions on test set
    pred_probas, _ , pred_classes = learn.get_preds(dl=test_dl, with_decoded=True) 
    # get accuracy of (y_true, y_pred)
    return accuracy_score(test_df.label.values, pred_classes)

In [None]:
# Modify this to match your data directory
train_df = pd.read_csv("train.csv", lineterminator='\n')
valid_df = pd.read_csv("valid.csv", lineterminator='\n')
test_df = pd.read_csv("test.csv", lineterminator='\n')

In [None]:
# add 'is_valid' column (for fastai train-val splitting)
valid_df['is_valid'] = True
test_df['is_valid'] = False
train_df['is_valid'] = False

In [None]:
# Concatenate train and validation set
lm_df_10k =  pd.concat([train_df, valid_df])
lm_df_10k.shape

In [None]:
# HYPERPARAMETERS
lr = 5e-2
wd = 0.1
moms = (0.8,0.7,0.6)

In [None]:
# Filenames of pre-trained LM weights and vocab
pretrained_fnames = ['finetuned_weights_20', 'vocab']

# Language Model Fine-tuning

Here, we fine-tune the pre-trained language model to better learn the vocab and syntax of the target corpus which is, in our case, the Hate Speech Dataset.

**About the pre-trained language model files:**

By default, fastai looks for models inside the 'models' folder. Make sure that the pre-trained models and vocab are in 'models' folder.

In [None]:
# Create a dataloader for language model fine-tuning
dls_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True),
                    get_x=ColReader('text'),
                    splitter=ColSplitter())
                    .dataloaders(lm_df_10k, bs=128, seq_len=72, min_freq=2, num_workers=0)

In [None]:
# Uncomment this if you want early stopping and save your best model.
# This is fastai callbacks, see docs for more info
cbs = [EarlyStoppingCallback(monitor='valid_loss', patience=2), SaveModelCallback()]

# Notice the pretrained_fnames parameter.
# Here we pass the list of filenames of pretrained weights and vocab.pkl
# This is where the use of pre-trained language model happens.
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.5, 
    metrics=[accuracy, Perplexity()],
    pretrained_fnames=pretrained_fnames,
    cbs=cbs).to_fp16()

In [None]:
learn.lr_find()

In [None]:
# train last layers first
learn.fit_one_cycle(1, 4e-2)

In [None]:
# train the whole network with smaller learning rate
learn.unfreeze()
learn.fit_one_cycle(6, 4e-3)

In [None]:
# Save encoder. To be used by text classifier learner.
learn.save_encoder('lm_finetune_final_enc')

In [None]:
# This is totally unrelated to training process but just for fun...
# you can try generating text with the language model here
TEXT = "Ako ay"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

print("\n".join(preds))

# Text Classifier Fine-tuning

Here, we use the encoder of the fine-tuned language model to transfer the learnings to the classifier model. The model will learn to classify the text to binary labels, hate (1) or non-hate (0).

Here, we'll apply gradual unfreezing and discriminative learning rates as discussed by [(Howard and Ruder, 2018)](https://arxiv.org/abs/1801.06146).

In [None]:
# Create a dataloader for text classifier fine-tuning
dls_clas = DataBlock(blocks=(TextBlock.from_df('text', seq_len=72, min_freq=2, vocab=dls_lm.vocab), CategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader('label'),
                      splitter=ColSplitter()
                      ).dataloaders(clas_df_10k, bs=128, num_workers=0)

In [None]:
# Create learner
learn = text_classifier_learner(dl, AWD_LSTM, moms=moms, wd=wd, metrics=accuracy).to_fp16()

# Load encoder
learn.load_encoder('lm_finetune_final_enc')

In [None]:
# Train the last layers
learn.fit_one_cycle(4, lr) 

In [None]:
learn.freeze_to(-2) # Unfreeze a little bit
learn.fit_one_cycle(2, slice(lr/(2.6**4),lr)) # Decrease the learning rate

In [None]:
learn.freeze_to(-3) # Unfreeze a little bit more
learn.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2)) # Decrease the learning rate more

In [None]:
learn.unfreeze() # Unfreeze the whole network
learn.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10)) # Train the whole network with really small learning rate

In [None]:
# Get accuracy with test set
acc = get_test_acc(learn)