<a href="https://colab.research.google.com/github/aditya-malte/SemEval/blob/master/notebooks/Hinglish_smallBERTa_Pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-training SmallBERTa - A tiny model to train on a tiny dataset
(Using HuggingFace Transformers)<br>
Admittedly, while language modeling is associated with terabytes of data, not all of use have either the processing power nor the resources to train huge models on such huge amounts of data.
In this example, we are going to train a relatively small neural net on a small dataset (which still happens to have over 2M rows).
<br>

The ***main purpose*** of this blog is not to achieve state-of-the-art performance on LM tasks but to show a simple idea of how the recent language_modeling.py script can be used to train a Transformer model from scratch.

This very notebook can be extended to various esoteric use cases where general purpose pre-trained models fail to perform well. Examples include medical dataset, scientific literature, legal documentation, etc.

Input:
  1. To the Tokenizer:<br>
      LM data in a directory containing all samples in separate *.txt files.
  
  2. To the Model:<br>
      LM data split into:<br>
        1. train.txt <br>
        2. eval.txt 
        
Output:<br>
  Trained Model weights(that can be used elsewhere) and Tensorboard logs

## Install Dependencies

In [0]:
#tokenizer working version --- 0.5.0
#transformer working version --- 2.5.0
!pip install transformers
!pip install tokenizers
!pip install emoji
!pip install tensorboard==2.1.0

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/04/58/3d789b98923da6485f376be1e04d59ad7003a63bdb2b04b5eea7e02857e5/transformers-2.5.0-py3-none-any.whl (481kB)
[K     |▊                               | 10kB 26.8MB/s eta 0:00:01[K     |█▍                              | 20kB 6.0MB/s eta 0:00:01[K     |██                              | 30kB 6.8MB/s eta 0:00:01[K     |██▊                             | 40kB 5.6MB/s eta 0:00:01[K     |███▍                            | 51kB 6.0MB/s eta 0:00:01[K     |████                            | 61kB 7.0MB/s eta 0:00:01[K     |████▊                           | 71kB 7.6MB/s eta 0:00:01[K     |█████▍                          | 81kB 7.1MB/s eta 0:00:01[K     |██████▏                         | 92kB 7.9MB/s eta 0:00:01[K     |██████▉                         | 102kB 8.2MB/s eta 0:00:01[K     |███████▌                        | 112kB 8.2MB/s eta 0:00:01[K     |████████▏                       | 122kB 8.2M

In [0]:
import os
from tqdm import tqdm
from requests.utils import quote
import getpass
repo_name = "SemEval"
if repo_name not in os.listdir():
  username = input("User: ")
  password = getpass.getpass(prompt='Password: ') 
  print(os.system("git clone https://"+username+":"+password+"@github.com/aditya-malte/"+repo_name+".git"))
%cd {repo_name}
from utils_text import PreProcess
%cd ..
!ls

User: aditya-malte
Password: ··········
0
/content/SemEval
/content
sample_data  SemEval


In [0]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)
!ln -s "/gdrive/My Drive/SemEval_weights_data" "/content/"
drive_path = "/content/SemEval_weights_data/data/"

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


## Fetch Data


In [0]:
import pandas as pd
text_mixed =pd.read_csv("/content/SemEval_weights_data/twitter_scraped/tweets_final_mix_394468.csv", lineterminator='\n')
text_mixed

Unnamed: 0.1,Unnamed: 0,tweets
0,0,Meri Kashti Nu Dar Kahda Toofan Da \nMai Gadha...
1,1,Yahaan choor ko jaza or Ghareeb ko Saza milti ...
2,2,Date Pakri Gai In Village: http://youtu.be/ddL...
3,3,Bhai tu smjha hi nahin ab tak ki aj baarish bo...
4,4,"Zindagi Ki BheeK To Kisi Sourat ManGi Na Gahi,..."
...,...,...
394463,37281,"मस्जिद इस्लाम का अंग है या नहीं, इस पर बहस हो ..."
394464,37282,Bhi aap great ho or Mara sval ya ha ki aap har...
394465,37284,"तुम पास रहो,\nये ज़िद तो नहीं,\nपर किसी लंबे स..."
394466,37285,केवल आम आदमी पार्टी ही है जो देश निर्माण का का...


## Load and Preprocess data

In [0]:
import pandas as pd
from tqdm import tqdm

In [0]:
import re
def getHindi(input_list):
  regex = "[\u0900-\u097F]"
  output_list = []
  for test_text in input_list:
      try:
        match = re.search(regex, test_text)
        if match is None:
          output_list.append(test_text)
      except Exception as e:
        print(e, test_text)
  return output_list

In [0]:
text_pure = pd.DataFrame(getHindi(text_mixed["tweets"].tolist()), columns=["tweets"])
text_pure.head()

expected string or bytes-like object nan


Unnamed: 0,tweets
0,Meri Kashti Nu Dar Kahda Toofan Da \nMai Gadha...
1,Yahaan choor ko jaza or Ghareeb ko Saza milti ...
2,Date Pakri Gai In Village: http://youtu.be/ddL...
3,Bhai tu smjha hi nahin ab tak ki aj baarish bo...
4,"Zindagi Ki BheeK To Kisi Sourat ManGi Na Gahi,..."


### Before Preprocessing 

In [0]:
text_pure = text_pure.sample(frac=1).sample(frac=1).sample(frac=1)
print(text_pure)

                                                   tweets
67228   Bin mausam ki baarish ya toh kuch leke jaati h...
106616  Dimag kharab hai is neta ka kashmir ko aazadi ...
121984  Har Muskrahat k Baad\nKhuda Ka Shukar Adaa Nhi...
295484  Haaa ye to h per dhawan ke wicket ke baad jo n...
235611  I agree.\nAisay khail rahay thay jaisay vacati...
...                                                   ...
135660  Manyavar pm modi va yogi sharkar mahngai rokan...
95900   @9919Shivam ek baar ek Hindu or Muslim dono me...
296401  Seekh nahi Kaan ke neeche bajana chahiye tha u...
206865  Modi ji aap toh janam se hi fakiri hai aap me....
38972   Ranji Trophy 2015-16: Mohammad Kaif, Ricky Bhu...

[306259 rows x 1 columns]


In [0]:
preprocess = PreProcess(sep_url=False, remove_url=True, lowercase=True,
               convert_emoji=False, solve_gaps=True, remove_punct = True).preprocess

data = pd.DataFrame(text_pure["tweets"].apply(preprocess).dropna(), columns=["tweets"])

### After Preprocessing

In [0]:
print(data)

                                                   tweets
67228   bin mausam ki baarish ya toh kuch leke jaati h...
106616  dimag kharab hai is neta ka kashmir ko aazadi ...
121984  har muskrahat k baad khuda ka shukar adaa nhi ...
295484  haaa ye to h per dhawan ke wicket ke baad jo n...
235611  i agree. aisay khail rahay thay jaisay vacatio...
...                                                   ...
135660  manyavar pm modi va yogi sharkar mahngai rokan...
95900   @shivam ek baar ek hindu or muslim dono me bho...
296401  seekh nahi kaan ke neeche bajana chahiye tha u...
206865  modi ji aap toh janam se hi fakiri hai aap me....
38972   ranji trophy -: mohammad kaif, ricky bhui shin...

[306259 rows x 1 columns]


Removing newline characters just in case the input text has them. This is because the LineByLine class that we are going to use later assumes that samples are separated by newline

In [0]:
data = data["tweets"]
data = data.replace("\n"," ")

## Train a custom tokenizer
I have used a ByteLevelBPETokenizer just to prevent \<unk> tokens entirely.
Furthermore, the function used to train the tokenizer assumes that each sample is stored in a different text file.

In [0]:
txt_files_dir = "/tmp/text_split"
!mkdir {txt_files_dir}

Split LM data into individual files. These files are stored in /tmp/text_split and are used to train the tokenizer **only**.

In [0]:
i=0
for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
  try:
    f = open(file_name, 'w')
    f.write(row)
    f.close()
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 
  i+=1

100%|██████████| 306259/306259 [00:14<00:00, 21764.75it/s]


In [0]:
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


paths = [str(x) for x in Path(txt_files_dir).glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

vocab_size=20000
# Customize training
tokenizer.train(files=paths, vocab_size=vocab_size, min_frequency=5, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [0]:
lm_data_dir = "/tmp/lm_data"
!mkdir {lm_data_dir}

## Split into Valdation and Train set
We split the train data into validation and train. These two files are used to train and evaluate our model

In [0]:
train_split = 0.9
train_data_size = int(len(data)*train_split)

with open(os.path.join(lm_data_dir,'train.txt') , 'w') as f:
    for item in data[:train_data_size].tolist():
        f.write("%s\n" % item)

with open(os.path.join(lm_data_dir,'eval.txt') , 'w') as f:
    for item in data[train_data_size:].tolist():
        f.write("%s\n" % item)

In [0]:
!mkdir /content/models
!mkdir /content/models/smallBERTa

In [0]:
tokenizer.save("/content/models/smallBERTa", "smallBERTa")

['/content/models/smallBERTa/smallBERTa-vocab.json',
 '/content/models/smallBERTa/smallBERTa-merges.txt']

In [0]:
!mv /content/models/smallBERTa/smallBERTa-vocab.json /content/models/smallBERTa/vocab.json
!mv /content/models/smallBERTa/smallBERTa-merges.txt /content/models/smallBERTa/merges.txt

In [0]:
train_path = os.path.join(lm_data_dir,"train.txt")
eval_path = os.path.join(lm_data_dir,"eval.txt")

## Set Model Configuration
For our purpose, we are training a very small model for demo purposes

In [0]:
import json
config = {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.3,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "num_attention_heads": 1,
  "num_hidden_layers": 1,
  "vocab_size": vocab_size,
  "intermediate_size": 256,
  "max_position_embeddings": 288
}
with open("/content/models/smallBERTa/config.json", 'w') as fp:
    json.dump(config, fp)

In [0]:
#%cd /content
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 52, done.[K
remote: Counting objects:   1% (1/52)[Kremote: Counting objects:   3% (2/52)[Kremote: Counting objects:   5% (3/52)[Kremote: Counting objects:   7% (4/52)[Kremote: Counting objects:   9% (5/52)[Kremote: Counting objects:  11% (6/52)[Kremote: Counting objects:  13% (7/52)[Kremote: Counting objects:  15% (8/52)[Kremote: Counting objects:  17% (9/52)[Kremote: Counting objects:  19% (10/52)[Kremote: Counting objects:  21% (11/52)[Kremote: Counting objects:  23% (12/52)[Kremote: Counting objects:  25% (13/52)[Kremote: Counting objects:  26% (14/52)[Kremote: Counting objects:  28% (15/52)[Kremote: Counting objects:  30% (16/52)[Kremote: Counting objects:  32% (17/52)[Kremote: Counting objects:  34% (18/52)[Kremote: Counting objects:  36% (19/52)[Kremote: Counting objects:  38% (20/52)[Kremote: Counting objects:  40% (21/52)[Kremote: Counting objects:  42% (22/52)[Kremote: Coun

## Run training using the run_language_modeling.py examples script

In [0]:
!nvidia-smi #just to confirm that you are on a GPU, if not go to Runtime->Change Runtime

Sat Feb 22 15:58:32 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
#Setting environment variables
os.environ["train_path"] = train_path
os.environ["eval_path"] = eval_path
os.environ["CUDA_LAUNCH_BLOCKING"]='1'  #Makes for easier debugging (just in case)
weights_dir = "/content/models/smallBERTa/weights"
!mkdir {weights_dir}

In [0]:
cmd = '''python /content/transformers/examples/run_language_modeling.py --output_dir {0}  \
    --model_type roberta \
    --mlm \
    --train_data_file {1} \
    --eval_data_file {2} \
    --config_name /content/models/smallBERTa \
    --tokenizer_name /content/models/smallBERTa \
    --do_train \
    --line_by_line \
    --overwrite_output_dir \
    --do_eval \
    --block_size 256 \
    --learning_rate 1e-4 \
    --num_train_epochs 5 \
    --save_total_limit 2 \
    --save_steps 2000 \
    --logging_steps 500 \
    --per_gpu_eval_batch_size 32 \
    --per_gpu_train_batch_size 32 \
    --evaluate_during_training \
    --seed 42 \
    '''.format(weights_dir, train_path, eval_path)

In [0]:
!{cmd}

02/22/2020 15:58:45 - INFO - transformers.configuration_utils -   loading configuration file /content/models/smallBERTa/config.json
02/22/2020 15:58:45 - INFO - transformers.configuration_utils -   Model config RobertaConfig {
  "architectures": null,
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "do_sample": false,
  "eos_token_ids": 0,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.3,
  "hidden_size": 128,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 256,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 256,
  "model_type": "roberta",
  "num_attention_heads": 1,
  "num_beams": 1,
  "num_hidden_layers": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad

## Run Fill Mask

In [0]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="/content/models/tinyBERT/weights/checkpoint-36000",
    tokenizer="/content/models/tinyBERT/weights/checkpoint-36000"
)

In [0]:
result = fill_mask("aaj bahot <mask> ho raha hei")
print(result)

## View Results on Tensorboard

In [0]:
!tensorboard dev upload --logdir /content/runs