<a href="https://colab.research.google.com/github/aviaga/testing/blob/main/ML%20Secondary/Intro%20to%20NLPs%20-%20RoBERTa%20Model%20Creation%20(7-28).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Intro
This is the implementation of Day 23's Lesson Plan Item for SureStart's summer program, 2021.

The goal of this Lesson Plan item was to experiment with NLPs. In this project, the goal was to build a RoBERTa model for the Spanish language. This model is able to predict the next words in a sentence given the appropriate context. 


The website where this code was adapted from can be found [here](https://colab.research.google.com/drive/1mXWYYkB9UjRdklPVSDvAcUDralmv3Pgv#scrollTo=LKs_0Gy998vO).

In [1]:
%%capture
!pip uninstall -y tensorflow
!pip install transformers==2.8.0

import os

In [2]:
import os
# Importing dataset
if not os.path.exists('data/dataset.txt'):
  !wget "https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz" -O dataset.txt.gz
  !gzip -d dataset.txt.gz
  !mkdir data
  !mv dataset.txt data

--2021-07-29 04:04:14--  https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1859673728 (1.7G) [application/gzip]
Saving to: ‘dataset.txt.gz’


2021-07-29 04:05:30 (23.8 MB/s) - ‘dataset.txt.gz’ saved [1859673728/1859673728]



In [3]:
#Looking at the data
!wc -l data/dataset.txt
!shuf -n 5 data/dataset.txt

179287150 data/dataset.txt
- Me refiero a la pistola.
Mira las bromas que escribiste, mira ese chandal que me hiciste llevar.
Estás borracho.
¿Con tu paga?
No salgo barato.


In [4]:
#Training and validation data
TRAIN_SIZE = 1000000 #@param {type:"integer"}
!(head -n $TRAIN_SIZE data/dataset.txt) > data/train.txt
VAL_SIZE = 10000 #@param {type:"integer"}
!(sed -n {TRAIN_SIZE + 1},{TRAIN_SIZE + VAL_SIZE}p data/dataset.txt) > data/dev.txt

In [5]:
#Training tokenizer
%%time
from tokenizers import ByteLevelBPETokenizer

path = "data/train.txt"

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=path,
                vocab_size=50265,
                min_frequency=2,
                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

!mkdir -p "models/roberta"
tokenizer.save("models/roberta")

CPU times: user 27.8 s, sys: 285 ms, total: 28.1 s
Wall time: 28 s


In [6]:
#Defining RoBERTa model architecture
import json
config = {
	"architectures": [
		"RobertaForMaskedLM"
	],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 12,
	"type_vocab_size": 1,
	"vocab_size": 50265
}

with open("models/roberta/config.json", 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {"max_len": 512}

with open("models/roberta/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

In [7]:
#Model training and defining paths
!wget -c https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/run_language_modeling.py
MODEL_TYPE = "roberta" #@param ["roberta", "bert"]
MODEL_DIR = "models/roberta" #@param {type: "string"}
OUTPUT_DIR = "models/roberta/output" #@param {type: "string"}
TRAIN_PATH = "data/train.txt" #@param {type: "string"}
EVAL_PATH = "data/dev.txt" #@param {type: "string"}


--2021-07-29 04:09:00--  https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34328 (34K) [text/plain]
Saving to: ‘run_language_modeling.py’


2021-07-29 04:09:01 (107 MB/s) - ‘run_language_modeling.py’ saved [34328/34328]



In [8]:
#Continue model training
train_params = {
    "output_dir": OUTPUT_DIR,
    "model_type": MODEL_TYPE,
    "config_name": MODEL_DIR,
    "tokenizer_name": MODEL_DIR,
    "train_path": TRAIN_PATH,
    "eval_path": EVAL_PATH,
    "do_eval": "--do_eval",
    "evaluate_during_training": "",
    "line_by_line": "",
    "should_continue": "",
    "model_name_or_path": "",
}

In [9]:
pip install tensorboard==2.1.0

Collecting tensorboard==2.1.0
  Downloading tensorboard-2.1.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 30.1 MB/s 
Installing collected packages: tensorboard
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.5.0
    Uninstalling tensorboard-2.5.0:
      Successfully uninstalled tensorboard-2.5.0
Successfully installed tensorboard-2.1.0


In [10]:
cmd = """python run_language_modeling.py \
    --output_dir {output_dir} \
    --model_type {model_type} \
    --mlm \
    --config_name {config_name} \
    --tokenizer_name {tokenizer_name} \
    {line_by_line} \
    {should_continue} \
    {model_name_or_path} \
    --train_data_file {train_path} \
    --eval_data_file {eval_path} \
    --do_train \
    {do_eval} \
    {evaluate_during_training} \
    --overwrite_output_dir \
    --block_size 512 \
    --max_step 25 \
    --warmup_steps 10 \
    --learning_rate 5e-5 \
    --per_gpu_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 100.0 \
    --save_total_limit 10 \
    --save_steps 10 \
    --logging_steps 2 \
    --seed 42
"""

In [11]:
!{cmd.format(**train_params)}

07/29/2021 04:09:12 - INFO - transformers.configuration_utils -   loading configuration file models/roberta/config.json
07/29/2021 04:09:12 - INFO - transformers.configuration_utils -   Model config RobertaConfig {
  "_num_labels": 2,
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bad_words_ids": null,
  "bos_token_id": 0,
  "decoder_start_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": 2,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 514,
  "min_length": 0,
  "model_type": "roberta",
  "no_repeat_ngram_size": 0,
  "num_attent

In [12]:
#Making a prediction - setting it up
%%capture
%%time
from transformers import pipeline

fill_mask = pipeline( "fill-mask", model="chriskhanhtran/spanberta",tokenizer="chriskhanhtran/spanberta")



In [13]:
#The prediction. In English, the input states "one should wash their hands frequently with soap and ____". The model then predicts possible options, with them being soap, salt, steam, lemon, and vinegar, respectively
fill_mask("Lavarse frecuentemente las manos con agua y <mask>.")

[{'score': 0.6469604969024658,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y jabón.</s>',
  'token': 18493},
 {'score': 0.06074365973472595,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y sal.</s>',
  'token': 619},
 {'score': 0.029788149520754814,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y vapor.</s>',
  'token': 11079},
 {'score': 0.0264101754873991,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y limón.</s>',
  'token': 12788},
 {'score': 0.01702934503555298,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y vinagre.</s>',
  'token': 18424}]