In this notebook we are presenting you all the code cells to run in order to reproduce the experiments we did on CharBERT.

In particular, we are going to present the work we did starting from pre-training and then performing the Named Entity Recognition task, on a different domain, like Twitter, and on different languages with respect to English, like Spanish.

# Set up the environment

In [None]:
!rm -d -r /content/CharBERT-main

In [None]:
! git clone https://github.com/cmmedoro/CharBERT-main.git

The following cell should be run if you need to reproduce the experiment where we modified the architecture of CharBERT, by changing the way we obtain char embeddings (instead of concatenating the first and last character, we choose to consider the mean and standard deviation of the characters present in the token).

In [None]:
! git clone --single-branch --branch no_concatenation https://github.com/cmmedoro/CharBERT-main.git

In [None]:
!pip install transformers

In [None]:
!pip install seqeval

In [None]:
!pip install tensorboardX

In [None]:
!pip install boto3

In [None]:
!pip install datasets

Note that for reproducing the experiments we did in terms of the model exploration (branch no_concatenation of the Github repository) you can just reproduce the following two cells, "Pre-train MLM on English Wikipedia (simple version)" and "NER".

# Pre-train MLM on English Wikipedia (simple version)

For pre-training, we chose to use a simplified version of Wikipedia in the English language. We retain a portion of it and divide it into train/val/test.

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("wikipedia", "20220301.simple")

In [None]:
train_text = dataset['train'][:500]['text']
eval_text = dataset['train'][500:650]['text']
test_text = dataset['train'][650:800]['text']

In [None]:
# Note that if you work on Colab you may need to create the folder "data" ---> change the directories based on your development tool of choice
text = ''
for el in train_text:
  text += el
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in eval_text:
  text += el
with open("/content/data/eval.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in test_text:
  text += el
with open("/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

The following two cells are to be run if the pre-training needs to happen from a previous checkpoint of the model.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP/mlm_training_3epochs.zip -d /content/ckpt

Pre-train MLM for 3 epochs:

In [None]:
DATA_DIR= "/content/data"
MODEL_DIR="/content/ckpt/model_pretrained" # Here you need to insert the path to the model checkpoint downloaded
# Note that if you are passing a checkpoint, you need to modify "--model_name_or_path bert-base-cased" in "--model_name_or_path $MODEL_DIR"
OUTPUT_DIR="/content/output/mlm"
!python3 /content/CharBERT-main/run_lm_finetuning.py --model_type bert --model_name_or_path bert-base-cased --do_train --do_eval --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --term_vocab /content/CharBERT-main/data/dict/term_vocab --train_data_file $DATA_DIR/train.txt --eval_data_file $DATA_DIR/eval.txt --learning_rate 3e-5 --num_train_epochs 1 --mlm_probability 0.10 --input_nraws 1000 --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 4 --save_steps 10000 --block_size 384 --overwrite_output_dir --mlm --output_dir ${OUTPUT_DIR}

In [None]:
!zip -r mlm_epoch_3.zip /content/$/content/output/mlm

Note that this code was also used to further train the model for other 3 epochs, so for a total of 6 epochs, when fine-tuning the model on Twitter data, for following experiments.

# NER

On CoNLL-2003.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP/conll2003.zip -d /content/CharBERT-main/data # Download CoNLL-2003

In [None]:
!unzip -q /content/drive/MyDrive/NLP/mlm_training_3epochs.zip -d /content/ckpt # Checkpoint of model

In [None]:
#NER
DATA_DIR= "/content/CharBERT-main/data/conll2003" # Path to data
MODEL_DIR= "/content/kaggle/working/$/kaggle/working/mlm" # Path to checkpoint of model
OUTPUT_DIR="/content/output/ner"
!python3 /content/CharBERT-main/run_ner.py --model_type bert --data_dir $DATA_DIR --model_name_or_path $MODEL_DIR --do_train --do_predict --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --learning_rate 3e-5 --num_train_epochs 1  --per_gpu_train_batch_size 4 --overwrite_output_dir --output_dir ${OUTPUT_DIR}

In [None]:
!rm -rf /content/$/content/output/ner/checkpoint-1150

In [None]:
!zip -r ner_conll_epoch3.zip /content/$/content/output/mlm

# Fine-tuning on Twitter data

Here we are going to perform domain adaptation of CharBERT on social media data coming from Twitter. The idea is to take a dataset with tweets, which more or less has the same dimenson as the one of Wikipedia used for pre-train, and further fine-tune CharBERT (the one obtained after 3 epochs of training on English Wikipedia) for 3 additional epochs. Note that for the experiments in this setting we also trained for 3 additional epochs the model on English Wikipedia, to have comparable models to then perform the downstream task.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP/twitter_cikm_2010.zip -d /content/twitter_data

In [None]:
import pandas as pd
df=pd.read_csv('/content/twitter_data/training_set_tweets.txt', sep='\t', on_bad_lines = 'skip', header = None)

In [None]:
texts = df[2]
tweets = []
for text in texts.values:
  if isinstance(text, str):
    tweets.append(text)

In [None]:
import re
def remove_lines_with_only_numbers(tweets):
    filtered_lines = [tw for tw in tweets if not re.match(r'^\d+$', tw)]
    return filtered_lines
filtered = remove_lines_with_only_numbers(tweets)
filt = pd.DataFrame(filtered)

In [None]:
train = filt[:22000]
eval = filt[22000:29000]
test = filt[29000:36000]

In [None]:
text = ''
for el in train.values:
  text += el[0] + '\n'
with open("/content/twitter_data/train.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in eval.values:
  text += el[0] + '\n'
with open("/content/twitter_data/eval.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in test.values:
  text += el[0] + '\n'
with open("/content/twitter_data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
# Download model checkpoint
!unzip -q /content/drive/MyDrive/NLP/mlm_training_3epochs.zip -d /content/ckpt

In [None]:
# TRAINING SU TWEETS
DATA_DIR= "/content/twitter_data"
MODEL_DIR="/content/ckpt/content/$/content/output/mlm"
OUTPUT_DIR="/content/output/mlm"
!python3 /content/CharBERT-main/run_lm_finetuning.py --model_type bert --model_name_or_path $MODEL_DIR --do_train --do_eval --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --term_vocab /content/CharBERT-main/data/dict/term_vocab --train_data_file $DATA_DIR/train_tw.txt --eval_data_file $DATA_DIR/eval_tw.txt --learning_rate 3e-5 --num_train_epochs 1 --mlm_probability 0.10 --input_nraws 1000 --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 4 --save_steps 10000 --block_size 384 --overwrite_output_dir --mlm --output_dir ${OUTPUT_DIR}

In [None]:
!zip -r fine_tuning_twitter_epoch6.zip /content/$/content/output/mlm

# NER Twitter

Here we are going to assess the performances of the model fine-tuned on Twitter data on a NER dataset taken from Twitter. We are going to compare the performances with the model trained on English Wikipedia.

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("tner/tweetner7")

We need to perform mapping of the labels as CharBERT is based on four main entities, while the current dataset has seven entities.

In [None]:
label_mapping = {0:'B-ORG',1:'B-MISC',2:'B-MISC',3:'B-ORG',4:'B-LOC',5:'B-PER',6:'B-MISC',7:'I-ORG',8:'I-MISC',9:'I-MISC',10:'I-ORG',11:'I-LOC',12:'I-PER',13:'I-MISC',14:'O'}

We are going to use "train_all", "test_2021" and "validation_2021".

In [None]:
train = ''
for batch_list in dataset['train_all']:
  for i in range(len(batch_list['tokens'])):
    train += batch_list['tokens'][i] +' '+label_mapping[batch_list['tags'][i]]+'\n'
  train +='\n'

In [None]:
validation = ''
for batch_list in dataset['validation_2021']:
  for i in range(len(batch_list['tokens'])):
    validation += batch_list['tokens'][i] +' '+label_mapping[batch_list['tags'][i]]+'\n'
  validation += '\n'

In [None]:
text = ''
for batch_list in dataset['test_2021']:
  for i in range(len(batch_list['tokens'])):
    text += batch_list['tokens'][i] +' '+label_mapping[batch_list['tags'][i]]+'\n'
  text += '\n'

In [None]:
# Note that if you work on Colab you may need to create the folder "data" ---> change the directories based on your development tool of choice
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(train)

In [None]:
with open("/content/data/validation.txt", 'w', encoding='utf-8') as f:
  f.write(validation)

In [None]:
with open("/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP_PROVE/mlm_epoch6.zip -d /content/ckpt

In [None]:
#NER
DATA_DIR= "/content/data"
MODEL_DIR= "/content/ckpt/content/$/content/output/mlm"
OUTPUT_DIR="/content/output/ner"
!python3 /content/CharBERT-main/run_ner.py  --model_type bert --data_dir $DATA_DIR --model_name_or_path $MODEL_DIR --do_train --do_predict --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --learning_rate 3e-5 --num_train_epochs 3 --save_steps 500 --per_gpu_train_batch_size 4 --overwrite_output_dir --output_dir ${OUTPUT_DIR}

In [None]:
!rm -rf /content/$/content/output/ner/checkpoint-4500

In [None]:
!zip -r tner_our_wikipedia_epoch3.zip /content/$/content/output/ner

# Multilingual extension

Here we are going to train a model on a dataset comprising both English and Italian Wikipedia.

In [None]:
from datasets import load_dataset

In [None]:
dataset_it = load_dataset("wikipedia", "20220301.it")

In [None]:
train_text_it = dataset_it['train'][:200]['text']

In [None]:
eval_text_it = dataset_it['train'][200:220]['text']
test_text_it = dataset_it['train'][220:240]['text']

In [None]:
dataset_en = load_dataset("wikipedia", "20220301.en")

In [None]:
train_text_en = dataset_en['train'][:30]['text']
eval_text_en = dataset_en['train'][30:37]['text']
test_text_en = dataset_en['train'][37:45]['text']

In [None]:
# Note that if you work on Colab you may need to create the folder "data" ---> change the directories based on your development tool of choice
text = ''
for el in train_text_it:
  text += el
for el in train_text_en:
  text += el
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in eval_text_it:
  text += el
for el in eval_text_en:
  text += el
with open("/content/data/eval.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in test_text_it:
  text += el
for el in test_text_en:
  text += el
with open("/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP/multilingual_it_eng_epoch3.zip -d /content/ckpt

In [None]:
DATA_DIR= "/content/data"
MODEL_DIR="/content/content/model_pretrained"
OUTPUT_DIR="/content/output/mlm"
!python3 /content/CharBERT-main/run_lm_finetuning.py --model_type bert --model_name_or_path bert-base-multilingual-cased --do_train --do_eval --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --term_vocab /content/CharBERT-main/data/dict/term_vocab --train_data_file $DATA_DIR/train.txt --eval_data_file $DATA_DIR/eval.txt --learning_rate 3e-5 --num_train_epochs 1 --mlm_probability 0.10 --input_nraws 1000 --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 4 --save_steps 10000 --block_size 384 --overwrite_output_dir --mlm --output_dir ${OUTPUT_DIR}

In [None]:
!zip -r multilingual_epoch1.zip /content/$/content/output/mlm

#NER multilingual

Here we are going to assess the performances of CharBERT pre-trained on English and Italian Wikipedia data on the downstream task of NER. We are going to assess the performance separately on each language.

In [None]:
dataset = load_dataset("Babelscape/wikineural")

In [None]:
label_mapping = {0:'O', 1: 'B-PER', 2:'I-PER', 3 : 'B-ORG', 4:'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}

Italian Wikipedia NER:

In [None]:
train = ''
for phrase_number, batch_list in enumerate(dataset['train_it'][:3000]['tokens']):
    for i in range(len(batch_list)):
        train += batch_list[i] +' '+label_mapping[dataset['train_it'][phrase_number]['ner_tags'][i]]+'\n'
    train +='\n'

In [None]:
validation = ''
for phrase_number, batch_list in enumerate(dataset['val_fr'][:1900]['tokens']):
    for i in range(len(batch_list)):
        validation += batch_list[i] +' '+label_mapping[dataset['val_fr'][phrase_number]['ner_tags'][i]]+'\n'
    validation +='\n'

In [None]:
test = ''
for phrase_number, batch_list in enumerate(dataset['test_fr'][:1900]['tokens']):
    for i in range(len(batch_list)):
        test += batch_list[i] +' '+label_mapping[dataset['test_fr'][phrase_number]['ner_tags'][i]]+'\n'
    test +='\n'

English Wikipedia NER:

In [None]:
train = ''
for phrase_number, batch_list in enumerate(dataset['train_en'][:3500]['tokens']):
    for i in range(len(batch_list)):
        train += batch_list[i] +' '+label_mapping[dataset['train_en'][phrase_number]['ner_tags'][i]]+'\n'
    train +='\n'

In [None]:
validation = ''
for phrase_number, batch_list in enumerate(dataset['val_en'][:2000]['tokens']):
    for i in range(len(batch_list)):
        validation += batch_list[i] +' '+label_mapping[dataset['val_en'][phrase_number]['ner_tags'][i]]+'\n'
    validation +='\n'

In [None]:
test = ''
for phrase_number, batch_list in enumerate(dataset['test_en'][:2000]['tokens']):
    for i in range(len(batch_list)):
        test += batch_list[i] +' '+label_mapping[dataset['test_en'][phrase_number]['ner_tags'][i]]+'\n'
    test +='\n'

Save the train/validation/test splits into file txt.

In [None]:
# Note that if you work on Colab you may need to create the folder "data" ---> change the directories based on your development tool of choice
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(train)

In [None]:
with open("/content/data/validation.txt", 'w', encoding='utf-8') as f:
  f.write(validation)

In [None]:
with open("/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(test)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP/multilingual_it_eng_epoch3.zip -d /content/ckpt

In [None]:
# NER MULTILINGUAL
DATA_DIR= "/content/data"
MODEL_DIR= "/content/ckpt/content/$/content/output/mlm"
OUTPUT_DIR="/content/output/ner"
!python3 /content/CharBERT-main/run_ner.py  --model_type bert --data_dir $DATA_DIR --model_name_or_path $MODEL_DIR --do_train --do_predict --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --learning_rate 3e-5 --num_train_epochs 3 --save_steps 500 --per_gpu_train_batch_size 4 --overwrite_output_dir --output_dir ${OUTPUT_DIR}

In [None]:
!rm -rf /content/$/content/output/ner/checkpoint-4500

In [None]:
!zip -r multilingual_ner_epoch1.zip /content/$/content/output/ner