In this notebook we are presenting you all the code cells to run in order to reproduce the experiments we did on CharBERT.

In particular, we are going to present the work we did starting from pre-training and then performing the Named Entity Recognition task, on a different domain, like Twitter, and on different languages with respect to English, like Spanish.

# Set up the environment

In [None]:
!rm -d -r /content/CharBERT-main

In [None]:
! git clone https://github.com/cmmedoro/CharBERT-main.git

fatal: destination path 'CharBERT-main' already exists and is not an empty directory.


The following cell should be run if you need to reproduce the experiment where we modified the architecture of CharBERT, by changing the way we obtain char embeddings (instead of concatenating the first and last character, we choose to consider the mean and standard deviation of the characters present in the token).

In [None]:
! git clone --single-branch --branch no_concatenation https://github.com/cmmedoro/CharBERT-main.git

fatal: destination path 'CharBERT-main' already exists and is not an empty directory.


In [None]:
!pip install transformers



In [None]:
!pip install seqeval



In [None]:
!pip install tensorboardX



In [None]:
!pip install boto3



In [None]:
!pip install datasets



# Pre-train MLM on English Wikipedia (simple version)

For pre-training, we chose to use a simplified version of Wikipedia in the English language. We retain a portion of it and divide it into train/val/test.

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("wikipedia", "20220301.simple")

In [None]:
dataset['train']

In [None]:
train_text = dataset['train'][:500]['text']
eval_text = dataset['train'][500:650]['text']
test_text = dataset['train'][650:800]['text']

In [None]:
text = ''
for el in train_text:
  text += el
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in eval_text:
  text += el
with open("/content/data/eval.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in test_text:
  text += el
with open("/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

The following two cells are to be run if the pre-training needs to happen from a previous checkpoint of the model.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
!unzip -q /content/drive/MyDrive/NLP/mlm_epoch5.zip -d /content/ckpt

Pre-train MLM for 3 epochs:

In [None]:
DATA_DIR= "/content/data"
MODEL_DIR="/content/ckpt/model_pretrained" # Here you need to insert the path to the model checkpoint downloaded
# Note that if you are passing a checkpoint, you need to modify "--model_name_or_path bert-base-cased" in "--model_name_or_path $MODEL_DIR"
OUTPUT_DIR="/content/output/mlm"
!python3 /content/CharBERT-main/run_lm_finetuning.py --model_type bert --model_name_or_path bert-base-cased --do_train --do_eval --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --term_vocab /content/CharBERT-main/data/dict/term_vocab --train_data_file $DATA_DIR/train.txt --eval_data_file $DATA_DIR/eval.txt --learning_rate 3e-5 --num_train_epochs 1 --mlm_probability 0.10 --input_nraws 1000 --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 4 --save_steps 10000 --block_size 384 --overwrite_output_dir --mlm --output_dir ${OUTPUT_DIR}

In [None]:
!zip -r mlm_epoch_3.zip /content/$/content/output/mlm

Note that this code was also used to further train the model for other 3 epochs, so for a total of 6 epochs, when fine-tuning the model on Twitter data, for following experiments.

# NER

On CoNLL-2003.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!unzip -q /content/drive/MyDrive/NLP/conll2003.zip -d /content/CharBERT-main/data # Download CoNLL-2003

In [None]:
!unzip -q /content/drive/MyDrive/NLP/mlm_training_3epochs.zip -d /content/ckpt # Checkpoint of model

In [None]:
%cd /content

/content


In [None]:
#NER
DATA_DIR= "/content/CharBERT-main/data/conll2003" # Path to data
MODEL_DIR= "/content/kaggle/working/$/kaggle/working/mlm" # Path to checkpoint of model
OUTPUT_DIR="/content/output/ner"
!python3 /content/CharBERT-main/run_ner.py --model_type bert --data_dir $DATA_DIR --model_name_or_path $MODEL_DIR --do_train --do_predict --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --learning_rate 3e-5 --num_train_epochs 1  --per_gpu_train_batch_size 4 --overwrite_output_dir --output_dir ${OUTPUT_DIR}

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2024-01-13 11:17:29.455326: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-13 11:17:29.455386: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-13 11:17:29.457172: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
01/13/2024 11:17:32 - INFO - modeling.configuration_utils -   loading configuration file /content/kaggle/working/$/kaggle/working/mlm/config.json
01/13/2024 11:17:32 - INFO - modeling.configuration_utils -   Model config {
  "architectures": [
    "BertF

In [None]:
!rm -rf /content/$/content/output/ner/checkpoint-1150

In [None]:
!zip -r ner_conll_epoch3.zip /content/$/content/output/mlm

# Fine-tuning on Twitter data

Here we are going to perform domain adaptation of CharBERT on social media data coming from Twitter. The idea is to take a dataset with tweets, which more or less has the same dimenson as the one of Wikipedia used for pre-train, and further fine-tune CharBERT (the one obtained after 3 epochs of training on English Wikipedia) for 3 additional epochs. Note that for the experiments in this setting we also trained for 3 additional epochs the model on English Wikipedia, to have comparable models to then perform the downstream task.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!unzip -q /content/drive/MyDrive/NLP/twitter_cikm_2010.zip -d /content/twitter_data

In [None]:
import pandas as pd
df=pd.read_csv('/content/twitter_data/training_set_tweets.txt', sep='\t', on_bad_lines = 'skip', header = None)

In [None]:
df

Unnamed: 0,0,1,2,3
0,60730027,6320951896,@thediscovietnam coo. thanks. just dropped yo...,2009-12-03 18:41:07
1,60730027,6320673258,@thediscovietnam shit it ain't lettin me DM yo...,2009-12-03 18:31:01
2,60730027,6319871652,"@thediscovietnam hey cody, quick question...ca...",2009-12-03 18:01:51
3,60730027,6318151501,@smokinvinyl dang. you need anything? I got ...,2009-12-03 17:00:16
4,60730027,6317932721,"maybe i'm late in the game on this one, but th...",2009-12-03 16:52:36
...,...,...,...,...
2118320,(Reuters),,,
2118321,: Reuters - U.S. co.. http://bit.ly/5ET5vP,2009-11-25 16:23:28,62128121,6062799704
2118322,un!! So refreshing,2009-09-22 00:36:04,58171035,4167298270
2118323,ainydepressing tommorow when we decided to go ...,2009-08-19 01:08:00,46808596,3409939792


In [None]:
texts = df[2]
tweets = []
for text in texts.values:
  if isinstance(text, str):
    tweets.append(text)

In [None]:
import re
def remove_lines_with_only_numbers(tweets):
    filtered_lines = [tw for tw in tweets if not re.match(r'^\d+$', tw)]
    return filtered_lines
filtered = remove_lines_with_only_numbers(tweets)
filt = pd.DataFrame(filtered)
filt

Unnamed: 0,0
0,@thediscovietnam coo. thanks. just dropped yo...
1,@thediscovietnam shit it ain't lettin me DM yo...
2,"@thediscovietnam hey cody, quick question...ca..."
3,@smokinvinyl dang. you need anything? I got ...
4,"maybe i'm late in the game on this one, but th..."
...,...
2039079,@Jia_J <<-- FOLLOW! <<---
2039080,List for list? Who wants? List me and I'll lis...
2039081,RT @bieberlicious13
2039082,Pin:


In [None]:
filt.to_csv("tweets.txt", sep='\n', index=False)

In [None]:
train = filt[:22000]
eval = filt[22000:29000]

In [None]:
text = ''
for el in train.values:
  text += el[0] + '\n'
with open("/content/twitter_data/train_tw.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in eval.values:
  text += el[0] + '\n'
with open("/content/twitter_data/eval_tw.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
# Download model checkpoint
!unzip -q /content/drive/MyDrive/NLP/fine_tuning_twitter_epoch5.zip -d /content/ckpt

In [None]:
# TRAINING SU TWEETS
DATA_DIR= "/content/twitter_data"#/content/content/wikipedia
MODEL_DIR="/content/ckpt/content/$/content/output/mlm" #initialized by bert_base_cased model
OUTPUT_DIR="/content/output/mlm"
!python3 /content/CharBERT-main/run_lm_finetuning.py --model_type bert --model_name_or_path $MODEL_DIR --do_train --do_eval --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --term_vocab /content/CharBERT-main/data/dict/term_vocab --train_data_file $DATA_DIR/train_tw.txt --eval_data_file $DATA_DIR/eval_tw.txt --learning_rate 3e-5 --num_train_epochs 1 --mlm_probability 0.10 --input_nraws 1000 --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 4 --save_steps 10000 --block_size 384 --overwrite_output_dir --mlm --output_dir ${OUTPUT_DIR}

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
Iteration:  17% 939/5528 [17:05<1:21:53,  1.07s/it][A01/28/2024 16:03:42 - INFO - __main__ -   Reading the [47]th data block from dataset file at /content/twitter_data/train_tw.txt

Iteration:  17% 940/5528 [17:06<1:23:54,  1.10s/it][A
Iteration:  17% 941/5528 [17:07<1:24:06,  1.10s/it][A
Iteration:  17% 942/5528 [17:08<1:28:06,  1.15s/it][A
Iteration:  17% 943/5528 [17:10<1:26:03,  1.13s/it][A
Iteration:  17% 944/5528 [17:11<1:24:36,  1.11s/it][A
Iteration:  17% 945/5528 [17:12<1:23:57,  1.10s/it][A
Iteration:  17% 946/5528 [17:13<1:23:37,  1.10s/it][A
Iteration:  17% 947/5528 [17:14<1:23:27,  1.09s/it][A
Iteration:  17% 948/5528 [17:15<1:22:46,  1.08s/it][A
Iteration:  17% 949/5528 [17:16<1:22:17,  1.08s/it][A
Iteration:  17% 950/5528 [17:17<1:21:53,  1.07s/it][A
Iteration:  17% 951/5528 [17:18<1:21:53,  1.07s/it][A
Iteration:  17% 952/5528 [17:19<1:22:05,  1.08s/it][A
Iteration:  17% 953/5528 [17:20<1:21:5

In [None]:
!zip -r fine_tuning_twitter_epoch6.zip /content/$/content/output/mlm

  adding: content/$/content/output/mlm/ (stored 0%)
  adding: content/$/content/output/mlm/tokenizer_config.json (deflated 75%)
  adding: content/$/content/output/mlm/pytorch_model.bin (deflated 7%)
  adding: content/$/content/output/mlm/eval_results.txt (deflated 17%)
  adding: content/$/content/output/mlm/config.json (deflated 52%)
  adding: content/$/content/output/mlm/special_tokens_map.json (deflated 80%)
  adding: content/$/content/output/mlm/vocab.txt (deflated 49%)
  adding: content/$/content/output/mlm/training_args.bin (deflated 52%)


# NER Twitter

Here we are going to assess the performances of the model fine-tuned on Twitter data on a NER dataset taken from Twitter. We are going to compare the performances with the model trained on English Wikipedia.

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("tner/tweetner7")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/447k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/96.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/56.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/723k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/400k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/99.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/753k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.7M [00:00<?, ?B/s]

Generating test_2020 split: 0 examples [00:00, ? examples/s]

Generating test_2021 split: 0 examples [00:00, ? examples/s]

Generating validation_2020 split: 0 examples [00:00, ? examples/s]

Generating validation_2021 split: 0 examples [00:00, ? examples/s]

Generating train_2020 split: 0 examples [00:00, ? examples/s]

Generating train_2021 split: 0 examples [00:00, ? examples/s]

Generating train_all split: 0 examples [00:00, ? examples/s]

Generating validation_random split: 0 examples [00:00, ? examples/s]

Generating train_random split: 0 examples [00:00, ? examples/s]

Generating extra_2020 split: 0 examples [00:00, ? examples/s]

Generating extra_2021 split: 0 examples [00:00, ? examples/s]

We need to perform mapping of the labels as CharBERT is based on four main entities, while the current dataset has seven entities.

In [None]:
label_mapping = {0:'B-ORG',1:'B-MISC',2:'B-MISC',3:'B-ORG',4:'B-LOC',5:'B-PER',6:'B-MISC',7:'I-ORG',8:'I-MISC',9:'I-MISC',10:'I-ORG',11:'I-LOC',12:'I-PER',13:'I-MISC',14:'O'}

In [None]:
dataset

DatasetDict({
    test_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    test_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 2807
    })
    validation_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    validation_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 310
    })
    train_2020: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 4616
    })
    train_2021: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 2495
    })
    train_all: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 7111
    })
    validation_random: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 576
    })
    train_random: Dataset({
        features: ['tokens', 'tags', 'id', 'date'],
        num_rows: 4616
    })
  

We are going to use "train_all", "test_2021" and "validation_2021".

In [None]:
len(dataset['train_2020'])

4616

In [None]:
train = ''
for batch_list in dataset['train_all']:
  for i in range(len(batch_list['tokens'])):
    train += batch_list['tokens'][i] +' '+label_mapping[batch_list['tags'][i]]+'\n'
  train +='\n'

In [None]:
validation = ''
for batch_list in dataset['validation_2021']:
  for i in range(len(batch_list['tokens'])):
    validation += batch_list['tokens'][i] +' '+label_mapping[batch_list['tags'][i]]+'\n'
  validation += '\n'

In [None]:
text = ''
for batch_list in dataset['test_2021']:
  for i in range(len(batch_list['tokens'])):
    text += batch_list['tokens'][i] +' '+label_mapping[batch_list['tags'][i]]+'\n'
  text += '\n'

In [None]:
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(train)

In [None]:
with open("/content/data/validation.txt", 'w', encoding='utf-8') as f:
  f.write(validation)

In [None]:
with open("/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!unzip -q /content/drive/MyDrive/NLP_PROVE/mlm_epoch6.zip -d /content/ckpt

In [None]:
#NER
DATA_DIR= "/content/data"
MODEL_DIR= "/content/ckpt/content/$/content/output/mlm"
OUTPUT_DIR="/content/output/ner"
!python3 /content/CharBERT-main/run_ner.py  --model_type bert --data_dir $DATA_DIR --model_name_or_path $MODEL_DIR --do_train --do_predict --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --learning_rate 3e-5 --num_train_epochs 3 --save_steps 500 --per_gpu_train_batch_size 4 --overwrite_output_dir --output_dir ${OUTPUT_DIR}

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
Iteration:  27% 480/1778 [02:30<06:46,  3.19it/s][A
Iteration:  27% 481/1778 [02:30<06:47,  3.18it/s][A
Iteration:  27% 482/1778 [02:30<06:47,  3.18it/s][A
Iteration:  27% 483/1778 [02:31<06:45,  3.19it/s][A
Iteration:  27% 484/1778 [02:31<06:44,  3.20it/s][A
Iteration:  27% 485/1778 [02:31<06:42,  3.21it/s][A
Iteration:  27% 486/1778 [02:31<06:41,  3.22it/s][A
Iteration:  27% 487/1778 [02:32<06:41,  3.22it/s][A
Iteration:  27% 488/1778 [02:32<06:39,  3.23it/s][A
Iteration:  28% 489/1778 [02:32<06:39,  3.23it/s][A
Iteration:  28% 490/1778 [02:33<06:40,  3.22it/s][A
Iteration:  28% 491/1778 [02:33<06:39,  3.22it/s][A
Iteration:  28% 492/1778 [02:33<06:42,  3.19it/s][A
Iteration:  28% 493/1778 [02:34<06:42,  3.19it/s][A
Iteration:  28% 494/1778 [02:34<06:42,  3.19it/s][A
Iteration:  28% 495/1778 [02:34<06:39,  3.21it/s][A
Iteration:  28% 496/1778 [02:35<06:37,  3.22it/s][A
Iteration:  28% 497/1778 [02:35<06:

In [None]:
!rm -rf /content/$/content/output/ner/checkpoint-4500

In [None]:
!zip -r tner_our_wikipedia_epoch3.zip /content/$/content/output/ner

  adding: content/$/content/output/ner/ (stored 0%)
  adding: content/$/content/output/ner/vocab.txt (deflated 49%)
  adding: content/$/content/output/ner/config.json (deflated 52%)
  adding: content/$/content/output/ner/test_results.txt (deflated 20%)
  adding: content/$/content/output/ner/tokenizer_config.json (deflated 75%)
  adding: content/$/content/output/ner/training_args.bin (deflated 50%)
  adding: content/$/content/output/ner/special_tokens_map.json (deflated 80%)
  adding: content/$/content/output/ner/pytorch_model.bin (deflated 7%)
  adding: content/$/content/output/ner/test_predictions.txt (deflated 65%)


# Multilingual extension

Here we are going to train a model on a dataset comprising both English and Italian Wikipedia.

In [None]:
from datasets import load_dataset

In [None]:
dataset_fr = load_dataset("wikipedia", "20220301.it")

In [None]:
len(dataset_fr['train'])

2402095

In [None]:
dataset_fr['train']

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 2402095
})

In [None]:
train_text_fr = dataset_fr['train'][:30]['text']

In [None]:
eval_text_fr = dataset_fr['train'][30:40]['text']
test_text_fr = dataset_fr['train'][40:60]['text']

In [None]:
dataset_en = load_dataset("wikipedia", "20220301.en")

In [None]:
train_text_en = dataset_en['train'][:30]['text']
eval_text_en = dataset_en['train'][30:37]['text']
test_text_en = dataset_en['train'][37:45]['text']

In [None]:
text = ''
for el in train_text_fr:
  text += el
for el in train_text_en:
  text += el
with open("/content/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
text = ''
for el in eval_text_fr:
  text += el
for el in eval_text_en:
  text += el
with open("/content/content/data/eval.txt", 'w', encoding='utf-8') as f:
  f.write(text)

OSError: [Errno 28] No space left on device

In [None]:
text = ''
for el in test_text_fr:
  text += el
for el in test_text_en:
  text += el
with open("/content/content/data/test.txt", 'w', encoding='utf-8') as f:
  f.write(text)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!unzip -q /content/drive/MyDrive/NLP/mlm_epoch5.zip -d /content/ckpt

In [None]:
DATA_DIR= "/content/data"
MODEL_DIR="/content/content/model_pretrained"
OUTPUT_DIR="/content/output/mlm"
!python3 /content/CharBERT-main/run_lm_finetuning.py --model_type bert --model_name_or_path bert-base-multilingual-cased --do_train --do_eval --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --term_vocab /content/CharBERT-main/data/dict/term_vocab --train_data_file $DATA_DIR/train.txt --eval_data_file $DATA_DIR/eval.txt --learning_rate 3e-5 --num_train_epochs 1 --mlm_probability 0.10 --input_nraws 1000 --per_gpu_train_batch_size 4 --per_gpu_eval_batch_size 4 --save_steps 10000 --block_size 384 --overwrite_output_dir --mlm --output_dir ${OUTPUT_DIR}

2024-02-14 08:29:10.133580: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-14 08:29:10.133638: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-14 08:29:10.135367: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-14 08:29:10.143162: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /root/nltk_d

In [None]:
!zip -r multilingual_epoch1.zip /content/$/content/output/mlm

  adding: content/$/content/output/mlm/ (stored 0%)
  adding: content/$/content/output/mlm/config.json (deflated 52%)
  adding: content/$/content/output/mlm/vocab.txt (deflated 49%)
  adding: content/$/content/output/mlm/tokenizer_config.json (deflated 75%)
  adding: content/$/content/output/mlm/special_tokens_map.json (deflated 80%)
  adding: content/$/content/output/mlm/training_args.bin (deflated 52%)
  adding: content/$/content/output/mlm/eval_results.txt (deflated 16%)
  adding: content/$/content/output/mlm/pytorch_model.bin (deflated 7%)


#NER multilingual

Here we are going to assess the performances of CharBERT pre-trained on English and Italian Wikipedia data on the downstream task of NER. We are going to assess the performance separately on each language.

In [None]:
dataset = load_dataset("Babelscape/wikineural")

In [None]:
label_mapping = {0:'O', 1: 'B-PER', 2:'I-PER', 3 : 'B-ORG', 4:'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8:'I-MISC'}

Italian Wikipedia NER:

In [None]:
train = ''
for phrase_number, batch_list in enumerate(dataset['train_it'][:6000]['tokens']):
    for i in range(len(batch_list)):
        train += batch_list[i] +' '+label_mapping[dataset['train_it'][phrase_number]['ner_tags'][i]]+'\n'
    train +='\n'

In [None]:
validation = ''
for phrase_number, batch_list in enumerate(dataset['val_fr'][:2000]['tokens']):
    for i in range(len(batch_list)):
        validation += batch_list[i] +' '+label_mapping[dataset['val_fr'][phrase_number]['ner_tags'][i]]+'\n'
    validation +='\n'

In [None]:
test = ''
for phrase_number, batch_list in enumerate(dataset['test_fr'][:2000]['tokens']):
    for i in range(len(batch_list)):
        test += batch_list[i] +' '+label_mapping[dataset['test_fr'][phrase_number]['ner_tags'][i]]+'\n'
    test +='\n'

English Wikipedia NER:

In [None]:
train = ''
for phrase_number, batch_list in enumerate(dataset['train_en'][:6000]['tokens']):
    for i in range(len(batch_list)):
        train += batch_list[i] +' '+label_mapping[dataset['train_en'][phrase_number]['ner_tags'][i]]+'\n'
    train +='\n'

In [None]:
validation = ''
for phrase_number, batch_list in enumerate(dataset['val_en'][:2000]['tokens']):
    for i in range(len(batch_list)):
        validation += batch_list[i] +' '+label_mapping[dataset['val_en'][phrase_number]['ner_tags'][i]]+'\n'
    validation +='\n'

In [None]:
test = ''
for phrase_number, batch_list in enumerate(dataset['test_en'][:2000]['tokens']):
    for i in range(len(batch_list)):
        test += batch_list[i] +' '+label_mapping[dataset['test_en'][phrase_number]['ner_tags'][i]]+'\n'
    test +='\n'

Save the train/validation/test splits into file txt.

In [None]:
with open("/content/data/train.txt", 'w', encoding='utf-8') as f:
  f.write(train)

In [None]:
with open("/content/data/validation.txt", 'w', encoding='utf-8') as f:
  f.write(validation)

In [None]:
with open("/kaggle/working/test.txt", 'w', encoding='utf-8') as f:
  f.write(test)

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!unzip -q /content/drive/MyDrive/NLP/multilingual_epoch3.zip -d /content/ckpt

In [None]:
# NER MULTILINGUAL
DATA_DIR= "/content/data"
MODEL_DIR= "/content/ckpt/content/$/content/output/mlm"
OUTPUT_DIR="/content/output/ner"
!python3 /content/CharBERT-main/run_ner.py  --model_type bert --data_dir $DATA_DIR --model_name_or_path $MODEL_DIR --do_train --do_predict --char_vocab /content/CharBERT-main/data/dict/bert_char_vocab --learning_rate 3e-5 --num_train_epochs 3 --save_steps 500 --per_gpu_train_batch_size 4 --overwrite_output_dir --output_dir ${OUTPUT_DIR}

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
Iteration:  27% 480/1778 [02:30<06:46,  3.19it/s][A
Iteration:  27% 481/1778 [02:30<06:47,  3.18it/s][A
Iteration:  27% 482/1778 [02:30<06:47,  3.18it/s][A
Iteration:  27% 483/1778 [02:31<06:45,  3.19it/s][A
Iteration:  27% 484/1778 [02:31<06:44,  3.20it/s][A
Iteration:  27% 485/1778 [02:31<06:42,  3.21it/s][A
Iteration:  27% 486/1778 [02:31<06:41,  3.22it/s][A
Iteration:  27% 487/1778 [02:32<06:41,  3.22it/s][A
Iteration:  27% 488/1778 [02:32<06:39,  3.23it/s][A
Iteration:  28% 489/1778 [02:32<06:39,  3.23it/s][A
Iteration:  28% 490/1778 [02:33<06:40,  3.22it/s][A
Iteration:  28% 491/1778 [02:33<06:39,  3.22it/s][A
Iteration:  28% 492/1778 [02:33<06:42,  3.19it/s][A
Iteration:  28% 493/1778 [02:34<06:42,  3.19it/s][A
Iteration:  28% 494/1778 [02:34<06:42,  3.19it/s][A
Iteration:  28% 495/1778 [02:34<06:39,  3.21it/s][A
Iteration:  28% 496/1778 [02:35<06:37,  3.22it/s][A
Iteration:  28% 497/1778 [02:35<06:

In [None]:
!rm -rf /kaggle/working/$/kaggle/working/ner_multilingual/checkpoint-4500

In [None]:
!zip -r multilingual_ner_epoch1.zip /kaggle/working/$/kaggle/working/ner_multilingual