<a href="https://colab.research.google.com/github/arnaudmkonan/transformers/blob/main/akonan_address_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [87]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt


# !wget -c https://adresse.data.gouv.fr/data/ban/adresses/latest/csv/adresses-france.csv.gz

--2022-09-26 06:05:54--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 54.192.18.17, 54.192.18.43, 54.192.18.58, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|54.192.18.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‘oscar.eo.txt’


2022-09-26 06:06:12 (17.6 MB/s) - ‘oscar.eo.txt’ saved [312733741/312733741]



In [2]:
%%bash

wget https://nationaladdressdata.s3.amazonaws.com/NAD_r11_TXT.zip -O temp.zip
unzip temp.zip
rm temp.zip

Archive:  temp.zip
   creating: TXT/
   creating: TXT/info/
 extracting: TXT/info/arc.dir        
  inflating: TXT/NAD_r11.txt         
  inflating: TXT/NAD_r11.txt.xml     
  inflating: TXT/NationalAddressDatabaseMetadata_v11.0.xml  
  inflating: TXT/schema.ini          


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [3]:
%%bash


head -n 2 /content/TXT/NAD_r11.txt

OID,State,County,Inc_Muni,Uninc_Comm,Nbrhd_Comm,Post_Comm,Zip_Code,Plus_4,Bulk_Zip,Bulk_Plus4,StN_PreMod,StN_PreDir,StN_PreTyp,StN_PreSep,StreetName,StN_PosTyp,StN_PosDir,StN_PosMod,AddNum_Pre,Add_Number,AddNum_Suf,LandmkPart,LandmkName,Building,Floor,Unit,Room,Addtl_Loc,Milepost,Longitude,Latitude,NatGrid_Coord,GUID,Addr_Type,Placement,Source,AddAuth,UniqWithin,LastUpdate,Effective,Expired
-1,CO,Adams,BRIGHTON,,,,80601,,,,,South,,,Cabbage,Avenue,,,,282,,,,,,,,,,-104.821935861388994,39.982623567676498,13SEE1520325843,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,


In [88]:
%%bash

head -n 2 /content/oscar.eo.txt

Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon
Temas pri kolekto de kristanaj kantoj, eldonita de Adolf Burkhardt inter 1974 kaj 1990 en dek kajeretoj. Ili estas reeldonitaj inter 1995 kaj 1998 de Bernhard Eichkorn en tri kajeroj, kies tria estas pliampleksigita per Dek Novaj Kantoj kaj suplemento, same de Adolf Burkhardt.


In [4]:
import pandas as pd

# df_ban = pd.read_csv('/content/adresses-france.csv.gz', sep=';') 
# df_ban.head(12)

In [5]:
import pandas as pd
pd.options.display.max_columns = 50
# pd.options.display.max_rows = 50

In [3]:
import pandas as pd
df_nad = pd.read_csv("/content/TXT/NAD_r11.txt", 
                     sep=',', 
                    #  low_memory=False, 
                     nrows=1_000_000
                     )
df_nad.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,OID,State,County,Inc_Muni,Uninc_Comm,Nbrhd_Comm,Post_Comm,Zip_Code,Plus_4,Bulk_Zip,...,NatGrid_Coord,GUID,Addr_Type,Placement,Source,AddAuth,UniqWithin,LastUpdate,Effective,Expired
0,-1,CO,Adams,BRIGHTON,,,,80601.0,,,...,13SEE1520325843,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,
1,-1,CO,Adams,BRIGHTON,,,,80601.0,,,...,13SEE1528726184,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,
2,-1,CO,Adams,,,,,80103.0,,,...,13SEE6552801229,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,
3,-1,CO,Adams,,,,,80105.0,,,...,13SEE8440901177,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,
4,-1,CO,Adams,,,,,80103.0,,,...,13SEE6613807911,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,


In [4]:
df_nad.shape

(1000000, 42)

In [5]:
headers_needed = ['State', 'County', 'Inc_Muni', 'Zip_Code', 'StN_PreDir', 'StreetName', 'StN_PosTyp', 'Add_Number', 'Building', 'Unit', 'NatGrid_Coord']

In [6]:
df_nad[headers_needed].isnull().sum()

State                 0
County                0
Inc_Muni          46023
Zip_Code            192
StN_PreDir       369976
StreetName            2
StN_PosTyp        17931
Add_Number            0
Building         997947
Unit             723626
NatGrid_Coord         0
dtype: int64

In [7]:
df_nad[headers_needed].head(25)

Unnamed: 0,State,County,Inc_Muni,Zip_Code,StN_PreDir,StreetName,StN_PosTyp,Add_Number,Building,Unit,NatGrid_Coord
0,CO,Adams,BRIGHTON,80601.0,South,Cabbage,Avenue,282,,,13SEE1520325843
1,CO,Adams,BRIGHTON,80601.0,,CABBAGE,Street,60,,,13SEE1528726184
2,CO,Adams,,80103.0,East,26TH,Court,64400,,,13SEE6552801229
3,CO,Adams,,80105.0,East,26TH,Avenue,83107,,,13SEE8440901177
4,CO,Adams,,80103.0,,PEERLESS MINE,Road,6571,,,13SEE6613807911
5,CO,Adams,,80103.0,,BEHRENS,Road,4680,,,13SEE7274003896
6,CO,Adams,,80103.0,,BEHRENS,Road,5505,,,13SEE7252705367
7,CO,Adams,,80103.0,,CALHOUN BYERS,Road,6570,,,13SEE6954607869
8,CO,Adams,,80103.0,,BEHRENS,Road,4500,,,13SEE7270703716
9,CO,Adams,,80103.0,,BEHRENS,Road,5440,,,13SEE7281205123


In [8]:
df_nad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 42 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   OID            1000000 non-null  int64  
 1   State          1000000 non-null  object 
 2   County         1000000 non-null  object 
 3   Inc_Muni       953977 non-null   object 
 4   Uninc_Comm     0 non-null        float64
 5   Nbrhd_Comm     0 non-null        float64
 6   Post_Comm      0 non-null        float64
 7   Zip_Code       999808 non-null   float64
 8   Plus_4         0 non-null        float64
 9   Bulk_Zip       0 non-null        float64
 10  Bulk_Plus4     0 non-null        float64
 11  StN_PreMod     0 non-null        float64
 12  StN_PreDir     630024 non-null   object 
 13  StN_PreTyp     5768 non-null     object 
 14  StN_PreSep     0 non-null        float64
 15  StreetName     999998 non-null   object 
 16  StN_PosTyp     982069 non-null   object 
 17  StN_PosDi

In [9]:
df_nad[headers_needed].nunique()

State                 1
County               22
Inc_Muni             72
Zip_Code            165
StN_PreDir            8
StreetName         8176
StN_PosTyp           47
Add_Number        32134
Building              3
Unit              23446
NatGrid_Coord    904754
dtype: int64

In [10]:
df_nad.keys()

Index(['OID', 'State', 'County', 'Inc_Muni', 'Uninc_Comm', 'Nbrhd_Comm',
       'Post_Comm', 'Zip_Code', 'Plus_4', 'Bulk_Zip', 'Bulk_Plus4',
       'StN_PreMod', 'StN_PreDir', 'StN_PreTyp', 'StN_PreSep', 'StreetName',
       'StN_PosTyp', 'StN_PosDir', 'StN_PosMod', 'AddNum_Pre', 'Add_Number',
       'AddNum_Suf', 'LandmkPart', 'LandmkName', 'Building', 'Floor', 'Unit',
       'Room', 'Addtl_Loc', 'Milepost', 'Longitude', 'Latitude',
       'NatGrid_Coord', 'GUID', 'Addr_Type', 'Placement', 'Source', 'AddAuth',
       'UniqWithin', 'LastUpdate', 'Effective', 'Expired'],
      dtype='object')

In [11]:
# https://medium.com/analytics-vidhya/create-a-tokenizer-and-train-a-huggingface-roberta-model-from-scratch-f3ed1138180c

# Sample chinese address
# https://help.sap.com/docs/SAP_DATA_QUALITY_MANAGEMENT_SDK/7ac66320ca514acc89396a367db6dba8/e540e1677ee51014a997d681b0e91070.html?version=4.2.9

# Address formatter
# https://github.com/fulfilio/address-formatter/blob/develop/address_formatter/address.py

In [12]:
# addresses = LastName + (", " + FirstName if FirstName else "")
items = "CO	Adams	80601.0	South	Cabbage	Avenue	282		  "
print(" ".join([str(s).strip() for s in items.split() if s is not 'NaN']))

CO Adams 80601.0 South Cabbage Avenue 282


In [13]:
headers_needed[:-1]

['State',
 'County',
 'Inc_Muni',
 'Zip_Code',
 'StN_PreDir',
 'StreetName',
 'StN_PosTyp',
 'Add_Number',
 'Building',
 'Unit']

In [14]:
df_nad['Zip_Code'] = df_nad.Zip_Code.apply(lambda x: str(x).split('.')[0])

In [15]:
df_nad.fillna(value='', inplace=True)

In [16]:
df_nad[headers_needed].head()

Unnamed: 0,State,County,Inc_Muni,Zip_Code,StN_PreDir,StreetName,StN_PosTyp,Add_Number,Building,Unit,NatGrid_Coord
0,CO,Adams,BRIGHTON,80601,South,Cabbage,Avenue,282,,,13SEE1520325843
1,CO,Adams,BRIGHTON,80601,,CABBAGE,Street,60,,,13SEE1528726184
2,CO,Adams,,80103,East,26TH,Court,64400,,,13SEE6552801229
3,CO,Adams,,80105,East,26TH,Avenue,83107,,,13SEE8440901177
4,CO,Adams,,80103,,PEERLESS MINE,Road,6571,,,13SEE6613807911


In [17]:
# df_nad['Zip_Code'] = df_nad.Zip_Code.astype(int)

cols = headers_needed[:-1]
df_nad['combined'] = df_nad[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

In [27]:
df_nad.head(25)

Unnamed: 0,OID,State,County,Inc_Muni,Uninc_Comm,Nbrhd_Comm,Post_Comm,Zip_Code,Plus_4,Bulk_Zip,...,GUID,Addr_Type,Placement,Source,AddAuth,UniqWithin,LastUpdate,Effective,Expired,combined
0,-1,CO,Adams,BRIGHTON,,,,80601,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams BRIGHTON 80601 South Cabbage Avenue 2...
1,-1,CO,Adams,BRIGHTON,,,,80601,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams BRIGHTON 80601 CABBAGE Street 60
2,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 East 26TH Court 64400
3,-1,CO,Adams,,,,,80105,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80105 East 26TH Avenue 83107
4,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 PEERLESS MINE Road 6571
5,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 BEHRENS Road 4680
6,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 BEHRENS Road 5505
7,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 CALHOUN BYERS Road 6570
8,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 BEHRENS Road 4500
9,-1,CO,Adams,,,,,80103,,,...,,Unknown,Unknown,Colorado OIT GIS,,,7/1/2021 0:00:00,,,CO Adams 80103 BEHRENS Road 5440


In [28]:
from sklearn.model_selection import train_test_split
# help(train_test_split)

Splitting the data into training and testing

In [29]:
df_train, df_test = train_test_split(df_nad, 
                                     test_size=.1, 
                                     random_state=20220924,
                                     )

In [30]:
!pip install pycountry
!pip install address-formatter



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [31]:
headers_needed

['State',
 'County',
 'Inc_Muni',
 'Zip_Code',
 'StN_PreDir',
 'StreetName',
 'StN_PosTyp',
 'Add_Number',
 'Building',
 'Unit',
 'NatGrid_Coord']

In [33]:
import pycountry
import address_formatter as formatter
import random 

In [34]:
from address import Address2

In [35]:
df_nad.iloc[1][headers_needed]

State                         CO
County                     Adams
Inc_Muni                BRIGHTON
Zip_Code                   80601
StN_PreDir                      
StreetName               CABBAGE
StN_PosTyp                Street
Add_Number                    60
Building                        
Unit                            
NatGrid_Coord    13SEE1528726184
Name: 1, dtype: object

In [36]:
# country_code,
# name=None, 
# company_name=None,
# street=None, 
# street2=None,
# city=None, 
# county=None,
# postal_code=None,
# subdivision_code=None

In [37]:
addr = formatter.Address(country_code='US', name='4595', street='N picadilly Ct', 
                         street2='',
                         company_name='',
                         postal_code='80019', city='aurora', county='adams', 
                         subdivision_code='Co')

addr.render_us()

['4595', '', 'N picadilly Ct', 'aurora CO 80019', 'United States']

In [40]:
from re import sub
# tmp = formatter.Address(df_nad.iloc[1][headers_needed].to_list())
# tmp.company_name
def assign_right_order(df):
  formatted = formatter.Address(country_code='US',
                                 name='',
                                 street=str(df['Add_Number']) +' ' + str(df['StN_PreDir']) +' ' + str(df['StreetName']) +' ' + str(df['StN_PosTyp']),
                                 company_name=random.choices(["WALMART", "PUBLIC","SAM'S CLUB", "COSTCO"])[0],
                                 street2='',
                                 city=df['Inc_Muni'],
                                 county=df['County'],
                                 postal_code=df['Zip_Code'],
                                 subdivision_code=df['State']
                                 ).render_us()
  return formatted

In [43]:
fmt = Address2(country_code='US',
            #  name=df['StreetName'],
              street=str(df_train.iloc[0]['Add_Number']) +' ' + str(df_train.iloc[0]['StN_PreDir']) +' ' + str(df_train.iloc[0]['StreetName']) +' ' + str(df_train.iloc[0]['StN_PosTyp']),
              company_name=random.choices(["WALMART", "PUBLIC","SAM'S CLUB", "COSTCO"])[0],
              street2='',
              city=df_train.iloc[0]['Inc_Muni'],
              county=df_train.iloc[0]['County'],
              postal_code=df_train.iloc[0]['Zip_Code'],
              subdivision_code=df_train.iloc[0]['State']
              ).render_us()

fmt

["SAM'S CLUB", '8296 East Girard Avenue', 'Denver CO 80231', 'United States']

In [45]:
df = df_nad.loc[1,:]
print(type(df))
formatter.Address(country_code='US',
                #  name=df['StreetName'],
                  street=str(df['Add_Number']) +' ' + str(df['StN_PreDir']) +' ' + str(df['StreetName']) +' ' + str(df['StN_PosTyp']),
                  company_name=random.choices(["WALMART", "PUBLIC","SAM'S CLUB", "COSTCO"])[0],
                  street2='',
                  city=df['Inc_Muni'],
                  county=df['County'],
                  postal_code=df['Zip_Code'],
                  subdivision_code=df['State']
                  ).render_us()

<class 'pandas.core.series.Series'>


[None, 'PUBLIC', '60  CABBAGE Street', 'BRIGHTON CO 80601', 'United States']

In [46]:
test = assign_right_order(df_nad.iloc[0])
test

['',
 'COSTCO',
 '282 South Cabbage Avenue',
 'BRIGHTON CO 80601',
 'United States']

In [47]:
# import requests

# dd = requests.get("https://www.sec.gov/rules/other/4-460list.htm")
# dd.text

In [48]:
df_companies = pd.read_html('https://www.sec.gov/rules/other/4-460list.htm')
df_companies[0]

Unnamed: 0,0,1
0,List of Companies (Corrected) A | B | C | D | ...,List of Companies (Corrected) A | B | C | D | ...
1,,3Com Corp
2,,3M Company
3,,A.G. Edwards Inc.
4,,Abbott Laboratories
...,...,...
943,,Yellow Corporation
944,,York International Corp
945,,Yum Brands Inc.
946,,Zale Corporation


In [49]:
df_companies[1]

Unnamed: 0,0,1
0,Home | Previous Page,Modified: 07/02/2002


In [69]:
# Store values in a dataframe column (Series object) to files, one file per record
import os
from tqdm import tqdm
def column_to_files(data, columns, index_column, txt_files_dir):
    # The prefix is a unique ID to avoid to overwrite a text file
    i=0
    #For every value in the df, with just one column
    for idx in tqdm(data.index):#[columns].to_list():
      # Create the filename using the prefix ID
      id = str(data.loc[idx, index_column])
      file_name = os.path.join(txt_files_dir, id+'.txt')
      try:
        # Create the file and write the column text to it
        f = open(file_name, 'wb')
        
        # print(f"filename :{file_name} ===> id: {id}")
        # row1 = data.loc[idx, columns].replace(',', '').encode('utf-8')

        # print(f"file to be written: \n{row1}")
        row2 = ' '.join(assign_right_order(data.loc[idx, headers_needed])).encode('utf-8')
        f.write(row2)
        # print(f"file to be written 2: \n{row2}")
        f.close()
      except Exception as e:  #catch exceptions(for eg. empty rows)
        print(idx, e) 
      # print(i)
      # i+=1
      
    # Retur



In [70]:
' '.join(assign_right_order(df_train.loc[0,headers_needed]))

' WALMART 282 South Cabbage Avenue BRIGHTON CO 80601 United States'

In [71]:
# from pyparsing.helpers import col
!mkdir training
column_to_files(df_train, columns='combined', index_column='NatGrid_Coord', txt_files_dir='training')
# df_train.loc[0, 'combined'].replace(',', '').encode('utf-8')

100%|██████████| 900000/900000 [12:12<00:00, 1228.85it/s]


In [72]:

!mkdir testing
column_to_files(df_test, columns='combined', index_column='NatGrid_Coord', txt_files_dir='testing')


100%|██████████| 100000/100000 [01:18<00:00, 1280.49it/s]


In [73]:
# # Get the training data
# data = df_train
# # Removing the end of line character \n
# data = data.replace("\n"," ")
# # Set the ID to 0
# prefix=0
# # Create a file for every description value
# prefix = column_to_files(data, prefix, txt_files_dir='training')
# # Print the last ID
# print(prefix)
# # Get the test data
# data = test_df["description"]
# # Removing the end of line character \n
# data = data.replace("\n"," ")
# print(len(data))
# # Create a file for every description value
# prefix = column_to_files(data, prefix, txt_files_dir)
# print(prefix)

## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [74]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Found existing installation: tensorflow 2.8.2+zzzcolab20220719082949
Uninstalling tensorflow-2.8.2+zzzcolab20220719082949:
  Successfully uninstalled tensorflow-2.8.2+zzzcolab20220719082949
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-c_ff7645
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-c_ff7645
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 2.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl 

In [75]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path("training").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 51.2 s, sys: 1min 26s, total: 2min 17s
Wall time: 48.1 s


Now let's save files to disk

In [76]:
!mkdir AddressBERTa
tokenizer.save_model("AddressBERTa")

['AddressBERTa/vocab.json', 'AddressBERTa/merges.txt']

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [77]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./AddressBERTa/vocab.json",
    "./AddressBERTa/merges.txt",
)

In [78]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [79]:
tokenizer.encode("4595 N Picadilly ct aurora co")

Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [80]:
tokenizer.encode("4595 N Picadilly ct aurora co").tokens

['<s>',
 '45',
 '95',
 'ĠN',
 'ĠPicadilly',
 'Ġ',
 'ct',
 'Ġ',
 'a',
 'ur',
 'ora',
 'Ġ',
 'co',
 '</s>']

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [81]:
# Check that we have a GPU
!nvidia-smi

Mon Sep 26 06:01:15 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [82]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [83]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [84]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./AddressBERTa", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [85]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [86]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
## Need to create the Line by line dataset

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

CPU times: user 4min 54s, sys: 2.98 s, total: 4min 57s
Wall time: 1min 37s


Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./AddressBERTa",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

### Start training

In [None]:
%%time
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=15228.0, style=ProgressStyle(description_…

{"loss": 7.152712148666382, "learning_rate": 4.8358287365379566e-05, "epoch": 0.03283425269240872, "step": 500}
{"loss": 6.928811420440674, "learning_rate": 4.671657473075913e-05, "epoch": 0.06566850538481744, "step": 1000}
{"loss": 6.789419063568115, "learning_rate": 4.5074862096138694e-05, "epoch": 0.09850275807722617, "step": 1500}
{"loss": 6.688932447433472, "learning_rate": 4.343314946151826e-05, "epoch": 0.1313370107696349, "step": 2000}
{"loss": 6.595982004165649, "learning_rate": 4.179143682689782e-05, "epoch": 0.1641712634620436, "step": 2500}
{"loss": 6.545944199562073, "learning_rate": 4.0149724192277385e-05, "epoch": 0.19700551615445233, "step": 3000}
{"loss": 6.4864857263565066, "learning_rate": 3.850801155765695e-05, "epoch": 0.22983976884686105, "step": 3500}
{"loss": 6.412427802085876, "learning_rate": 3.686629892303651e-05, "epoch": 0.2626740215392698, "step": 4000}
{"loss": 6.363630670547486, "learning_rate": 3.522458628841608e-05, "epoch": 0.29550827423167847, "step"

TrainOutput(global_step=15228, training_loss=5.762423221226405)

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model("./AddressBERTa")

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./AddressBERTa",
    tokenizer="./AddressBERTa"
)

In [None]:
# The sun <mask>.
# =>

fill_mask("La suno <mask>.")

[{'score': 0.02119220793247223,
  'sequence': '<s> La suno estas.</s>',
  'token': 316},
 {'score': 0.012403824366629124,
  'sequence': '<s> La suno situas.</s>',
  'token': 2340},
 {'score': 0.011061107739806175,
  'sequence': '<s> La suno estis.</s>',
  'token': 394},
 {'score': 0.008284995332360268,
  'sequence': '<s> La suno de.</s>',
  'token': 274},
 {'score': 0.006471084896475077,
  'sequence': '<s> La suno akvo.</s>',
  'token': 1833}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'score': 0.01814725436270237,
  'sequence': '<s> Jen la komenco de bela urbo.</s>',
  'token': 871},
 {'score': 0.015888698399066925,
  'sequence': '<s> Jen la komenco de bela vivo.</s>',
  'token': 1160},
 {'score': 0.015662025660276413,
  'sequence': '<s> Jen la komenco de bela tempo.</s>',
  'token': 1021},
 {'score': 0.015555007383227348,
  'sequence': '<s> Jen la komenco de bela mondo.</s>',
  'token': 945},
 {'score': 0.01412549614906311,
  'sequence': '<s> Jen la komenco de bela tago.</s>',
  'token': 1633}]

## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)
