# (Build) Dataset to token classification

- Author: Didier Guillevic
- Date: 2024-08-10

- We would like to build a Dataset instance with two columns: `'words'`, `'labels'`
    - `'words'` will be a list of words
    - `'labels'` will the labels (**as integers**) we wish to predict
- Define the `label_names` as a list of strings; e.g.
    ```
        label_names = [
            'O',
            'B-STREET_NB', 'I-STREET_NB',
            'B-STREET_NAME', 'I-STREET_NAME',
            'B-UNIT', 'I-UNIT',
            'B-CITY', 'I-CITY',
            'B-REGION', 'I-REGION',
            'B-POSTCODE', 'I-POSTCODE'
        ]
    ```

In [None]:
from datasets import load_dataset
import random

## Dataset

### Load dataset

In [None]:
data_files = {
	"train": "openaddresses_ca_train.parquet",
	"validation": "openaddresses_ca_validation.parquet",
	"test": "openaddresses_ca_test.parquet"
}
#data_files_dev = {
#    "train": "openaddresses_ca_test.parquet",
#}
dataset = load_dataset("parquet", data_files=data_files)
dataset

In [None]:
dataset['train'][0]

### Preprocess dataset

- The addresses have been post-normalized in the OpenAddresses dataset.
- They need to preprocess so they look like what people might have actually
  written (i.e. before the addresses were normalized).
- Tasks:
    - lowercase all texts: sreet, unit, city, region, postcode
    - add (randomly) a space between the first and last 3 characters of a postal
      code.

#### Postal code

- The postal code has been normalized as a 6 character string with no space.
- I believe people would write the postal code with a space between the first
  and last 3 characters.
- Hence, randomly (1 chance out of 2) adding a space between the first and last
  3 characters of the postal code.

In [None]:
def rand_split_postcode(example):
    """Randomly add a space between first 3 and last 3 characters of postcode"""
    return {
        'postcode': (
            (example['postcode'][:3] + ' ' + example['postcode'][3:]) if
            (example['postcode'] and len(example['postcode']) == 6 and random.randint(0, 1)) else 
            example['postcode']
        )
    }

dataset = dataset.map(rand_split_postcode)

In [None]:
dataset['train'][:8]['postcode']

#### Region (provinces)

- The regions (provinces) have been normalized to be 2 letter codes; e.g.
"QC", "ON", "BC", ...
- Hence, the model will not be able to recognize that "British Columbia" might
  stand for the "BC" region.
- We will randomly replace the 2 letter codes with the expanded versions
  (ideally both French and English version where appropriate); e.g.
  - "BC -> {"British Columbia", "Colombie Britannique"}

In [None]:
print(sorted(list(set(dataset['train']['region']))))

In [None]:
# Define some alternative for each region
region_alts = {
    'AB': ['Alberta',],
    'BC': ['British Columbia', "Colombie Britannique"],
    'MB': ['Manitoba',],
    'NB': ['New Brunswick', 'Nouveau Brunswick'],
    'NL': ['Newfoundland and Labrador', 'Terre Neuve et Labrador'],
    'NS': ['Nova Scotia', 'Nouvelle Écosse'],
    'NT': ['Northwest Territories', 'Territoires du Nord Ouest'],
    'NU': ['Nunavut',],
    'ON': ['Ontario',],
    'PE': ['PEI', 'Prince Edward Island', 'Île du Prince Édouard'],
    'QC': ['Quebec', 'Québec'],
    'SK': ['Saskatchewan',],
    'YT': ['Yukon', 'Yukon Territory']
}

In [None]:
def rand_alt_region(example):
    """Randomly substitute the 2 character code region with an alternate form"""
    return {
        'region': (
            random.choice(region_alts[example['region']]) if
            (example['region'] and example['region'] in region_alts and random.randint(0, 1)) else
            example['region']
        )
    }

dataset = dataset.map(rand_alt_region)

In [None]:
dataset['train'][:5]['region']

#### Unit

In [None]:
print(sorted(list(set(dataset['train']['unit']))))

#### Lowercase text

The data has been normalized where the city names are in all uppercase.
Probably easier to lowercase everything.
Will need to test if the performance suffers at test / eval time.

In [None]:
def lowercase_columns(example):
    for key, value in example.items():
        if isinstance(value, str):
            example[key] = value.lower() if value else value
    return example

dataset = dataset.map(lowercase_columns)

In [None]:
dataset['train'][0]

### Add lists of words and labels

1. We might want to randomly add a comma "," between the address components.
2. Additionnally, we might wish to randomly omit the region and postal code to 
   simulate cases where that data would be present when in operational mode.

In [None]:
# We wish to create two new columns: words and labels
label_names = [
    'O',
    'B-STREET_NB', 'I-STREET_NB',
    'B-STREET_NAME', 'I-STREET_NAME',
    'B-UNIT', 'I-UNIT',
    'B-CITY', 'I-CITY',
    'B-REGION', 'I-REGION',
    'B-POSTCODE', 'I-POSTCODE'
]
label_name_to_id = {name: i for i, name in enumerate(label_names)}
label_name_to_id

In [None]:
feature_to_labelID = {
    'number': 1,
    'street': 3,
    'unit': 5,
    'city': 7,
    'region': 9,
    'postcode': 11
}
feature_to_labelID

Let's create two new columns: words, labels

In [None]:
feature_names = ['number', 'street', 'unit', 'city', 'region', 'postcode']
unit_alts = ['unit', 'suite', 'appt', '#']

def build_words_labels(example):
    all_words = []
    all_labels = []

    def add_word_label(word, label):
        word = word.strip()
        if not word:
            return
        words = word.split()
        for i, w in enumerate(words):
            all_words.append(w)
            all_labels.append(label if (i == 0 or label == 0) else label+1)
    
    # CanadaPost recommends to put the unit number before the civic number,
    # with both numbers separated by a hyphen.
    # https://www.canadapost-postescanada.ca/cpc/en/support/kb/business/address-accuracy/addressing-mail-accurately#:~:text=Place%20the%20unit%20number%20before,province%20symbol%20by%202%20spaces.

    # In the dataset, there are some unit values such "1-2-3".. Not ideal.
    # So if hyphen present in the value, we will instead write "suite 1-2-3",
    # or "appt 1-2-3" or "#1-2-3"

    unit_pre = (
        example['unit'] and ('-' not in example['unit']) and
        example['number'] and random.randint(0, 1)
    )

    # unit
    if unit_pre:
        all_words.append(example['unit'])
        all_labels.append(feature_to_labelID['unit'])
        all_words.append("-")
        all_labels.append(0)
        all_words.append(example['number'])
        all_labels.append(feature_to_labelID['number'])
    
    # number
    if not unit_pre and example['number']:
        all_words.append(example['number'])
        all_labels.append(feature_to_labelID['number'])
    
    # street
    if example['street']:
        add_word_label(example['street'], feature_to_labelID['street'])
    
    # unit
    if not unit_pre and example['unit']:
        # Randomly add a comma between address components (1 in 2 chance)
        if random.randint(0, 1):
            add_word_label(",", 0)

        # randomly add a prefix to the unit number
        if random.randint(0, 1):
            add_word_label(random.choice(unit_alts), 0)
        
        add_word_label(example['unit'], feature_to_labelID['unit'])
    
    # city
    if random.randint(0, 1):
        add_word_label(",", 0)
    add_word_label(example['city'], feature_to_labelID['city'])

    # region (randomly omit 1 out of 5 chance)
    if random.randint(0, 4) < 4:
        if random.randint(0, 1):
            add_word_label(",", 0)
        add_word_label(example['region'], feature_to_labelID['region'])

    # postcode (randomly omit 1 out of 5 chance)
    if random.randint(0, 4) < 4:
        if random.randint(0, 1):
            add_word_label(",", 0)
        add_word_label(example['postcode'], feature_to_labelID['postcode'])

    return {'words': all_words, 'labels': all_labels}

In [None]:
dataset_token_classif = dataset.map(build_words_labels)

In [None]:
dataset_token_classif['train'][1]

Let's store the label names as metadata in the dataset

In [None]:
import json

# Save the list to a file
with open("labels.json", "w") as f:
    json.dump(label_names, f)

# Load the list from the file
with open("labels.json", "r") as f:
    loaded_labels = json.load(f)

print(loaded_labels)

Let's keep only the columns: 'words', 'labels'. This will save space when saving to disk.

In [None]:
columns_to_remove = ['number', 'street', 'unit', 'city', 'region', 'postcode', 'coordinates']
dataset_to_save = dataset_token_classif.remove_columns(columns_to_remove)
dataset_to_save

Let's save the dataset ready for finetuning a token classifier...

In [None]:
dataset_preprocessed_name = "openaddresses_ca_preprocessed_token_classif"
for split, dataset in dataset_to_save.items():
    dataset.to_parquet(f"{dataset_preprocessed_name}_{split}.parquet")