# OpenAddresses data preparation

- Author: Didier Guillevic
- Date: 2024-08-10

### TL;DR

- We will use Canadian postal address data from [OpenAdresses.io](https://openaddresses.io)
- We will format addresses as a list of words and labels.
- We might wish to "de-normalize" the data; e.g.
    - all postal codes are 6 contiguous letters (e.g. "A0G1E0"), we might wish to
      randomly assign a space between the first and lasat 3 characters
      (e.g. "A0G 1E0") which might be more similar to the way people write.
    - the city names appear to be all upper case. We might lowercase the entire text.
    - the region names are the two letters abbreviations. We might wish to randonly replace those with the full name of the region.
- We will load that data as a HuggingFace Dataset

## Data: OpenAdresses

The data is presented as [jsonline](https://jsonlines.org) format (one address per line).
A sample address has the following format: 
```
{
    "type": "Feature",
    "properties": {
        "hash": "42c5facec7f5f9f5",
        "number": "434",
        "street": "Main ST",
        "unit": "",
        "city": "BIRCHY BAY",
        "district": "",
        "region": "NL",
        "postcode": "A0G1E0",
        "id": ""},
    "geometry": {
        "type": "Point",
        "coordinates": [-54.7197282, 49.3584932]}
}
```
Hence, a sample address has 3 main keys:
- type: ??? ("Feature" is the only value? to be checked)
- properties: a dictionary with all the address components
- geometry: a dictionary with two keys "type" and "coordinates" representing the
  (latitude, longitude) coordinates

In [1]:
from datasets import load_dataset, Dataset
import pandas as pd

### Load dataset

In [None]:
dataset_path = "./openaddresses/ca/countrywide-addresses-country.jsonl"

dataset_orig = load_dataset("json", data_files=dataset_path, split='train')
dataset_orig

### Re-format / keep desired columns

In [None]:
def change_format_dataset(dataset: Dataset) -> Dataset:
    """Clean up: json_normalize() columns properties and geometry. Omit empty columns."""
    # Convert to pandas DataFrame so we can use the json_normalize() function
    dataset.set_format('pandas')
    df = dataset[:]
    dataset.reset_format()
    
    # json_normalize()
    df_properties = pd.json_normalize(df['properties'])
    df_geometry = pd.json_normalize(df['geometry'])

    # Omit columns with no data
    df_address = pd.concat(
        [
            df_properties.drop(['hash', 'district', 'id'], axis=1),
            df_geometry.drop('type', axis=1),
        ],
        axis=1
    )

    # Return a Dataset instance
    return Dataset.from_pandas(df_address)

In [None]:
dataset = change_format_dataset(dataset_orig)
dataset

### Split data into train / validation / test sets

In [None]:
def create_val_test_sets(dataset: Dataset, train_size: float=0.9, seed: int=0) -> Dataset:
    """Given a dataset, add a validation and test partitions
    """
    dataset_tmp = dataset.train_test_split(train_size=train_size, seed=seed)
    # We will set validation to be 2/3 and test 1/3 of the non-training data
    dataset_new = dataset_tmp['test'].train_test_split(train_size=2/3, seed=seed)
    dataset_new['validation'] = dataset_new['train']
    dataset_new['train'] = dataset_tmp['train']
    return dataset_new

In [None]:
dataset_tr_val_te = create_val_test_sets(dataset, train_size=0.95)
dataset_tr_val_te

### Save dataset

In [None]:
dataset_name = "openaddresses_ca"
for split, dataset in dataset_tr_val_te.items():
    dataset.to_parquet(f"{dataset_name}_{split}.parquet")