# Fine-Tuning SentenceTransformers for Toponym Resolution

This notebook outlines the process of adapting SentenceTransformer models for toponym resolution.

Before proceeding, ensure that all necessary packages are installed by running `pip install -r requirements.txt` to install the dependencies listed in the requirements file.

---
## Imports

First, we import necessary modules from our `src` package. These modules are categorized as follows:

- `toponyms.py` for handling the toponym datasets
- `gazetteer.py` for working with the gazetteer data
- `indexing.py` for indexing and candidate generation
- `training.py` for model training utilities
- `evaluation.py` for model prediction and evaluation

In [None]:
from src.toponyms import *
from src.gazetteer import *
from src.indexing import *
from src.training import *
from src.evaluation import *

---
## Data

This notebook relies on various data sources, including toponym datasets and gazetteer data from GeoNames. Please follow the links provided below to download the necessary files. After downloading, organise and name the files according to the specified folder structure to ensure the code runs without issues.

### Download Links

- Toponym Datasets:
    - [Download `lgl.xml`](https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation/blob/master/data/Corpora/lgl.xml)
    - [Download `gwn.xml`](https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation/blob/master/data/GWN.xml)
    - [Download `trn.xml`](https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation/blob/master/data/Corpora/TR-News.xml)
- Gazetteer Data:
    - [Download `allCountries.txt`](https://download.geonames.org/export/dump/allCountries.zip)
    - [Download `admin1CodesASCII.txt`](https://download.geonames.org/export/dump/admin1CodesASCII.txt)
    - [Download `admin2Codes.txt`](https://download.geonames.org/export/dump/admin2Codes.txt)
    - [Download `countryInfo.txt`](https://download.geonames.org/export/dump/countryInfo.txt)
    - [Download `featureCodes_en.txt`](https://download.geonames.org/export/dump/featureCodes_en.txt)
- Demonyms:
    - [Download `demonyms.csv`](https://github.com/knowitall/chunkedextractor/blob/master/src/main/resources/edu/knowitall/chunkedextractor/demonyms.csv)

### Folder Structure

Ensure that the downloaded files are correctly named and saved in the following structure within the project directory:

```
SemToR/
│
└───data/
    │
    ├───texts/
    │   ├───lgl.xml
    │   ├───gwn.xml
    │   └───trn.xml
    │
    ├───geonames/
    │   ├───allCountries.txt
    │   ├───admin1CodesASCII.txt
    │   ├───admin2Codes.txt
    │   ├───countryInfo.txt
    │   └───featureCodes_en.txt
    │
    └───demonyms.csv
```

In [None]:
LGL_PATH = 'data/texts/lgl.xml'
GWN_PATH = 'data/texts/gwn.xml'
TRN_PATH = 'data/texts/trn.xml'
GAZETTEER_PATH = 'data/geonames/allCountries.txt'
ADMIN1_PATH = 'data/geonames/admin1CodesASCII.txt'
ADMIN2_PATH = 'data/geonames/admin2Codes.txt'
COUNTRY_PATH = 'data/geonames/countryInfo.txt'
FEATURE_PATH = 'data/geonames/featureCodes_en.txt'
DEMONYMS_PATH = 'data/demonyms.csv'

---
## Toponym Index

We construct an index that maps toponyms to their potential location candidates using data from the GeoNames gazetteer. This index is a Python dictionary where keys are normalized toponym strings and values are lists of GeoName IDs. We extend the index with demonyms to cover more potential string matches.

We start by loading the gazetteer data.

In [None]:
gazetteer_df = load_gazetteer(GAZETTEER_PATH)

We then generate the toponym index from the gazetteer.

In [None]:
toponym_index = build_toponym_index(gazetteer_df)

Finally, we extend the index with the singular and plural demonymic forms of 2144 locations.

In [None]:
toponym_index = extend_index_with_demonyms(toponym_index, DEMONYMS_PATH)

---
## Toponym Datasets

To train and evaluate our models for toponym resolution, we prepare three toponym datasets of English news articles:
- [Local Global Lexicon (Lieberman et al. 2010)](https://doi.org/10.1109/ICDE.2010.5447903)
- [GeoWebNews (Gritta et al. 2020)](https://doi.org/10.1007/s10579-019-09475-3)
- [TR-News (Kamalloo and Rafiei 2018)](https://doi.org/10.1145/3178876.3186027)

We start by loading the datasets, which contain news article texts along with annotated toponyms.

In [None]:
lgl_df = load_toponyms(LGL_PATH)
gwn_df = load_toponyms(GWN_PATH)
trn_df = load_toponyms(TRN_PATH)

Next, we filter out toponyms with invalid GeoName IDs by removing rows for which the GeoName IDs are not present in the gazetteer.

In [None]:
lgl_df = filter_toponyms(lgl_df, gazetteer_df)
gwn_df = filter_toponyms(gwn_df, gazetteer_df)
trn_df = filter_toponyms(trn_df, gazetteer_df)

For each toponym, we generate a list of candidate locations using our previously created index. This process might require querying the GeoNames API for any toponym string that isn't covered by the index.

> **Important Note on API Calls and Caching**: All responses from the GeoNames API have been cached in the provided `geonames_cache.pkl` file. This cache ensures that no new API calls are required to process the datasets used in this project, reducing the need for individual users to use their own GeoNames accounts. However, if datasets are updated or extended beyond the scope of the cached responses, users may need to make new API calls, which would require a personal GeoNames username.

In [None]:
GEONAMES_USERNAME = 'demo'

lgl_df = generate_candidates(lgl_df, toponym_index, username=GEONAMES_USERNAME)
gwn_df = generate_candidates(gwn_df, toponym_index, username=GEONAMES_USERNAME)
trn_df = generate_candidates(trn_df, toponym_index, username=GEONAMES_USERNAME)

Finally, we truncate the texts to meet the model's maximum sequence length requirements. At this stage, we specify the model that will be fine-tuned later, ensuring that the texts are appropriately prepared for that specific model's input limits.

In [None]:
MODEL_NAME = 'all-MiniLM-L6-v2'

MODEL_LIMITS = {
    'all-MiniLM-L6-v2': 256,
    'all-MiniLM-L12-v2': 256,
    'all-distilroberta-v1': 512,
    'all-mpnet-base-v2': 384,
    'multi-qa-MiniLM-L6-cos-v1': 512,
    'multi-qa-distilbert-cos-v1': 512,
    'multi-qa-mpnet-base-dot-v1': 512
}

lgl_df = truncate_texts(lgl_df, f'sentence-transformers/{MODEL_NAME}', MODEL_LIMITS[MODEL_NAME])
gwn_df = truncate_texts(gwn_df, f'sentence-transformers/{MODEL_NAME}', MODEL_LIMITS[MODEL_NAME])
trn_df = truncate_texts(trn_df, f'sentence-transformers/{MODEL_NAME}', MODEL_LIMITS[MODEL_NAME])

---
## Location Candidates

For the models to be able to generate embeddings for location candidates, we create textual representations of gazetteer entries.

First, we reduce the size of our gazetteer to include only the entries that are actual candidates for the toponyms in our datasets. This step helps in minimizing the computational resources needed for subsequent operations.

In [None]:
gazetteer_df = filter_gazetteer(gazetteer_df, [lgl_df, gwn_df, trn_df])

The GeoNames gazetteer uses codes to represent countries, administrative divisions, and feature types. We map these codes to their corresponding names using the downloaded lookup tables.

In [None]:
gazetteer_df = generate_descriptor_names(gazetteer_df, ADMIN1_PATH, ADMIN2_PATH, COUNTRY_PATH, FEATURE_PATH)

With the names extracted, we then generate pseudotexts for each gazetteer entry. These pseudotexts follow the format:

`[name] ([feature type]) in [admin2], [admin1], [country]`

In [None]:
gazetteer_df = generate_pseudotexts(gazetteer_df)

---
## Fine-Tuning

This section outlines the process of fine-tuning the SentenceTransformer model using our prepared datasets.

First, we import the necessary components from the SentenceTransformers library.

In [None]:
from sentence_transformers import SentenceTransformer, losses

The datasets are divided into training (70%), evaluation (10%), and test sets (20%). While the training and evaluation sets are pooled from all datasets, the test sets are kept separate for each dataset.

In [None]:
train_df, eval_df, (test_lgl_df, test_gwn_df, test_trn_df) = split_train_eval_test(lgl_df, gwn_df, trn_df, test_size=0.2, eval_size=0.1)

We define the batch size based on our computational resources. Using a NVIDIA GeForce RTX 3070, we found a batch size of 8 to be appropriate.

In [None]:
BATCH_SIZE = 8

Dataloader and evaluator objects are then created for use during the training process.

In [None]:
dataloader = create_dataloader(train_df, gazetteer_df, BATCH_SIZE)
evaluator = ToponymResolutionEvaluator(gazetteer_df, eval_df, model_name=MODEL_NAME, batch_size=BATCH_SIZE)

The SentenceTransformer model is instantiated with the previously specified model name, which identifies the pre-trained model to be fine-tuned.

In [None]:
model = SentenceTransformer(MODEL_NAME)

We use a contrastive loss function which is suitable for the task of learning to distinguish between similar and dissimilar pairs of texts.

In [None]:
loss = losses.ContrastiveLoss(model)

Finally, we fine-tune the model. Hyperparameters have been specified following the code examples in the [documentation](https://www.sbert.net/docs/training/overview.html).  No hyperparameter optimisation was performed.

During training, checkpoints are saved in the `model_checkpoints` directory, and evaluation results are stored as CSV files in `evaluation_results`. The best-performing model, based on accuracy on the evaluation set, is saved in the `models` directory.

In [None]:
model.fit(train_objectives=[(dataloader, loss)],
          epochs=1,
          warmup_steps=100,
          evaluator=evaluator,
          evaluation_steps=len(dataloader) // 10,
          checkpoint_save_steps=len(dataloader) // 10,
          checkpoint_path=f'model_checkpoints/{MODEL_NAME}',
          save_best_model=True,
          output_path=f'models/{MODEL_NAME}')

---
## Evaluation

After fine-tuning the model, we can evaluate its performance using our test sets.

We start by loading the best-performing model.

In [None]:
model = SentenceTransformer(f'models/{MODEL_NAME}')

We can then proceed to evaluate its performance on the separate test datasets for LGL, GWN, and TRN.

In [None]:
evaluate_model([test_lgl_df, test_gwn_df, test_trn_df], ['LGL', 'GWN', 'TRN'], gazetteer_df, model, model_name=MODEL_NAME, batch_size=BATCH_SIZE)