# Training and using a DeezyMatch model (option 2)

This notebook shows how to train a new DeezyMatch model, given an existing string pairs dataset.

To do so, the `resources/` folder should (at least) contain the following files, in the following locations:
```
toponym-resolution/
   ├── ...
   ├── resources/
   │   ├── deezymatch/
   │   │   ├── data/
   │   │   │   └── w2v_ocr_pairs.txt
   │   │   └── inputs/
   │   │       ├── characters_v001.vocab
   │   │       └── input_dfm.yaml
   │   ├── models/
   │   ├── news_datasets/
   │   ├── wikidata/
   │   │   └── mentions_to_wikidata.json
   │   └── wikipedia/
   └── ...
```

We start by importing some libraries, and the `ranking` script from the `geoparser` folder:

In [1]:
import os
import sys
from pathlib import Path

from t_res.geoparser import ranking

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


  from tqdm.autonotebook import tqdm


Create a `myranker` object of the `Ranker` class.

In [2]:
myranker = ranking.Ranker(
    method="deezymatch", # Here we're telling the ranker to use DeezyMatch.
    resources_path="../resources/", # Here, the path to the Wikidata resources.
    # Parameters to create the string pair dataset:
    strvar_parameters={
        "overwrite_dataset": False,
    },
    # Parameters to train, load and use a DeezyMatch model:
    deezy_parameters={
        # Paths and filenames of DeezyMatch models and data:
        "dm_path": str(Path("../resources/deezymatch/").resolve()), # Path to the DeezyMatch directory where the model is saved.
        "dm_cands": "wkdtalts", # Name we'll give to the folder that will contain the wikidata candidate vectors.
        "dm_model": "w2v_ocr", # Name of the DeezyMatch model.
        "dm_output": "on_the_fly1", # Name of the file where the output of DeezyMatch will be stored. Feel free to change that.
        # Ranking measures:
        "ranking_metric": "faiss", # Metric used by DeezyMatch to rank the candidates.
        "selection_threshold": 50, # Threshold for that metric.
        "num_candidates": 1, # Number of name variations for a string (e.g. "London", "Londra", and "Londres" are three different variations in our gazetteer of "Londcn").
        "verbose": False, # Whether to see the DeezyMatch progress or not.
        # DeezyMatch training:
        "overwrite_training": True, # You can choose to overwrite the model if it exists: in this case we're loading an existing model, so that should be False.
        "do_test": False, # Whether the DeezyMatch model we're loading was a test, or not.
    },
)

Load the resources (i.e. the `mentions-to-wikidata` and `wikidata-to-mentions` mappers) that will be used by the ranker:

In [3]:
# Load the resources:
myranker.mentions_to_wikidata = myranker.load_resources()

*** Loading the ranker resources.


Train a DeezyMatch model (notice we will be training a `test` model):

In [4]:
# Train a DeezyMatch model if needed:
myranker.mentions_to_wikidata = myranker.train()

The string match dataset already exists!
[92m2024-04-10 09:51:26[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mread input file: /home/antoine/Documents/GitHub/T-Res/resources/deezymatch/inputs/input_dfm.yaml[0m
[92m2024-04-10 09:51:26[0m [95mantoine-liris[0m [1m[90m[INFO][0m [1;31mGPU was requested but not available.[0m
[92m2024-04-10 09:51:26[0m [95mantoine-liris[0m [1m[90m[INFO][0m [1;32mpytorch will use: cpu[0m
[92m2024-04-10 09:51:26[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mread CSV file: /home/antoine/Documents/GitHub/T-Res/resources/deezymatch/data/w2v_ocr_pairs.txt[0m
[92m2024-04-10 09:51:31[0m [95mantoine-liris[0m [1m[90m[INFO][0m [1;32mnumber of labels, True: 610031 and False: 475483[0m
[92m2024-04-10 09:51:31[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mSplitting the Dataset[0m
[92m2024-04-10 09:51:31[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mfinish splitting the Dataset. User time: 0.3357422351837158

Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x76d90446e7f0>>
Traceback (most recent call last):
  File "/home/antoine/.cache/pypoetry/virtualenvs/t-res-rAxVKS4n-py3.9/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 


[92m2024-04-10 09:51:40[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32m-- convert tokens to indices[0m
[92m2024-04-10 09:51:40[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32m-- create a lookup table for tokens[0m
[92m2024-04-10 09:51:40[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32m-- read list of characters from ../resources/deezymatch/inputs/characters_v001.vocab[0m
[92m2024-04-10 09:51:40[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32m-- Length of vocabulary: 7554[0m


                                                                       




[92m2024-04-10 09:51:47[0m [95mantoine-liris[0m [1m[90m[INFO][0m [95m******************************[0m
[92m2024-04-10 09:51:47[0m [95mantoine-liris[0m [1m[90m[INFO][0m [95m**** (Bi-directional) GRU ****[0m
[92m2024-04-10 09:51:47[0m [95mantoine-liris[0m [1m[90m[INFO][0m [95m******************************[0m
[92m2024-04-10 09:51:47[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mread inputs[0m
[92m2024-04-10 09:51:47[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mcreate a two_parallel_rnns model[0m
[92m2024-04-10 09:51:48[0m [95mantoine-liris[0m [1m[90m[INFO][0m [1;32mstart fitting parameters[0m
[92m2024-04-10 09:51:48[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mNumber of batches: 28834[0m
[92m2024-04-10 09:51:48[0m [95mantoine-liris[0m [1m[90m[INFO][0m [2;32mNumber of epochs: 5[0m


  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/28834 [00:00<?, ?it/s]




Total number of params: 627963

two_parallel_rnns (
  (emb): Embedding(7554, 60), weights=((7554, 60),), parameters=453240
  (rnn_1): GRU(60, 60, num_layers=2, dropout=0.1, bidirectional=True), weights=((180, 60), (180, 60), (180,), (180,), (180, 60), (180, 60), (180,), (180,), (180, 120), (180, 60), (180,), (180,), (180, 120), (180, 60), (180,), (180,)), parameters=109440
  (attn_step1): Linear(in_features=120, out_features=60, bias=True), weights=((60, 120), (60,)), parameters=7260
  (attn_step2): Linear(in_features=60, out_features=1, bias=True), weights=((1, 60), (1,)), parameters=61
  (fc1): Linear(in_features=480, out_features=120, bias=True), weights=((120, 480), (120,)), parameters=57720
  (fc2): Linear(in_features=120, out_features=2, bias=True), weights=((2, 120), (2,)), parameters=242
)




Given the DeezyMatch model that has been loaded, find candidates on Wikidata:

In [None]:
# Find candidates given a toponym:
toponym = "Manchefter"
print(myranker.find_candidates([{"mention": toponym}])[toponym])

In [None]:
# Find candidates given a toponym:
toponym = "Londen"
print(myranker.find_candidates([{"mention": toponym}])[toponym])