# Basic T-Res pipeline

An example of how to run the basic pipeline (with default values).

In [1]:
import os
import sys

from t_res.geoparser import pipeline

Once the `pipeline` script has been imported (in the previous cell), we create a new object of the `Pipeline` class. Since we don't pass any parameters, it will take all the default values: it will detect toponyms using `Livingwithmachines/toponym-19thC-en` NER model, it will find candidates using the perfect match approach, and will disambiguate them using the most popular approach. You can see the default `Pipeline` values [here](https://living-with-machines.github.io/T-Res/reference/geoparser/pipeline.html).

In [2]:
geoparser = pipeline.Pipeline(resources_path="../resources/")

*** Creating and loading a NER pipeline.
*** Loading the ranker resources.
*** Load linking resources.
  > Loading mentions to wikidata mapping.
  > Loading gazetteer.
*** Linking resources loaded!



### Using the pipeline: end-to-end

The pipeline can take either a sentence (`run_sentence`) or a document (`run_text`). If the latter, the text is split into sentences using the `sentence-splitter` library. See an example of how to run each:

In [3]:
resolved = geoparser.run_text(" The city of Valence.")
resolved

[{'mention': 'Valence',
  'ner_score': 1.0,
  'pos': 12,
  'sent_idx': 0,
  'end_pos': 19,
  'tag': 'LOC',
  'sentence': 'The city of Valence.',
  'prediction': 'Q8848',
  'ed_score': 0.876,
  'string_match_score': {'Valence': (1.0,
    ['Q8818',
     'Q2875976',
     'Q8848',
     'Q2052261',
     'Q1467944',
     'Q1361868',
     'Q3097931',
     'Q702697',
     'Q495485'])},
  'prior_cand_score': {},
  'cross_cand_score': {'Q8848': 0.876,
   'Q1467944': 0.053,
   'Q1361868': 0.046,
   'Q702697': 0.007,
   'Q2052261': 0.005,
   'Q3097931': 0.005,
   'Q8818': 0.002},
  'latlon': [44.9325, 4.890833],
  'wkdt_class': 'Q484170'}]

In [4]:
resolved = geoparser.run_sentence("A remarkable case of rattening has just occurred in the building trade at Sheffield.")
print(resolved)

[{'mention': 'Sheffield', 'ner_score': 1.0, 'pos': 74, 'sent_idx': 0, 'end_pos': 83, 'tag': 'LOC', 'sentence': 'A remarkable case of rattening has just occurred in the building trade at Sheffield.', 'prediction': 'Q42448', 'ed_score': 0.896, 'string_match_score': {'Sheffield': (1.0, ['Q6707254', 'Q823917', 'Q5953687', 'Q7492778', 'Q1421317', 'Q7492594', 'Q897533', 'Q42448', 'Q7492565', 'Q1862179', 'Q4834926', 'Q17643392', 'Q7492570', 'Q1950928', 'Q2277715', 'Q79568', 'Q518864', 'Q7492591', 'Q2306176', 'Q7492775', 'Q741640', 'Q7492686', 'Q3577611', 'Q12956644', 'Q547824', 'Q7684835', 'Q3365926', 'Q7492719', 'Q7492566', 'Q7492567', 'Q4523493', 'Q3028626', 'Q7492607', 'Q7492568', 'Q1984238', 'Q1184547', 'Q925542', 'Q4664093', 'Q2892594', 'Q1916592', 'Q371969', 'Q1141915', 'Q6986914', 'Q7114883', 'Q1915446', 'Q5224096', 'Q7492766', 'Q15277074', 'Q4065168', 'Q1548891', 'Q7492772', 'Q977409', 'Q1752117', 'Q7492586', 'Q5035049', 'Q108940076'])}, 'prior_cand_score': {}, 'cross_cand_score': {'Q

### Using the pipeline: step-wise

Instead of using the end-to-end pipeline, the pipeline can be used step-wise.

Therefore, it can be used to just perform toponym recognition (i.e. NER):

In [5]:
mentions = geoparser.run_text_recognition("A remarkable case of rattening has just occurred in the building trade at Sheffield.")
print(mentions)

[{'mention': 'Sheffield', 'context': ['', ''], 'candidates': [], 'gold': ['NONE'], 'ner_score': 1.0, 'pos': 74, 'sent_idx': 0, 'end_pos': 83, 'ngram': 'Sheffield', 'conf_md': 1.0, 'tag': 'LOC', 'sentence': 'A remarkable case of rattening has just occurred in the building trade at Sheffield.', 'place': '', 'place_wqid': ''}]


The pipeline can then be used to just perform candidate selection given the output of NER:

In [15]:
candidates = geoparser.run_candidate_selection(mentions)
print(candidates)

{'Sheffield': {'Sheffield': {'Score': 1.0, 'Candidates': {'Q6707254': 0.038461538461538464, 'Q823917': 0.04389027431421446, 'Q5953687': 0.25, 'Q7492778': 0.21153846153846156, 'Q1421317': 0.044117647058823525, 'Q7492594': 0.05, 'Q897533': 0.026007802340702213, 'Q42448': 0.9632482747552559, 'Q7492565': 0.7058823529411764, 'Q1862179': 0.6057347670250897, 'Q4834926': 0.043478260869565216, 'Q17643392': 0.047619047619047616, 'Q7492570': 0.7391304347826086, 'Q1950928': 0.6666666666666666, 'Q2277715': 0.7636363636363636, 'Q79568': 0.2857142857142857, 'Q518864': 0.6551724137931034, 'Q7492591': 0.20454545454545456, 'Q2306176': 0.3943661971830986, 'Q7492775': 0.125, 'Q741640': 0.16666666666666666, 'Q7492686': 0.25, 'Q3577611': 0.1, 'Q12956644': 0.22988505747126436, 'Q547824': 0.09090909090909091, 'Q7684835': 0.1, 'Q3365926': 0.47058823529411764, 'Q7492719': 0.125, 'Q7492566': 0.6071428571428571, 'Q7492567': 1.0, 'Q4523493': 0.07692307692307693, 'Q3028626': 0.07792207792207792, 'Q7492607': 0.00714

And finally, the pipeline can be used to perform entity disambiguation, given the output from the previous two steps:

In [16]:
disamb_output = geoparser.run_disambiguation(mentions, candidates)
print(disamb_output)

[{'mention': 'Sheffield', 'ner_score': 1.0, 'pos': 74, 'sent_idx': 0, 'end_pos': 83, 'tag': 'LOC', 'sentence': 'A remarkable case of rattening has just occurred in the building trade at Sheffield.', 'prediction': 'Q42448', 'ed_score': 0.896, 'string_match_score': {'Sheffield': (1.0, ['Q6707254', 'Q823917', 'Q5953687', 'Q7492778', 'Q1421317', 'Q7492594', 'Q897533', 'Q42448', 'Q7492565', 'Q1862179', 'Q4834926', 'Q17643392', 'Q7492570', 'Q1950928', 'Q2277715', 'Q79568', 'Q518864', 'Q7492591', 'Q2306176', 'Q7492775', 'Q741640', 'Q7492686', 'Q3577611', 'Q12956644', 'Q547824', 'Q7684835', 'Q3365926', 'Q7492719', 'Q7492566', 'Q7492567', 'Q4523493', 'Q3028626', 'Q7492607', 'Q7492568', 'Q1984238', 'Q1184547', 'Q925542', 'Q4664093', 'Q2892594', 'Q1916592', 'Q371969', 'Q1141915', 'Q6986914', 'Q7114883', 'Q1915446', 'Q5224096', 'Q7492766', 'Q15277074', 'Q4065168', 'Q1548891', 'Q7492772', 'Q977409', 'Q1752117', 'Q7492586', 'Q5035049', 'Q108940076'])}, 'prior_cand_score': {}, 'cross_cand_score': {'Q

# Test ranking

In [6]:
from pathlib import Path
from t_res.geoparser import pipeline, ranking, linking

In [7]:
# --------------------------------------
# Instantiate the ranker:
myranker = ranking.Ranker(
    method="deezymatch",
    resources_path="../resources/",
    strvar_parameters={
        # Parameters to create the string pair dataset:
        "ocr_threshold": 60,
        "top_threshold": 85,
        "min_len": 5,
        "max_len": 15,
        "w2v_ocr_path": str(Path("../resources/models/w2v/").resolve()),
        "w2v_ocr_model": "w2v_*_news",
        "overwrite_dataset": False,
    },
    deezy_parameters={
        # Paths and filenames of DeezyMatch models and data:
        "dm_path": str(Path("../resources/deezymatch/").resolve()),
        "dm_cands": "wkdtalts",
        "dm_model": "w2v_ocr",
        "dm_output": "deezymatch_on_the_fly",
        # Ranking measures:
        "ranking_metric": "faiss",
        "selection_threshold": 50,
        "num_candidates": 1,
        "verbose": False,
        # DeezyMatch training:
        "overwrite_training": False,
        "do_test": False,
    },
)

    Valence, (Géograph. mod.) petite ville, disons mieux, bourg de France dans l'Agénois, sur la rive droite de la Garonne, vis-à-vis d'Aurignac. (D. J.) 

    LOURDE, Laperdum, (Géog.) petite ville de France en Gascogne, ville unique, & chef-lieu du Lavedan, avec un ancien château sur un rocher. Elle est sur le Gave de Pau, à 4 lieues de Bagnieres. Long. 17. 30. lat. 43. 8. (D. J.) 

In [8]:
myranker.find_candidates(mentions = ['Valence', 'France', 'Agénois', 'Garonne', 'Aurignac'])

TypeError: string indices must be integers