### Geotagging - prepare text for Label Studio
This notebook pre-process a text document and outputs a .json file supported by Label Studio NER annotation tool (https://labelstud.io/).  
The original text is created with TextFromFile method from deep_parser package (https://github.com/the-deep/deepex).  
This script computes location predictions from different NER models, and add them as pre-annotations to the Label Studio output.  
Pre-annotations will be uploaded on Label Studio and subject to human revision.

In [None]:
%pip install -r requirements.txt

In [None]:
# download spacy models
!python -m spacy download fr_core_news_md 
!python -m spacy download es_core_news_md 
!python -m spacy download en_core_web_md 

In [None]:
import pandas as pd
from os import path
from src.pre_processing import pre_processing
from src.NER_models import get_spacy_predictions, get_transformer_predictions, merge_multiwords
from src.LS_utils import write_LS_json, merge_overlapping_models
import json

In [4]:
#paths
input_folder = "data/input"
output_folder = "data/output"
doc_id = 17344

#fetch data
df, text = pre_processing(input_folder, doc_id, max_pages = 5)
df.head(10)

Unnamed: 0,page,block,text,offset
0,1,1,I D W N O T,23
1,1,2,O R,37
2,1,3,K E R S T A R G,43
3,1,4,A R E T,61
4,1,5,In memory of,71
5,1,6,"Saifura Hussaini Ahmed Khorsa, aid worker, kil...",86
6,1,7,NORTH-EAST NIGERIA HUMANITARIAN SITUATION UPDATE,207
7,1,8,Progress on key activities from the 2019-2021 ...,258
8,1,9,(covering 1 - 30 June 2019) JULY 2019 EDITION\...,337
9,2,1,NORTH-EAST NIGERIA: HUMANITARIAN SITUATION UPD...,414


#### Model 1: Spacy NER
https://spacy.io/models/en#en_core_web_md

In [5]:
spacy_pred = get_spacy_predictions(df)
spacy_pred



Unnamed: 0,page,block,text,offset,locs
0,1,1,I D W N O T,23,[]
1,1,2,O R,37,[]
2,1,3,K E R S T A R G,43,[]
3,1,4,A R E T,61,[]
4,1,5,In memory of,71,[]
5,1,6,"Saifura Hussaini Ahmed Khorsa, aid worker, kil...",86,[]
6,1,7,NORTH-EAST NIGERIA HUMANITARIAN SITUATION UPDATE,207,[]
7,1,8,Progress on key activities from the 2019-2021 ...,258,[]
8,1,9,(covering 1 - 30 June 2019) JULY 2019 EDITION\...,337,[]
9,2,1,NORTH-EAST NIGERIA: HUMANITARIAN SITUATION UPD...,414,[]


#### Model 2: xlm-roberta
https://huggingface.co/Davlan/xlm-roberta-base-wikiann-ner

In [6]:
tr_model = "Davlan/xlm-roberta-base-wikiann-ner"
xlmrob_pred = get_transformer_predictions(df, model_path = tr_model)
xlmrob_pred

Unnamed: 0,page,block,text,offset,locs
0,1,1,I D W N O T,23,[]
1,1,2,O R,37,[]
2,1,3,K E R S T A R G,43,[]
3,1,4,A R E T,61,[]
4,1,5,In memory of,71,[]
5,1,6,"Saifura Hussaini Ahmed Khorsa, aid worker, kil...",86,[]
6,1,7,NORTH-EAST NIGERIA HUMANITARIAN SITUATION UPDATE,207,[]
7,1,8,Progress on key activities from the 2019-2021 ...,258,[]
8,1,9,(covering 1 - 30 June 2019) JULY 2019 EDITION\...,337,[]
9,2,1,NORTH-EAST NIGERIA: HUMANITARIAN SITUATION UPD...,414,"[{'ent': 'NORTH-EAST NIGERIA', 'offset_start':..."


In [7]:
spacy_pred['type'] = "SPACY"
xlmrob_pred['type'] = "XLM_ROB"

merged = merge_overlapping_models(spacy_pred, xlmrob_pred)
merged['locs'] = merged.apply(lambda x: merge_multiwords(x.text, x.locs), axis = 1)

In [8]:
#write output
write_LS_json(merged, text, doc_id = doc_id, model_version = "predictions", out_folder = output_folder, append = False)

Predictions created: predictions to 17344.json


In [9]:
#check output
with open(path.join(output_folder, str(doc_id) + '.json'), 'r') as f:
    out = json.load(f)
out

{'id': 17344,
 'data': {'text': '17344\n[PAGE 1 START]\n\t\nI D W N O T\n\t\nO R\n\t\nK E R S T A R G\n\t\nA R E T\n\t\nIn memory of\n\t\nSaifura Hussaini Ahmed Khorsa, aid worker, killed September 2018 Hauwa Mohammed Liman, aid worker, killed October 2018\n\t\nNORTH-EAST NIGERIA HUMANITARIAN SITUATION UPDATE\n\t\nProgress on key activities from the 2019-2021 Humanitarian Response Strategy\n\t\n(covering 1 - 30 June 2019) JULY 2019 EDITION\n[PAGE 1 END]\n\n[PAGE 2 START]\n\t\nNORTH-EAST NIGERIA: HUMANITARIAN SITUATION UPDATE | 1-30 June 2019\n\t\nNorth-East Nigeria Humanitarian Situation Update, July 2019 Edition – Update on key activities from the 2019-2021 Humanitarian Response Strategy. Reporting period: 1 to 30 June 2019. Publication date: 1 August 2019. Cover Photo: OCHA/Leni Kinzli Caption: An internally displaced woman from Molai village on the outskirts of Maiduguri stays in overcrowded conditions in NYSC Camp in Maiduguri. Ongoing insecurity continues to trigger new displaceme