# **PACKAGES**

To install spaCy, follow the instructions in https://spacy.io/usage  
Operating System, Platform (**ARM/M1** if you have a Apple M1-M3 chip), Package manager, Hardware, Configurations (**virtual env**), Trained pipelines (**English**, **French**, **Spanish**), Select pipeline for (**accuracy**)

In [1]:
import os, sys, csv, time, re
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import openpyxl
from pickle import load
from datetime import datetime

import spacy
import spacy.cli

In [2]:
#spacy.cli.download("es_dep_news_trf")
#spacy.cli.download("fr_dep_news_trf")

In [2]:
print(sys.version)

3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ]


# **QUICK SETUP**

In [4]:
#os.getcwd()
#os.chdir("../../../Downloads")

'/Users/julienmhp/Downloads'

In [3]:
pd.set_option('display.max_rows', None)

In [4]:
cty = "Belarus" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< change here!
lang = "English" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< change here!

# **DATA IMPORT**

In [5]:
#print(os.getcwd())
path = os.getcwd() + "/data/countries/" + cty.lower().replace(" ", "_")
print(path)

/Users/julienmhp/Desktop/undp/TargetAssessmentReport/data/countries/belarus


In [6]:
file = [os.path.join(path, f) for f in os.listdir(path) if cty.lower().replace(" ", "_") in f and f.endswith(".xlsx")]
print(file[0])

/Users/julienmhp/Desktop/undp/TargetAssessmentReport/data/countries/belarus/data_belarus_3Jul25.xlsx


In [7]:
dta = pd.read_excel(file[0], sheet_name = "targets", engine = "openpyxl")

In [8]:
dta.head()

Unnamed: 0,Country,Target Text,Target Name,Document,Source,Convention,Doc,Type
0,Belarus,Integration of the function of biodiversity co...,Target 1,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs
1,Belarus,Ensure restoration of at least 30% of disturbe...,Target 2,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs
2,Belarus,Development of the system of protected areas a...,Target 3,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs
3,Belarus,Reduction of surface and groundwater pollution...,Target 7,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs
4,Belarus,"Ensure sustainable use of flora objects, prote...",Target 9,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs


In [9]:
dta.shape

(29, 8)

# **QUICK TWEAKS**

# **MODEL**

Main attributes (parameters) of the **spaCy** model for NLP:
- **token** each work or symbol  
- **lemma** root of lowecase token
- **pos** part-of-speech (https://universaldependencies.org/u/pos/)
- **tag** detailed 'pos' tag (not in 'TRF')
- **morph** returns morgphosintsctic info - gender, number, case, tense, mood, ...
- **entity** grammatical role played in phrase (https://spacy.io/usage/linguistic-features) (https://www.universalner.org/)
- - **dependency** relations between tokens (https://spacy.io/usage/linguistic-features) (https://universaldependencies.org/u/dep/)
- **is_alpha**, **is_digit**, **is_punct**, **is_space**, **is_title**, **is_stop**, **is_currency**, **is_quote**, ...

For more details: https://spacy.io/api/token  
... or use 'print(spacy.explain("{KEYWORD}"))'

### **Loading the model**

In [10]:
if lang == "English":
    lang_cd = "en"; media = "web"; model = "core"
elif lang == "Spanish":
    lang_cd = "es"; media = "news"; model = "dep"
elif lang == "French":
    lang_cd = "fr"; media = "news"; model = "dep"

In [11]:
lang_cd+"_"+model+"_"+media+"_trf"

'en_core_web_trf'

In [12]:
nlp = spacy.load(lang_cd+"_"+model+"_"+media+"_trf")

### **Model attributes**
https://spacy.io/api/attributes

In [60]:
print(nlp.pipe_names)

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


**POS/Tag**

In [61]:
print(nlp.get_pipe("tagger").labels) # only works with the "sm", "md" and "lg" models
# ADJ (adjective), ADP (adposition), ADV (adverb), AUX (auxiliary verb), CONJ (conjugation), CCONJ (coordinating conjugation), 
# DET (determiner), INTJ (interjection), NOUN, NUM, PART (particle), PRON (pronoun), PROPN (proper noun), PUNCT (punctuation), 
# SCONJ (subordinating conjugation), SYM (symbol), VERB , X (other/unknown), SPACE (white space)

('$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', 'XX', '``')


In [45]:
print(spacy.explain("SYM"))

symbol


**Morphologizer**

In [79]:
print(nlp.get_pipe("morphologizer").labels)

('Definite=Def|Gender=Masc|Number=Sing|POS=DET|PronType=Art', 'Gender=Masc|Number=Sing|POS=NOUN', 'Definite=Def|Gender=Masc|Number=Sing|POS=ADP|PronType=Art', 'Gender=Masc|Number=Sing|POS=ADJ', 'POS=ADP', 'Definite=Def|Gender=Fem|Number=Plur|POS=DET|PronType=Art', 'POS=PROPN', 'Case=Acc|POS=PRON|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes', 'Mood=Ind|Number=Sing|POS=VERB|Person=3|Tense=Past|VerbForm=Fin', 'POS=VERB|VerbForm=Inf', 'Gender=Fem|Number=Sing|POS=DET|PronType=Dem', 'Gender=Fem|Number=Sing|POS=NOUN', 'Gender=Fem|Number=Plur|POS=NOUN', 'Gender=Fem|Number=Plur|POS=DET|PronType=Ind', 'POS=PRON|PronType=Int,Rel', 'Mood=Sub|Number=Plur|POS=VERB|Person=3|Tense=Pres|VerbForm=Fin', 'Definite=Def|Gender=Fem|Number=Sing|POS=DET|PronType=Art', 'POS=SCONJ', 'POS=NOUN', 'Definite=Def|Gender=Masc|Number=Plur|POS=DET|PronType=Art', 'Number=Plur|POS=NOUN', 'Gender=Masc|Number=Plur|POS=DET|PronType=Ind', 'Gender=Masc|Number=Plur|POS=NOUN', 'POS=PUNCT|PunctType=Peri', 'Mood=Ind|Number=Sing|P

In [47]:
print(spacy.explain("ADJ"))

adjective


**Parser**

In [63]:
print(nlp.get_pipe("parser").labels)
# ROOT (root of sentence), bsubj (nominal subject), nsubjpass (passive nominal subject), 
# dobj (direct object), iobj (indirect object), attr (attribute), prep (preposition modifier), 
# pobj (object of a preposition), amod (adjectival modifier), advmod (adverbial modifier), 
# compound (compound noun modifier), aux (auxiliary verb), auxpass (passive auxliary), 
# det (determiner), conj (conjugation), cc (coordinating conjugation), mod (nominal modifier), 
# npadvmod (noun phrase as adverbial modifier), poss (possession modifier), 
# ccomp (clausal complement), xcomp (open clausal complement), mark (marker for subordinate clause)

('ROOT', 'acl', 'acomp', 'advcl', 'advmod', 'agent', 'amod', 'appos', 'attr', 'aux', 'auxpass', 'case', 'cc', 'ccomp', 'compound', 'conj', 'csubj', 'csubjpass', 'dative', 'dep', 'det', 'dobj', 'expl', 'intj', 'mark', 'meta', 'neg', 'nmod', 'npadvmod', 'nsubj', 'nsubjpass', 'nummod', 'oprd', 'parataxis', 'pcomp', 'pobj', 'poss', 'preconj', 'predet', 'prep', 'prt', 'punct', 'quantmod', 'relcl', 'xcomp')


In [64]:
print(spacy.explain("appos"))

appositional modifier


**Entities (NER)**

In [65]:
print(nlp.get_pipe("ner").labels) # only works with the "sm", "md" and "lg" models
# "trf" doesn't even use NER, apparently...
# GPE (country, state, city, ...), 
# NORP (nationality, religious or political groups, ...), 
# FAC (buildings, airports, highways, ...), 
# LAW (doucments)

('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')


In [66]:
print(spacy.explain("FAC"))

Buildings, airports, highways, bridges, etc.


### **Getting started**

In [13]:
dta["Full Target"] = dta["Doc"] + " " + dta["Target Name"]

In [50]:
corpus = list(nlp.pipe(dta["Target Text"]))

In [51]:
rows = []
for corpus, text in zip(corpus, dta["Full Target"]):
    sent_starts = {sent[0].i for sent in corpus.sents}
    for token in corpus:
        rows.append({ # all attributes: https://spacy.io/api/attributes
            "Full Target": text, 
            "token": token.text, 
            "lemma": token.lemma_, # root of the word (no plurals, gender variation, conugation, ...)
            "pos": token.pos_, # …
            #"tag": token.tag_, # not in TRF
            "morph": str(token.morph), # …
            "dependency": token.dep_, # …
            "head": token.head, # …
            "entity": token.ent_type_, # not in TRF
            #"start": token.sent_start, # does it start a sentence? prefer "token.doc.sents"
            "start": token.i in sent_starts,
            "alpha": token.is_alpha, # only [A-Za-z]
            "digit": token.is_digit, # only [0-9] - so no Roman numerals or "1.5"
            "num": token.like_num, # resembles number {like_email, like_url}
            "punct": token.is_punct, # 
            "stop": token.is_stop, # stopwords: 
            "space": token.is_space, # 
            "title": token.is_title, # 
            "upper": token.is_upper, # 
            "lower": token.is_lower, # 
            "shape": token.shape_ # "dddd", "X", "dd%", ...
            #"length": token.length # not in TRF
        })

tokens_df = pd.DataFrame(rows)

### **Corrections**

**General formatting: text and bool**

In [52]:
#print(tokens_df["shape"].dtype)
cols = ["Full Target", "token", "lemma", "pos", "morph", "dependency", "head", "entity", "shape"]
tokens_df[cols] = tokens_df[cols].astype("string")

**Makes sure all math signs are properly classified**

In [53]:
mth1 = ["±", ">", "<", "=", "≥", "≤", "~", "≈", "%"] # problems: "-" (2015-2030), "/" (86%/92%), "+" (REDD+)
mth1_p = '[' + ''.join(re.escape(s) for s in mth1) + ']'
tokens_df.loc[(tokens_df["lemma"].str.contains(mth1_p, regex=True, na=False)), 
     "entity"] = "CARD"

mth2 = ["percent", "percentage"] # keep updating
mth2_p = '|'.join(re.escape(s) for s in mth2)
tokens_df.loc[(tokens_df["lemma"].str.contains(mth2_p, regex=True, na=False)), 
     "entity"] = "CARD"

**Makes sure all monetary/currency simbols are properly classified**

In [54]:
cty1 = ["$", "€", "£", "¥"] # ?: "₹", "₣", "₽", "₩"
cty1_p = '[' + ''.join(re.escape(s) for s in cty1) + ']'
tokens_df.loc[(tokens_df["lemma"].str.contains(cty1_p, regex=True, na=False)), 
     "entity"] = "CARD"

cty2 = ["usd", "euro", "eur"] # keep updating
if lang_cd == "en":
    cty2 += ["dollar", "pound", "sterling"] # keep updating
if lang_cd == "es":
    cty2 += ["dólar", "libra", "esterlina"] # keep updating
cty2_p = '|'.join(re.escape(s) for s in cty2)
tokens_df.loc[(tokens_df["lemma"].str.contains(cty2_p, regex=True, na=False)), 
     "entity"] = "CARD"

**Makes sure that measurements are properly classified**

In [55]:
msr = ["ha", "acre", "acres", "km2", "km²", 
       "m³", # problems: "l" (obviously)
      "cm", "km", # problems: "m" (obviously)
       "kg", # problems: "t", "g" (obviously)
       "ºc", "°c", "oc", 
      "mw", # problems: "w" (obviously)
      "co2", "co₂"]
if lang_cd == "en": 
    msr += ["hectar", "ton", "mile"]
if lang_cd == "es": 
    msr += ["hectáre", "hectárea", "tonelada"]
msr_p = r'^(?:' + '|'.join(re.escape(s) for s in msr) + r')$'
tokens_df.loc[(tokens_df["lemma"].str.contains(msr_p, regex=True, na=False)), 
     "entity"] = "CARD"

if lang_cd == "en":
    qty = ["hundred", "thousand", "million", "billion"]
if lang_cd == "es":
    qty = ["cien", "mil", "millón"]
qty_p = r'|'.join(re.escape(s) for s in qty)
tokens_df.loc[(tokens_df["lemma"].str.fullmatch(qty_p, na=False)), 
     "entity"] = "CARD"

**Makes sure all 'entity' categories that are numeric (except for 'DATE') are properly classified:**

In [56]:
#tokens_df["entity"].unique()
if lang_cd == "en":
    tokens_df.loc[(tokens_df["entity"] == "PERCENT") | 
    (tokens_df["entity"] == "QUANTITY") | 
    (tokens_df["entity"] == "CARDINAL") | 
    (tokens_df["entity"] == "MONEY"), 
    "entity"] = "CARD"

tokens_df.loc[tokens_df["dependency"] == "quantmod", 
    "entity"] = "CARD"

**Makes sure quantitative verbs are properly classified**

In [57]:
if lang_cd == "en":
    qvr = ["halve", "half", "double", "triple", "quadruple"] # keep updating
    
if lang_cd == "es":
    qvr = ["mitad", "duplicar", "triplicar", "cuadruplicar"] # keep updating

qvr_p = r'|'.join(re.escape(s) for s in qvr)
tokens_df.loc[tokens_df["lemma"].str.fullmatch(qvr_p, na=False), "entity"] = "CARD"
#fullmatch(qty_p, na=False)), 

**Makes sure that numeric values is properly labeled**

In [58]:
if lang_cd != "en":
    tokens_df.loc[(tokens_df["pos"] == "NUM") | 
        (tokens_df["morph"].str.contains('NumForm=Digit', regex=False)) | 
        (tokens_df["morph"].str.contains('NumType=Frac', regex=False)), 
        "entity"] = "CARD"

**Makes sure that numeric IDs corresponging to the target name are properly classified**

In [59]:
# look into EXACT matches, so that "10" doesn't get flagged when the target name is "1.10"
tokens_df["trg_id"] = tokens_df.apply(
    lambda row: bool(re.search(r'\b'+re.escape(row["token"])+r'\b', str(row["Full Target"]))) and row["num"], 
    axis=1)

# …corrects for types such as "1,1" instead of "1.1"
tokens_df["trg_id"] = tokens_df.apply(
    lambda row: bool(re.search(r'\b'+re.escape(row["token"].replace(',', '.'))+r'\b', str(row["Full Target"]))) and row["num"], 
    axis=1)

#tokens_df["trg_id"] = tokens_df["trg_id"].astype("string")
tokens_df.loc[tokens_df["trg_id"] == True, "entity"] = "TITLE"

**Makes sure that "dddd" structures are properly classified**

In [60]:
# e.g., 'al 2030'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd") & 
        (tokens_df["pos"] == "NOUN") & 
        (tokens_df["morph"] == "AdvType=Tim") & 
         ((tokens_df["dependency"] == "obl") | (tokens_df["dependency"] == "nmod")), 
        "entity"] = "DATE"

# e.g., 'para el año 2030'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd") & 
        (tokens_df["pos"] == "NUM") & 
        (tokens_df["morph"] == "NumForm=Digit|NumType=Card") & 
        (tokens_df["dependency"] == "compound") & 
        (tokens_df["head"].isin(["año"])), # probably complement list
        "entity"] = "DATE"

# e.g., '1700 MW'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd") & 
        (tokens_df["pos"] == "NUM") & 
        (tokens_df["morph"] == "NumForm=Digit|NumType=Card") & 
        (tokens_df["head"].isin(["MW"])) & # LIST (tokens_df["title"].shift(1) == True)
        (tokens_df["dependency"] == "nummod"), 
        "entity"] = "CARD"

# e.g. 'Ley 44 del 2002'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd") & 
        (tokens_df["pos"] == "NOUN") & 
        (tokens_df["morph"] == "AdvType=Tim") & 
        (tokens_df["dependency"] == "appos"), 
        "entity"] = "TITLE"

# e.g. 'el suplemento 2013 del IPCC'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd") & 
        (tokens_df["pos"] == "NOUN") & 
        (tokens_df["morph"] == "AdvType=Tim") & 
        (tokens_df["head"].isin(["Ley"])) & # LIST (tokens_df["title"].shift(1) == True)
        (tokens_df["dependency"] == "nmod"), 
        "entity"] = "TITLE"

**Makes sure that "dddd-dddd" are properly classified**

In [61]:
#e.g., 'entre 2022-2025'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd-dddd") & 
        #(tokens_df["pos"] == "NOUN") & 
        #(tokens_df["morph"] == "AdvType=Tim") & 
        (tokens_df["dependency"] == "obl"), 
        "entity"] = "DATE"

# e.g., 'PEN 2015-2050'
if lang_cd != "en":
    tokens_df.loc[(tokens_df["shape"] == "dddd-dddd") & 
        #(tokens_df["pos"] == "NOUN") & 
        #(tokens_df["morph"] == "AdvType=Tim") & 
        (tokens_df["dependency"] == "appos"), 
        "entity"] = "TITLE"

**Makes sure that numbered bullet-points are properly classified**

In [62]:
# e.g., '1.'
tokens_df.loc[(tokens_df["shape"] == "d") & 
    (tokens_df["dependency"] == "ROOT") & 
    (tokens_df["token"].shift(-1) == "."), 
    "entity"] = "BULLET"

**Makes sure specific words are considered time-references**

In [63]:
if lang_cd == "en":
    tim = ["year", "annual", "month", "week", "day", "hour"] # keep updating
if lang_cd == "es":
    tim = ["año", "anual", "mes", "semana", "día", "hora"] # keep updating

tim_p = r'|'.join(re.escape(s) for s in tim)
tokens_df.loc[(tokens_df["lemma"].str.fullmatch(tim_p, na=False)) &
    (tokens_df["title"] == False), 
    "entity"] = "DATE"

**Makes sure that numbers with a preceding and succeeding "title" are properly classified**

In [64]:
if lang_cd == "en":
    org = ["Law", "law", "chapter", "act", "bill", "regulation", "decree", 
          "article", "recommendation", "target", "goal", "objective", "strategy", 
          "plan", "phase", "agenda", "policy", "action", "programme", "number"] # keep updating
if lang_cd == "es":
    org = ["Ley", "ley", "capítulo", "acto", "proyecto", "regulación", "decreto", 
          "artículo", "recomendación", "meta", "objetivo", "estrategia", 
          "plan", "fase", "orden", "política", "acción", "programa", "número"] # keep updating

org_p = r'|'.join(re.escape(s) for s in org)
tokens_df.loc[((tokens_df["entity"] == "CARD") | (tokens_df["entity"] == "DATE")) &
    (tokens_df["head"] == tokens_df["token"].shift(1)) & 
    (((tokens_df["title"].shift(1) == True)) & ((tokens_df["start"].shift(1) == False)) | 
     (tokens_df["lemma"].shift(1).str.fullmatch(org_p, na=False))), 
    "entity"] = "TITLE"

tokens_df.loc[((tokens_df["entity"] == "CARD") | (tokens_df["entity"] == "DATE")) &
    (tokens_df["head"] == tokens_df["head"].shift(-1)) & 
    (((tokens_df["title"].shift(-1) == True)) & ((tokens_df["start"].shift(-1) == False)) | 
     (tokens_df["lemma"].shift(-1).str.fullmatch(org_p, na=False))), 
    "entity"] = "TITLE"

**Makes sure that measurement units and similars that follow a number are properly classified**

In [65]:
# e.g., "by 2030"
tokens_df.loc[((tokens_df["entity"].shift(-1) == "CARD") | (tokens_df["entity"].shift(-1) == "DATE")) &
    (tokens_df["token"] == tokens_df["head"].shift(-1)) & 
    (tokens_df["alpha"] == True), 
    "entity"] = tokens_df["entity"].shift(-1)

# e.g., "para el 2030"
tokens_df.loc[(tokens_df["pos"] == "ADP") & 
    (tokens_df["dependency"] == "case") & 
    (tokens_df["head"] == tokens_df["token"].shift(-2)) & 
    (tokens_df["alpha"] == True), 
    "entity"] = tokens_df["entity"].shift(-2)
tokens_df.loc[(tokens_df["pos"] == "DET") & 
    (tokens_df["head"] == tokens_df["token"].shift(-1)) & 
    (tokens_df["alpha"] == True), 
    "entity"] = tokens_df["entity"].shift(-1)

# e.g., "Al 2030"
tokens_df.loc[(tokens_df["pos"] == "ADP") & 
    (tokens_df["dependency"] == "case") & 
    (tokens_df["head"] == tokens_df["token"].shift(-1)) & 
    (tokens_df["alpha"] == True), 
    "entity"] = tokens_df["entity"].shift(-1)

**Makes sure that all-uppercase character tokens are properly classified**

In [66]:
tokens_df.loc[tokens_df["shape"].str.match(r'^X+-?X+$'), 
    "title"] = True

**Makes sure that numbers with a preceeding "title" are properly classified**

In [67]:
tokens_df.loc[(tokens_df["dependency"] == "prep") & # maybe a little too restrictive?
    (tokens_df["head"].shift(-1) == tokens_df["token"]), 
    "entity"] = tokens_df["entity"].shift(-1)

# e.g., "Act, 2004"
tokens_df.loc[(tokens_df["pos"] == "NUM") & 
    (tokens_df["head"] == tokens_df["token"].shift(2)) & 
    (tokens_df["title"].shift(2) == True) & 
    (tokens_df["punct"].shift(1) == True), # maybe add something for "of"?
    "entity"] = tokens_df["entity"].shift(2)

**Makes sure that nominal time-references preceeded by numbers are properly classified**

In [68]:
timm = ["año", "annual", "annually"]

timm_p = r'|'.join(re.escape(s) for s in timm)
tokens_df.loc[(tokens_df["lemma"].str.fullmatch(timm_p, na=False)) & 
    ((tokens_df["entity"].shift(1) == "CARD") | (tokens_df["entity"].shift(1) == "DATE")),
    "entity"] = tokens_df["entity"].shift(1)

**Makes sure that prepositions refering to numbers are properly classified**

In [69]:
tokens_df.loc[(tokens_df["pos"] == "ADP") & 
    (tokens_df["dependency"] == "prep") & 
    (tokens_df["stop"] == True) & 
    (tokens_df["head"].shift(-2) == (tokens_df["token"])) & 
    ((tokens_df["entity"].shift(-2) == "CARD") | (tokens_df["entity"].shift(-2) == "DATE")), 
    "entity"] = tokens_df["entity"].shift(-2)

**Makes sure that "dddd-dddd" are properly classified (2)**

In [70]:
# e.g., "Action 2021-26" - "2021" is corrected as "TITLE", but not "-26"
tokens_df.loc[(tokens_df["token"] == "-") & 
    (tokens_df['shape'].shift(1).str.fullmatch(r'^d+$', case=True)) & 
    (tokens_df['shape'].shift(-1).str.fullmatch(r'^d+$', case=True)), 
    "entity"] = tokens_df["entity"].shift(1)

tokens_df.loc[(tokens_df['shape'].str.fullmatch(r'^d+$', case=True)) & 
    (tokens_df["token"].shift(1) == "-") & 
    (tokens_df['shape'].shift(2).str.fullmatch(r'^d+$', case=True)), 
    "entity"] = tokens_df["entity"].shift(1)

In [71]:
#tokens_df
tokens_df[(tokens_df["entity"] == "CARD") | (tokens_df["entity"] == "DATE")]
#tokens_df[tokens_df["Full Target"] == "NBSAP National Action 1.10"]

#tokens_df[tokens_df["entity"] == 'CARD']
#tokens_df[tokens_df["entity"] == 'DATE']
#tokens_df.loc[741]

Unnamed: 0,Full Target,token,lemma,pos,morph,dependency,head,entity,start,alpha,digit,num,punct,stop,space,title,upper,lower,shape,trg_id
50,NBSAP Target 2,at,at,ADV,,advmod,least,CARD,False,True,False,False,False,True,False,False,False,True,xx,False
51,NBSAP Target 2,least,least,ADV,Degree=Sup,advmod,30,CARD,False,True,False,False,False,True,False,False,False,True,xxxx,False
52,NBSAP Target 2,30,30,NUM,NumType=Card,nummod,%,CARD,False,False,True,True,False,False,False,False,False,False,dd,False
53,NBSAP Target 2,%,%,NOUN,Number=Sing,pobj,of,CARD,False,False,False,False,True,False,False,False,False,False,%,False
113,NBSAP Target 3,by,by,ADP,,prep,area,DATE,False,True,False,False,False,True,False,False,False,True,xx,False
114,NBSAP Target 3,2030,2030,NUM,NumType=Card,pobj,by,DATE,False,False,True,True,False,False,False,False,False,False,dddd,False
116,NBSAP Target 3,9.2,9.2,NUM,NumType=Card,nummod,%,CARD,False,False,False,True,False,False,False,False,False,False,d.d,False
117,NBSAP Target 3,%,%,NOUN,Number=Sing,appos,area,CARD,False,False,False,False,True,False,False,False,False,False,%,False
125,NBSAP Target 3,2035,2035,NUM,NumType=Card,nummod,%,DATE,False,False,True,True,False,False,False,False,False,False,dddd,False
128,NBSAP Target 3,%,%,NOUN,Number=Sing,pobj,by,CARD,False,False,False,False,True,False,False,False,False,False,%,False


#### **Country-specific changes**

In [56]:
#Panama: "128", "132", "102"
#Guatemala: META ZMC-3.2 "(1)" and META ZMC-3.1 "(2)"
#Namibia: NBSAP Target 8 "2004"
#Sri Lanka: NBT 24: "2050 biodiversity vision"
#Tanzania: Target 3: "By 2030" ; Target 19: "at least $300 million"; "per year"; "2025-2030"
#Uzbekistan: tokens_df[tokens_df["Target Name"] == dta["Target Name"][4]] # 222 "I-IV 0 20%" […]
    #tokens_df[tokens_df["Target Name"] == dta["Target Name"][33]] # 1637 "Target 31" [√]
#Lebanon: "2024 war" in NBSAP Target 2, NBSAP National Action 2.1, NBSAP National Action 8.9 and NDC Key Action 4.3
    # "10" in NBSAP Target 1.10 and NBSAP Target 10
#Colombia: …
#Belarus: 

### **Clean-up**

In [72]:
# Eliminates all tokens the "entity" parameter of which inexists
#tokens_df = tokens_df.loc[(tokens_df["entity"] != "")]
tokens_df = tokens_df[(tokens_df["entity"] == "CARD") | (tokens_df["entity"] == "DATE")]

In [73]:
# Lumps together consecutive tokens that come from the same entity parameter into a single string
tokens_df["flag"] = (
    (tokens_df["entity"] != tokens_df["entity"].shift()) |
    (tokens_df.index != tokens_df.index.to_series().shift() + 1))
tokens_df["entity_group"] = tokens_df["flag"].cumsum()
tokens_df.drop(columns="flag", inplace=True)

In [74]:
tokens_df["mergeable"] = (tokens_df["entity"] != "") & (tokens_df["entity"] != "O")
tokens_df["merge_group"] = tokens_df["entity_group"].where(tokens_df["mergeable"])

In [75]:
merged = (
    tokens_df.groupby(["Full Target", "merge_group", "entity"], dropna=True)
    .agg({"token": " ".join})
    .reset_index()
)
merged = merged.drop(["merge_group"], axis = 1)

In [76]:
merged

Unnamed: 0,Full Target,entity,token
0,NAPDGE 44,DATE,up to 2050
1,NBSAP Target 2,CARD,at least 30 %
2,NBSAP Target 3,DATE,by 2030
3,NBSAP Target 3,CARD,9.2 %
4,NBSAP Target 3,DATE,2035
5,NBSAP Target 3,CARD,%
6,NBSAP Target 3,DATE,by 2025
7,NBSAP Target 3,CARD,9.2 %
8,NBSAP Target 3,CARD,9.2 %
9,NBSAP Target 3,DATE,by 2035


In [77]:
# removes white spaces wrongfully added when the tokens were merged, e.g., "20 %", "$ 100", ""50 - 245", "m3 / ha"
merged["token"] = merged["token"].str.replace(r' %', '%', regex=True)
merged["token"] = merged["token"].str.replace(r'$ ', '$', regex=True)
merged["token"] = merged["token"].str.replace(r' - ', '-', regex=True)
merged["token"] = merged["token"].str.replace(r' / ', '/', regex=True)

In [78]:
# removes duplicates
merged.drop_duplicates(inplace=True)

In [79]:
# Creates a list of time-bound terms per target
dates = (
    merged[merged["entity"] == "DATE"]
    .groupby("Full Target")["token"]
    .apply(lambda x: "; ".join(x))
    .reset_index(name="dates")
)

In [80]:
dates

Unnamed: 0,Full Target,dates
0,NAPDGE 44,up to 2050
1,NBSAP Target 3,by 2030; 2035; by 2025; by 2035; by 2025-22
2,NDC Forestry 1,by 2030; by 2050
3,NDC Forestry 2,by 2030; 2050
4,NDC Forestry 3,by 2030; by 2050
5,NDC Forestry 4,by 2030; by 2050
6,NDC Forestry 5,by 2030; by 2050


In [81]:
# Creates a list of quantitative terms per target
quants = (
    merged[merged["entity"] != "DATE"]
    .groupby("Full Target")["token"]
    .apply(lambda x: "; ".join(x))
    .reset_index(name="quants")
)

In [82]:
quants

Unnamed: 0,Full Target,quants
0,NBSAP Target 2,at least 30%
1,NBSAP Target 3,9.2%; %; 9.6%; 22%
2,NDC Forestry 1,to 41.0%; 42.0%
3,NDC Forestry 2,up to 60 and 62%; up to 5.0 and 5.5%; to 34 an...
4,NDC Forestry 3,to 230 m3/ha; 235 m3/ha
5,NDC Forestry 4,to 33%; 35%
6,NDC Forestry 5,to 47%; 50%


In [83]:
condens = pd.merge(dates, quants, on="Full Target", how="outer")

In [84]:
condens[["dates", "quants"]] = condens[["dates", "quants"]].fillna("")

In [85]:
condens

Unnamed: 0,Full Target,dates,quants
0,NAPDGE 44,up to 2050,
1,NBSAP Target 2,,at least 30%
2,NBSAP Target 3,by 2030; 2035; by 2025; by 2035; by 2025-22,9.2%; %; 9.6%; 22%
3,NDC Forestry 1,by 2030; by 2050,to 41.0%; 42.0%
4,NDC Forestry 2,by 2030; 2050,up to 60 and 62%; up to 5.0 and 5.5%; to 34 an...
5,NDC Forestry 3,by 2030; by 2050,to 230 m3/ha; 235 m3/ha
6,NDC Forestry 4,by 2030; by 2050,to 33%; 35%
7,NDC Forestry 5,by 2030; by 2050,to 47%; 50%


In [175]:
#condens[["Doc", "Target Name"]] = condens["Full Target"].str.split(" ", n = 1, expand = True)
#condens = condens.drop("Full Target", axis = 1)
#condens = condens[["Doc", "Target Name", "quants", "dates"]]

# **Saving results**

In [86]:
condens.head()

Unnamed: 0,Full Target,dates,quants
0,NAPDGE 44,up to 2050,
1,NBSAP Target 2,,at least 30%
2,NBSAP Target 3,by 2030; 2035; by 2025; by 2035; by 2025-22,9.2%; %; 9.6%; 22%
3,NDC Forestry 1,by 2030; by 2050,to 41.0%; 42.0%
4,NDC Forestry 2,by 2030; 2050,up to 60 and 62%; up to 5.0 and 5.5%; to 34 an...


In [87]:
dta.head()

Unnamed: 0,Country,Target Text,Target Name,Document,Source,Convention,Doc,Type,Full Target
0,Belarus,Integration of the function of biodiversity co...,Target 1,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs,NBSAP Target 1
1,Belarus,Ensure restoration of at least 30% of disturbe...,Target 2,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs,NBSAP Target 2
2,Belarus,Development of the system of protected areas a...,Target 3,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs,NBSAP Target 3
3,Belarus,Reduction of surface and groundwater pollution...,Target 7,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs,NBSAP Target 7
4,Belarus,"Ensure sustainable use of flora objects, prote...",Target 9,CBD Online Reporting Tool,https://ort.cbd.int/,nature,NBSAP,NBTs,NBSAP Target 9


In [88]:
dta.drop(["Country", "Target Text", "Source", "Convention"], axis = 1, inplace = True, errors = 'ignore')

In [89]:
final = pd.merge(dta, condens, how = "left")

In [90]:
final = final.fillna("")

In [91]:
final

Unnamed: 0,Target Name,Document,Doc,Type,Full Target,dates,quants
0,Target 1,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 1,,
1,Target 2,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 2,,at least 30%
2,Target 3,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 3,by 2030; 2035; by 2025; by 2035; by 2025-22,9.2%; %; 9.6%; 22%
3,Target 7,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 7,,
4,Target 9,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 9,,
5,Target 10,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 10,,
6,Target 11,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 11,,
7,Target 14,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 14,,
8,Target 15,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 15,,
9,Target 16,CBD Online Reporting Tool,NBSAP,NBTs,NBSAP Target 16,,


In [92]:
final.drop(["Full Target"], axis = 1, inplace = True, errors = 'ignore')

In [93]:
final.head()

Unnamed: 0,Target Name,Document,Doc,Type,dates,quants
0,Target 1,CBD Online Reporting Tool,NBSAP,NBTs,,
1,Target 2,CBD Online Reporting Tool,NBSAP,NBTs,,at least 30%
2,Target 3,CBD Online Reporting Tool,NBSAP,NBTs,by 2030; 2035; by 2025; by 2035; by 2025-22,9.2%; %; 9.6%; 22%
3,Target 7,CBD Online Reporting Tool,NBSAP,NBTs,,
4,Target 9,CBD Online Reporting Tool,NBSAP,NBTs,,


In [94]:
final.to_excel(path+"/"+cty+"_quantitative_"+datetime.today().strftime("%d%b%y").lstrip("0")+".xlsx", sheet_name = "Quantitative Terms", index=False)