#### **PACKAGES**

To install spaCy, follow the instructions in https://spacy.io/usage  
Operating System, Platform (**ARM/M1** if you have a Apple M1-M3 chip), Package manager, Hardware, Configurations (**virtual env**), Trained pipelines (**English**, **French**, **Spanish**), Select pipeline for (**accuracy**)

In [1]:
import os, sys, csv, time, re
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import openpyxl
from pickle import load
from datetime import datetime
import spacy

In [2]:
print(sys.version)

3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ]


#### **QUICK SETUP**

In [3]:
pd.set_option('display.max_rows', None)

In [2]:
cty = "Dominican Republic" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< change here!
lang = "Spanish" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< change here!

#### **DATA IMPORT**

In [3]:
#print(os.getcwd())
path = os.getcwd() + "/data/countries/" + cty.lower().replace(" ", "_")
print(path)

/Users/julienmhp/Desktop/undp/TargetAssessmentReport/data/countries/dominican_republic


In [4]:
file = [os.path.join(path, f) for f in os.listdir(path) if cty.lower().replace(" ", "_") in f and f.endswith(".xlsx")]
print(file[0])

/Users/julienmhp/Desktop/undp/TargetAssessmentReport/data/countries/dominican_republic/data_dominican_republic_23May25.xlsx


In [5]:
dta = pd.read_excel(file[0], sheet_name = "targets", engine = "openpyxl")

In [6]:
dta.head()

Unnamed: 0,Country,Target Text,Target Name,Document,Source,Doc,Type
0,Dominican Republic,"Impulsar el desarrollo local, provincial y reg...",Objetivo especifíco 1.1.2,Ley 1- 12 Estrategia Nacional de Desarrollo,https://mepyd.gob.do/mepyd/wp-content/uploads/...,LEND,Other targets
1,Dominican Republic,Establecer mecanismos de participación permane...,Línea de acción 1.1.2.3,Ley 1- 12 Estrategia Nacional de Desarrollo,https://mepyd.gob.do/mepyd/wp-content/uploads/...,LEND,Other targets
2,Dominican Republic,"Promover la calidad de la democracia, sus prin...",Objetivo especifíco 1.3.1,Ley 1- 12 Estrategia Nacional de Desarrollo,https://mepyd.gob.do/mepyd/wp-content/uploads/...,LEND,Other targets
3,Dominican Republic,Consolidar y promover la participación de las ...,Línea de acción 1.3.1.4,Ley 1- 12 Estrategia Nacional de Desarrollo,https://mepyd.gob.do/mepyd/wp-content/uploads/...,LEND,Other targets
4,Dominican Republic,Disminuir la pobreza mediante un efectivo y ef...,Objetivo específico 2.3.3,Ley 1- 12 Estrategia Nacional de Desarrollo,https://mepyd.gob.do/mepyd/wp-content/uploads/...,LEND,Other targets


#### **MODEL**

The **spaCy** model for NLP - what to know
- **token** each work or symbol  
- **lemma** root of lowecase token
- **pos** part-of-speech
- **dependency** relations between tokens
- **entity** grammatical role played in phrase

In [11]:
# POS
print(nlp.get_pipe("tagger").labels)
# ADJ (adjective), ADP (adposition), ADV (adverb), AUX (auxiliary verb), CONJ (conjugation), CCONJ (coordinating conjugation), 
# DET (determiner), INTJ (interjection), NOUN, NUM, PART (particle), PRON (pronoun), PROPN (proper noun), PUNCT (punctuation), 
# SCONJ (subordinating conjugation), SYM (symbol), VERB , X (other/unknown), SPACE (white space)

('$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', 'XX', '``')


In [12]:
# Dependencies
print(nlp.get_pipe("parser").labels)
# ROOT (root of sentence), bsubj (nominal subject), nsubjpass (passive nominal subject), 
# dobj (direct object), iobj (indirect object), attr (attribute), prep (preposition modifier), 
# pobj (object of a preposition), amod (adjectival modifier), advmod (adverbial modifier), 
# compound (compound noun modifier), aux (auxiliary verb), auxpass (passive auxliary), 
# det (determiner), conj (conjugation), cc (coordinating conjugation), mod (nominal modifier), 
# npadvmod (noun phrase as adverbial modifier), poss (possession modifier), 
# ccomp (clausal complement), xcomp (open clausal complement), mark (marker for subordinate clause)

('ROOT', 'acl', 'acomp', 'advcl', 'advmod', 'agent', 'amod', 'appos', 'attr', 'aux', 'auxpass', 'case', 'cc', 'ccomp', 'compound', 'conj', 'csubj', 'csubjpass', 'dative', 'dep', 'det', 'dobj', 'expl', 'intj', 'mark', 'meta', 'neg', 'nmod', 'npadvmod', 'nsubj', 'nsubjpass', 'nummod', 'oprd', 'parataxis', 'pcomp', 'pobj', 'poss', 'preconj', 'predet', 'prep', 'prt', 'punct', 'quantmod', 'relcl', 'xcomp')


In [13]:
# Entities
print(nlp.get_pipe("ner").labels)
# GPE (country, state, city, ...), 
# NORP (nationality, religious or political groups, ...), 
# FAC (buildings, airports, highways, ...), 
# LAW (doucments)

('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')


In [6]:
if lang == "English":
    lang_cd = "en"
    media = "web"
    model = "core"
elif lang == "Spanish":
    lang_cd = "es"
    media = "news"
    model = "dep"
elif lang == "French":
    lang_cd = "fr"
    media = "news"
    model = "dep"

In [8]:
lang_cd+"_"+model+"_"+media+"_trf"

'es_dep_news_trf'

In [7]:
nlp = spacy.load(lang_cd+"_"+model+"_"+media+"_trf")

OSError: [E050] Can't find model 'es_dep_news_trf'. It doesn't seem to be a Python package or a valid path to a data directory.

In [14]:
corpus = list(nlp.pipe(dta["Target Text"]))

In [15]:
rows = []
for corpus, text in zip(corpus, dta["Target Name"]):
    for token in corpus:
        rows.append({
            "Target Name": text,
            "token": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "dependency": token.dep_,
            "entity": token.ent_type_
        })

tokens_df = pd.DataFrame(rows)

In [36]:
# Uzbekistan:
#tokens_df[tokens_df["Target Name"] == dta["Target Name"][4]] # 222 "I-IV 0 20%"
#tokens_df[tokens_df["Target Name"] == dta["Target Name"][33]] # 1637 "Target 31"

Unnamed: 0,Target Name,token,lemma,pos,dependency,entity
182,CBD Target 4,By,by,ADP,prep,
183,CBD Target 4,2030,2030,NUM,pobj,DATE
184,CBD Target 4,at,at,ADV,advmod,PERCENT
185,CBD Target 4,least,least,ADV,advmod,PERCENT
186,CBD Target 4,30,30,NUM,nummod,PERCENT
187,CBD Target 4,percent,percent,NOUN,nsubjpass,PERCENT
188,CBD Target 4,of,of,ADP,prep,
189,CBD Target 4,Uzbekistan,Uzbekistan,PROPN,poss,GPE
190,CBD Target 4,’s,’s,PART,case,
191,CBD Target 4,terrestrial,terrestrial,ADJ,amod,


In [16]:
# Ensures that "by", followed by a number representing a date also represents a date
tokens_df.loc[
    (tokens_df["lemma"] == "by") & 
    (tokens_df["pos"].shift(-1) == "NUM") & 
    (tokens_df["entity"].shift(-1) == "DATE"), 
    "entity"] = tokens_df["entity"].shift(-1)
# Makes sure that "GPE", "ORG" and "LAW"
tokens_df.loc[
    (tokens_df["entity"] == "ORG") | (tokens_df["entity"] == "LAW") |  
    (tokens_df["entity"] == "GPE") | (tokens_df["entity"] == "LOC"), 
    "entity"] = ""

In [17]:
# Eliminates all tokens the "entity" parameter of which inexists
tokens_df = tokens_df.loc[(tokens_df["entity"] != "")]

In [18]:
# Lumps together into a single string consecutive tokens that come from the same entity parameter
tokens_df["flag"] = (
    (tokens_df["entity"] != tokens_df["entity"].shift()) |
    (tokens_df.index != tokens_df.index.to_series().shift() + 1))
tokens_df["entity_group"] = tokens_df["flag"].cumsum()
tokens_df.drop(columns="flag", inplace=True)

In [19]:
tokens_df["mergeable"] = (tokens_df["entity"] != "") & (tokens_df["entity"] != "O")
tokens_df["merge_group"] = tokens_df["entity_group"].where(tokens_df["mergeable"])

In [20]:
merged = (
    tokens_df.groupby(["Target Name", "merge_group", "entity"], dropna=True)
    .agg({"token": " ".join})
    .reset_index()
)
merged = merged.drop(["merge_group"], axis = 1)

In [21]:
# ensures there are no spaces between a number and "%"
merged["token"] = merged["token"].str.replace(r"(\d+)\s+%", r"\1%", regex=True)

In [22]:
merged

Unnamed: 0,Target Name,entity,token
0,BTR1 Target 3,CARDINAL,31
1,BTR1 Target 3,QUANTITY,at least 7 billion m3
2,BTR1 Target 7,QUANTITY,2 million hectares
3,CBD Target 10,DATE,By 2030
4,CBD Target 12,DATE,By 2030
5,CBD Target 13,DATE,By 2030
6,CBD Target 15,DATE,By 2030
7,CBD Target 16,DATE,2030
8,CBD Target 17,DATE,By 2030
9,CBD Target 18a,DATE,By 2026


In [23]:
# Creates a list of time-bound terms per target
dates = (
    merged[merged["entity"] == "DATE"]
    .groupby("Target Name")["token"]
    .apply(list)
    .reset_index(name="dates")
)

In [24]:
# Creates a list of quantitative terms per target
quants = (
    merged[merged["entity"] != "DATE"]
    .groupby("Target Name")["token"]
    .apply(list)
    .reset_index(name="quants")
)

In [25]:
quants

Unnamed: 0,Target Name,quants
0,BTR1 Target 3,"[31, at least 7 billion m3]"
1,BTR1 Target 7,[2 million hectares]
2,CBD Target 19b,[at least 15%]
3,CBD Target 1b,[30%]
4,CBD Target 2,[at least 30%]
5,CBD Target 23,[three]
6,CBD Target 4,"[at least 30 percent, 0 20%, 10%]"
7,CBD Target 6,[at least 50%]
8,NDC2 Target 1,[25%]


In [26]:
condens = pd.merge(dates, quants, on="Target Name", how="outer")

In [27]:
condens[["dates", "quants"]] = condens[["dates", "quants"]].fillna("")

In [28]:
condens

Unnamed: 0,Target Name,dates,quants
0,BTR1 Target 3,,"[31, at least 7 billion m3]"
1,BTR1 Target 7,,[2 million hectares]
2,CBD Target 10,[By 2030],
3,CBD Target 12,[By 2030],
4,CBD Target 13,[By 2030],
5,CBD Target 15,[By 2030],
6,CBD Target 16,[2030],
7,CBD Target 17,[By 2030],
8,CBD Target 18a,[By 2026],
9,CBD Target 18b,[By 2030],


In [36]:
condens.to_excel(path+"/"+cty+"_quantitative_"+datetime.today().strftime("%d%b%y").lstrip("0")+".xlsx", sheet_name = "Quantitative Terms", index=False)