# **PACKAGES**

To install spaCy, follow the instructions in https://spacy.io/usage  
Operating System, Platform (**ARM/M1** if you have a Apple M1-M3 chip), Package manager, Hardware, Configurations (**virtual env**), Trained pipelines (**English**, **French**, **Spanish**), Select pipeline for (**accuracy**)

In [1]:
import os, sys, csv, time, re
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import openpyxl
from pickle import load
from datetime import datetime

import spacy
import spacy.cli

In [2]:
#spacy.cli.download("es_dep_news_trf")
#spacy.cli.download("fr_dep_news_trf")

Collecting fr-dep-news-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_dep_news_trf-3.8.0/fr_dep_news_trf-3.8.0-py3-none-any.whl (397.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.7/397.7 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:03[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_dep_news_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
print(sys.version)

3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ]


# **QUICK SETUP**

In [2]:
pd.set_option('display.max_rows', None)

In [3]:
cty = "Panama" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< change here!
lang = "Spanish" #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< change here!

# **DATA IMPORT**

In [4]:
#print(os.getcwd())
path = os.getcwd() + "/data/countries/" + cty.lower().replace(" ", "_")
print(path)

/Users/julienmhp/Desktop/undp/TargetAssessmentReport/data/countries/panama


In [5]:
file = [os.path.join(path, f) for f in os.listdir(path) if cty.lower().replace(" ", "_") in f and f.endswith(".xlsx")]
print(file[0])

/Users/julienmhp/Desktop/undp/TargetAssessmentReport/data/countries/panama/data_panama_11Jun25.xlsx


In [6]:
dta = pd.read_excel(file[0], sheet_name = "targets", engine = "openpyxl")

In [57]:
dta.head()

Unnamed: 0,Country,Target Text,Target Name,Document,Source,Convention,Tipo de Meta,Sector,Doc,Type
0,Panama,"A 2050, el 30% de la capacidad instalada de la...",Meta 1.1,Segunda Contribución Determinada a Nivel Nacio...,https://sinia.gob.pa/segunda-contribucion-dete...,Clima,No-GEI,Energía,CDN,Metas de las CDN
1,Panama,"A 2050, Panamá logrará una reducción de las em...",Meta 1.2,Segunda Contribución Determinada a Nivel Nacio...,https://sinia.gob.pa/segunda-contribucion-dete...,Clima,GEI,Energía,CDN,Metas de las CDN
2,Panama,"Al 2025, Panamá contará con un Plan de Adaptac...",Meta 1.3,Segunda Contribución Determinada a Nivel Nacio...,https://sinia.gob.pa/segunda-contribucion-dete...,Clima,No-GEI,Energía,CDN,Metas de las CDN
3,Panama,Eliminación de la generación con Carbón en la ...,Meta 1.4,Segunda Contribución Determinada a Nivel Nacio...,https://sinia.gob.pa/segunda-contribucion-dete...,Clima,GEI,Energía,CDN,Metas de las CDN
4,Panama,"Al 2027, Panamá logrará generar 21.000 nuevos ...",Meta 1.5,Segunda Contribución Determinada a Nivel Nacio...,https://sinia.gob.pa/segunda-contribucion-dete...,Clima,No-GEI,Energía,CDN,Metas de las CDN


In [58]:
dta.shape

(114, 10)

# **QUICK TWEAKS**

In [7]:
# ensures there are no spaces between a number and "%"
dta["Target Text"] = dta["Target Text"].str.replace(r"(\d+)\s+%", r"\1%", regex=True)

# **MODEL**

Main attributes (parameters) of the **spaCy** model for NLP:
- **token** each work or symbol  
- **lemma** root of lowecase token
- **pos** part-of-speech (https://universaldependencies.org/u/pos/)
- **tag** detailed 'pos' tag (not in 'TRF')
- **morph** returns morgphosintsctic info - gender, number, case, tense, mood, ...
- **entity** grammatical role played in phrase (https://spacy.io/usage/linguistic-features) (https://www.universalner.org/)
- - **dependency** relations between tokens (https://spacy.io/usage/linguistic-features) (https://universaldependencies.org/u/dep/)
- **is_alpha**, **is_digit**, **is_punct**, **is_space**, **is_title**, **is_stop**, **is_currency**, **is_quote**, ...

For more details: https://spacy.io/api/token  
... or use 'print(spacy.explain("{KEYWORD}"))'

### **Loading the model**

In [8]:
if lang == "English":
    lang_cd = "en"; media = "web"; model = "core"
elif lang == "Spanish":
    lang_cd = "es"; media = "news"; model = "dep"
elif lang == "French":
    lang_cd = "fr"; media = "news"; model = "dep"

In [9]:
lang_cd+"_"+model+"_"+media+"_trf"

'es_dep_news_trf'

In [10]:
nlp = spacy.load(lang_cd+"_"+model+"_"+media+"_trf")

### **Model references**

In [12]:
print(nlp.pipe_names)

['transformer', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer']


**POS/Tag**

In [14]:
print(nlp.get_pipe("tagger").labels) # only works with the "sm", "md" and "lg" models
# ADJ (adjective), ADP (adposition), ADV (adverb), AUX (auxiliary verb), CONJ (conjugation), CCONJ (coordinating conjugation), 
# DET (determiner), INTJ (interjection), NOUN, NUM, PART (particle), PRON (pronoun), PROPN (proper noun), PUNCT (punctuation), 
# SCONJ (subordinating conjugation), SYM (symbol), VERB , X (other/unknown), SPACE (white space)

KeyError: "[E001] No component 'tagger' found in pipeline. Available names: ['transformer', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer']"

In [45]:
print(spacy.explain("SYM"))

symbol


**Morphologizer**

In [46]:
print(nlp.get_pipe("morphologizer").labels)

('Definite=Def|Gender=Masc|Number=Sing|POS=DET|PronType=Art', 'Gender=Masc|Number=Sing|POS=NOUN', 'Definite=Def|Gender=Masc|Number=Sing|POS=ADP|PronType=Art', 'Gender=Masc|Number=Sing|POS=ADJ', 'POS=ADP', 'Definite=Def|Gender=Fem|Number=Plur|POS=DET|PronType=Art', 'POS=PROPN', 'Case=Acc|POS=PRON|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes', 'Mood=Ind|Number=Sing|POS=VERB|Person=3|Tense=Past|VerbForm=Fin', 'POS=VERB|VerbForm=Inf', 'Gender=Fem|Number=Sing|POS=DET|PronType=Dem', 'Gender=Fem|Number=Sing|POS=NOUN', 'Gender=Fem|Number=Plur|POS=NOUN', 'Gender=Fem|Number=Plur|POS=DET|PronType=Ind', 'POS=PRON|PronType=Int,Rel', 'Mood=Sub|Number=Plur|POS=VERB|Person=3|Tense=Pres|VerbForm=Fin', 'Definite=Def|Gender=Fem|Number=Sing|POS=DET|PronType=Art', 'POS=SCONJ', 'POS=NOUN', 'Definite=Def|Gender=Masc|Number=Plur|POS=DET|PronType=Art', 'Number=Plur|POS=NOUN', 'Gender=Masc|Number=Plur|POS=DET|PronType=Ind', 'Gender=Masc|Number=Plur|POS=NOUN', 'POS=PUNCT|PunctType=Peri', 'Mood=Ind|Number=Sing|P

In [47]:
print(spacy.explain("ADJ"))

adjective


**Parser**

In [48]:
print(nlp.get_pipe("parser").labels)
# ROOT (root of sentence), bsubj (nominal subject), nsubjpass (passive nominal subject), 
# dobj (direct object), iobj (indirect object), attr (attribute), prep (preposition modifier), 
# pobj (object of a preposition), amod (adjectival modifier), advmod (adverbial modifier), 
# compound (compound noun modifier), aux (auxiliary verb), auxpass (passive auxliary), 
# det (determiner), conj (conjugation), cc (coordinating conjugation), mod (nominal modifier), 
# npadvmod (noun phrase as adverbial modifier), poss (possession modifier), 
# ccomp (clausal complement), xcomp (open clausal complement), mark (marker for subordinate clause)

('ROOT', 'acl', 'advcl', 'advmod', 'amod', 'appos', 'aux', 'case', 'cc', 'ccomp', 'compound', 'conj', 'cop', 'csubj', 'dep', 'det', 'expl:impers', 'expl:pass', 'expl:pv', 'fixed', 'flat', 'iobj', 'mark', 'nmod', 'nsubj', 'nummod', 'obj', 'obl', 'parataxis', 'punct', 'xcomp')


In [49]:
print(spacy.explain("appos"))

appositional modifier


**Entities (NER)**

In [50]:
print(nlp.get_pipe("ner").labels) # only works with the "sm", "md" and "lg" models
# "trf" doesn't even use NER, apparently...
# GPE (country, state, city, ...), 
# NORP (nationality, religious or political groups, ...), 
# FAC (buildings, airports, highways, ...), 
# LAW (doucments)

KeyError: "[E001] No component 'ner' found in pipeline. Available names: ['transformer', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer']"

In [51]:
print(spacy.explain("FAC"))

Buildings, airports, highways, bridges, etc.


### **Getting started**

In [28]:
dta["Full Target"] = dta["Doc"] + " " + dta["Target Name"]

In [36]:
corpus = list(nlp.pipe(dta["Target Text"]))

In [37]:
rows = []
for corpus, text in zip(corpus, dta["Full Target"]):
    sent_starts = {sent[0].i for sent in corpus.sents}
    for token in corpus:
        rows.append({ # all attributes: https://spacy.io/api/attributes
            "Full Target": text, 
            "token": token.text, 
            "lemma": token.lemma_, # root of the word (no plurals, gender variation, conugation, ...)
            "pos": token.pos_, # …
            #"tag": token.tag_, # not in TRF
            "morph": str(token.morph), # …
            "dependency": token.dep_, # …
            "head": token.head, # …
            "entity": token.ent_type_, # …
            #"start": token.sent_start, # does it start a sentence? prefer "token.doc.sents"
            "start": token.i in sent_starts,
            "alpha": token.is_alpha, # only [A-Za-z]
            "digit": token.is_digit, # only [0-9] - so no Roman numerals or "1.5"
            "num": token.like_num, # resembles number {like_email, like_url}
            "punct": token.is_punct, # 
            "stop": token.is_stop, # stopwords: 
            "space": token.is_space, # 
            "title": token.is_title, # 
            "upper": token.is_upper, # 
            "lower": token.is_lower, # 
            "shape": token.shape_ # "dddd", "X", "dd%", ...
            #"length": token.length # not in TRF
        })

tokens_df = pd.DataFrame(rows)

In [39]:
#tokens_df.head()
tokens_df[tokens_df["Full Target"] == "CDN Meta 1.11"]

Unnamed: 0,Full Target,token,lemma,pos,morph,dependency,head,entity,start,alpha,digit,num,punct,stop,space,title,upper,lower,shape
349,CDN Meta 1.11,Al,al,ADP,Definite=Def|Gender=Masc|Number=Sing|PronType=Art,case,2030,,True,True,False,False,False,True,False,True,False,False,Xx
350,CDN Meta 1.11,2030,2030,NOUN,AdvType=Tim,obl,provendrá,,False,False,True,True,False,False,False,False,False,False,dddd
351,CDN Meta 1.11,",",",",PUNCT,PunctType=Comm,punct,2030,,False,False,False,False,True,False,False,False,False,False,","
352,CDN Meta 1.11,al,al,ADP,Definite=Def|Gender=Masc|Number=Sing|PronType=Art,case,MW,,False,True,False,False,False,True,False,False,False,True,xx
353,CDN Meta 1.11,menos,menos,ADV,Degree=Cmp,fixed,al,,False,True,False,False,False,True,False,False,False,True,xxxx
354,CDN Meta 1.11,1700,1700,NUM,NumForm=Digit|NumType=Card,nummod,MW,,False,False,True,True,False,False,False,False,False,False,dddd
355,CDN Meta 1.11,MW,MW,SYM,NumForm=Digit|NumType=Frac,nsubj,provendrá,,False,True,False,False,False,False,False,False,True,False,XX
356,CDN Meta 1.11,de,de,ADP,,case,capacidad,,False,True,False,False,False,True,False,False,False,True,xx
357,CDN Meta 1.11,la,el,DET,Definite=Def|Gender=Fem|Number=Sing|PronType=Art,det,capacidad,,False,True,False,False,False,True,False,False,False,True,xx
358,CDN Meta 1.11,capacidad,capacidad,NOUN,Gender=Fem|Number=Sing,nmod,MW,,False,True,False,False,False,False,False,False,False,True,xxxx


### **Corrections**

#### **General**

In [40]:
# makes sure that the structure "YYYY-YYYY" is DATE
tokens_df.loc[tokens_df["lemma"].str.fullmatch(r"(19|20)\d{2}-(19|20)\d{2}"), "entity"] = "DATE"
#tokens_df.loc[tokens_df["shape"] == "dddd" & tokens_df["morph"] == "AdvType=Tim", "entity"] = "DATE"

#### **Country-specific changes**

In [38]:
if cty == "Uzbekistan":
    tokens_df[tokens_df["Target Name"] == dta["Target Name"][4]] # 222 "I-IV 0 20%" [...]
    #tokens_df[tokens_df["Target Name"] == dta["Target Name"][33]] # 1637 "Target 31" [√]

#### **Language-specific changes**

**Spanish**

In [41]:
#wrd = ["double", "halve", "half", "triple", "quadruple", "quarter", "quintuple"]
#lgs = ["act", "acts", "bill", "bills", "regulation", "regulations", "decree", "decrees", "article", "articles", "law", "laws", "recommendation", "recommendations", "bill", "bills", "exco"]
org = ["Cantidad", "Ley", "capítulo"] # ["target", "targets", "goal", "goals", "objective", "objectives", "figure", "table", "zone", "zones", "strategy", "strategies", "strategic", "plan", "plans", "nt", "phase", "phases", "agenda", "agendas", "policy", "policies", "stage", "stages", "programme", "programmes", "action", "actions", "budget"]
mtr = ["ha", "hectáre", "hectáreas", "mm", "tonelada", "toneladas", "MW"] #["cm", "m", "km", "km2", "mile", "miles", "g", "kg", "ºc", "oc", "microns", "acre", "acres", "factor"] # "m"?
#mth = [">", "<", "=", "≥", "≤", "~", "≈"]
qty = ["mil", "miles", "hundred", "cien", "cientos", "millón", "millones", "billón", "billones", "mtco2e", "mtco₂e", "co2", "co2e", "co2eq", "co2-eq"]
#prc = ["%", "percent", "percentage"]
tmp = ["año"] # ["annually", "annum", "year", "years", "monthly", "month", "months", "weekly", "week", "weeks", "daily", "day", "days", "hourly", "hour", "hours", tolower(month.name), tolower(month.abb)]
#mny = ["dollar", "usd", "euros", "eur", "$", "£", "¥", "€"] # "M"?2

if lang_cd == "es":
    # ensures tokens that are just digits, or digits with ("-", ",", "." and "%") are CARD
    tokens_df.loc[tokens_df["lemma"].astype(str).str.match(r"^-?\d+[.,]?\d*%?$", na=False) & 
                    ~tokens_df["lemma"].astype(str).str.match(r"^(19|20)\d{2}$", na=False), "entity"] = "CARD"
    
    # makes sure 4-digit numbers - that are not "appos" or "nummod", e.g., DR's "2010-2030" - are DATE
    tokens_df.loc[tokens_df["lemma"].astype(str).str.match(r"^(19|20)\d{2}$") & 
                    ~tokens_df["dependency"].isin(["appos", "nummod"]), "entity"] = "DATE"
    
    # makes sure that spelled-out numbers are CARD
    tokens_df.loc[(tokens_df["pos"] == "NUM") & (tokens_df["lemma"].astype(str).str.match(r"^[A-Aa-z]+$", na=False)), "entity"] = "CARD"
    
    # make sure "%" is CARD
    tokens_df.loc[(tokens_df["lemma"].str.match(r"^\d+$", na=False) & tokens_df["lemma"].shift(-1) == "%"), "entity"] = "CARD"
    tokens_df.loc[tokens_df["lemma"] == "%", "entity"] = "CARD"
    
    # makes sure "hasta", "para (el)" and "al"(and similar) are considered DATE (e.g., "[hasta/para (el)/al] 2030")
    tokens_df.loc[(tokens_df["pos"] == "ADP") & (tokens_df["dependency"] == "case") & 
                    ((tokens_df["entity"].shift(-1) == "DATE") | 
                     ((tokens_df["pos"].shift(-1) == "DET") & (tokens_df["entity"].shift(-2) == "DATE"))
                    ), "entity"] = "DATE"
    tokens_df.loc[(tokens_df["pos"] == "DET") &
                    (tokens_df["entity"].shift(1) == "DATE") & (tokens_df["entity"].shift(-1) == "DATE"), "entity"] = "DATE"
    
    # makes sure that numbers preceded by "name-terms" (e.g. "target") are not CARD/DATE
    tokens_df.loc[(tokens_df["entity"] == "CARD") & 
                    (tokens_df["lemma"].shift(1).isin(org)), "entity"] = ""

    # makes sure that quantity measurements are CARD/DATE
    tokens_df.loc[tokens_df["lemma"].isin(qty), "entity"] = ""
    
    # makes sure that numbers preceded by a title (e.g. "ACE 23") are not CARD/DATE
    tokens_df.loc[(tokens_df["pos"] == "NUM") & 
                    ((tokens_df["lemma"].shift(1).astype(str).str.match(r"^[A-Z0-9_-]+$")) | 
                     (tokens_df["pos"].shift(1) == "PROPN")), "entity"] = ""
    
    # makes sure that time references (e.g., "años") preceded by numbers are DATE
    tokens_df.loc[(tokens_df["lemma"].isin(tmp)) & 
                    (tokens_df["entity"].shift(1) == "CARD"), "entity"] = "DATE"
    tokens_df.loc[(tokens_df["entity"] == "CARD") & 
                (tokens_df["lemma"].shift(-1).isin(tmp)), "entity"] = "DATE"
    
    # makes sure that unit measurements are counted as CARD (e.g., "75,102 ha")
    tokens_df.loc[((tokens_df["lemma"].isin(mtr)) | (tokens_df["dependency"] == "appos"))& 
                    (tokens_df["entity"].shift(1) == "CARD"), "entity"] = "CARD"

In [145]:
# test code chuncks here


In [42]:
tokens_df[tokens_df["entity"] == 'CARD']
#tokens_df[tokens_df["entity"] == 'DATE']
#tokens_df
#tokens_df[tokens_df["Target Name"] == 'Meta 1.6']

Unnamed: 0,Full Target,token,lemma,pos,morph,dependency,head,entity,start,alpha,digit,num,punct,stop,space,title,upper,lower,shape
4,CDN Meta 1.1,30%,30%,SYM,NumForm=Digit|NumType=Frac,nsubj,provenir,CARD,False,False,False,False,False,False,False,False,False,False,dd%
43,CDN Meta 1.2,24%,24%,SYM,NumForm=Digit|NumType=Frac,nmod,reducción,CARD,False,False,False,False,False,False,False,False,False,False,dd%
48,CDN Meta 1.2,11.5%,11.5%,SYM,NumForm=Digit|NumType=Frac,conj,24%,CARD,False,False,False,False,False,False,False,False,False,False,dd.d%
63,CDN Meta 1.2,60,60,NUM,NumForm=Digit|NumType=Card,nummod,millones,CARD,False,False,True,True,False,False,False,False,False,False,dd
75,CDN Meta 1.2,10,10,NUM,NumForm=Digit|NumType=Card,nummod,millones,CARD,False,False,True,True,False,False,False,False,False,False,dd
123,CDN Meta 1.5,21.000,21000,NUM,NumForm=Digit|NumType=Card,nummod,empleos,CARD,False,False,False,True,False,False,False,False,False,False,dd.ddd
148,CDN Meta 1.6,1,1,NUM,NumForm=Digit|NumType=Card,ROOT,1,CARD,True,False,True,True,False,False,False,False,False,False,d
162,CDN Meta 1.6,2,2,NOUN,AdvType=Tim,ROOT,2,CARD,True,False,True,True,False,False,False,False,False,False,d
181,CDN Meta 1.6,60%,60%,SYM,NumForm=Digit|NumType=Frac,obj,cocinar,CARD,False,False,False,False,False,False,False,False,False,False,dd%
201,CDN Meta 1.7,10%,10%,SYM,NumForm=Digit|NumType=Frac,nsubj,-,CARD,False,False,False,False,False,False,False,False,False,False,dd%


**English**

In [14]:
#wrd = ["double", "halve", "half", "triple", "quadruple", "quarter", "quintuple"]
#lgs = ["act", "acts", "bill", "bills", "regulation", "regulations", "decree", "decrees", "article", "articles", "law", "laws", "recommendation", "recommendations", "bill", "bills", "exco"]
#org = ["target", "targets", "goal", "goals", "objective", "objectives", "figure", "table", "zone", "zones", "strategy", "strategies", "strategic", "plan", "plans", "nt", "phase", "phases", "agenda", "agendas", "policy", "policies", "stage", "stages", "programme", "programmes", "action", "actions", "budget"]
#mtr = ["ha", "hectare", "hectares", "cm", "m", "km", "km2", "mile", "miles", "g", "kg", "ton", "tons", "ºc", "oc", "microns", "acre", "acres", "factor", "mw"] # "m"?
#mth = [">", "<", "=", "≥", "≤", "~", "≈"]
#qty = ["thousand", "thousands", "hundred", "hundreds", "million", "millions", "billion", "billions", "trillion", "trillions", "mtco2e", "mtco₂e", "co2", "co2e", "co2eq", "co2-eq"]
#prc = ["%", "percent", "percentage"]
#tme = ["annually", "annum", "year", "years", "monthly", "month", "months", "weekly", "week", "weeks", "daily", "day", "days", "hourly", "hour", "hours", tolower(month.name), tolower(month.abb)]
#mny = ["dollar", "usd", "euros", "eur", "$", "£", "¥", "€"] # "M"?2

if lang == "en":
    # Ensures that the word "by", followed by a number representing a date also is considered as a date
    tokens_df.loc[
        (tokens_df["lemma"] == "by") & 
        (tokens_df["pos"].shift(-1) == "NUM") & 
        (tokens_df["entity"].shift(-1) == "DATE"), 
        "entity"] = tokens_df["entity"].shift(-1)
    # Ensures that numbers preceded by any terms such as "target" are not regarded as numbers
    org = ["target", "targets", "goal", "goals", "objective", "objectives", "figure", "table", 
           "zone", "zones", "strategy", "strategies", "strategic", "plan", "plans", 
           "phase", "phases", "agenda", "agendas", "policy", "policies", "stage", "stages", 
           "programme", "programmes", "action", "actions", "budget"]
    tokens_df.loc[(tokens_df["entity"] == "CARDINAL") & 
                    (tokens_df["lemma"].shift(1).isin(org)), "entity"] = ""

#### **General changes**

In [43]:
# Ensures that if the number part of a target name appears in its target text, 
# that token should be regarded as ORG - not NUM
def target_name_in_text(row):
    target = str(row["Full Target"])
    lemma = str(row["lemma"])
    matches = re.findall(r"\d+[.,]\d+", target)
    lemma_normalized = lemma.replace(",", ".")
    matches_normalized = [m.replace(",", ".") for m in matches]
    for num in matches_normalized:
        if num == lemma_normalized:
            return "ORG"
    
    return row["pos"]

tokens_df["pos"] = tokens_df.apply(target_name_in_text, axis=1)

In [44]:
# Ensures that if a number is preceded and succeeded immediately by ".", it is not considered a "CARD"
tokens_df.loc[
(tokens_df["shape"] == "d") & 
(tokens_df["shape"].shift(-1) == ".") & 
(tokens_df["shape"].shift(1) == "."), "entity"] = ""

In [45]:
# Ensures that "GPE", "ORG" and "LAW" are discarded as relevant to the analysis - not quant nor temporal
tokens_df.loc[(tokens_df["pos"] == "ORG") | (tokens_df["pos"] == "LAW") | 
                (tokens_df["pos"] == "GPE") | (tokens_df["pos"] == "LOC"), "entity"] = ""

In [46]:
tokens_df[tokens_df["entity"] == 'CARD']

Unnamed: 0,Full Target,token,lemma,pos,morph,dependency,head,entity,start,alpha,digit,num,punct,stop,space,title,upper,lower,shape
4,CDN Meta 1.1,30%,30%,SYM,NumForm=Digit|NumType=Frac,nsubj,provenir,CARD,False,False,False,False,False,False,False,False,False,False,dd%
43,CDN Meta 1.2,24%,24%,SYM,NumForm=Digit|NumType=Frac,nmod,reducción,CARD,False,False,False,False,False,False,False,False,False,False,dd%
48,CDN Meta 1.2,11.5%,11.5%,SYM,NumForm=Digit|NumType=Frac,conj,24%,CARD,False,False,False,False,False,False,False,False,False,False,dd.d%
63,CDN Meta 1.2,60,60,NUM,NumForm=Digit|NumType=Card,nummod,millones,CARD,False,False,True,True,False,False,False,False,False,False,dd
75,CDN Meta 1.2,10,10,NUM,NumForm=Digit|NumType=Card,nummod,millones,CARD,False,False,True,True,False,False,False,False,False,False,dd
123,CDN Meta 1.5,21.000,21000,NUM,NumForm=Digit|NumType=Card,nummod,empleos,CARD,False,False,False,True,False,False,False,False,False,False,dd.ddd
181,CDN Meta 1.6,60%,60%,SYM,NumForm=Digit|NumType=Frac,obj,cocinar,CARD,False,False,False,False,False,False,False,False,False,False,dd%
201,CDN Meta 1.7,10%,10%,SYM,NumForm=Digit|NumType=Frac,nsubj,-,CARD,False,False,False,False,False,False,False,False,False,False,dd%
203,CDN Meta 1.7,20%,20%,SYM,NumForm=Digit,nmod,-,CARD,False,False,False,False,False,False,False,False,False,False,dd%
221,CDN Meta 1.7,7%,7%,SYM,NumForm=Digit|NumType=Frac,nmod,2027,CARD,False,False,False,False,False,False,False,False,False,False,d%


In [47]:
tokens_df[tokens_df["entity"] == 'DATE']

Unnamed: 0,Full Target,token,lemma,pos,morph,dependency,head,entity,start,alpha,digit,num,punct,stop,space,title,upper,lower,shape
0,CDN Meta 1.1,A,a,ADP,,case,2050,DATE,True,True,False,False,False,True,False,True,True,False,X
1,CDN Meta 1.1,2050,2050,NOUN,AdvType=Tim,obl,provenir,DATE,False,False,True,True,False,False,False,False,False,False,dddd
23,CDN Meta 1.2,A,a,ADP,,case,2050,DATE,True,True,False,False,False,True,False,True,True,False,X
24,CDN Meta 1.2,2050,2050,NOUN,AdvType=Tim,obl,logrará,DATE,False,False,True,True,False,False,False,False,False,False,dddd
49,CDN Meta 1.2,al,al,ADP,Definite=Def|Gender=Masc|Number=Sing|PronType=Art,case,2030,DATE,False,True,False,False,False,True,False,False,False,True,xx
50,CDN Meta 1.2,2030,2030,NOUN,AdvType=Tim,nmod,11.5%,DATE,False,False,True,True,False,False,False,False,False,False,dddd
71,CDN Meta 1.2,entre,entre,ADP,,case,2022-2050,DATE,False,True,False,False,False,True,False,False,False,True,xxxx
72,CDN Meta 1.2,2022-2050,2022-2050,NOUN,AdvType=Tim,obl,acumuladas,DATE,False,False,False,False,False,False,False,False,False,False,dddd-dddd
83,CDN Meta 1.2,entre,entre,ADP,,case,2022-2030,DATE,False,True,False,False,False,True,False,False,False,True,xxxx
84,CDN Meta 1.2,2022-2030,2022-2030,NOUN,AdvType=Tim,obl,acumuladas,DATE,False,False,False,False,False,False,False,False,False,False,dddd-dddd


### **Clean-up**

In [152]:
#tokens_df["entity"].unique()

array(['', 'DATE', 'CARD'], dtype=object)

In [48]:
# Eliminates all tokens the "entity" parameter of which inexists
tokens_df = tokens_df.loc[(tokens_df["entity"] != "")]

In [49]:
# Lumps together consecutive tokens that come from the same entity parameter into a single string
tokens_df["flag"] = (
    (tokens_df["entity"] != tokens_df["entity"].shift()) |
    (tokens_df.index != tokens_df.index.to_series().shift() + 1))
tokens_df["entity_group"] = tokens_df["flag"].cumsum()
tokens_df.drop(columns="flag", inplace=True)

In [50]:
tokens_df["mergeable"] = (tokens_df["entity"] != "") & (tokens_df["entity"] != "O")
tokens_df["merge_group"] = tokens_df["entity_group"].where(tokens_df["mergeable"])

In [51]:
merged = (
    tokens_df.groupby(["Full Target", "merge_group", "entity"], dropna=True)
    .agg({"token": " ".join})
    .reset_index()
)
merged = merged.drop(["merge_group"], axis = 1)

In [52]:
merged

Unnamed: 0,Full Target,entity,token
0,CDN Meta 1.1,DATE,A 2050
1,CDN Meta 1.1,CARD,30%
2,CDN Meta 1.10,DATE,Al 2030
3,CDN Meta 1.10,CARD,25%
4,CDN Meta 1.10,CARD,50%
5,CDN Meta 1.10,DATE,Al 2027
6,CDN Meta 1.10,CARD,21%
7,CDN Meta 1.10,CARD,35%
8,CDN Meta 1.11,DATE,Al 2030
9,CDN Meta 1.11,CARD,1700 MW


In [53]:
# Creates a list of time-bound terms per target
dates = (
    merged[merged["entity"] == "DATE"]
    .groupby("Full Target")["token"]
    .apply(lambda x: "; ".join(x))
    .reset_index(name="dates")
)

In [54]:
dates

Unnamed: 0,Full Target,dates
0,CDN Meta 1.1,A 2050
1,CDN Meta 1.10,Al 2030; Al 2027
2,CDN Meta 1.11,Al 2030; Al 2027
3,CDN Meta 1.12,Al 2030; 2015-2050; Al 2027
4,CDN Meta 1.13,Al 2030; 2015; Al 2027; 2015
5,CDN Meta 1.14,Al 2030; Al 2027
6,CDN Meta 1.15,Al 2030; Al 2027
7,CDN Meta 1.16,Al 2030; Al 2027
8,CDN Meta 1.17,Al 2030
9,CDN Meta 1.2,A 2050; al 2030; entre 2022-2050; entre 2022-2030


In [55]:
# Creates a list of quantitative terms per target
quants = (
    merged[merged["entity"] != "DATE"]
    .groupby("Full Target")["token"]
    .apply(lambda x: "; ".join(x))
    .reset_index(name="quants")
)

In [56]:
quants

Unnamed: 0,Full Target,quants
0,CDN Meta 1.1,30%
1,CDN Meta 1.10,25%; 50%; 21%; 35%
2,CDN Meta 1.11,1700 MW; 950 MW
3,CDN Meta 1.12,15%; 11%
4,CDN Meta 1.13,3%; 2%
5,CDN Meta 1.14,5%; 3.5%
6,CDN Meta 1.15,30%; 20%
7,CDN Meta 1.16,20%; 15%
8,CDN Meta 1.2,24%; 11.5%; 60; 10
9,CDN Meta 1.5,21.000


In [171]:
condens = pd.merge(dates, quants, on="Full Target", how="outer")

In [172]:
condens[["dates", "quants"]] = condens[["dates", "quants"]].fillna("")

In [173]:
condens

Unnamed: 0,Full Target,dates,quants
0,CDN Meta 1.1,A 2050,30%
1,CDN Meta 1.10,Al 2030; Al 2027,25%; 50%; 21%; 35%
2,CDN Meta 1.11,Al 2030; Al 2027,1700 MW; 950 MW
3,CDN Meta 1.12,Al 2030; 2015-2050; Al 2027,15%; 11%
4,CDN Meta 1.13,Al 2030; 2015; Al 2027; 2015,3%; 2%
5,CDN Meta 1.14,Al 2030; Al 2027,5%; 3.5%
6,CDN Meta 1.15,Al 2030; Al 2027,30%; 20%
7,CDN Meta 1.16,Al 2030; Al 2027,20%; 15%
8,CDN Meta 1.17,Al 2030,
9,CDN Meta 1.2,A 2050; al 2030; entre 2022-2050; entre 2022-2030,24%; 11.5%; 60; 10


In [None]:
#final[["Doc", "Target Name"]] = condens["Full Target"].str.split(" ", n = 1, expand = True)

# **Saving results**

In [174]:
dta.head()

Unnamed: 0,Target Name,Document,Tipo de Meta,Sector,Full Target
0,Meta 1.1,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.1
1,Meta 1.2,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,CDN Meta 1.2
2,Meta 1.3,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.3
3,Meta 1.4,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,CDN Meta 1.4
4,Meta 1.5,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.5


In [175]:
dta.drop(["Country", "Target Text", "Source", "Convention", "Doc", "Type"], axis = 1, inplace = True, errors = 'ignore')

In [176]:
final = pd.merge(dta, condens, how = "left")

In [177]:
final = final.fillna("")

In [178]:
final

Unnamed: 0,Target Name,Document,Tipo de Meta,Sector,Full Target,dates,quants
0,Meta 1.1,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.1,A 2050,30%
1,Meta 1.2,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,CDN Meta 1.2,A 2050; al 2030; entre 2022-2050; entre 2022-2030,24%; 11.5%; 60; 10
2,Meta 1.3,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.3,Al 2025,
3,Meta 1.4,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,CDN Meta 1.4,al 2026,
4,Meta 1.5,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.5,Al 2027,21.000
5,Meta 1.6,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.6,Al 2030; Al 2024; Al 2027; 2023,60%
6,Meta 1.7,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,CDN Meta 1.7,Al 2030; Al 2027,10%; 20%; 7%; 18%
7,Meta 1.8,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,CDN Meta 1.8,Al 2030; Al 2027,25%; 40%; 15%; 30%
8,Meta 1.9,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.9,Al 2030; Al 2027,15%; 35%; 14%; 25%
9,Meta 1.10,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,CDN Meta 1.10,Al 2030; Al 2027,25%; 50%; 21%; 35%


In [179]:
final.drop(["Full Target"], axis = 1, inplace = True, errors = 'ignore')

In [180]:
final.head()

Unnamed: 0,Target Name,Document,Tipo de Meta,Sector,dates,quants
0,Meta 1.1,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,A 2050,30%
1,Meta 1.2,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,A 2050; al 2030; entre 2022-2050; entre 2022-2030,24%; 11.5%; 60; 10
2,Meta 1.3,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,Al 2025,
3,Meta 1.4,Segunda Contribución Determinada a Nivel Nacio...,GEI,Energía,al 2026,
4,Meta 1.5,Segunda Contribución Determinada a Nivel Nacio...,No-GEI,Energía,Al 2027,21.000


In [181]:
final.to_excel(path+"/"+cty+"_quantitative_"+datetime.today().strftime("%d%b%y").lstrip("0")+".xlsx", sheet_name = "Quantitative Terms", index=False)