# Synonym expansion

Explorer inputs contain lots of information about synonymous or near-synonymous phrases.

This notebook investigates:

* whether we can extract groups of synonyms or near-synonyms

Further work is to:

* identify how appropriate they are for uses such as data augmentation, search query expansion and measuring embedding quality (vector similarity between synonymous phrases)

### Notes

- Getting all word forms from a lemma is difficult. It's impossible in spacy, and the [lemminflect](https://github.com/bjascob/LemmInflect) library, which was designed to overcome this, needs a POS tag which we don't have in the explorer inputs

In [2]:
import sys

!{sys.executable} -m pip install git+https://github.com/climatepolicyradar/explorer@4c67a26f8f4ee861a38ecbb877b9723c6c0e60aa

Collecting git+https://github.com/climatepolicyradar/explorer
  Cloning https://github.com/climatepolicyradar/explorer to /private/var/folders/nt/2c78pgv94312v7_mmz24h6kc0000gn/T/pip-req-build-pp_q_voi
  Running command git clone --filter=blob:none --quiet https://github.com/climatepolicyradar/explorer /private/var/folders/nt/2c78pgv94312v7_mmz24h6kc0000gn/T/pip-req-build-pp_q_voi
  Resolved https://github.com/climatepolicyradar/explorer to commit 4c67a26f8f4ee861a38ecbb877b9723c6c0e60aa
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting en-core-web-trf@ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.5.0/en_core_web_trf-3.5.0.tar.gz
  Using cached en_core_web_trf-3.5.0-py3-none-any.whl
Collecting en-core-web-sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.w

In [27]:
from pathlib import Path
import itertools
from collections import defaultdict
import json

from tqdm.auto import tqdm
import spacy

from explorer.main import load_input_spreadsheet

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
spreadsheet_dir = Path("../../../concepts/")

In [4]:
patterns_by_concept = dict()

for concept_dir in tqdm(list(spreadsheet_dir.iterdir())):
    if not concept_dir.is_dir():
        continue

    if not (concept_dir / "input.xlsx").exists():
        print(
            f"Skipping {concept_dir} as it doesn't contain a recognisable input.xlsx file"
        )
        continue

    patterns, _, _ = load_input_spreadsheet(concept_dir / "input.xlsx")

    patterns_by_concept[concept_dir.stem] = sorted(
        patterns, key=lambda i: i.get("id", "")
    )

patterns_by_concept.keys()

  0%|          | 0/23 [00:00<?, ?it/s]

Skipping ../../../concepts/sectors as it doesn't contain a recognisable input.xlsx file


  warn(msg)


Skipping ../../../concepts/policy-instruments as it doesn't contain a recognisable input.xlsx file


  warn(msg)
  warn(msg)
  warn(msg)


dict_keys(['loss-and-damage', 'deforestation', 'equity-and-just-transition', 'technologies-br-adaptation-br', 'barriers-and-challenges', 'response-measures', 'international-cooperation', 'greenhouse-gases', 'technologies-br-mitigation-br', 'climate-related-hazards', 'good-practice-and-opportunities', 'adaptation', 'mitigation', 'capacity-building', 'financial-flows', 'renewables', 'fossil-fuels', 'vulnerable-groups'])

In [33]:
concept_name = "climate-related-hazards"


def get_synonym_set(concept_name: str) -> dict[str, list[tuple]]:
    """
    Get a synonym set given a concept name by loading the spacy ruleset and turning rules into phrases.

    Synonyms are defined by expanding the LEMMA_IN property or looking for multiple rows with the same span ID.

    Returns dictionary of {span_id: [("synonym", "one"), ("synonym", "two")]}. Values are already tokenised according to Explorer input.
    """

    synonyms = defaultdict(list)

    for span_id, rules in itertools.groupby(
        patterns_by_concept[concept_name], lambda i: i.get("id", "")
    ):
        rule_list = list(rules)

        if len(rule_list) > 1:
            patterns = [p["pattern"] for p in rule_list]

            for pattern in patterns:
                tokens = []

                for token in pattern:
                    token_val = list(token.values())[0]

                    if isinstance(token_val, str):
                        tokens.append([token_val])
                    elif isinstance(token_val, dict):
                        token_vals_list = list(token_val.values())[0]
                        tokens.append(list(set([i.lower() for i in token_vals_list])))

                    else:
                        print(f"could not process: {token}")

                synonyms[span_id] += list(itertools.product(*tokens))

    return synonyms


synonyms = get_synonym_set(concept_name)

# print example list of synonyms
for idx, (span_id, syn_set) in enumerate(synonyms.items()):
    if idx > 4:
        break

    print(span_id)
    for syn in syn_set:
        print(" - " + " ".join(syn))
    print()

Biodiversity loss
 - biodiversity,species destruction
 - biodiversity,species extinction
 - biodiversity,species damage
 - biodiversity,species loss
 - extinction of biodiversity,species
 - extinction to biodiversity,species
 - damage of biodiversity,species
 - damage to biodiversity,species
 - loss of biodiversity,species
 - loss to biodiversity,species
 - biological diversity destruction
 - biological diversity damage
 - biological diversity loss
 - destruction of biological diversity
 - destruction to biological diversity
 - damage of biological diversity
 - damage to biological diversity
 - loss of biological diversity
 - loss to biological diversity

Bridge failure
 - bridge collapse
 - bridge failure
 - collapse of bridge
 - failure of bridge

Brush fires
 - brushfire
 - brush fire

Building collapse
 - structural collapse
 - structural failure
 - building collapse
 - building failure
 - collapse of building

Bushfires
 - bushfire
 - bush fire



In [32]:
export_dir = Path("./synonyms/")

for concept_name in patterns_by_concept:
    synonyms = get_synonym_set(concept_name)

    if len(synonyms) == 0:
        print(f"Skipping {concept_name} as no synonyms found")
        continue

    (export_dir / f"{concept_name}.json").write_text(json.dumps(synonyms, indent=4))

Skipping barriers-and-challenges as no synonyms found
Skipping response-measures as no synonyms found
Skipping good-practice-and-opportunities as no synonyms found
Skipping adaptation as no synonyms found
Skipping mitigation as no synonyms found
Skipping financial-flows as no synonyms found
