# Data preparation for semantic sentence similarity 

When we want to retrieve text data, we need text documents. For this course, I chose the [General debate of the seventy-eighth session of the United Nations General Assembly](https://en.wikipedia.org/wiki/General_debate_of_the_seventy-eighth_session_of_the_United_Nations_General_Assembly) which is free and available online. In the general debates, each country can give a (long) speech)

Unfortunately, there are a few issues with these documents:

* Sometimes, `.` has been used in wronge places. I tried to correct that.
* The debates are long, retrieving the correct debate is not sufficient.
* Therefore, we have to work with smaller entities.

In this first notebook, the debates are separated in sentences. Normally, you would then combine several sentences as fragments (chunking). But in this case, the sentences are already quite long and self-contained. Therefore, we keep the sentences as entities.

Sentence segmentation sounds easy, but we cannot just use a `.` to split the sentences. This would not work for "Mr. X" or similar contructs. Instead, we use [spacy](https://spacy.io) as a tool for linguistic analysis which also can perform this sentence splitting.

This notebook loads the data, splits the sentences and saves them in `json` format so we can use them in the later notebooks.

In [1]:
# Disable progress bars to avoid ipywidgets rendering issues
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'

## Load data

In [1]:
import glob
import os

In [2]:
data = []
for n in glob.glob("un/TXT/Session 78 - 2023/*.txt"):
    data.append({"country": os.path.basename(n.replace("_78_2023.txt", "")), "text": open(n).read() })

In [3]:
import pandas as pd

In [4]:
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame(data)
df

Unnamed: 0,country,text
0,UGA,"I congratulate the President on his election as President of the General Assembly at its seventy-eighth session and assure him of Uganda’s full support. I would like to thank Mr. Csaba Korosi for his stewardship of the seventy-seventh session. I pay tribute to the Secretary-General. Mr. Antonio Guterres, for his leadership and commitment to the work of the United Nations.\nAs we mark 78 years of the existence of the United Nations, we yearn for a revitalized Organization that is capable of a..."
1,QAT,"At the outset. I congratulate His Excellency Mr. Dennis Francis on assuming the presidency of the General Assembly at its seventy-eighth session. I wish him every success. I also express my appreciation to His Excellency Mr. Csaba Korosi for his efforts in presiding over the General Assembly at its seventy-seventh session. I commend the efforts of the Secretary-General. His Excellency Mr. Antonio Guterres, and the staff of the United Nations for fulfilling its noble goals.\nI would like firs..."
2,ISR,"Over three millenniums ago, our great leader Moses addressed the people of Israel as they were about to enter the Promised Land. He said that they would find there two mountains facing one another: Mount Gerizim, the site on which a great blessing would be proclaimed, and Mount Ebal, the site of a great curse. Moses said that the people’s fate would be determined by the choice they made between the blessing and the curse. \nThat same choice has echoed down the ages not just for the people of..."
3,IRN,"I congratulate President Francis on the occasion of the opening of the seventy-eighth session of the General Assembly.\nSince last year, when I addressed everyone from this rostrum (see A/77/PV.6), the world has witnessed bitter as well as sweet events, but nearly eight decades following the establishment of the United Nations, the new session of the General Assembly is beginning as the world is experiencing unprecedented and historic changes. Meanwhile, the assurance of a luminous future fo..."
4,AGO,"It is with great pleasure that I take the floor at the General Assembly, at a time when the world faces a very complex situation that requires our Organization to strengthen its role and its ability to formulate the most appropriate responses in order to be able to tackle the serious challenges facing the world.\nI would like to wish Mr. Dennis Francis all the best during his term of office, starting now, as the President of the General Assembly at its seventy-eighth session. I would also li..."
...,...,...
187,HND,"Today marks one year since I last appeared before the General Assembly (see A/77/PV.5) as the first female President of the Republic of Honduras, an event that emerged from the resistance in the streets and the fight against the coup d’etat that defeated the democratically elected president. Jose Manuel Zelaya Rosales. My Government’s progress and results have already been recognized by the international community and financial organizations: greater economic growth, public finances salvaged..."
188,BOL,"Brother Vice-President of the General Assembly. Diego Pary Rodriguez, it is a source of joy and pride for Bolivia to see you lead the General Assembly of the most important multilateral Organization created by humankind, and we are sure that, together with President Dennis Francis and his leadership, you will elevate the names of the countries of our Latin American and Caribbean region.\nA year ago in this forum (see A/77/PV.5), we denounced the fact that the world was facing a capitalist cr..."
189,ARM,"First of all, let me congratulate Mr. Dennis Francis on assuming the presidency of the General Assembly at its seventy-eighth session.\nI will not be the first and definitely not the last speaker in this body who will identify global threats to democracies, challenges to security and violations of the principles and purposes of the Charter of the United Nations, including the non-use of force and the peaceful resolution of conflicts, as a main source of instability and tension in the world.\..."
190,BHR,"It is my pleasure to congratulate the President of the General Assembly at its current session. I wish him every success in conducting its work. I also want to express my thanks and appreciation to his predecessor. His Excellency Mr. Csaba Korosi, for his remarkable efforts in leading the previous session, and His Excellency Secretary-General Antonio Guterres for his continuous efforts to achieve the noble purposes and objectives of the United Nations.\nAt the outset. I wish to express the c..."


## Sentence segmentation

In [10]:
import spacy

Each language has several models which can be used to analyze text. In this case, we use a small model for English, as we are only interested in sentence segmentation. If we were interested in named entity recoginition, a larger model would be more suitable.

In [11]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.1 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m26.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [12]:
nlp = spacy.load("en_core_web_sm")

In [13]:
from tqdm.auto import tqdm

In [14]:
# runs ~ 1min
sentences = []
for text in tqdm(df["text"]):
    doc = nlp(text)
    for sentence in doc.sents:
        sentences.append(str(sentence).strip())

  0%|          | 0/192 [00:00<?, ?it/s]

In [15]:
sentences[0:20]

['On behalf of myself and the Turkish nation.',
 'I would like to salute the members of the General Assembly with my most heartfelt regards.',
 'I would like to congratulate Mr. Korosi, who successfully completed his term as President of the General Assembly at its seventy seventh session, and to wish Mr. Francis, who is succeeding him, every success.',
 'I hope that the seventy-eighth session of the General Assembly, convened in a spirit of trust and solidarity, will be a blessing for the entire human race.',
 'Unfortunately, it is not possible to draw a more optimistic picture of the future of our world than the assessments made from this rostrum last year (see A/77/ PV.4).',
 'The picture before us shows that we are facing increasingly complex and dangerous challenges on a global scale.',
 'There are conflicts, wars, humanitarian crises, political strife and social tensions to the south, north, east and west of my country.',
 'Those growing challenges, compounded by global economic 

In [16]:
len(sentences)

18342

In [17]:
import json
with open("sentences.json", "w") as f:
    f.write(json.dumps(sentences))