# Data preparation for semantic sentence similarity 

When we want to retrieve text data, we need text documents. For this course, I chose the [General debate of the seventy-eighth session of the United Nations General Assembly](https://en.wikipedia.org/wiki/General_debate_of_the_seventy-eighth_session_of_the_United_Nations_General_Assembly) which is free and available online. In the general debates, each country can give a (long) speech)

Unfortunately, there are a few issues with these documents:

* Sometimes, `.` has been used in wronge places. I tried to correct that.
* The debates are long, retrieving the correct debate is not sufficient.
* Therefore, we have to work with smaller entities.

In this first notebook, the debates are separated in sentences. Normally, you would then combine several sentences as fragments (chunking). But in this case, the sentences are already quite long and self-contained. Therefore, we keep the sentences as entities.

Sentence segmentation sounds easy, but we cannot just use a `.` to split the sentences. This would not work for "Mr. X" or similar contructs. Instead, we use [spacy](https://spacy.io) as a tool for linguistic analysis which also can perform this sentence splitting.

This notebook loads the data, splits the sentences and saves them in `json` format so we can use them in the later notebooks.

In [1]:
# Disable progress bars to avoid ipywidgets rendering issues
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'

## Load data

In [2]:
import glob
import os

In [4]:
data = []
for n in glob.glob("un/TXT/Session 78 - 2023/*.txt"):
    data.append({"country": os.path.basename(n.replace("_78_2023.txt", "")), "text": open(n).read() })

In [5]:
import pandas as pd

In [6]:
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame(data)
df

Unnamed: 0,country,text
0,TUR,"On behalf of myself and the Turkish nation. I would like to salute the members of the General Assembly with my most heartfelt regards. I would like to congratulate Mr. Korosi, who successfully completed his term as President of the General Assembly at its seventy seventh session, and to wish Mr. Francis, who is succeeding him, every success.\nI hope that the seventy-eighth session of the General Assembly, convened in a spirit of trust and solidarity, will be a blessing for the entire human r..."
1,TGO,"On the occasion of the seventy-eighth session of the General Assembly of our common institution, on behalf of my country. Togo, and His Excellency President Faure Essozimna Gnassingbe, allow me first of all to offer my warm congratulations to Mr. Dennis Francis of Trinidad and Tobago on his election and his skill in conducting the work of this session.\nMy warm congratulations also go not only to his predecessor. Mr. Csaba Korosi, who presided over our work last year, but also, and above all..."
2,SWE,"Looking back, we — the international community — did not acknowledge the signs for what they were. The war in Georgia in 2008 and the aggression in Ukraine since 2014 and in Syria since 2015 clearly show that Russia has no scruples about using military force to reach its political ambitions, recreate its former colonial empire and undermine the European security order and the Charter of the United Nations. We open this year’s session of the General Assembly at a time when a permanent member ..."
3,BLZ,"It is with immense pride that Belize offers its heartiest congratulations to Mr. Dennis Francis as a Caribbean Community (CARICOM) national unanimously elected to preside over the General Assembly at its seventy-eighth session.\nThe theme for our general debate. “Rebuilding trust and reigniting global solidarity: accelerating action on the 2030 Agenda and its Sustainable Development Goals towards peace, prosperity, progress and sustainability for all”, is timely.\nWhen we act in good faith, ..."
4,CUB,"I am bringing to this Assembly the voice of the South, the voice of the “exploited and scorned” — the words of Ernesto Che Guevara in this same Hall almost 60 years ago (see A/PV.1299).\nWe are diverse peoples with the same problems. We just confirmed that recently in Havana, which was honoured to host a summit of leaders and other high representatives of the Group of 77 (G77) and China, the most representative, broad and diverse group in the multilateral arena.\nDuring those two virtually n..."
...,...,...
187,ARM,"First of all, let me congratulate Mr. Dennis Francis on assuming the presidency of the General Assembly at its seventy-eighth session.\nI will not be the first and definitely not the last speaker in this body who will identify global threats to democracies, challenges to security and violations of the principles and purposes of the Charter of the United Nations, including the non-use of force and the peaceful resolution of conflicts, as a main source of instability and tension in the world.\..."
188,SDN,"On behalf of the people and the Government of the Sudan. I would like to congratulate the President on assuming the presidency of General Assembly at its seventy-eighth session. I also thank the President of the Assembly at \nits seventy-seventh session and the Secretary-General for their efforts in confronting the challenges that have faced the world over the past year.\nSince 15 April, the Sudanese people have been facing a devastating war launched by the rebel Rapid Support Forces, which ..."
189,LCA,"Let me join in the congratulations to Mr. Dennis Francis on his election as President of the General Assembly. This is the first time that a national of his country. Trinidad and Tobago, has assumed that office and only the fourth occasion that a representative of a Caribbean Community (CARICOM) State has been so elected. Let me therefore not only wish him success as he presides over our deliberations, but also assure him of the fullest levels of respectful cooperation from Saint Lucia as we..."
190,THA,"On behalf of the delegation of the Kingdom of Thailand, allow me to congratulate Mr. Dennis France on his election to preside over the General Assembly at its seventy-eighth session.\nThailand has marked a new chapter in our democracy. I assumed office only a few days ago, with the mandate of the people, to strengthen democratic institutions and values in Thailand and to uplift the well-being of the Thai people, who have been through difficult times over the past several years. In our foreig..."


## Sentence segmentation

In [10]:
import spacy

Each language has several models which can be used to analyze text. In this case, we use a small model for English, as we are only interested in sentence segmentation. If we were interested in named entity recoginition, a larger model would be more suitable.

In [11]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.1 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m26.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [12]:
nlp = spacy.load("en_core_web_sm")

In [13]:
from tqdm.auto import tqdm

In [14]:
# runs ~ 1min
sentences = []
for text in tqdm(df["text"]):
    doc = nlp(text)
    for sentence in doc.sents:
        sentences.append(str(sentence).strip())

  0%|          | 0/192 [00:00<?, ?it/s]

In [15]:
sentences[0:20]

['On behalf of myself and the Turkish nation.',
 'I would like to salute the members of the General Assembly with my most heartfelt regards.',
 'I would like to congratulate Mr. Korosi, who successfully completed his term as President of the General Assembly at its seventy seventh session, and to wish Mr. Francis, who is succeeding him, every success.',
 'I hope that the seventy-eighth session of the General Assembly, convened in a spirit of trust and solidarity, will be a blessing for the entire human race.',
 'Unfortunately, it is not possible to draw a more optimistic picture of the future of our world than the assessments made from this rostrum last year (see A/77/ PV.4).',
 'The picture before us shows that we are facing increasingly complex and dangerous challenges on a global scale.',
 'There are conflicts, wars, humanitarian crises, political strife and social tensions to the south, north, east and west of my country.',
 'Those growing challenges, compounded by global economic 

In [16]:
len(sentences)

18342

In [17]:
import json
with open("sentences.json", "w") as f:
    f.write(json.dumps(sentences))