# Data preparation for semantic sentence similarity 

When we want to retrieve text data, we need text documents. For this course, I chose the [General debate of the seventy-eighth session of the United Nations General Assembly](https://en.wikipedia.org/wiki/General_debate_of_the_seventy-eighth_session_of_the_United_Nations_General_Assembly) which is free and available online. In the general debates, each country can give a (long) speech)

Unfortunately, there are a few issues with these documents:

* Sometimes, `.` has been used in wronge places. I tried to correct that.
* The debates are long, retrieving the correct debate is not sufficient.
* Therefore, we have to work with smaller entities.

In this first notebook, the debates are separated in sentences. Normally, you would then combine several sentences as fragments (chunking). But in this case, the sentences are already quite long and self-contained. Therefore, we keep the sentences as entities.

Sentence segmentation sounds easy, but we cannot just use a `.` to split the sentences. This would not work for "Mr. X" or similar contructs. Instead, we use [spacy](https://spacy.io) as a tool for linguistic analysis which also can perform this sentence splitting.

This notebook loads the data, splits the sentences and saves them in `json` format so we can use them in the later notebooks.

## Load data

In [None]:
import glob
import os

In [None]:
data = []
for n in glob.glob("un/TXT/Session 78 - 2023/*.txt"):
    data.append({"country": os.path.basename(n.replace("_78_2023.txt", "")), "text": open(n).read() })

In [None]:
import pandas as pd

In [None]:
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame(data)
df

## Sentence segmentation

In [None]:
import spacy

Each language has several models which can be used to analyze text. In this case, we use a small model for English, as we are only interested in sentence segmentation. If we were interested in named entity recoginition, a larger model would be more suitable.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
from tqdm.auto import tqdm

In [None]:
# runs ~ 1min
sentences = []
for text in tqdm(df["text"]):
    doc = nlp(text)
    for sentence in doc.sents:
        sentences.append(str(sentence).strip())

In [None]:
sentences[0:20]

In [None]:
len(sentences)

In [None]:
import json
with open("sentences.json", "w") as f:
    f.write(json.dumps(sentences))