# Load data

Sätze extrahiert aus der UN-Generaldebatte 2022

In [None]:
!test -f sentences.txt || wget https://github.com/datanizing/m3-llm-workshop/raw/main/sentences.txt

In [1]:
sentences = open("sentences.txt").read().split("@@@")

In [2]:
len(sentences)

17250

# Encode sentences

In [None]:
!pip install sentence-transformers

In [3]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-base-v2')

In [4]:
# can take a minute or two depending on CPU/GPU configuration
sembeddings = model.encode(["passage: " + s for s in sentences], 
                           show_progress_bar=True)

Batches:   0%|          | 0/540 [00:00<?, ?it/s]

In [5]:
import numpy as np
with open("sentences-e5.npy", "wb") as f:
    np.save(f, sembeddings)

In [6]:
sembeddings.shape

(17250, 768)

In [7]:
model2 = SentenceTransformer('all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
# can take a minute or two depending on CPU/GPU configuration
sembeddings2 = model2.encode(sentences, show_progress_bar=True)

Batches:   0%|          | 0/540 [00:00<?, ?it/s]

In [9]:
import numpy as np
with open("sentences-mpnet.npy", "wb") as f:
    np.save(f, sembeddings2)

# Retrieval

In [10]:
def search(query, text, corpus_embeddings, model, top=20):
    # code query to restrict search space
    question_embedding = model.encode(query)
    
    # Determine similarity (vectors are normalized)
    sim = np.dot(corpus_embeddings, question_embedding)
    
    # Get most similar top by sorting
    hits = [ { "text": text[i], "score": sim[i] } 
                     for i in sim.argsort()[::-1][0:top] ]
    
    # Return as dataframe
    return pd.DataFrame(hits)

In [11]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

In [12]:
search("query: Is the climate crisis worse for poorer countries?", 
       sentences, sembeddings, model)

Unnamed: 0,text,score
0,"A triple crisis concerning energy, food security and finance is weighing heavily on vulnerable countries, countries that are already suffering the most from the climate crisis and the coronavirus disease pandemic.",0.851163
1,"The negative impacts of climate change are disproportionately affecting the most vulnerable and marginalized communities around the world, and doing so at a faster pace.",0.850445
2,"For example, my country, Cabo Verde, in the past 15 years — between 2007 and 2022 — has suffered the economic and social impact of multiple crises: the economic and financial crisis of 2007-2008, at the very moment when we graduated from the list of least developed countries; the coronavirus disease pandemic, which caused a recession of 14.6 per cent in 2020; the ongoing inflationary impact of world events; and, in the last five years, one of the most profound and most serious droughts in the recent history of the country.",0.841351
3,"Meanwhile, the threats have been adding up: economic recovery from the coronavirus disease pandemic has slowed; the climate crisis is worsening, with extreme weather events, biodiversity loss and collapsing ecosystems; poverty and hunger are on the rise; and there is definitely a humanitarian crisis.",0.841316
4,"The risks of further inequality are also real, especially for our developing countries, whose capacity to adapt to and mitigate the effects of climate change is inadequate, sadly, in spite of our insignificant carbon footprint.",0.840212
5,"How much more scorched Earth, how many millions more climate refugees, how many flood victims will it take to convince us that ignoring our commitments is no longer an option?",0.839927
6,That is why the richest countries must strengthen their financial and technological solidarity with the poorest countries on climate issues.,0.839283
7,"Climate change reduces opportunity and prosperity, which in Africa, Latin America and some parts of Asia also contributes to transnational organized crime.",0.838881
8,The effects of climate change are worsening.,0.837806
9,"Climate change destroys our ecosystems, results in land degradation and contributes to the decline of agricultural productivity, which is the mainstay of small economies.",0.837374


In [13]:
search("query: Is the war on Ukraine caused by Russia?", 
       sentences, sembeddings, model)

Unnamed: 0,text,score
0,"Russia’s unprovoked war against another sovereign State, Ukraine, has shaken the world to the core, put to the test the fundamental principles of the United Nations, shattered global security and triggered a European energy crisis, global food shortages and an economic downturn.",0.866839
1,"The war, provoked by Russian aggression, is a war in which Russia is not limiting itself to fighting the Ukrainian army.",0.863874
2,"The war in Ukraine has not only unleashed death and horrendous destruction, but has plunged the world into an economic crisis of runaway inflation and shortages of food and energy supplies and worsened a global supply chain crisis that had been triggered by the COVID-19 pandemic.",0.861158
3,Russia bears sole responsibility for the war and its consequences — and Russia is responsible for bringing it to an end.,0.85924
4,"This war, started by Russia in Ukraine, like all other conflicts going on in the world today, must be lost by the aggressor.",0.859155
5,"The Russian aggression against Ukraine has caused food insecurity and an energy crisis, which have devastating socioeconomic impacts on countries worldwide.",0.85721
6,Some argue that the war is between Russia and Ukraine.,0.856763
7,"It is the war, the aggression in Ukraine that is responsible for the problems that they are experiencing.",0.856398
8,"Russia’s aggression against Ukraine has caused the greatest refugee crisis in Europe since the Second World War, aggravating the already serious global refugee situation.",0.855043
9,"In late February 2022, open hostilities between Russia and Ukraine erupted, exacerbating the economic turmoil that was already brewing globally.",0.85386


In [14]:
search("Is the climate crisis worse for poorer countries?", 
       sentences, sembeddings2, model2)

Unnamed: 0,text,score
0,"The risks of further inequality are also real, especially for our developing countries, whose capacity to adapt to and mitigate the effects of climate change is inadequate, sadly, in spite of our insignificant carbon footprint.",0.663719
1,No country is immune to the climate crisis.,0.655583
2,Let us consider the climate crisis.,0.653818
3,"Although climate change and the resulting extreme weather conditions occur throughout the globe, the crisis largely affects the minimally resilient and those least responsible for causing the problem.",0.64611
4,"While Africa is the region least responsible for the climate crisis, it finds itself at the epicentre of its worst impacts.",0.645133
5,"The turmoil and insecurity in many parts of the world require urgent attention, and so does the need to tackle the problems posed by climate change.",0.6389
6,The climate crisis is creating an increasingly uncertain future for people in most parts of the world.,0.634599
7,That is why the richest countries must strengthen their financial and technological solidarity with the poorest countries on climate issues.,0.633547
8,The effects of climate change are worsening.,0.633082
9,"Finally, as many Member States are clearly experiencing, the climate crisis has a particularly strong impact on our Latin American continent, and especially the Caribbean, as well as the livelihoods of our people.",0.62649


In [None]:
search("Is the war on Ukraine caused by Russia?", 
       sentences, sembeddings2, model2)