# Description-based text similarity

Notebook that summarises work from the above paper by Ravfogel et al.

09 06 24

Recreating some results from the following paper: 

- https://arxiv.org/pdf/2305.12517
- https://github.com/shauli-ravfogel/descriptions
- https://huggingface.co/biu-nlp/abstract-sim-sentence

Idea:
- Text similarity often unsufficient for the task of retrieving information based on an abstract description
- This paper demonstrates the use of embeddings that have been trained so that they retrieve examples that match a given description.
- A nice demonstration of the use of ChatGPT for generating training data to solve an NLP problem.

Data sources:
- English Wikipedia
  
Approach:
- Extract sentences from Wikipedia
- Take a subset of the sentences
- Use GPT3 to generate 5 abstract descriptions that describe each sentence in the subset, and 5 that do not
- Then train an embedding model with contrastive loss so that that sentences have embeddings close to that of their corresponding descriptions.

My questions
- Can the notions of 'description', 'summary', 'paraphrase' be made more well-defined?
- Is a suitable abstract description of an example dependent on what information we want to capture in the description? Or can we define the notion of description in a consistent way?
- Was the evaluation sufficient?
- How could I extend this? Apply it to my own research?
- How to implement?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

## Generating descriptions from sentences using ChatGPT

In [2]:
from openai import OpenAI
import os

api_key = os.environ.get("OPEN_API_KEY")

client = OpenAI(api_key=api_key)

In [3]:
sentence = "My dad was very sad when Timmy the dog died. My parents buried him in the backyard with his favourite blanket."

In [4]:
system_prompt = f"""
You are a clear, observant writer who creates abstract descriptions from sentences.
"""

In [5]:
user_prompt1 = f"""
Let's write abstract descriptions of sentences. 

Example:
Sentence: Pilate 's role in the events leading to the crucifixion lent themselves
to melodrama , even tragedy , and Pilate often has a role in medieval mystery
plays .

Description: A description of a historical religious figure's involvement in a
significant event and its later portrayal in art.

Note: Descriptions can differ in the level of abstraction, granularity and the
part of the sentence they focus on. Some descriptions neeed to be abstract, while
others should be concrete and detailed.

For the following sentence, write up 5 good and stand-alone, independent
descriptions and 5 bad descriptions (which may be related, but are clearly wrong).
Output a json file with keys 'good', 'bad'.

Sentence: {sentence}

Start your answer with a curly bracket.
"""

In [6]:
completion = client.chat.completions.create(
  model="gpt-4",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt1}
  ]
)

In [7]:
print(completion.choices[0].message.content)

{
"good": [
"A narrative depicting a personal experience of a family’s sorrow in the wake of a pet’s death.", 
"A reflection on a father's emotional response to the loss of a beloved family dog.",
"A recollection that describes the end of a treasured pet's life and its burial in the family home.",
"A story of grief encapsulated within the death of a pet and expressing concern for a parent's well-being.",
"An account describing a somber familial event involving a dog's life and its sentimental burial."

],
"bad": [
"A jovial story recounting a memorable family vacation.", 
"An analysis of the economical impact of pet deaths.",
"A description of a dad's excitement about a dog's playful antics.", 
"Depiction of a dog’s excitement after being rescued by a caring family.", 
"A recounting of how enthusiastically a father built a dog house in the backyard."
]
}


## Hugginface model

Need two models: the sentence encoder and the query encoder

- https://huggingface.co/biu-nlp/abstract-sim-sentence
- https://huggingface.co/biu-nlp/abstract-sim-query

In [8]:
from transformers import AutoTokenizer, AutoModel
import torch
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
def load_finetuned_model():
    sentence_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-sentence") # load the sentence encoder
    query_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-query") # load the query encoder
    tokenizer = AutoTokenizer.from_pretrained("biu-nlp/abstract-sim-sentence") # tokenizer (converts the text to tokens)
    return tokenizer, query_encoder, sentence_encoder

def encode_batch(model, tokenizer, sentences: List[str], device: str):
    """
    Given a model, a tokenizer and a list of strings representing sentences return the text features

    Args
    ----
    model: hf model
    tokenizer: hf tokenizer
    sentences: list of strings representing the sentences to be encoded
    device: cpu or gpu (?)
    """
    input_ids = tokenizer(sentences, 
                          padding=True, 
                          max_length=512, 
                          truncation=True, 
                          return_tensors="pt",
                          add_special_tokens=True).to(device)
    features = model(**input_ids)[0]
    features =  torch.sum(features[:,1:,:] * input_ids["attention_mask"][:,1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids["attention_mask"][:,1:], dim=1, keepdims=True), min=1e-9)
    return features

In [14]:
# load the tokenizer, query encoder and the sentence encoder
tokenizer, query_encoder, sentence_encoder = load_finetuned_model()

# examples of relevant sentences that should be returned given the query below
relevant_sentences = ["Fingersoft's parent company is the Finger Group.",
                      "WHIRC – a subsidiary company of Wright-Hennepin",
                      "CK Life Sciences International (Holdings) Inc. (), or CK Life Sciences, is a subsidiary of CK Hutchison Holdings",
                      "EM Microelectronic-Marin (subsidiary of The Swatch Group).",
                      "The company is currently a division of the corporate group Jam Industries.",
                      "Volt Technical Resources is a business unit of Volt Workforce Solutions, a subsidiary of Volt Information Sciences (currently trading over-the-counter as VISI.)."
                     ]

# examples of irrelevant sentences that should not be returned
irrelevant_sentences = ["The second company is deemed to be a subsidiary of the parent company.",
                        "The company has gone through more than one incarnation.",
                        "The company is owned by its employees.",
                        "Larger companies compete for market share by acquiring smaller companies that may own a particular market sector.",
                        "A parent company is a company that owns 51% or more voting stock in another firm (or subsidiary).",
                        "It is a holding company that provides services through its subsidiaries in the following areas: oil and gas, industrial and infrastructure, government and power.",
                        "RXVT Technologies is no longer a subsidiary of the parent company."
                        ]



In [11]:
all_sentences = relevant_sentences + irrelevant_sentences

In [22]:
query = "<query>: An misunderstanding that could have resulted in someone getting into trouble"

In [16]:
# numpy array representing the embeddings of each of the sentences
embeddings = encode_batch(sentence_encoder, tokenizer, all_sentences, "cpu").detach().cpu().numpy()

# numpy array representing the embedding of the query
query_embedding = encode_batch(query_encoder, tokenizer, [query], "cpu").detach().cpu().numpy()

In [15]:
embeddings.shape

(13, 768)

In [17]:
query_embedding.shape

(1, 768)

In [18]:
# calculate the cosine similarities between the query embedding and all other embeddings
sims = cosine_similarity(query_embedding, embeddings)[0]

# all sentences along with similarity score with the query
sentences_sims = list(zip(all_sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)

for s, sim in sentences_sims:
    print(s, sim)

WHIRC – a subsidiary company of Wright-Hennepin 0.9396287
EM Microelectronic-Marin (subsidiary of The Swatch Group). 0.93929076
Fingersoft's parent company is the Finger Group. 0.9362471
CK Life Sciences International (Holdings) Inc. (), or CK Life Sciences, is a subsidiary of CK Hutchison Holdings 0.9350311
The company is currently a division of the corporate group Jam Industries. 0.927349
Volt Technical Resources is a business unit of Volt Workforce Solutions, a subsidiary of Volt Information Sciences (currently trading over-the-counter as VISI.). 0.90050864
The second company is deemed to be a subsidiary of the parent company. 0.6723647
It is a holding company that provides services through its subsidiaries in the following areas: oil and gas, industrial and infrastructure, government and power. 0.60081375
A parent company is a company that owns 51% or more voting stock in another firm (or subsidiary). 0.5949048
The company is owned by its employees. 0.55286556
RXVT Technologies is 

## Application: search through Hansard petitions for weird things

https://www.aph.gov.au/Parliamentary_Business/Hansard

In [3]:
from pypdf import PdfReader
from transformers import AutoTokenizer, AutoModel
import torch
from typing import List
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import sent_tokenize

In [4]:
# load a hansard document
reader = PdfReader("/Users/alexlee/Desktop/Data/pdfs/House of Representatives_2024_06_03.pdf")
number_of_pages = len(reader.pages)
print(number_of_pages)

206


In [5]:
all_text = ''

for page in reader.pages[19:]:
    text = page.extract_text()
    all_text += text

In [6]:
# Download the Punkt tokenizer for sentence splitting
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/alexlee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# remove introductory pages
texts_tidier = all_text[113:]

In [8]:
texts_tidier = (
    texts_tidier
    .replace('\n', ' ')
    .replace('\n- ', '. ')
    .replace('\n* ', '. ')
    .replace('\n1. ', '. ')
    .replace('•', '. ')
    .replace(':', '. ')
    .replace(';', '. ')
    .replace('.', '. ')
    .replace('?', '. ')
)

In [9]:
sentences = sent_tokenize(texts_tidier)

In [10]:
def load_finetuned_model():
    sentence_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-sentence") # load the sentence encoder
    query_encoder = AutoModel.from_pretrained("biu-nlp/abstract-sim-query") # load the query encoder
    tokenizer = AutoTokenizer.from_pretrained("biu-nlp/abstract-sim-sentence") # tokenizer (converts the text to tokens)
    return tokenizer, query_encoder, sentence_encoder

def encode_batch(model, tokenizer, sentences: List[str], device: str):
    """
    Given a model, a tokenizer and a list of strings representing sentences return the text features

    Args
    ----
    model: hf model
    tokenizer: hf tokenizer
    sentences: list of strings representing the sentences to be encoded
    device: cpu or gpu (?)
    """
    input_ids = tokenizer(sentences, 
                          padding=True, 
                          max_length=512, 
                          truncation=True, 
                          return_tensors="pt",
                          add_special_tokens=True).to(device)
    features = model(**input_ids)[0]
    features =  torch.sum(features[:,1:,:] * input_ids["attention_mask"][:,1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids["attention_mask"][:,1:], dim=1, keepdims=True), min=1e-9)
    return features

In [11]:
# create the embeddings
# load the tokenizer, query encoder and the sentence encoder
tokenizer, query_encoder, sentence_encoder = load_finetuned_model()



In [13]:
#embeddings = encode_batch(sentence_encoder, tokenizer, sentences[:5], "cpu").detach().cpu().numpy()

In [15]:
# numpy array representing the embedding of the query
query = "<query>: A group of people overcoming a significant challenge"
query_embedding = encode_batch(query_encoder, tokenizer, [query], "cpu").detach().cpu().numpy()

In [19]:
# calculate the cosine similarities between the query embedding and all other embeddings
sims = cosine_similarity(query_embedding, embeddings)[0]

# all sentences along with similarity score with the query
sentences_sims = list(zip(sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)

for s, sim in sentences_sims:
    print(s, sim)

Current rental crises is hard enough, stop landlords and agents from taking advantage of the situation 0.0631634
When does it stop? The government needs to step in and put regulations in place to stop this and even reduce what has already happened 0.027913576
Put a limit on how much of an increase is allowed and a time limit -0.073042616
Presentation Ms TEMPLEMAN (Macquarie) (10:01):  I present the following 54 petitions: Housing Rental price increases are putting Australians on the street, in cars, tents, couches and under bridges -0.08194008
We therefore ask the House to regulate the housing rental increases -0.13467577


In [12]:
device = 'cpu'

In [13]:
num_chunks = len(sentences) // 100

In [14]:
num_chunks

95

In [15]:
model = sentence_encoder

In [None]:
all_features = []

# have to do in chunks and then still get a kernel dying if two chunks done consecutively. Why?
# issue with the model creating the features
for n in range(90, 96):
    print(f'Creating features for chunk {n}')
    input_ids = tokenizer(sentences[n*100:(n+1)*100], 
                          padding=True, 
                          max_length=512, 
                          truncation=True, 
                          return_tensors="pt",
                          add_special_tokens=True).to(device)
    features = model(**input_ids)[0]
    features =  torch.sum(features[:,1:,:] * input_ids["attention_mask"][:,1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids["attention_mask"][:,1:], dim=1, keepdims=True), min=1e-9)
    all_features.append(features)

In [17]:
features_final = torch.concatenate(all_features)

In [23]:
# load all the tensors
tensor_files = [f'features_part{n}.pt' for n in range(0, 10)]

In [29]:
all_tensors = torch.concatenate([torch.load(tensor_files[n]) for n in range(10)])

In [113]:
query = "<query>: Concerns about young people"
query_embedding = encode_batch(query_encoder, tokenizer, [query], "cpu").detach().cpu().numpy()

In [114]:
embeddings = all_tensors.detach().cpu().numpy()

In [None]:
sims = cosine_similarity(query_embedding, embeddings)[0]

# all sentences along with similarity score with the query
sentences_sims = list(zip(sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)

for s, sim in sentences_sims:
    print(s, sim)