# Generative AI - SEBx & Combient Hackathon December 2023

> **`Run and execute each code cell block in the notebook in a consecutive manner. This is important since some code cell blocks relies on having properly executed some previous code cell block.`**
>
>
> Notebook code blocks can be executed via either:
> * **shift + enter**: executes current code block and moves to the cell below
> * **control + enter**: executes current code block

## Environment setup

Here we set up the environment and make sure we can access data via Google Drive. Run the below code blocks to install necessary packages.

Run the code below to clone the GenAI_BootCamp2023 directory which contains files which we will be using during this course. The code clones a directory called GenAI_BootCamp2023 from our Github repository.

> **`During execution of this cell block you will be prompted to provide your GDrive access to download the course content. Use your Google account credentials to allow this action.`**

## Packages and Imports

In [None]:
!pip install -q cohere \
    -q tiktoken \
    -q langchain \
    -q sentence_transformers \
    -q openai \
    -q faiss-cpu \
    -q colorama \
    -q pypdf \
    -q PyMuPDF \
    -q requests \
    -q beautifulsoup4 \
    -q umap-learn \
    -q mycolorpy \
    -q pandas \
    -q plotly \
    -q seaborn

In [None]:
# Some system and base modules
import os
import sys
from timeit import default_timer as timer
from typing import List, Optional, Type
import getpass

# NLP modules
import openai
from openai import OpenAI
import langchain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, JSONLoader
import fitz
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
import requests
from bs4 import BeautifulSoup

# Other modules
import numpy as np
import pandas as pd
import umap
from colorama import Fore, Back, Style


# modules for plotting
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from mycolorpy import colorlist as mcp
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
import seaborn as sns
%matplotlib inline

## Setting the access key for OpenAI API

> **NB: Don't share the OpenAI access key in public spaces.**

> **The OpenAI API key can be set manually in the notebook by running the code cell block below.**
>
> **`A query box will appear the first time you run the below code cell block. Paste the OpenAI API key which you have been provided into the query box and press Enter/Return (access key is on the form sk-...)`**

In [None]:
# Here we can set the OpenAI API access key manually in case it fails to load from the environment.
if not os.environ.get("OPENAI_API_KEY"):
  api_key = getpass.getpass("Enter OpenAI API Key here")
  os.environ["OPENAI_API_KEY"] = api_key
else:
  print(f"OPENAI_API_KEY fetched from environment!")


# sk-...

In [None]:
# You can optionally manually insert the OpenAI API key below between the quotation marks.
# Then uncomment the following two lines by removing the preceeding # and run the cell

#os.environ["OPENAI_API_KEY"] = "sk-..."
#print(f"The Open AI access key is given by: \n\n {os.environ['OPENAI_API_KEY']}")

In [None]:
# The following helps to format print output to match the size of the browser window
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Hands-On 2: Introduction to Embeddings & Vector index storage

Some text here for introducing embeddings and vector stores ...

## Getting arXiv data & illustrating basic embedding concepts

We will start by collecting scientific abstracts from [arXiv](https://arxiv.org). We collevct these by fetching from their new releases section, which provides the daily deluge of preprint articles in various STEM subjects.

Below we try out the function to see the output from a single article.

In [None]:
def fetch_arxiv_data(url, subject):
    """
    Function for fetching articles from the arXiv. Expects a url pointing to the daily relese site of arXiv topics.
    Returns a list of dictionaries containing 'title', 'abstract' and 'arxiv_topic'.
    """
    response = requests.get(url)
    if response.status_code != 200:
        print('Failed to retrieve data from', url)
        return []

    soup = BeautifulSoup(response.content, 'html.parser')
    papers = []

    cnt_found=0
    cnt_not_found=0
    for item in soup.find_all('div', class_='meta'):
      # We only extract info from articles with abstract in the listing. This includes cross-topic listings
      try:
        title = item.find('div', class_='list-title mathjax').text.replace('Title:', '').strip()
        abstract = item.find('p', class_='mathjax').text.strip()
        arxiv_topic = item.find('span', class_='primary-subject').text.strip()
        papers.append({'title': title, 'abstract': abstract, 'arxiv_topic': arxiv_topic, 'subject': subject})
        cnt_found+=1
      # We do not try to get abstract from replacements
      except:
        #print(f"NO ABSTRACT FOUND, DUE TO ARTICLE BEING A REPLACEMENT OF EARLIER SUBMISSION")
        cnt_not_found+=1

    print(f"Extracted abstract for {cnt_found} new articles from {subject}.\nThis excludes {cnt_not_found} replacements.")

    return papers

In [None]:
# Example usage
url = 'https://arxiv.org/list/gr-qc/new'  # URL for the General Relativity and Quantum Cosmology section
papers = fetch_arxiv_data(url, subject="gr-qc")

# Print the first few papers
for paper in papers[:1]:
    print("")
    print(Style.BRIGHT + 'Title:' + Style.RESET_ALL, paper['title'])
    print(Style.BRIGHT + 'Abstract:' + Style.RESET_ALL, paper['abstract'])
    print(Style.BRIGHT + 'arXiv Topic:' + Style.RESET_ALL, paper['arxiv_topic'])
    print(Style.BRIGHT + 'arXiv Subject:' + Style.RESET_ALL, paper['subject'])
    print('---')

In [None]:
# Let's do it for all arXiv subjects
subjects = [
    "astro-ph",
    "gr-qc",
    "cond-mat",
    "quant-ph",
    "hep-th",
    "hep-ph",
    "hep-ex",
    "hep-lat",
    "nucl-ex",
    "nucl-th",
    "nlin",
    "math-ph",
    "math",
    "cs",
    "stat",
    "eess",
    ]


# We collect everything in a list of dictionaries
papers_list = []
for subject in subjects:
  papers_subject = fetch_arxiv_data(f"https://arxiv.org/list/{subject}/new", subject=subject)
  for paper in papers_subject:
    papers_list.append(paper)

print("")
print(f"We extracted a total of {len(papers_list)} abstracts")

In [None]:
# Print the first few papers
for paper in papers_list[:1]:
    print(Style.BRIGHT + 'Title:' + Style.RESET_ALL, paper['title'])
    print(Style.BRIGHT + 'Abstract:' + Style.RESET_ALL, paper['abstract'])
    print(Style.BRIGHT + 'arXiv Topic:' + Style.RESET_ALL, paper['arxiv_topic'])
    print(Style.BRIGHT + 'arXiv Subject:' + Style.RESET_ALL, paper['subject'])
    print('---')

### Checking lengths of retrieved abstracts

In [None]:
abstract_lengths = []
for paper in papers_list:
  abstract_lengths.append(len(paper["abstract"]))

abstract_lengths.sort(reverse=True)
print("We print out the lengths of the 10 longest abstracts")
abstract_lengths[:10]

### Embedding models

In [None]:
# For embeddings we use models on the MTEB leaderboard at https://huggingface.co/spaces/mteb/leaderboard


# Voyage, currently nr 1 (REQUIRES REGISTERING TO GET API KEY)
#!pip install -q voyageai
#import voyageai
#from langchain.embeddings import VoyageEmbeddings
#os.environ["VOYAGE_API_KEY"] = "..."
#voyageai.api_key = os.environ["VOYAGE_API_KEY"]


# Cohere, currently nr 2 (REQUIRES REGISTERING TO GET API KEY)
#import cohere
#Get your API key from www.cohere.com
#os.environ["COHERE_API_KEY"] = "..."


# Open source HuggingFace embeddings, below is currently nr 3 & 12 (NO REGISTRATION REQUIRED)
#embedding_models_HF = [
#    "BAAI/bge-large-en-v1.5",
#    "BAAI/bge-small-en-v1.5"
#    ]

We will collect our abstracts into a list of Langchain Document objects. This is not necessary for doing embeddings, but will facilitate working with vector stores later on. The Document object class has the methods `page_content`, which stores the text string, and `metadata`, where additional metadata can be stored as a dictionary with key-value pairs.

In [None]:
documents =  []

for paper in papers_list:
  doc  = Document(
      page_content = paper["abstract"],
      metadata = {"title": paper["title"], "arxiv_topic": paper["arxiv_topic"], "subject": paper["subject"]}
  )
  documents.append(doc)
documents[0]

Let's demonstrate how an embedding works by using open source embeddings for a single abstract. First let's see what the abstract looks like in plain text.

In [None]:
test_text = documents[0].page_content
print(test_text)

In [None]:
def doc_embedding(
    embedding_model: str,
    model_kwargs: dict={'device': 'cpu'},
    encode_kwargs: dict={'normalize_embeddings': True},
    cache_folder: Optional[str]=None,
    multi_process: bool=False,
    ) -> HuggingFaceEmbeddings:
  """
  TBW...
  """
  embedder = HuggingFaceEmbeddings(
      model_name = embedding_model,
      model_kwargs = model_kwargs,
      encode_kwargs = encode_kwargs,
      cache_folder = cache_folder,
      multi_process = multi_process
  )
  return embedder

In [None]:
def get_API_embedding(text, model="text-embedding-ada-002"):
  """This function retrieves embedding vector from text string using various models"""
  text = text.replace("\n", " ")

  # OpenAI embeddings
  if model == "text-embedding-ada-002":
    client = OpenAI()
    embedding = client.embeddings.create(input = [text], model=model).data[0].embedding

  # Voyage embeddings
  elif model == 'voyage-01':
    voyage = VoyageEmbeddings(model=model, voyage_api_key=os.environ["VOYAGE_API_KEY"])
    embedding = voyage.embed_query(text)

  # Cohere embeddings
  elif model == "embed-english-v3.0":
    co = cohere.Client(os.environ["COHERE_API_KEY"])
    embedding = co.embed([text], input_type="search_document", model=model).embeddings

  elif model in embedding_models_HF:
    embedder = doc_embedding(model)
    embedding = embedder.embed_query(text)

  else:
    embedding = [None]

  return embedding

Now we call the open source embeddings from HuggingFace and check the first 10 entries of the resulting embedding vector.

In [None]:
embedding = get_API_embedding(test_text, model="BAAI/bge-small-en-v1.5")
embedding[:10]

We can now loop through all abstracts, embed them and add the embeddings to an embedding list. Later we will see how we can do this using a vector store to manage the retrieved embeddings along with additional metadata.

**NB: This takes a couple of minutes to complete for all abstracts**

In [None]:
embeddings = []
for document in documents:
  embedding = get_API_embedding(document.page_content, model="BAAI/bge-small-en-v1.5")
  embeddings.append(embedding)


Let's check the first few entries of the embedding of one article

In [None]:
embeddings[0][:10]

### Projecting embeddings using UMAP

We will now use the [UMAP](https://umap-learn.readthedocs.io/en/latest/) library for performing projections of the embedding vectors down to 2D, preserving both local and global structure of the data.

Let's remind ourselves what the arXiv subjects are.

In [None]:
import numpy as np
# modules for plotting
import matplotlib.pyplot as plt
#import matplotlib.lines as mlines
from mycolorpy import colorlist as mcp
#from mpl_toolkits.mplot3d import Axes3D
#import plotly.express as px
import seaborn as sns
%matplotlib inline

In [None]:
subjects

We prepare the data for input to the UMAP algorithm

In [None]:
colors = mcp.gen_color(cmap="Spectral",n=len(subjects))
color_dict_subjects =dict(zip(subjects, colors))
    
    
embedding_data_array = np.array(embeddings)
print(f"We now have an array of embeddings with shape: {embedding_data_array.shape}")

In [None]:
import umap
# We do the projection for several values of the n_neighbours hyperparameter
# This is the most important hyperparameter of the UMAP algorithm
n_neighbors = [2, 5, 15, 25, 50, 100] # 15 is default

umap_results = []
for n in n_neighbors:
    reducer = umap.UMAP(random_state=42,
                        n_components=2,
                        learning_rate=1.0,
                        min_dist=0.1,
                        n_neighbors=n,
                        metric='euclidean',
                        output_metric='euclidean',
                        target_metric='categorical',
                        target_n_neighbors=-1,
                        target_weight=0.5,)
    umap_embedding = reducer.fit_transform(embeddings)
    umap_results.append(umap_embedding)

Let's first display the result using matplotlib

In [None]:
import pandas as pd
# Let's choose one of the UMAP results to display
nr_index = 5

df_arxiv_umap = pd.DataFrame(np.array([umap_results[nr_index][:,0], umap_results[nr_index][:,1]]).T, columns=["umap-2d-one", "umap-2d-two"])

subjects_list = []
for paper in papers_list:
  subjects_list.append(paper["subject"])
df_arxiv_umap["y"] = subjects_list

plt.figure(figsize=(16,10))
sns.scatterplot(
    x="umap-2d-one", y="umap-2d-two",
    hue="y",
    palette=color_dict_subjects,
    data=df_arxiv_umap,
    legend="full",
    alpha=0.5
)
plt.title(f"UMAP projection of arXiv abstracts with n_neighbors={n_neighbors[nr_index]}", fontsize=25)

Let's make a more interactive plot using Plotly where we can hover over the points interactively and inspect the results in more detail.

In [None]:
import plotly.express as px

subjects_list = []
for paper in papers_list:
  subjects_list.append(paper["arxiv_topic"])
df_arxiv_umap["y_long"] = subjects_list


fig = px.scatter(df_arxiv_umap,
                 x='umap-2d-one',
                 y='umap-2d-two',
                 color='y',
                 color_discrete_map=color_dict_subjects, # Use your color dictionary
                 hover_data=['y_long']) # This will show the category on hover

fig.update_traces(marker=dict(size=5, opacity=0.5)) # Adjust size and opacity similar to your seaborn plot
fig.update_layout(legend_title_text='arXiv subject') # Customize legend title
fig.show()

Tracing over the points and examining the hover labels we can see a clear clustering of physics topics and computer science topics respectively. The math topics tend to lie between these and we can see that they are closer to those topics which deal with similar lines of research.

> ```We can use the above plot to remove some of the topic subjects if we wish, to reduce the amount of data to embed and make the separation even more visually clear. Also consider changing color scheme.```

### Vector index store & Semantic similarity search

Let's now examine how we can store embeddings in an indexed vector database. There are many different vector stores to choose from which all perform similarly. Here we will make use of [FAISS](https://ai.meta.com/tools/faiss/), which is an open source vector store library developed by Meta.

We will then see how we can use this tool to perform a similarity search over the indexed embeddings and retrieve the most relevant article based on a query using semantic similarity.

We will make use of an open source model from HuggingFace for doing the embeddings here. There are many good options to choose from of varying sizes. As we will see even quite small models perform quite well with semantic search.

> ```The first time you run the below code snippet you will download the embedding model into memory and you will see the progress of this displayed.```

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
# For embeddings we use a top ranked open source model on the MTEB leaderboard at https://huggingface.co/spaces/mteb/leaderboard
embedding_models_HF = [
    "BAAI/bge-large-en-v1.5",
    "BAAI/bge-small-en-v1.5"
    ]
embedding_model = embedding_models_HF[1]


# We use the HuggingFaceEmbeddings wrapper to make an embedding object from our model
embedder = HuggingFaceEmbeddings(
      model_name = embedding_model,
      model_kwargs = {'device': 'cpu'},
      encode_kwargs = {'normalize_embeddings': True},
      cache_folder = f"{embedding_model}_cache",
      multi_process = False
      )

It is then straightforward to create a FAISS vector index using our Document and HuggingFaceEmbeddings objects. With the small BGE model this will take slightly less than one minute to complete for 50 abstracts. We will therefore pick out a subset of roughly 50 abstracts here for demonstrative purposes.

In [None]:
# We aim to pick out about 50 abstracts from across the documents list, irrespective of how many we originally retrieved
nr_abstracts_in_short_list = 50


# We pick out the baove nr of abstracts evenly spaced out over our list of retrieved abstracts
documents_short = documents[::max(1, len(documents) // nr_abstracts_in_short_list)]
#documents_short = documents[-nr_abstracts_in_short_list:]

print(f" The short list of documents contain {len(documents_short)} abstracts")

In [None]:
from langchain.vectorstores import FAISS

# We embed and store embeddings for the shortened list of abstracts
faiss_index = FAISS.from_documents(
      documents=documents_short,
      embedding=embedder
      )

Let's pick out an article and display its abstract. This will allow us to make a query which we know matches this particular abstract.

In [None]:
# choose a number between 0-20 to pick out one of the indexed abstracts
abstract_nr = 25

documents_short[abstract_nr]

Now we can construct a search query which is related to the above abstract. We then use this query to retrieve a nr of close matches from the indexed vector stored. These are retrieved as a list in sorted order, with the closest match appearing first.

Let's do this by having the LLM make a query for us from the above abstract.

In [None]:
# The system prompt will be placed at the top of every message and should set overall system behaviour
system_prompt = """
You are an expert in summarizing scientific literature.
"""

# We can try to set this quite low to make it hard for the retriever,
# noting that if several abstracts are on a similar topic then a very short summary
# should make it more difficult to retrieve the intended one
max_words_for_summary = 3

user_prompt = f"""
Consider carefully the abstract supplied below and construct a very brief summary of its content.
Try to use a maximum of {max_words_for_summary} words and do not include any mathematical formulas in your summary.

abstract: {documents_short[abstract_nr].page_content}
"""


# These are collected into a messages list of
#
#   SystemMessage - the system prompt
#   HumanMessage  - the user query
#   AIMessage     - the bot response, in case you wish to continue on a conversation
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_prompt),
]

In [None]:
from langchain.chat_models import ChatOpenAI

# This creates an instance of the model interface which we can subsequently call on
chat_model = ChatOpenAI(
    openai_api_key=os.environ['OPENAI_API_KEY'],
    # The below parameters can be changed
    model="gpt-3.5-turbo-1106",
    temperature=0.0
)


# Here we collect the output from the chat model in the variable response
response = chat_model.invoke(messages)

# We can print out the response by calling on its content using a .content
print(response.content)

Now we can use this short summary to try to find the correct article amongst all the ones we have embedded.

In [None]:
# We define a generic query which incorporates the summary from the LLM
search_query = f"Find an article which discusses: {response.content}"


# Define how many similar documents you want to retrieve
# These are returned in sorted order, with most similar placed first
nr_hits = 5


# Use FAISS to perform similarity search ...
most_similar = faiss_index.similarity_search(query = search_query, k=nr_hits)


# Lets check that the closest retrieved match is the same as abstract we used to construct the query
if documents_short[abstract_nr].metadata["title"] == most_similar[0].metadata["title"]:
  print(Style.BRIGHT + "SUCESS! We found the correct abstract as the top ranked choice!" + Style.RESET_ALL)
  print("--"*25)
  print(most_similar[0])
# In case it was not the top pick, we check if it was among the ones retrieved from the vector store
elif documents_short[abstract_nr].metadata["title"] in [most_similar[nr_hit].metadata["title"] for nr_hit in range(1,nr_hits)]:
  print(Style.BRIGHT + f"PARTIAL SUCESS! We found the correct abstract among the top {nr_hits} ranked choices!" + Style.RESET_ALL)
  print("--"*25)
  print(most_similar[0])
else:
  print(Style.BRIGHT + "FAILURE! We didn't retrieve the correct abstract as top choice!" + Style.RESET_ALL)

We can inspect all the top ranked abstracts we retrieved and see how well they matched the summarization we used when searching.

In [None]:
cnt=1
for doc in most_similar:
  print(Style.BRIGHT + f"Hit nr {cnt}, Title: {doc.metadata['title']}" + Style.RESET_ALL)
  print(doc.page_content)
  print("--"*25)
  cnt+=1

If you try to play around with the above you will find that it is quite hard to get the retriever to fail based on a semantic similarity search even for a very condense summary of the abstracts.

#### Breaking the retriever?


In order to make things a bit harder we will try to collect a series of abstracts which are all related to the same topic, but which are not necessarily recent. To this end we fetch 100 article abstracts based on the query keyword `LLM`.

In [None]:
import requests
from bs4 import BeautifulSoup

query_keyword = "LLM"
url = f"https://arxiv.org/search/?query={query_keyword}&searchtype=all&abstracts=show&order=-announced_date_first&size=100"


# We collect everything in a list of dictionaries
response = requests.get(url)
if response.status_code != 200:
  print('Failed to retrieve data from', url)

soup = BeautifulSoup(response.content, 'html.parser')


papers = []
for item in soup.find_all('li', class_='arxiv-result'):
  title = item.find('p', class_='title is-5 mathjax').text.strip()
  abstract = item.find('span', class_='abstract-full has-text-grey-dark mathjax').text.strip()
  papers.append({"title": title, "abstract": abstract})

print("")
print(f"We extracted a total of {len(papers)} abstracts on the topic of {query_keyword}")

In [None]:
documents_topic =  []

for paper in papers:
  doc  = Document(
      page_content = paper["abstract"],
      metadata = {"title": paper["title"]}
  )
  documents_topic.append(doc)

We embed these documents just as before

In [None]:
# We embed and store embeddings for the shortened list of abstracts
faiss_index_topic = FAISS.from_documents(
      documents=documents_topic,
      embedding=embedder
      )

And then we pick out one of the abstracts to make a query we can try to use for search retrieval.

In [None]:
# choose a number between 0-99 to pick out one of the indexed abstracts
abstract_nr = 10

documents_topic[abstract_nr]

In [None]:
# The system prompt will be placed at the top of every message and should set overall system behaviour
system_prompt = """
You are an expert in summarizing scientific literature.
"""

# We can try to set this quite low to make it hard for the retriever,
# noting that if several abstracts are on a similar topic then a very short summary
# should make it more difficult to retrieve the intended one
max_words_for_summary = 3

user_prompt = f"""
Consider carefully the abstract supplied below and construct a very brief summary of its content.
Try to use a maximum of {max_words_for_summary} words and do not include any mathematical formulas in your summary.

abstract: {documents_topic[abstract_nr].page_content}
"""


# These are collected into a messages list of
#
#   SystemMessage - the system prompt
#   HumanMessage  - the user query
#   AIMessage     - the bot response, in case you wish to continue on a conversation
messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_prompt),
]

In [None]:
# This creates an instance of the model interface which we can subsequently call on
chat_model = ChatOpenAI(
    openai_api_key=os.environ['OPENAI_API_KEY'],
    # The below parameters can be changed
    model="gpt-3.5-turbo-1106",
    temperature=0.0
)


# Here we collect the output from the chat model in the variable response
response = chat_model.invoke(messages)

# We can print out the response by calling on its content using a .content
print(response.content)

Now let's see if we can find the needle in this proverbial haystack as easily as before.

In [None]:
# We define a generic query which incorporates the summary from the LLM
search_query = f"Find an article which discusses: {response.content}"


# Define how many similar documents you want to retrieve
# These are returned in sorted order, with most similar placed first
nr_hits = 5


# Use FAISS to perform similarity search ...
most_similar_topic = faiss_index_topic.similarity_search(query = search_query, k=nr_hits)


# Lets check that the closest retrieved match is the same as abstract we used to construct the query
if documents_topic[abstract_nr].metadata["title"] == most_similar_topic[0].metadata["title"]:
  print(Style.BRIGHT + "SUCESS! We found the correct abstract as the top ranked choice!" + Style.RESET_ALL)
  print("--"*25)
  print(most_similar_topic[0])
# In case it was not the top pick, we check if it was among the ones retrieved from the vector store
elif documents_topic[abstract_nr].metadata["title"] in [most_similar_topic[nr_hit].metadata["title"] for nr_hit in range(1,nr_hits)]:
  print(Style.BRIGHT + f"PARTIAL SUCESS! We found the correct abstract among the top {nr_hits} ranked choices!" + Style.RESET_ALL)
  print("--"*25)
  print(most_similar_topic[0])
else:
  print(Style.BRIGHT + "FAILURE! We didn't retrieve the correct abstract as top choice!" + Style.RESET_ALL)

# Fine-Tuning

We are going to work further on the sentence transformers and on the abstracts dataset. We are going to fine-tune e open source models on the dataset. Be aware that fine-tuning in this case is just a demonstration. In a real case scenario, you would use more data and more special data for fine-tuning.

For finetuning, we need pairs of query - corresponding abstract pairs. In this example, we are going to use Chatgpt to generate queries for each abstract.

In [None]:
from langchain.schema import SystemMessage
from langchain.schema import HumanMessage

chat_model = ChatOpenAI(
    openai_api_key=os.environ['OPENAI_API_KEY'],
    # The below parameters can be changed
    model="gpt-3.5-turbo-1106",
    temperature=0.0
)

def generate_questions(abstract):
    system_prompt = """
    You are an expert in scientific papers.
    """
    user_prompt = f"""
    Consider carefully the following abstract of a scientific paper: {abstract}. 
    Please provide a question to this abstract. Output only the question.
    """
    messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=user_prompt),
    ]
    response = chat_model.invoke(messages)

    return response.content

After generating the queries, we save them in a json in order to be able to reuse it in case we don't want to rerun the query generation.

In [None]:
import json

queries = []

for paper in papers:
    queries.append(generate_questions(paper['abstract']))

abstracts_with_queries = []

for query, paper in zip(queries, papers):
    abstracts_with_queries.append({"query": query, "abstract": paper["abstract"], "title": paper["title"]})

with open("abstracts_with_queries.json", "w") as json_file:
    json.dump(abstracts_with_queries, json_file)

In [None]:
queries[0]

Load your previously generated data from json.

In [None]:
import json

with open('abstracts_with_queries.json', 'r') as file:
    abstracts_with_queries_json = json.load(file)

queries = []
abstracts = []
for entry in abstracts_with_queries_json:
    queries.append(entry["query"])
    abstracts.append(entry["abstract"])

# E5

In this section, we are going to fine-tune an E5 model on our data. First, we need to organise the input data. Each row should have a query and a passage which is in our case, the abstract.

In [None]:
from sentence_transformers import InputExample

train_examples = []
for query, abstract in zip(queries, abstracts):
    train_examples.append(InputExample(texts=[query.strip(), abstract.strip()]))

In utilizing the MultipleNegativesRankingLoss, it's crucial to avoid duplicate entries within the batch, specifically ensuring the absence of identical queries or paragraphs.

In [None]:
from sentence_transformers import (
    SentenceTransformer,
    losses,
    models,
    datasets,
)

word_embbedding = models.Transformer("intfloat/e5-small-v2")
pooling = models.Pooling(word_embbedding.get_word_embedding_dimension())
e5_model = SentenceTransformer(modules=[word_embbedding, pooling])

train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=8)
train_loss = losses.MultipleNegativesRankingLoss(e5_model)

num_epochs = 1
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
e5_model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    show_progress_bar=True,
)

After training, we create corresponding embeddings for each abstract.

In [None]:
e5_embeddings = e5_model.encode(abstracts)

Next, we encode our user query and get the closest abstract embedding to it based on cosine similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def get_closest_abstract(query, model, embeddings):
    target_embedding = model.encode(query)
    similarities = [cosine_similarity([target_embedding], [emb])[0][0] for emb in embeddings]
    return abstracts[np.argmax(similarities)]

In [None]:
query = "What are the implications of the observed decline in GPT-4's performance?"
get_closest_abstract(query, e5_model, e5_embeddings)

# BGE

BGE is currently the best performing open source sentence transformer.

In [None]:
!pip install -U FlagEmbedding

We can fine-tune this model with a custum fuction that comes with its pip library. In order to use it, we need to prepare our data in a jsonl format. Example: {"query": "Five women walk along a beach wearing flip-flops.", "pos": ["Some women with flip-flops on, are walking along the beach"], "neg": ["The 4 women are sitting on the beach.", "There was a reform in 1996.", "She's not going to court to clear her record.", "The man is talking about hawaii.", "A woman is standing outside.", "The battle was over. ", "A group of people plays volleyball."]}

In [None]:
import random

def generate_random_except(start_range, end_range, count, exception):
    if count > (end_range - start_range + 1):
        return "Count should be smaller than the range."
    
    result = []
    while len(result) < count:
        num = random.randint(start_range, end_range)
        if num != exception and num not in result:
            result.append(num)
    
    return result

In [None]:
queries_with_positives_and_negatives = []

for counter, entry in enumerate(abstracts_with_queries_json):
    random_indexes = generate_random_except(0, len(abstracts_with_queries_json)-1, 4, counter)
    entries = [abstracts_with_queries_json[i] for i in random_indexes]
    negative_list = [entry['abstract'] for entry in entries]
    queries_with_positives_and_negatives.append({"query": entry["query"], "pos": [entry["abstract"]], "neg": negative_list})

with open('queries_with_positives_and_negatives.jsonl', 'w') as jsonl_file:
    for entry in queries_with_positives_and_negatives:
        json.dump(entry, jsonl_file)
        jsonl_file.write('\n')

In [None]:
!mkdir bge_abstracts

We run the fine-tuning script.

In [None]:
!torchrun -m FlagEmbedding.baai_general_embedding.finetune.run \
    --output_dir bge_abstracts \
    --model_name_or_path BAAI/bge-small-en-v1.5 \
    --train_data ./queries_with_positives_and_negatives.jsonl \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --dataloader_drop_last True \
    --normlized True \
    --temperature 0.02 \
    --query_max_len 64 \
    --passage_max_len 256 \
    --train_group_size 2 \
    --logging_steps 10 \
    --query_instruction_for_retrieval "" 

We create corresponding embeddings for each abstract with our new fine-tuned model.

In [None]:
from FlagEmbedding import FlagModel

bge_model = FlagModel('./bge_abstracts/', use_fp16=True)
bge_embeddings = bge_model.encode(abstracts)

Next, we encode our user query and get the closest abstract embedding to it based on cosine similarity. 

In [None]:
query = "What are the implications of the observed decline in GPT-4's performance?"
get_closest_abstract(query, bge_model, bge_embeddings)

# Mixed Models

Fine-tuning the foundational bge model might boost its effectiveness for the specific task at hand, yet it could potentially cause significant decline in the model’s overall abilities outside that specific area. If we merge the fine-tuned model and the base model with LM-Cocktail, it can increase the perfomance in general tasks.

In [None]:
!mkdir mixed_bge

In [None]:
pip install -U LM_Cocktail

In [None]:
from LM_Cocktail import mix_models, mix_models_with_data

model = mix_models(
    model_names_or_paths=["BAAI/bge-small-en-v1.5", "./bge_abstracts/"],
    model_type='encoder', 
    weights=[0.5, 0.5],
    output_path="./mixed_bge")

In [None]:
from FlagEmbedding import FlagModel

mixed_model = FlagModel('./mixed_bge', use_fp16=True)
mixed_embeddings = mixed_model.encode(abstracts)

In [None]:
query = "What are the implications of the observed decline in GPT-4's performance?"
get_closest_abstract(query, mixed_model, mixed_embeddings)