## TODO 
- maybe integrate the arxiv api and the similarity computation together (you can use the user query to use in the arxiv api)
- somehow combine a chatbot with the retrieved papers


In [1]:
import sqlite3
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Connect to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect("arxiv_papers.db")
cur = conn.cursor()

# Load the sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2') 

# NOTE: sanity check, the title of a paper
query = "re id"

#  Get the vector for the query
query_embedding = model.encode([query])

#  Fetch papers from the database
cur.execute("SELECT id, title, summary FROM papers")
papers = cur.fetchall()

#   Encode the summaries of the papers
paper_embeddings = [model.encode([paper[2]]) for paper in papers]  # paper[2] is the summary

#   Compute cosine similarities between the query and the paper summaries
similarities = []
for idx, paper_embedding in enumerate(paper_embeddings):
    similarity = cosine_similarity(query_embedding, paper_embedding)
    similarities.append((papers[idx], similarity[0][0]))  # (paper, similarity score)

#  Sort papers by similarity 
similarities.sort(key=lambda x: x[1], reverse=True)

#   Print the most similar papers
print("Most similar papers to your query:")
for paper, similarity in similarities[:10]:
    print(f"ID: {paper[0]}")
    print(f"Title: {paper[1]}")
    print(f"Similarity: {similarity:.4f}")
    print(f"Summary: {paper[2]}")
    print('-' * 80)

#   Close the connection to the database
conn.close()

Most similar papers to your query:
ID: 1
Title: Unsupervised Person Re-Identification: A Systematic Survey of Challenges and Solutions
Similarity: 0.4352
Summary: Person re-identification (Re-ID) has been a significant research topic in the
past decade due to its real-world applications and research significance. While
supervised person Re-ID methods achieve superior performance over unsupervised
counterparts, they can not scale to large unlabelled datasets and new domains
due to the prohibitive labelling cost. Therefore, unsupervised person Re-ID has
drawn increasing attention for its potential to address the scalability issue
in person Re-ID. Unsupervised person Re-ID is challenging primarily due to
lacking identity labels to supervise person feature learning. The corresponding
solutions are diverse and complex, with various merits and limitations.
Therefore, comprehensive surveys on this topic are essential to summarise
challenges and solutions to foster future research. Existing pe