## Embedding Model

Using an open source model from huggingface [scincl](https://huggingface.co/malteos/scincl)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [1]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('/mnt/c/Users/ankit/Desktop/Portfolio/Data/checkpoints/scincl')
model = AutoModel.from_pretrained('/mnt/c/Users/ankit/Desktop/Portfolio/Data/checkpoints/scincl')

In [4]:
df = pd.read_csv('arxiv-scrape')
df.head()

Unnamed: 0,Title,Abstract,Date,id
0,The Expanding Scope of the Stability Gap: Unve...,Recent research identified a temporary perform...,2024-06-07,2406.05114v1
1,Multiplane Prior Guided Few-Shot Aerial Scene ...,Neural Radiance Fields (NeRF) have been succes...,2024-06-07,2406.04961v1
2,MA-AVT: Modality Alignment for Parameter-Effic...,Recent advances in pre-trained vision transfor...,2024-06-07,2406.04930v1
3,MeLFusion: Synthesizing Music from Image and L...,Music is a universal language that can communi...,2024-06-07,2406.04673v1
4,M&M VTO: Multi-Garment Virtual Try-On and Editing,"We present M&M VTO, a mix and match virtual tr...",2024-06-06,2406.04542v1


In [6]:
title_abs_2 = [title + tokenizer.sep_token + abstract for title,abstract in zip(df['Title'], df['Abstract'])]

inputs_2 = tokenizer(title_abs_2[:15], padding=True, truncation=True, return_tensors="pt", max_length=512) ## 15 sweet spot

result_2 = model(**inputs_2)

embeddings_2 = result_2.last_hidden_state[:, 0, :]

In [7]:
embeddings_2.shape

torch.Size([15, 768])

In [24]:
a = np.array([1, 2, 3, 4, 5]).tolist()
b = np.array([['a', 'z'],
        ['b', 'x'],
        ['c', 'c'],
        ['d', 'v'],
        ['e', 'b']]).tolist()
temp = [{'id' : a[i], 'values' : b[i]} for i in range(5)]
temp

[{'id': 1, 'values': ['a', 'z']},
 {'id': 2, 'values': ['b', 'x']},
 {'id': 3, 'values': ['c', 'c']},
 {'id': 4, 'values': ['d', 'v']},
 {'id': 5, 'values': ['e', 'b']}]

In [27]:
embedding_vector = [{'id': df['id'][i], 'values': embeddings_2[i]} for i in range(15)]

## Vector Database

Using Pinecone standard version

For the current state of the project the free tier is sufficient

In [29]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="13e49798-af4c-4db1-82d6-82e891540972")

In [30]:
index_name = "pap-rec-sys-index"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    ) 


In [31]:
index = pc.Index(index_name)

index.upsert(
    vectors=embedding_vector,
    namespace='cs.CV'
)


{'upserted_count': 15}

In [55]:
print(index.describe_index_stats())

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'cs.CV': {'vector_count': 15}},
 'total_vector_count': 15}


In [41]:
import arxiv
client = arxiv.Client()

positive_id = '2003.08934v2' # without versions 400 error
negative_id = '2406.05126v1'
adversarial_id = '2310.06816v1'
ids = [positive_id, negative_id, adversarial_id]


search = arxiv.Search(
    id_list= ids,
    max_results= 10,
)

df_2 = pd.DataFrame({'Title': [result.title for result in client.results(search)],
              'Abstract': [result.summary.replace('\n', ' ') for result in client.results(search)],
              'Date': [result.published.date().strftime('%Y-%m-%d') for result in client.results(search)],
              'id': [result.entry_id.replace('http://arxiv.org/abs/', '') for result in client.results(search)]})

df_2.head()

Unnamed: 0,Title,Abstract,Date,id
0,NeRF: Representing Scenes as Neural Radiance F...,We present a method that achieves state-of-the...,2020-03-19,2003.08934v2
1,GR-Athena++: magnetohydrodynamical evolution w...,We present a self-contained overview of GR-Ath...,2024-06-07,2406.05126v1
2,Text Embeddings Reveal (Almost) As Much As Text,How much private information do text embedding...,2023-10-10,2310.06816v1


In [42]:
title_abs_3 = [title + tokenizer.sep_token + abstract for title,abstract in zip(df_2['Title'], df_2['Abstract'])]

inputs_3 = tokenizer(title_abs_3, padding=True, truncation=True, return_tensors="pt", max_length=512) ## 15 sweet spot

result_3 = model(**inputs_3)

embeddings_3 = result_3.last_hidden_state[:, 0, :]

In [50]:
positive_example = embeddings_3[0].detach().numpy().tolist()
negative_example = embeddings_3[1].detach().numpy().tolist()
adversarial_example = embeddings_3[2].detach().numpy().tolist()

In [56]:
positive_query = index.query(
    namespace='cs.CV',
    vector=positive_example,
    top_k=3,
    include_values=False
)

negative_query = index.query(
    namespace='cs.CV',
    vector=negative_example,
    top_k=3,
    include_values=False
)

adversarial_query = index.query(
    namespace='cs.CV',
    vector=adversarial_example,
    top_k=3,
    include_values=False
)

In [57]:
print(negative_query)

{'matches': [{'id': '2406.04155v1', 'score': 0.742778301, 'values': []},
             {'id': '2406.03723v1', 'score': 0.704175055, 'values': []},
             {'id': '2406.04111v1', 'score': 0.7000646, 'values': []}],
 'namespace': 'cs.CV',
 'usage': {'read_units': 5}}


In [58]:
print(adversarial_query)

{'matches': [{'id': '2406.04542v1', 'score': 0.767109, 'values': []},
             {'id': '2406.04322v2', 'score': 0.751672, 'values': []},
             {'id': '2406.04673v1', 'score': 0.745754957, 'values': []}],
 'namespace': 'cs.CV',
 'usage': {'read_units': 5}}


In [59]:
print(positive_query)

{'matches': [{'id': '2406.03723v1', 'score': 0.903211415, 'values': []},
             {'id': '2406.04961v1', 'score': 0.889982164, 'values': []},
             {'id': '2406.04322v2', 'score': 0.875168204, 'values': []}],
 'namespace': 'cs.CV',
 'usage': {'read_units': 5}}


The negative query doesnt belong to Computer Science and therefore will never be seen during inference

The adversarial query is from cs.CL that is for NLP category, still performs very well even though it is under the umbrella of AI category.

The positive query is from cs.CV few years back, Therefore a valid positive result.

In [60]:
search = arxiv.Search(
    id_list= ['2406.03723v1'],
    max_results= 2,
)

pd.DataFrame({'Title': [result.title for result in client.results(search)],
              'Abstract': [result.summary.replace('\n', ' ') for result in client.results(search)],
              'Date': [result.published.date().strftime('%Y-%m-%d') for result in client.results(search)],
              'id': [result.entry_id.replace('http://arxiv.org/abs/', '') for result in client.results(search)]}).head()

Unnamed: 0,Title,Abstract,Date,id
0,Gear-NeRF: Free-Viewpoint Rendering and Tracki...,Extensions of Neural Radiance Fields (NeRFs) t...,2024-06-06,2406.03723v1


In [61]:
search = arxiv.Search(
    id_list= ['2406.04961v1'],
    max_results= 2,
)

pd.DataFrame({'Title': [result.title for result in client.results(search)],
              'Abstract': [result.summary.replace('\n', ' ') for result in client.results(search)],
              'Date': [result.published.date().strftime('%Y-%m-%d') for result in client.results(search)],
              'id': [result.entry_id.replace('http://arxiv.org/abs/', '') for result in client.results(search)]}).head()

Unnamed: 0,Title,Abstract,Date,id
0,Multiplane Prior Guided Few-Shot Aerial Scene ...,Neural Radiance Fields (NeRF) have been succes...,2024-06-07,2406.04961v1


Both papers seems to include Nerf in their title/abstract therefore positive similarity results are valid

Project is promising

##### Possible Next Step

Zotero Integration