# RAG Pipeline

### Pine cone

Pine cone will be the the vector database implemented for storing all the embeddings created for the cord19 documents. 

All the embeddings are stored within an index in Pinecone. For this tutorial we will store the break up documents into two name spaces title and abstract. 

In [33]:
from pinecone import Pinecone, ServerlessSpec
from Keys import api_key

pine = Pinecone(api_key=api_key)

In [66]:
index = pine.Index("corddocument")

In [35]:
index_name="corddocument"

pine.create_index(
    name=index_name, 
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

## Cord 19 Dataset and Biobert embeddings

In [9]:
from datasets import load_dataset

data = load_dataset("cord19", "metadata")

In [10]:
df = data["train"].to_pandas()
df = df[["title", "abstract"]]

In [11]:
from transformers import AutoModel, AutoTokenizer

In [25]:
titles = list(df["title"])[:20]

In [53]:
model = AutoModel.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

import torch
title_embeddings = []
i = 0
for title in titles:
    print(i)
    i += 1
    tokens = tokenizer(title, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        embedding = model(**tokens).last_hidden_state
    title_embeddings.append(embedding)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


In [58]:
avg_embeddings = [embedding.mean(axis=1) for embedding in title_embeddings]

## Upserting the Data to the Database

In [76]:
titles=list(df["title"].loc[:10])
vectors = []
for i in range(len(titles)):
    vector_set = {"id" : titles[i], "values": avg_embeddings[i].squeeze(0).tolist()}
    vectors.append(vector_set)

In [77]:
vectors[0]

{'id': 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia',
 'values': [0.06598751246929169,
  -0.28783485293388367,
  -0.17019301652908325,
  0.05448983982205391,
  0.2601620852947235,
  -0.2154407799243927,
  -0.06793496012687683,
  0.3262845575809479,
  -0.11420939862728119,
  0.2289523482322693,
  0.2642815113067627,
  -0.1677493005990982,
  -0.1778506487607956,
  0.20629552006721497,
  0.06147000566124916,
  -0.0898958072066307,
  -0.08054717630147934,
  -0.2641662061214447,
  0.07998454570770264,
  0.29563969373703003,
  -0.26102912425994873,
  0.036744195967912674,
  -0.2422880381345749,
  0.11137557029724121,
  0.2374747395515442,
  0.027484247460961342,
  0.02110290713608265,
  0.3923691511154175,
  -0.10442624986171722,
  0.31602081656455994,
  -0.09790278971195221,
  -0.049486320465803146,
  -0.008141575381159782,
  -0.3722347617149353,
  0.048593003302812576,
  0.059041574597358704,
  0.155489161

In [78]:
index.upsert(vectors, namespace="cord-titles")

{'upserted_count': 11}