# Upload PDFs to a Vector Database

## Overview
This notebook will guide you through uploading a sample PDF dataset to a vector database. 
You should already have a sample [Milvus](https://milvus.io/docs/install_standalone-docker.md) vector database setup from the Workbench project, which is setup to run at port `19530`. 

In [1]:
!pip install pypdf

Defaulting to user installation because normal site-packages is not writeable
Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: pypdf
Successfully installed pypdf-5.4.0


## Unzip Dataset
The dataset used in this example is pdf files containing NVIDIA blogs and press releases. These PDF files have been scraped and stored in `../data/corp-comms-dataset.zip`.

In [2]:
!unzip -n ../data/corp-comms-dataset.zip -d ../data/

Archive:  ../data/corp-comms-dataset.zip
   creating: ../data/dataset/
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjIvMTIvMDEvbW9uZGF2aS1tb25hcmNoLXNtYXJ0LWVsZWN0cmljLWpldHNvbi10cmFjdG9yLw==.pdf  
  inflating: ../data/dataset/RGVsbCBUZWNoIDUvMjMvMjMucGRm.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDMvMjIvc3V0c2tldmVyLW9wZW5haS1ndGMv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDcvMTAvdHJlay1iaWN5Y2xlLXRvdXItZGUtZnJhbmNlLWdwdXMv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDcvMTIvbW9zYWljbWwv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDIvMTYvYWktbWV0YXZlcnNlLXNoYXBpbmctYXV0b21vdGl2ZS1pbmR1c3RyeS1ndGMv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDQvMjEvZXBpYy1iZW5lZml0cy1vbW5pdmVyc2UtY29ubmVjdG9yLXVucmVhbC1lbmdpbmUv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjIvMTEvMTcveHItdGVjaG5vbG9naWVzLw==.pdf  
  inflati

## Setup NVIDIA Embedding Model
This model, [embed-qa-4](https://build.nvidia.com/nvidia/embed-qa-4), is a fine-tuned E5-large model deployed as a NIM and hosted on the [NVIDIA API catalog](https://build.nvidia.com/). 


*⚠️* Be sure to populate config variables for the app!

In [1]:
from chain_server.configuration import config
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# "nvapi-xxx" is the NVIDIA API KEY format. If you have not configured this variable, be sure to do so. 
embedding_model = NVIDIAEmbeddings(
    model=config.embedding_model.name,
    base_url=str(config.embedding_model.url),
    api_key=config.nvidia_api_key,
    truncate="END"
)

## Setup Milvus Vector Database
[Milvus](https://milvus.io/docs/install_standalone-docker.md) should already be running through NVIDIA Workbench.  Milvus is a database that stores, indexes, and manages massive embedding vectors.

In [2]:
print(config.milvus.collection_name)

collection_1


In [3]:
from langchain_milvus.vectorstores.milvus import Milvus

vector_store = Milvus(
    embedding_function=embedding_model,
    connection_args={"uri": config.milvus.url},
    collection_name=config.milvus.collection_name,
    auto_id=True,
)

## Upload PDFs to Milvus Vector Database

In [4]:
import glob

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def upload_document(file_path):
    loader = PyPDFLoader(str(file_path))
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter()
    all_splits = text_splitter.split_documents(data)
    vector_store.add_documents(documents=all_splits)

    return f"uploaded {file_path}"

def upload_pdf_files(folder_path, num_files):
    i = 0
    for file_path in glob.glob(f"{folder_path}/*.pdf"):
        print(upload_document(file_path))
        i += 1
        if i >= num_files:
            break

In [9]:
NUM_DOCS_TO_UPLOAD=1
upload_pdf_files("../data", NUM_DOCS_TO_UPLOAD)

Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 18 0 (offset 0)
Ignoring wrong pointing object 20 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)
Ignoring wrong pointing object 47 0 (offset 0)


uploaded ../data/Writing Sample_Lauren Gray.pdf


In [6]:
query = "what did the ninth circuit decide regarding the definition of registration?"
docs = vector_store.similarity_search(query)
print(docs[0])

page_content='10 
Naturally, the Fourth Circuit found that “re-registrations” or changes in ownership registrations must be included in the meaning of “registration.” Id. at 798. The Ninth Circuit is the only Circuit to review “registration” and conclude differently. Gopets v. Hise, 657 F.3d 1024, 1032 (9th Cir. 2011). The Ninth Circuit used the text and language of the act to interpret that Congress put nothing into the statute to imply the prevention of property alienability of the domain based on a non-applicable case and issue. 657 F.3d at 1031-32. Using this perceived intent of Congress, the Ninth Circuit held their interpretation that other registrations are not included in “registration” under the ACPA. 657 F.3d at 1032.  Accordingly, the Third, Fourth, and Eleventh Circuit applied the Supreme Court’s rules of statutory interpretation correctly to comply with the legislation’s purpose of promoting cybersquatting protection. Prudential, 58 F.4th at 794-97. They used plain languag