## 0. Docstring

In [None]:
"""
Automate the ingestion and organization of PDF documents, their pages, and associated metadata, into a Weaviate vector database for later retrieval and analysis.

This script ingests PDF documents from a specified directory, extracts both document-level metadata and page-level content, 
and stores the data in a Weaviate vector database using two separate collections:
    - PDF_document: Contains metadata about the PDF files such as title, page count, creation date, and effective date.
    - PDF_document_page: Contains the text content of individual pages along with their page number and a reference to the associated PDF document.

Key functionality includes:

**Collection Setup**:
    - Creates or replaces two Weaviate collections: one for PDF documents and one for PDF pages.
        - PDF_document: Holds metadata about the document (title, leadership scope, page count, dates, etc.).
        - PDF_document_page: Stores the text content of each page, its page number, and a reference to the associated PDF document.
    
**Library Catalog Ingestion**:
    - Loads a library catalog (Excel file) containing metadata for the PDFs into a pandas DataFrame.
    - Processes specific date columns into a standardized format (Zulu time).

**PDF Processing**:
    - Walks through the specified directory, identifying PDF files.
    - For each PDF, computes a unique document ID, retrieves corresponding metadata from the DataFrame, and stores the metadata in the `PDF_document` collection.
    - For each page of the PDF, extracts the content and stores it in the `PDF_document_page` collection with a reference to the PDF document.


Usage:
    1. Ensure that the environment variables for Weaviate credentials (WEAVIATE_URL_COMP, WEAVIATE_API_KEY_COMP, OPENAI_API_KEY) are set.
    2. Place the PDF files in the specified `pdf_source_directory`.
    3. Ensure the library catalog is present in `library_catalog_directory`.
    4. Run the script to automatically create Weaviate collections and upload the PDF metadata and page content.
"""


## 1. Installs, Imports and Environmental Variables

In [None]:
%pip install -U weaviate-client
#%pip install python-dotenv

In [1]:
import os
import sys
import pandas as pd
import weaviate
import weaviate.classes as wvc
# from dotenv import load_dotenv, find_dotenv

# load_dotenv(find_dotenv())

In [2]:
'''This litle code block is used anytime you want to import a local module from within a Jupyter Notebook. This is required becuase Jupyter treats each cell as a module.'''

# Navigate up one level from the current notebook's directory to reach the root directory
current_dir = os.path.dirname(os.path.realpath('__file__'))
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

In [3]:
import utils

## 2. Set Configurations and Clients

In [4]:
url = os.getenv("WEAVIATE_URL_COMP")
api_key = os.getenv("WEAVIATE_API_KEY_COMP")

In [5]:
client = weaviate.connect_to_wcs(
    cluster_url=url,
    auth_credentials=weaviate.auth.AuthApiKey(api_key),
    headers={
        "X-OpenAI-Api-Key": os.environ.get("OPENAI_API_KEY")
    }
)

In [20]:
import pprint

list = client.collections.list_all(True)
# pprint.pprint(list)

{'ASK_vectorstore': _CollectionConfigSimple(name='ASK_vectorstore',
                                            description=None,
                                            generative_config=_GenerativeConfig(generative=<GenerativeSearches.OPENAI: 'generative-openai'>,
                                                                                model={}),
                                            properties=[_Property(name='page_content',
                                                                  description='This '
                                                                              'property '
                                                                              'was '
                                                                              'generated '
                                                                              'by '
                                                                              "Weaviate's "
                       

In [43]:
# Count number of objects
PDF_docs = client.collections.get("PDF_document")
response = PDF_docs.query.fetch_objects()
len(response.objects)

7

In [44]:
# This returns objects in ascending UUID order.
for o in response.objects:
    print(o.uuid)

1a811779-fde1-52b7-af49-a8afedd6a86d
2d846e21-4eb5-5b2e-9cde-0c45d5298550
4b5a25d7-60a2-5e60-9a6e-f0b3de310a15
968e3b36-4da7-558a-a8be-613fa35c3380
b9e3c726-3ca1-5f61-8895-c4ce7554a288
ca36540d-6b51-5478-952b-14839f9f0722
fe2dc3ce-387a-5c6e-be60-af5e15a83141


In [11]:
client.close()