# Utils package
contains common reusable functions

In [None]:
!pip install pipenv

In [None]:
!pipenv install pytest

In [None]:
%%writefile -a __init__.py
print('Hello, world!')

Since this module will be exported as a package, \_\_init\_\_.py just needs to exist so the Python packaging mechanism knows where to look for functions.

Start with processing the training data(instruction manual PDFs). I want to try both the various langchain pdf readers(eg. PyPDF, MathPix, Unstructured, PyPDFium2, PDFMiner, PyMuPDF, pdfplumber) and the PyPDF2 library to see if there are any differences or advantages. I also want to try out the different text splitter parameters(eg. smaller chunks may sometimes be more likely to match a query, so try varying chunk sizes and overlap). There are also different options for text splitters(character, recursive character, NLTK, spaCy, Tiktoken, Hugging Face tokenizer, tiktoken(OpenAI) tokenizer).

In [None]:
!pipenv install pypdf2 langchain

# First, try processing using PyPDF2's PdfReader and langchain's CharacterTextSplitter 

Note: to use pipenv in conjunction with Jupyter, do the following steps:
1. Navigate to project folder. In my case it was cd/Downloads/ctlpchatbot/utils.
2. In your project folder, do pipenv install ipykernel
                              pipenv shell
  If the machine doesn't recognize pipenv, add python -m before running the commands.
3. This will bring up a terminal in your virtualenv like this: (my-virtualenv-name) $
4. In that shell do: python -m ipykernel install --user --name=my-virtualenv-name
5. Launch jupyter notebook by running in the shell: jupyter notebook
6. In your notebook, top bar, near top left, to the right of "Cell" and to the left of "Widgets"
   Click Kernel -> Change Kernel. You should see the kernel my-vertualenv-name(whatever the name you set) as an option.
   Select that as your kernel and the pipenv environment should work, allowing the imports to be recognized.

In [None]:
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter

pdf = '../../ePortG9QuickStartGuide.pdf'

#extract information from pages in pdf and combine into a singular raw text
reader = PdfReader(pdf)
text = ""
for page in reader.pages:
    text += page.extract_text()
print(text)

The results I got above are interesting. All I did was append the pages together, but it seems like some of the words are split for some reason. I'm not sure how much this will affect the results or querying but I also want to see what the text would look like with another pdf reader. After I try the text splitter, I'm going to try langchain's library for pdf readers to see if I get a similar result. It might just be the way the PDF is formatted rather than a fault of the pdf reader.

In [None]:
splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = splitter.split_text(text)
print(chunks)

Now I'll try using solely langchain's libraries.

MathPix requires an API key so I'll avoid that for now to avoid uncessesary charges.

# Using Langchain's PyPDF wrapper


In [None]:
!pip install pypdf
from langchain.document_loaders import PyPDFLoader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyPDFLoader(pdf)
data = loader.load()
print(data)

# Using Unstructured:

In [None]:
!pip install unstructured
from langchain.document_loaders import UnstructuredPDFLoader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = UnstructuredPDFLoader(pdf)
data = loader.load()
data[0]

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = UnstructuredPDFLoader(pdf)
data = loader.load()
print(data)

Just judging from looks, this text looks less fragmented than the one retrieved by PyPDF2. I don't see any words with unintended spaces in them. I also want to note here that you can also fetch remote PDFs using Unstructured, so if there are online pdf sites, the OnlinePDFLoader class can be used.

# Using PyPDFium2:

In [None]:
!pip install pypdfium2
from langchain.document_loaders import PyPDFium2Loader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyPDFium2Loader(pdf)
data = loader.load()
print(data)

This also looks slightly different from the last two. Some things seem to be in different order, for example the tootle required for installation came before "Thank you for your purchase. Before you start, please read these instructions thoroughly"...

# Using PDFMiner:

Note: This can be helpful for chunking texts semanticaly into sections as the output html content can be parsed via BeautifulSoup to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.

In [None]:
from langchain.document_loaders import PDFMinerPDFasHTMLLoader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PDFMinerPDFasHTMLLoader(pdf)
data = loader.load()[0] #entire pdf is loaded as single doc
print(data)

This outputs html code, so I don't think I'll be using it as a loader. But if you do want to parse the HTML using BeautifulSoup, it's at https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html

# Using PyMuPDF:

Note: According to langchain, this is the fastest of the PDF parsing options, contains detailed metadata about the PDF and its pages, as well as returns one document per page.

In [None]:
!pip install pymupdf
from langchain.document_loaders import PyMuPDFLoader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDFLoader(pdf)
data = loader.load()
print(data)

It's more detailed than the prior options, as you can see the metadata tags.

# Using pdfplumber:
    
Note: Similar to PyMuPDF, it also contains detailed metadata

In [None]:
!pip install pdfplumber
from langchain.document_loaders import PDFPlumberLoader

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PDFPlumberLoader(pdf)
data = loader.load()
print(data)

Now that I've finished exploring the pdf loaders offered by langchain, I don't see any inherently "better" options. But the ones offered by langchain do appear to be better than the PyPDF2 standalone library since there's no fragmented text. But I want to try them all anyways when it comes to chunking these documents, convert into embeddings, then to a vector store. We'll see the differences in quality of responses if there are any.

Since I'll probably be re-using these functions, I'll organize the different pdf loaders into classes, so they can be instantiated and used when necessary when this utils module is imported as a package. I'll have different functions, such as a load function and a separate one for the chunking stage.

# First, I'll make the PyPDF2 class in the pypdf2.py file.

In [None]:
%%writefile -a pypdf2.py
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter

class PyPDF2_Reader:
    """
    A class to read and extract text from PDF files using the PyPDF2 library
    """
    def __init__(self, filepath):
        """
        Initializes a new instance of PyPDF2 PdfReader with the given filepath
        
        :param filepath: The path to the PDF file to be read
        """
        self.filepath = filepath
        self.reader = PdfReader(filepath)
    
    def load_text(self):
        """
        Loads and extracts the text content from the PDF file
        
        :return: The extracted text as a string
        """
        text = ""
        for page in self.reader.pages:
            text += page.extract_text()
        return text

# We also have to make/update a setup.py file to keep track of the dependencies needed since this package will be exported and used.

In [None]:
!pip install pipenv


In [None]:
!pipenv install setuptools

In [None]:
!pip list

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1', 'langchain=0.0.199', 'pytest=7.3.2']
)

Now, we repeat these steps for the other pdf loaders in langchain's library.

Actually, before proceeding. I wanted to quickly organize the utils directory since it will become cluttered. I realized that there are going to be more modules than I thought, so I'll be organizing them into loaders and splitters. PDF Loader classes will be in loaders and text splitters will be in splitters. I can keep the setup.py file where it is since it's usually kept at the top-level, but I need to create a directory for loaders and move the pypdf2.py file there.

In [None]:
from pathlib import Path

#make the directories "loaders" and "splitters"
Path("loaders").mkdir(parents=True, exist_ok=True)
Path("splitters").mkdir(parents=True, exist_ok=True)

In [None]:
import shutil
shutil.move('pypdf2.py', 'loaders')

Now make a \_\_init\_\_.py in the subpackages loaders and splitters for them to be recognized as such.

In [None]:
%%writefile -a loaders/__init__.py
print("Hello world")

In [None]:
%%writefile -a splitters/__init__.py
print("Hello world")

Before moving on, I also want to take some time to test the PyPDF2 class I made.(I added a quick main method for testing but you can't see that here. I  just instantiated the class by doing reader = PyPDF2_Reader("example_path") then used reader.load_text()

In [None]:
%run loaders/pypdf2.py

# I'll set up the module for the Unstructured loader in unstructured.py

In [None]:
%%writefile -a loaders/pypdf.py
from langchain.document_loaders import PyPDFLoader

class PyPDF_Loader:
    """
    A class to load a PDF file using the PyPDF PyPDFLoader wrapper from the langchain library.
    """
    def __init__(self, filepath):
        """
        Initializes a new instance of PyPDF PyPDFLoader with the given filepath. 
        
        :param filepath: The path to the PDF file to be loaded
        
        """
        self.filepath = filepath
        self.loader = PyPDFLoader(filepath)
    
    def load_text(self):
        """
        Loads the text into the document format used downstream(by text splitters, etc)
        
        :return: The pdf as a a Document(contains page_content and metadata)
        """
        data = self.loader.load()
        return data

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1', 'langchain=0.0.199', 'pytest=7.3.2', 'pypdf=3.9.1']
)

# I'll set up the module for the Unstructured loader in unstructured.py

In [None]:
%%writefile -a loaders/unstructured.py
from langchain.document_loaders import UnstructuredPDFLoader

class UnstructuredPDF_Loader:
    """
    A class to load a PDF file using the UnstructuredPDFLoader wrapper from the langchain library.
    """
    def __init__(self, filepath):
        """
        Initializes a new instance of Unstructured UnstructuredPDFLoader with the given filepath. 
        
        :param filepath: The path to the PDF file to be loaded
        
        """
        self.filepath = filepath
        self.loader = UnstructuredPDFLoader(filepath)
    
    def load_text(self):
        """
        Loads the text into the document format used downstream(by text splitters, etc)
        
        :return: The pdf as a a Document(contains page_content and metadata)
        """
        data = self.loader.load()
        return data

Update the setup.py file to include the unstructured dependency

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1', 'langchain=0.0.199', 'pytest=7.3.2', 'pypdf=3.9.1', 'unstructured=0.7.4']
)

# I'll set up the module for the PyPDFium2 loader in pypdfium2.py

In [None]:
%%writefile -a loaders/pypdfium2.py
from langchain.document_loaders import PyPDFium2Loader

class PyPDFium2PDF_Loader:
    """
    A class to load a PDF file using the PyPDFium2 wrapper from the langchain library.
    """
    def __init__(self, filepath):
        """
        Initializes a new instance of PyPDFium2 PyPDFium2Loader with the given filepath. 
        
        :param filepath: The path to the PDF file to be loaded
        
        """
        self.filepath = filepath
        self.loader = PyPDFium2Loader(filepath)
    
    def load_text(self):
        """
        Loads the text into the document format used downstream(by text splitters, etc)
        
        :return: The pdf as a a Document(contains page_content and metadata)
        """
        data = self.loader.load()
        return data


In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1', 'langchain=0.0.199', 'pytest=7.3.2', 'pypdf=3.9.1', 'unstructured=0.7.4', 'pypdfium2=4.15.0', 'pdf2image=1.16.3']
)

# I'll set up the module for the PyMuPDF loader in pymupdf.py

In [None]:
%%writefile -a loaders/pymupdf.py
from langchain.document_loaders import PyMuPDFLoader

class PyMuPDF_Loader:
    """
    A class to load a PDF file using the PyMuPDF wrapper from the langchain library.
    """
    def __init__(self, filepath):
        """
        Initializes a new instance of PyMuPDF PyMuPDFLoader with the given filepath. 
        
        :param filepath: The path to the PDF file to be loaded
        
        """
        self.filepath = filepath
        self.loader = PyMuPDFLoader(filepath)
    
    def load_text(self):
        """
        Loads the text into the document format used downstream(by text splitters, etc)
        
        :return: The pdf as a a Document(contains page_content and metadata)
        """
        data = self.loader.load()
        return data

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1',
                      'langchain=0.0.199',
                      'pytest=7.3.2',
                      'pypdf=3.9.1',
                      'unstructured=0.7.4',
                      'pypdfium2=4.15.0',
                      'pdf2image=1.16.3',
                      'PyMuPDF=1.22.3'
                     ]
)

# I'll set up the module for the pdfplumber loader in pdfplumber.py

In [None]:
%%writefile -a loaders/pdfplumber.py
from langchain.document_loaders import PDFPlumberLoader

class pdfplumber_loader:
    """
    A class to load a PDF file using the pdfplumber wrapper from the langchain library.
    """
    def __init__(self, filepath):
        """
        Initializes a new instance of pdfplumber PDFPlumberLoader with the given filepath. 
        
        :param filepath: The path to the PDF file to be loaded
        
        """
        self.filepath = filepath
        self.loader = PDFPlumberLoader(filepath)
    
    def load_text(self):
        """
        Loads the text into the document format used downstream(by text splitters, etc)
        
        :return: The pdf as a a Document(contains page_content and metadata)
        """
        data = self.loader.load()
        return data

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1',
                      'langchain=0.0.199',
                      'pytest=7.3.2',
                      'pypdf=3.9.1',
                      'unstructured=0.7.4',
                      'pypdfium2=4.15.0',
                      'pdf2image=1.16.3',
                      'PyMuPDF=1.22.3',
                      'pdfplumber=0.9.0'
                     ]
)

# Now we'll start working with text splitters, which handle tokenization

Query + Context -> LLM

But need to think about max token limit of the LLM and how much space can be reserved for contexts?

Example, if the model has 4096 available space for tokens, you need to think about both the input(query, context, instructions, history) and the output tokens

# Character Text Splitter:

Note: Simplest method, splits based on characters (by default "\n\n") and measure chunk length by number of characters


In [None]:
from loaders.pymupdf import PyMuPDF_Loader
from langchain.text_splitter import CharacterTextSplitter

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDF_Loader(pdf)

data = loader.load_text()
text_splitter = CharacterTextSplitter(
    separator= "\n",
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)

texts = text_splitter.split_documents(data)
print(texts)

# NLTK:

Note: text is split by NLTK tokenizer, chunk size measured by number of characters

In [None]:
!pip install nltk
from loaders.pypdf import PyPDF_Loader
from langchain.text_splitter import NLTKTextSplitter

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyPDF_Loader(pdf)
data = loader.load_text()

text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_documents(data)
print(texts)

# Recursive Character:

Note: recommeneded one for generic text. Parameterized by a list of characters["\n\n", "\n", " ", ""]. This tries to keep the paragraphs, sentences, words together. Chunk size is measured by number of characters

In [None]:
from loaders.pymupdf import PyMuPDF_Loader
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDF_Loader(pdf)

data = loader.load_text()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)

texts = text_splitter.split_documents(data)
print(texts)

# Tiktoken:

Note: fast BPE tokenizer created by OpenAI, text is split and chunk size is measured by tiktoken tokens.

In [None]:
!pip install tiktoken
from loaders.pymupdf import PyMuPDF_Loader
from langchain.text_splitter import TokenTextSplitter

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDF_Loader(pdf)

data = loader.load_text()
text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=200)

texts = text_splitter.split_documents(data)
print(texts)

# Now, I'll create the modules for each text splitter class


# CharacterTextSplitter

In [None]:
%%writefile -a splitters/character.py
from langchain.text_splitter import CharacterTextSplitter

class Character_TextSplitter:
    """
    A class to split a Document using the CharacterTextSplitter wrapper from the langchain library
    """
    def __init__(self, separator, chunk_size, chunk_overlap, length_function):
        """
        Initializes a new instance of CharacterTextSplitter
        
        :param separator: list of separator characters for the text splitter
        :param chunk_size: Maximum size of chunks to return
        :param chunk_overlap: Overlap in characters between chunks
        :param length_function: Function that measures the length of given chunks
        
        """
        self.splitter = CharacterTextSplitter(
            separator = separator,
            chunk_size = chunk_size,
            chunk_overlap = chunk_overlap,
            length_function = length_function
        )
    
    def split_data(self, data):
        """
        Splits the given Document based on single characters, default "\n\n", and measures chunk length by number of characters
        
        :param data: The Document to be split, in the Document format returned by the langchain pdf loaders

        :return: Split Documents
        """
        docs = self.splitter.split_documents(data)
        return docs

# NLTK:


In [None]:
%%writefile -a splitters/nltk.py
from langchain.text_splitter import NLTKTextSplitter

class NLTK_TextSplitter:
    """
    A class to split a Document using the NLTKTextSplitter wrapper from the langchain library
    """
    def __init__(self, chunk_size):
        """
        Initializes a new instance of NLTKTextSplitter
        
        :param chunk_size: Maximum size of chunks to return
        
        """
        self.splitter = NLTKTextSplitter(chunk_size = chunk_size)
    
    def split_data(self, data):
        """
        Splits the given Document based on NLTK tokenzer, chunk size is measured by number of characters
        
        :param data: The Document to be split, in the Document format returned by the langchain pdf loaders

        :return: Split Documents
        """
        docs = self.splitter.split_documents(data)
        return docs

Update the setup.py file

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1',
                      'langchain=0.0.199',
                      'pytest=7.3.2',
                      'pypdf=3.9.1',
                      'unstructured=0.7.4',
                      'pypdfium2=4.15.0',
                      'pdf2image=1.16.3',
                      'PyMuPDF=1.22.3',
                      'pdfplumber=0.9.0',
                      'nltk=3.8.1'
                     ]
)

# Recursive Character:

In [None]:
%%writefile -a splitters/recursive.py
from langchain.text_splitter import RecursiveCharacterTextSplitter

class RecursiveCharacter_TextSplitter:
    """
    A class to split a Document using the RecursiveCharacterTextSplitter wrapper from the langchain library.
    Recommended text splitter for generic text.
    """
    def __init__(self,, chunk_size, chunk_overlap, length_function):
        """
        Initializes a new instance of RecursiveCharacterTextSplitter
        
        :param chunk_size: Maximum size of chunks to return
        :param chunk_overlap: Overlap in characters between chunks
        :param length_function: Function that measures the length of given chunks
        
        """
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size = chunk_size,
            chunk_overlap = chunk_overlap,
            length_function = length_function
        )
    
    def split_data(self, data):
        """
        Splits the given Document based on list of characters, ["\n\n", "\n", " ", ""]. Chunk size is measured of characters.
        
        :param data: The Document to be split, in the Document format returned by the langchain pdf loaders

        :return: Split Documents
        """
        docs = self.splitter.split_documents(data)
        return docs

# Tiktoken:


In [None]:
%%writefile -a splitters/tiktoken.py
from langchain.text_splitter import TokenTextSplitter

class RecursiveCharacter_TextSplitter:
    """
    A class to split a Document using the TokenTextSplitter wrapper from the langchain library.
    """
    def __init__(self,, chunk_size, chunk_overlap):
        """
        Initializes a new instance of TokenTextSplitter
        
        :param chunk_size: Maximum size of chunks to return
        :param chunk_overlap: Overlap in characters between chunks
        
        """
        self.splitter = TokenTextSplitter(
            chunk_size = chunk_size,
            chunk_overlap = chunk_overlap,
        )
    
    def split_data(self, data):
        """
        Splits the given Document based on tiktoken tokens. The text is split and chunk size is measured by tiktoken tokens.
        
        :param data: The Document to be split, in the Document format returned by the langchain pdf loaders

        :return: Split Documents
        """
        docs = self.splitter.split_documents(data)
        return docs

Update setup.py


In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1',
                      'langchain=0.0.199',
                      'pytest=7.3.2',
                      'pypdf=3.9.1',
                      'unstructured=0.7.4',
                      'pypdfium2=4.15.0',
                      'pdf2image=1.16.3',
                      'PyMuPDF=1.22.3',
                      'pdfplumber=0.9.0',
                      'nltk=3.8.1',
                      'tiktoken=0.4.0'
                     ]
)

# Now that we have our util functions for processing/loading the PDFs to the langchain Document format and splitting/tokenizing text, we can move on embeddings/vector stores phase. Since I will be using the OpenAI Embeddings, I'll move to the ai module.

# Coming back from the ai module after implementing the openai_embeddings module, I'll now be implementing Deep Lake as my local vector store. It has the capacity to take in an Embeddings model and a Document as parameters, and both compute and store the embeddings.

In [None]:
!pip install deeplake
!pip install openai
import sys
sys.path.append('../ai/embeddings')
from openai_embeddings import OpenAI_Embeddings
from loaders.pymupdf import PyMuPDF_Loader
from splitters.recursive import RecursiveCharacter_TextSplitter
from langchain.vectorstores import DeepLake

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDF_Loader(pdf)
data = loader.load_text()
splitter = RecursiveCharacter_TextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len
)
docs = splitter.split_data(data)

embeddings = OpenAI_Embeddings(api_key='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').vectorizer

db = DeepLake(dataset_path="./documents_deeplake", embedding_function=embeddings)
db.add_documents(docs)
query="What are the tools required to install the ePort G9"
response=db.similarity_search(query)
print(docs[0].page_content)
db.delete_dataset()


# Now that I know it works, I'll create a module for Deep Lake

In [None]:
from pathlib import Path

#make the directories "loaders" and "splitters"
Path("vectorstores").mkdir(parents=True, exist_ok=True)

In [None]:
filename= 'vectorstores/__init__.py'

with open(filename, 'w') as file:
    pass

In [None]:
%%writefile -a vectorstores/deep_lake.py
from langchain.vectorstores import DeepLake

class DeeplakeDB:
    """
    A class to initialize the Deep Lake vector store and perform various operations based on the DeepLake wrapper from langchain
    """
    def __init__(self, store_path, embedding_model):
        """
        Initializes the DeepLake object based on a given dataset path and embedding function/model.
        DeepLake wrapper is capable of internally computing the embedding using the given model and storing it in the path.
        
        :param store_path: path that contains vector store. will create at that path if doesn't already exist 
        :param embedding_model: langchain embedding model
        """
        self.db = DeepLake(dataset_path = store_path, embedding_function = embedding_model)

    def add_docs(self, documents):
        """
        Adds the embedded documents to the path given on initialization.
        
        :param document: langchain Document object used for computing embedding, then to be stored
        """
        for document in documents:
            self.db.add_documents(document)
    
    def find_similar(self, query):
        """
        Returns the document that best matches the query
        
        :param query: String that is tested for similarity search
        
        :return: most similar Document object
        """
        return self.db.similarity_search(query)

    def delete_all(self):
        """
        Deletes the vector store in the given path.
        """
        self.db.delete_dataset()

Update setup.py accordingly:

In [None]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(
    name='utils',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing common utility functions such as pdf reading, loading, and splitting into chunks',
    packages=find_packages(),
    install_requires=['PyPDF2=3.0.1',
                      'langchain=0.0.199',
                      'pytest=7.3.2',
                      'pypdf=3.9.1',
                      'unstructured=0.7.4',
                      'pypdfium2=4.15.0',
                      'pdf2image=1.16.3',
                      'PyMuPDF=1.22.3',
                      'pdfplumber=0.9.0',
                      'nltk=3.8.1',
                      'tiktoken=0.4.0',
                      'deeplake=3.6.3'
                     ]
)


I just want to run a quick test of all the modules I created, so I'll import those and use them to store the computed embeddings in the vector store. Then I'll use that vector store as my retriever in the conversation chain to test if the LLM can produce a suitable reponse based on that stored embedding as context.

In [None]:
import sys
sys.path.append('../ai/embeddings')
from openai_embeddings import OpenAI_Embeddings
from loaders.pymupdf import PyMuPDF_Loader
from splitters.recursive import RecursiveCharacter_TextSplitter
from vectorstores.deep_lake import DeeplakeDB
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDF_Loader(pdf)
data = loader.load_text()
splitter = RecursiveCharacter_TextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len
)
docs = splitter.split_data(data)

embeddings = OpenAI_Embeddings(api_key='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').vectorizer

db = DeeplakeDB(store_path='./embeddings_deeplake', embedding_model=embeddings)
db.add_docs(docs)
llm = ChatOpenAI(openai_api_key='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')
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm=llm, retriever=db.db.as_retriever(), memory=memory)

query = "What tools do I need to install the ePort G9"
result = qa({"question": query})
print(result["answer"])

db.delete_all()

It recognized the asterisk next to the words as corresponding labels for the "Required for surface-mount installations" tag. For example, it said Power Drill* and Multi-diameter Step Drill Bit* and at the bottom has *Required for surface-mount installations.

# It's really intelligent and good at understanding text. Next, I'll be implementing the ChatOpenAI model and the conversation wrappers in the ai module.

# Now, I'll set up and organize tests with pytest. First, create a tests directory.


In [None]:
from pathlib import Path

#make the directory "pdf"
Path("loaders/pdf").mkdir(parents=True, exist_ok=True)

In [2]:
import shutil
shutil.move('loaders/pdfplumber.py', 'loaders/pdf')
shutil.move('loaders/pymupdf.py', 'loaders/pdf')
shutil.move('loaders/pypdf.py', 'loaders/pdf')
shutil.move('loaders/pypdfium2.py', 'loaders/pdf')
shutil.move('loaders/unstructured.py', 'loaders/pdf')

'loaders/pdf\\unstructured.py'

Then, let's make a main pdf loader module to handle the different file types and load accordingly.

In [21]:
%%writefile loaders/loader_mapper.py
from langchain.document_loaders import (
    CSVLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
)

Writing loaders/multi_loader.py


In [22]:
%%writefile -a loaders/loader_mapper.py

class LoaderMapper:
    """
    LoaderMapper can accept multiple file types and return a langchain loader wrapper that corresponds to the associated loader.
    Currently supports csv, pdf, txt, html, md, doc, docx, ppt, pptx, xls, xlsx, json
    """
    #keep dict of file extensions and their relevant loaders with their arguments
    loader_map = {
            ".csv": (CSVLoader, {}),
            ".pdf": (PyMuPDFLoader, {}),
            ".txt": (TextLoader, {"encoding": "utf8"}),
            ".html": (UnstructuredHTMLLoader, {}),
            ".md": (UnstructuredMarkdownLoader, {}),
            ".doc": (UnstructuredWordDocumentLoader, {}),
            ".docx": (UnstructuredWordDocumentLoader, {}),
            ".ppt": (UnstructuredPowerPointLoader, {}),
            ".pptx": (UnstructuredPowerPointLoader, {}),
            ".xls": (UnstructuredExcelLoader, {}),
            ".xlsx": (UnstructuredExcelLoader, {}),
        }
    
    @classmethod
    def find_loader(self, filepath):
        """
        Finds the associated loader based on filepath extension
        
        :param filepath: path of the file to be loaded
        
        :return: langchain loader wrapper object
        """
        ext = "." + filepath.rsplit(".", 1)[-1]
        if ext in LoaderMapper.loader_map:
            loader_class, loader_args = LoaderMapper.loader_map[ext]
            loader = loader_class(filepath, **loader_args)
            return loader
        
        raise ValueError(f"Unsupported file extension '{ext}'")

Appending to loaders/multi_loader.py


Let's test all the different file types to see if the class works as intended. We'll use the test_loaders.py module and pytest again. We also need to update the import line since we moved the pdf loaders under loaders/pdf. UPDATE: I just made a change to the multi_loader class and changed it to LoaderMapper instead. I want to return the relevant loader instead of doing all the loading work in that class. So I think I'll be removing the pdf directory altogether since we only need one pdf loader.

In [25]:
import os

import shutil

directory = 'loaders/pdf'

# Check if the directory exists
if os.path.exists(directory):
    # Remove the directory and its contents
    shutil.rmtree(directory)

Delete all the old tests with the pdf loaders and write new ones. One for returning the correct loader and another for actually returning the loaded text. Parametrize for each file type: pdf, json, docx etc.

In [41]:
%%writefile tests/test_loaders.py
from langchain.document_loaders import (
    CSVLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
)
from loaders.loader_mapper import LoaderMapper
import pytest


@pytest.mark.parametrize("doc, expected", [
    ('tests/docs/dummy_doc_twinkle.pdf', PyMuPDFLoader('tests/docs/dummy_doc_twinkle.pdf'))
])
def test_return_loader(doc, expected):
    mapper = LoaderMapper()
    loader = mapper.find_loader(doc)
    assert type(loader) == type(expected)

Overwriting tests/test_loaders.py


In [42]:
!python -m pytest tests/test_loaders.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 1 item

tests\test_loaders.py [32m.[0m[32m                                                  [100%][0m



Add more tests for each file type. I'm going to need to make dummy files for each one. 

In [20]:
from pathlib import Path
Path("tests/docs").mkdir(parents=True, exist_ok=True)

import csv

data = [
    ['Name', 'Age', 'Country'],
    ['John', 25, 'USA'],
    ['Alice', 30, 'Canada'],
    ['Bob', 35, 'Australia']
]

filename = 'tests/docs/example.csv'

with open(filename, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

import shutil
shutil.move('tests/dummy_doc_twinkle.pdf', 'tests/docs')

'tests/docs\\dummy_doc_twinkle.pdf'

In [44]:
file_path = "tests/docs/dummy.txt"

content="""Blah blah blah. Sample text. Blah Blah
Blah Blah Blah. This is so fun. Blah Blah.
Abcdefghijklmnopqrstuvwxyz."""

with open(file_path, "w") as file: 
    file.write(content)

In [45]:
%%writefile tests/docs/dummy.html

<!DOCTYPE html>
<html>
<head>
    <title>Dummy HTML File</title>
</head>
<body>
    <h1>This is a dummy HTML file.</h1>
    <p>It serves as an example.</p>
</body>
</html>

Writing tests/docs/dummy.html


In [46]:
%%writefile tests/docs/dummy.md

# Dummy Markdown File

This is a dummy Markdown file.
It serves as an example.

- Item 1
- Item 2
- Item 3

Writing tests/docs/dummy.md


In [48]:
from docx import Document

document = Document()
document.add_heading('Dummy Document', 0)
document.add_paragraph('This is a dummy Word document.')

document.save('tests/docs/dummy.docx')

In [49]:
from docx import Document
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.shared import Inches

document = Document()
document.add_heading('Dummy Document', 0)
paragraph = document.add_paragraph('This is a dummy Word document.')
paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER

document.save('tests/docs/dummy.doc')

In [66]:
#powerpoint
from pptx import Presentation

presentation = Presentation()

slide_1 = presentation.slides.add_slide(presentation.slide_layouts[0])
title_1 = slide_1.shapes.title
subtitle_1 = slide_1.placeholders[1]

title_1.text = "Dummy Presentation"
subtitle_1.text = "This is a dummy PowerPoint presentation."

presentation.save("tests/docs/dummy.pptx")

#excel
import openpyxl

workbook = openpyxl.Workbook()
worksheet = workbook.active

worksheet.title = "Dummy Sheet"
worksheet["A1"] = "Dummy Excel Spreadsheet"
worksheet["A2"] = "This is a dummy Excel spreadsheet."

workbook.save("tests/docs/dummy.xlsx")


In [70]:
from pptx import Presentation

# Create a presentation object
presentation = Presentation()

# Add a slide with a title and content
slide_layout = presentation.slide_layouts[1]  # Slide layout with title and content
slide = presentation.slides.add_slide(slide_layout)
title = slide.shapes.title
content = slide.placeholders[1]

title.text = "Dummy Slide"
content.text = "This is a dummy PowerPoint slide."

# Save the presentation to a PPT file
presentation.save("tests/docs/dummy.ppt")

In [68]:
!pip install jq

Collecting jq
  Using cached jq-1.4.1.tar.gz (2.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: jq
  Building wheel for jq (pyproject.toml): started
  Building wheel for jq (pyproject.toml): finished with status 'error'
Failed to build jq


  error: subprocess-exited-with-error
  
  Building wheel for jq (pyproject.toml) did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  Executing: ./configure CFLAGS=-fPIC --prefix=C:\Users\adinh\AppData\Local\Temp\pip-install-a8y06sk2\jq_996a74784b364c019316ced3b01d0e45\_deps\build\onig-install-6.9.4
  error: [WinError 2] The system cannot find the file specified
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for jq
ERROR: Could not build wheels for jq, which is required to install pyproject.toml-based projects


I took out JSON from the relevant files since it's not working. I'll try to come back to find a workaround or manually make my own JSON loader.

In [71]:
%%writefile tests/test_loaders.py
from langchain.document_loaders import (
    CSVLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
)
from loaders.loader_mapper import LoaderMapper
import pytest


@pytest.mark.parametrize("doc, expected", [
    ('tests/docs/dummy_doc_twinkle.pdf', PyMuPDFLoader('tests/docs/dummy_doc_twinkle.pdf')),
    ('tests/docs/example.csv', CSVLoader('tests/docs/example.csv')),
    ('tests/docs/dummy.txt', TextLoader(file_path='tests/docs/dummy.txt', encoding="utf8")),
    ('tests/docs/dummy.html', UnstructuredHTMLLoader('tests/docs/dummy.html')),
    ('tests/docs/dummy.md', UnstructuredMarkdownLoader('tests/docs/dummy.md')),
    ('tests/docs/dummy.doc', UnstructuredWordDocumentLoader('tests/docs/dummy.doc')),
    ('tests/docs/dummy.docx', UnstructuredWordDocumentLoader('tests/docs/dummy.docx')),
    ('tests/docs/dummy.pptx', UnstructuredPowerPointLoader('tests/docs/dummy.pptx')),
    ('tests/docs/dummy.ppt', UnstructuredPowerPointLoader('tests/docs/dummy.ppt')),
    ('tests/docs/dummy.xlsx', UnstructuredExcelLoader('tests/docs/dummy.xlsx')),
])
def test_return_loader(doc, expected):
    mapper = LoaderMapper()
    loader = mapper.find_loader(doc)
    assert type(loader) == type(expected)



Overwriting tests/test_loaders.py


In [64]:
!pip list

Package                 Version
----------------------- ---------
aiohttp                 3.8.4
aiosignal               1.3.1
anyio                   3.7.0
argilla                 1.10.0
asttokens               2.2.1
async-timeout           4.0.2
attrs                   23.1.0
backcall                0.2.0
backoff                 2.2.1
blis                    0.7.9
boto3                   1.26.157
botocore                1.29.157
catalogue               2.0.8
certifi                 2023.5.7
cffi                    1.15.1
chardet                 5.1.0
charset-normalizer      3.1.0
click                   8.1.3
colorama                0.4.6
comm                    0.1.3
commonmark              0.9.1
confection              0.0.4
cryptography            41.0.1
cymem                   2.0.7
dataclasses-json        0.5.8
debugpy                 1.6.7
decorator               5.1.1
deeplake                3.6.3
Deprecated              1.2.14
dill                    0.3.6
distlib             

In [73]:
!python -m pytest tests/test_loaders.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 10 items

tests\test_loaders.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                         [100%][0m



Now test the actual file loading capabilities.

In [80]:
%%writefile -a tests/test_loaders.py
@pytest.mark.parametrize("doc, expected", [
    ('tests/docs/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!"""),
    ('tests/docs/example.csv', 
     """Name: John
        Age: 25
        Country: USA"""),
    ('tests/docs/dummy.txt',
    """Blah blah blah. Sample text. Blah Blah
    Blah Blah Blah. This is so fun. Blah Blah.
    Abcdefghijklmnopqrstuvwxyz.""" ),
    ('tests/docs/dummy.html', 
    """This is a dummy HTML file.
    It serves as an example."""),
    ('tests/docs/dummy.md', 
    """Dummy Markdown File
    This is a dummy Markdown file.
    It serves as an example. Item 1 Item 2 Item 3"""),
    ('tests/docs/dummy.docx',
    """Dummy Document
    This is a dummy Word document."""),
    ('tests/docs/dummy.pptx',
    """Dummy Presentation
    This is a dummy PowerPoint presentation."""),
    ('tests/docs/dummy.xlsx',
    """This is a dummy Excel spreadsheet."""),
])
def test_load_doc(doc, expected):
    mapper = LoaderMapper()
    loader = mapper.find_loader(doc)
    loaded_doc = loader.load()
    text = loaded_doc[0].page_content
    actual_normalized = re.sub(r'\s+', ' ', text.strip())
    expected_normalized = re.sub(r'\s+', ' ', expected.strip())
    assert actual_normalized == expected_normalized
    
    
 
    

Appending to tests/test_loaders.py


In [86]:
!python -m pytest tests/test_loaders.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 16 items

tests\test_loaders.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                   [100%][0m



Another note: I went back and removed the functionality for doc and ppt in favor of docx and pptx because it was outdated and caused issues(required me to install more things on my computer). I didn't want to overcomplicate this for anyone wanting to use it, so unless it's absolutely necessary, I won't be implementing it for now.

# Now we need to go across the repo and change the imports for the text loader to use the new loader mapper instead.

In [88]:
!python -m pytest

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 20 items

tests\test_loaders.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                   [ 80%][0m
tests\test_splitters.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                             [100%][0m



In [None]:
%%writefile tests/test_splitters.py

from loaders.loader_mapper import LoaderMapper
from splitters import character, nltk, recursive, tiktoken
import pytest

#difficult to formally compare results of text splitters, so I checked visually that it worked beforehand, and just tested to see that imports work correctly.
mapper = LoaderMapper()
loader = mapper.find_loader('tests/docs/dummy_doc_twinkle.pdf')
data = loader.load()

def test_character():
    splitter = character.Character_TextSplitter(
        separator= "\n",
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len,
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_ntlk():
    splitter = nltk.NLTK_TextSplitter(chunk_size = 1000)
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_recursive():
    splitter = recursive.RecursiveCharacter_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_tiktoken():
    splitter = tiktoken.Token_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None


In [None]:
%%writefile -a tests/test_loaders.py
from loaders import pdfplumber, pymupdf, pypdf, pypdf2, pypdfium2, unstructured
import pytest


Actually, we'll use a small dummy pdf for our tests, not the instruction manual.

In [None]:
!pip install fpdf
from fpdf import FPDF


pdf = FPDF()

pdf.add_page()

pdf.set_font('Arial', size=12)

text = """Twinkle, twinkle, little star,
How I wonder what you are!
Up above the world so high,
Like a diamond in the sky.
Twinkle, twinkle, little star,
How I wonder what you are!"""

pdf.multi_cell(0, 10, txt=text)

pdf.output('tests/dummy_doc_twinkle.pdf')

make a test for the pdfplumber class

In [None]:
%%writefile -a tests/test_loaders.py

@pytest.mark.parametrize("pdf, expected", [
    ('tests/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!""")
])
def test_plumber(pdf, expected):
    loader = pdfplumber.pdfplumber_loader(pdf)
    data = loader.load_text()
    assert data[0].page_content.strip() == expected.strip()


In [None]:
!python -m pytest -vv

Quick note, I removed pypdf2 because it is deprecated.

Now, lets write a test for pymupdf

In [None]:
%%writefile -a tests/test_loaders.py

@pytest.mark.parametrize("pdf, expected", [
    ('tests/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!""")
])
def test_pymu(pdf, expected):
    loader = pymupdf.PyMuPDF_Loader(pdf)
    data = loader.load_text()
    assert data[0].page_content.strip() == expected.strip()

In [None]:
!python -m pytest

Now do pypdf

In [None]:
%%writefile -a tests/test_loaders.py

@pytest.mark.parametrize("pdf, expected", [
    ('tests/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!""")
])
def test_pypdf(pdf, expected):
    loader = pypdf.PyPDF_Loader(pdf)
    data = loader.load_text()
    assert data[0].page_content.strip() == expected.strip()

In [None]:
!python -m pytest

pypdfium

In [None]:
%%writefile -a tests/test_loaders.py

#this pdf loader is a little special, so normalize the text extracted before similarity comparison
import re
@pytest.mark.parametrize("pdf, expected", [
    ('tests/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,
How I wonder what you are!
Up above the world so high,
Like a diamond in the sky.
Twinkle, twinkle, little star,
How I wonder what you are!""")
])
def test_pypdfium(pdf, expected):
    loader = pypdfium2.PyPDFium2PDF_Loader(pdf)
    data = loader.load_text()
    
    # Normalize line endings and remove extra whitespaces
    actual_normalized = re.sub(r'\s+', ' ', data[0].page_content.strip())
    expected_normalized = re.sub(r'\s+', ' ', expected.strip())
    
    assert actual_normalized == expected_normalized

In [None]:
!python -m pytest 

finally, make a test for unstructured

In [None]:
%%writefile -a tests/test_loaders.py

@pytest.mark.parametrize("pdf, expected", [
    ('tests/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!""")
])
def test_unstructured(pdf, expected):
    loader = unstructured.UnstructuredPDF_Loader(pdf)
    data = loader.load_text()
    actual_normalized = re.sub(r'\s+', ' ', data[0].page_content.strip())
    expected_normalized = re.sub(r'\s+', ' ', expected.strip())
    
    assert actual_normalized == expected_normalized


In [None]:
!python -m pytest

# Now let's write up some tests for the text splitters

In [None]:
%%writefile -a tests/test_splitters.py
from loaders import pymupdf
from splitters import character, nltk, recursive, tiktoken
import pytest

In [None]:
%%writefile -a tests/test_splitters.py

loader = pymupdf.PyMuPDF_Loader('tests/dummy_doc_twinkle.pdf')
data = loader.load_text()

def test_character():
    splitter = character.Character_TextSplitter(
        separator= "\n",
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len,
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_ntlk():
    splitter = nltk.NLTK_TextSplitter(chunk_size = 1000)
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_recursive():
    splitter = recursive.RecursiveCharacter_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_tiktoken():
    splitter = tiktoken.Token_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

In [None]:
!python -m pytest

# Now, I want to account for different file types, not just PDF. So I'm gonna have to adjust the loaders subdirectory quite a bit.

The loading stage is what handles extracting text from different file types. So let's group all the pdf loaders into a single subdirectory called pdf in loaders.

In [1]:
from pathlib import Path

#make the directory "pdf"
Path("loaders/pdf").mkdir(parents=True, exist_ok=True)

In [2]:
import shutil
shutil.move('loaders/pdfplumber.py', 'loaders/pdf')
shutil.move('loaders/pymupdf.py', 'loaders/pdf')
shutil.move('loaders/pypdf.py', 'loaders/pdf')
shutil.move('loaders/pypdfium2.py', 'loaders/pdf')
shutil.move('loaders/unstructured.py', 'loaders/pdf')

'loaders/pdf\\unstructured.py'

Then, let's make a main pdf loader module to handle the different file types and load accordingly.

In [21]:
%%writefile loaders/loader_mapper.py
from langchain.document_loaders import (
    CSVLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
)

Writing loaders/multi_loader.py


In [22]:
%%writefile -a loaders/loader_mapper.py

class LoaderMapper:
    """
    LoaderMapper can accept multiple file types and return a langchain loader wrapper that corresponds to the associated loader.
    Currently supports csv, pdf, txt, html, md, doc, docx, ppt, pptx, xls, xlsx, json
    """
    #keep dict of file extensions and their relevant loaders with their arguments
    loader_map = {
            ".csv": (CSVLoader, {}),
            ".pdf": (PyMuPDFLoader, {}),
            ".txt": (TextLoader, {"encoding": "utf8"}),
            ".html": (UnstructuredHTMLLoader, {}),
            ".md": (UnstructuredMarkdownLoader, {}),
            ".doc": (UnstructuredWordDocumentLoader, {}),
            ".docx": (UnstructuredWordDocumentLoader, {}),
            ".ppt": (UnstructuredPowerPointLoader, {}),
            ".pptx": (UnstructuredPowerPointLoader, {}),
            ".xls": (UnstructuredExcelLoader, {}),
            ".xlsx": (UnstructuredExcelLoader, {}),
        }
    
    @classmethod
    def find_loader(self, filepath):
        """
        Finds the associated loader based on filepath extension
        
        :param filepath: path of the file to be loaded
        
        :return: langchain loader wrapper object
        """
        ext = "." + filepath.rsplit(".", 1)[-1]
        if ext in LoaderMapper.loader_map:
            loader_class, loader_args = LoaderMapper.loader_map[ext]
            loader = loader_class(filepath, **loader_args)
            return loader
        
        raise ValueError(f"Unsupported file extension '{ext}'")

Appending to loaders/multi_loader.py


Let's test all the different file types to see if the class works as intended. We'll use the test_loaders.py module and pytest again. We also need to update the import line since we moved the pdf loaders under loaders/pdf. UPDATE: I just made a change to the multi_loader class and changed it to LoaderMapper instead. I want to return the relevant loader instead of doing all the loading work in that class. So I think I'll be removing the pdf directory altogether since we only need one pdf loader.

In [25]:
import os

import shutil

directory = 'loaders/pdf'

# Check if the directory exists
if os.path.exists(directory):
    # Remove the directory and its contents
    shutil.rmtree(directory)

Delete all the old tests with the pdf loaders and write new ones. One for returning the correct loader and another for actually returning the loaded text. Parametrize for each file type: pdf, json, docx etc.

In [41]:
%%writefile tests/test_loaders.py
from langchain.document_loaders import (
    CSVLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
)
from loaders.loader_mapper import LoaderMapper
import pytest


@pytest.mark.parametrize("doc, expected", [
    ('tests/docs/dummy_doc_twinkle.pdf', PyMuPDFLoader('tests/docs/dummy_doc_twinkle.pdf'))
])
def test_return_loader(doc, expected):
    mapper = LoaderMapper()
    loader = mapper.find_loader(doc)
    assert type(loader) == type(expected)

Overwriting tests/test_loaders.py


In [42]:
!python -m pytest tests/test_loaders.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 1 item

tests\test_loaders.py [32m.[0m[32m                                                  [100%][0m



Add more tests for each file type. I'm going to need to make dummy files for each one. 

In [20]:
from pathlib import Path
Path("tests/docs").mkdir(parents=True, exist_ok=True)

import csv

data = [
    ['Name', 'Age', 'Country'],
    ['John', 25, 'USA'],
    ['Alice', 30, 'Canada'],
    ['Bob', 35, 'Australia']
]

filename = 'tests/docs/example.csv'

with open(filename, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

import shutil
shutil.move('tests/dummy_doc_twinkle.pdf', 'tests/docs')

'tests/docs\\dummy_doc_twinkle.pdf'

In [44]:
file_path = "tests/docs/dummy.txt"

content="""Blah blah blah. Sample text. Blah Blah
Blah Blah Blah. This is so fun. Blah Blah.
Abcdefghijklmnopqrstuvwxyz."""

with open(file_path, "w") as file: 
    file.write(content)

In [45]:
%%writefile tests/docs/dummy.html

<!DOCTYPE html>
<html>
<head>
    <title>Dummy HTML File</title>
</head>
<body>
    <h1>This is a dummy HTML file.</h1>
    <p>It serves as an example.</p>
</body>
</html>

Writing tests/docs/dummy.html


In [46]:
%%writefile tests/docs/dummy.md

# Dummy Markdown File

This is a dummy Markdown file.
It serves as an example.

- Item 1
- Item 2
- Item 3

Writing tests/docs/dummy.md


In [48]:
from docx import Document

document = Document()
document.add_heading('Dummy Document', 0)
document.add_paragraph('This is a dummy Word document.')

document.save('tests/docs/dummy.docx')

In [49]:
from docx import Document
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
from docx.shared import Inches

document = Document()
document.add_heading('Dummy Document', 0)
paragraph = document.add_paragraph('This is a dummy Word document.')
paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER

document.save('tests/docs/dummy.doc')

In [66]:
#powerpoint
from pptx import Presentation

presentation = Presentation()

slide_1 = presentation.slides.add_slide(presentation.slide_layouts[0])
title_1 = slide_1.shapes.title
subtitle_1 = slide_1.placeholders[1]

title_1.text = "Dummy Presentation"
subtitle_1.text = "This is a dummy PowerPoint presentation."

presentation.save("tests/docs/dummy.pptx")

#excel
import openpyxl

workbook = openpyxl.Workbook()
worksheet = workbook.active

worksheet.title = "Dummy Sheet"
worksheet["A1"] = "Dummy Excel Spreadsheet"
worksheet["A2"] = "This is a dummy Excel spreadsheet."

workbook.save("tests/docs/dummy.xlsx")


In [70]:
from pptx import Presentation

# Create a presentation object
presentation = Presentation()

# Add a slide with a title and content
slide_layout = presentation.slide_layouts[1]  # Slide layout with title and content
slide = presentation.slides.add_slide(slide_layout)
title = slide.shapes.title
content = slide.placeholders[1]

title.text = "Dummy Slide"
content.text = "This is a dummy PowerPoint slide."

# Save the presentation to a PPT file
presentation.save("tests/docs/dummy.ppt")

In [68]:
!pip install jq

Collecting jq
  Using cached jq-1.4.1.tar.gz (2.7 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: jq
  Building wheel for jq (pyproject.toml): started
  Building wheel for jq (pyproject.toml): finished with status 'error'
Failed to build jq


  error: subprocess-exited-with-error
  
  Building wheel for jq (pyproject.toml) did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  Executing: ./configure CFLAGS=-fPIC --prefix=C:\Users\adinh\AppData\Local\Temp\pip-install-a8y06sk2\jq_996a74784b364c019316ced3b01d0e45\_deps\build\onig-install-6.9.4
  error: [WinError 2] The system cannot find the file specified
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for jq
ERROR: Could not build wheels for jq, which is required to install pyproject.toml-based projects


I took out JSON from the relevant files since it's not working. I'll try to come back to find a workaround or manually make my own JSON loader.

In [71]:
%%writefile tests/test_loaders.py
from langchain.document_loaders import (
    CSVLoader,
    PyMuPDFLoader,
    TextLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredPowerPointLoader,
    UnstructuredExcelLoader,
)
from loaders.loader_mapper import LoaderMapper
import pytest


@pytest.mark.parametrize("doc, expected", [
    ('tests/docs/dummy_doc_twinkle.pdf', PyMuPDFLoader('tests/docs/dummy_doc_twinkle.pdf')),
    ('tests/docs/example.csv', CSVLoader('tests/docs/example.csv')),
    ('tests/docs/dummy.txt', TextLoader(file_path='tests/docs/dummy.txt', encoding="utf8")),
    ('tests/docs/dummy.html', UnstructuredHTMLLoader('tests/docs/dummy.html')),
    ('tests/docs/dummy.md', UnstructuredMarkdownLoader('tests/docs/dummy.md')),
    ('tests/docs/dummy.doc', UnstructuredWordDocumentLoader('tests/docs/dummy.doc')),
    ('tests/docs/dummy.docx', UnstructuredWordDocumentLoader('tests/docs/dummy.docx')),
    ('tests/docs/dummy.pptx', UnstructuredPowerPointLoader('tests/docs/dummy.pptx')),
    ('tests/docs/dummy.ppt', UnstructuredPowerPointLoader('tests/docs/dummy.ppt')),
    ('tests/docs/dummy.xlsx', UnstructuredExcelLoader('tests/docs/dummy.xlsx')),
])
def test_return_loader(doc, expected):
    mapper = LoaderMapper()
    loader = mapper.find_loader(doc)
    assert type(loader) == type(expected)



Overwriting tests/test_loaders.py


In [64]:
!pip list

Package                 Version
----------------------- ---------
aiohttp                 3.8.4
aiosignal               1.3.1
anyio                   3.7.0
argilla                 1.10.0
asttokens               2.2.1
async-timeout           4.0.2
attrs                   23.1.0
backcall                0.2.0
backoff                 2.2.1
blis                    0.7.9
boto3                   1.26.157
botocore                1.29.157
catalogue               2.0.8
certifi                 2023.5.7
cffi                    1.15.1
chardet                 5.1.0
charset-normalizer      3.1.0
click                   8.1.3
colorama                0.4.6
comm                    0.1.3
commonmark              0.9.1
confection              0.0.4
cryptography            41.0.1
cymem                   2.0.7
dataclasses-json        0.5.8
debugpy                 1.6.7
decorator               5.1.1
deeplake                3.6.3
Deprecated              1.2.14
dill                    0.3.6
distlib             

In [73]:
!python -m pytest tests/test_loaders.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 10 items

tests\test_loaders.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                         [100%][0m



Now test the actual file loading capabilities.

In [80]:
%%writefile -a tests/test_loaders.py
@pytest.mark.parametrize("doc, expected", [
    ('tests/docs/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!"""),
    ('tests/docs/example.csv', 
     """Name: John
        Age: 25
        Country: USA"""),
    ('tests/docs/dummy.txt',
    """Blah blah blah. Sample text. Blah Blah
    Blah Blah Blah. This is so fun. Blah Blah.
    Abcdefghijklmnopqrstuvwxyz.""" ),
    ('tests/docs/dummy.html', 
    """This is a dummy HTML file.
    It serves as an example."""),
    ('tests/docs/dummy.md', 
    """Dummy Markdown File
    This is a dummy Markdown file.
    It serves as an example. Item 1 Item 2 Item 3"""),
    ('tests/docs/dummy.docx',
    """Dummy Document
    This is a dummy Word document."""),
    ('tests/docs/dummy.pptx',
    """Dummy Presentation
    This is a dummy PowerPoint presentation."""),
    ('tests/docs/dummy.xlsx',
    """This is a dummy Excel spreadsheet."""),
])
def test_load_doc(doc, expected):
    mapper = LoaderMapper()
    loader = mapper.find_loader(doc)
    loaded_doc = loader.load()
    text = loaded_doc[0].page_content
    actual_normalized = re.sub(r'\s+', ' ', text.strip())
    expected_normalized = re.sub(r'\s+', ' ', expected.strip())
    assert actual_normalized == expected_normalized
    
    
 
    

Appending to tests/test_loaders.py


In [86]:
!python -m pytest tests/test_loaders.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 16 items

tests\test_loaders.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                   [100%][0m



Another note: I went back and removed the functionality for doc and ppt in favor of docx and pptx because it was outdated and caused issues(required me to install more things on my computer). I didn't want to overcomplicate this for anyone wanting to use it, so unless it's absolutely necessary, I won't be implementing it for now.

# Now we need to go across the repo and change the imports for the text loader to use the new loader mapper instead.

In [88]:
!python -m pytest

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 20 items

tests\test_loaders.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                   [ 80%][0m
tests\test_splitters.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                             [100%][0m



In [None]:
%%writefile tests/test_splitters.py

from loaders.loader_mapper import LoaderMapper
from splitters import character, nltk, recursive, tiktoken
import pytest

#difficult to formally compare results of text splitters, so I checked visually that it worked beforehand, and just tested to see that imports work correctly.
mapper = LoaderMapper()
loader = mapper.find_loader('tests/docs/dummy_doc_twinkle.pdf')
data = loader.load()

def test_character():
    splitter = character.Character_TextSplitter(
        separator= "\n",
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len,
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_ntlk():
    splitter = nltk.NLTK_TextSplitter(chunk_size = 1000)
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_recursive():
    splitter = recursive.RecursiveCharacter_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None

def test_tiktoken():
    splitter = tiktoken.Token_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
    )
    split_docs = splitter.split_data(data)
    assert split_docs is not None


In [1]:
%%writefile tests/test_vectorstore.py
import pytest
import sys
sys.path.append('..')
from ai.embeddings.embeddings_mapper import Embeddings_Mapper
from utils.loaders.loader_mapper import LoaderMapper
from utils.splitters.recursive import RecursiveCharacter_TextSplitter
from utils.vectorstores.deep_lake import DeeplakeDB

def clear_db():
    embeddings_mapper = Embeddings_Mapper()
    embeddings = embeddings_mapper.find_model("openai")
    deeplake = DeeplakeDB(store_path = './test_deeplake', embedding_model = embeddings)
    deeplake.delete_all()

@pytest.fixture(scope="session", autouse=True)
def teardown(request):
    request.addfinalizer(clear_db)
    
@pytest.mark.parametrize("file, content", [
    ('tests/docs/dummy_doc_twinkle.pdf',
    """Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!"""),
    ('tests/docs/example.csv', 
     """Name: John
        Age: 25
        Country: USA"""),
    ('tests/docs/dummy.txt',
    """Blah blah blah. Sample text. Blah Blah
    Blah Blah Blah. This is so fun. Blah Blah.
    Abcdefghijklmnopqrstuvwxyz.""" ),
    ('tests/docs/dummy.html', 
    """This is a dummy HTML file.
    It serves as an example."""),
    ('tests/docs/dummy.md', 
    """Dummy Markdown File
    This is a dummy Markdown file.
    It serves as an example. Item 1 Item 2 Item 3"""),
    ('tests/docs/dummy.docx',
    """Dummy Document
    This is a dummy Word document."""),
    ('tests/docs/dummy.pptx',
    """Dummy Presentation
    This is a dummy PowerPoint presentation."""),
    ('tests/docs/dummy.xlsx',
    """This is a dummy Excel spreadsheet."""),
])
def test_deeplake(file, content):
    #set up document to be embedded and stored
    loadermapper = LoaderMapper()
    loader = loadermapper.find_loader(file)
    data = loader.load()
    splitter = RecursiveCharacter_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len
    )
    docs = [splitter.split_data(data)]
    #set up deeplake db and pass in relevant params
    embeddings_mapper = Embeddings_Mapper()
    embeddings = embeddings_mapper.find_model("huggingface")
    deeplake = DeeplakeDB(store_path = './test_deeplake', embedding_model = embeddings)
    deeplake.add_docs(docs)
    #pass in the file contents and see if it can return the most relevant document
    doc = deeplake.find_similar(content)
    source = doc[0].metadata["source"]
    assert file == source
    
    

Writing tests/test_vectorstore.py


In [1]:
!pytest tests/test_vectorstore.py

platform win32 -- Python 3.11.3, pytest-7.3.2, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 8 items

tests\test_vectorstore.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[33m                                       [100%][0m

tests/test_vectorstore.py::test_deeplake[tests/docs/dummy_doc_twinkle.pdf-Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!]
    import pkg_resources  # type: ignore

tests/test_vectorstore.py::test_deeplake[tests/docs/dummy_doc_twinkle.pdf-Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!]



Exception in thread Thread-37:
Traceback (most recent call last):
  File "C:\Users\adinh\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\adinh\.virtualenvs\utils-Z4NtiPxo\Lib\site-packages\deeplake\util\spinner.py", line 77, in run
    self._hide_cursor()
  File "C:\Users\adinh\.virtualenvs\utils-Z4NtiPxo\Lib\site-packages\deeplake\util\spinner.py", line 119, in _hide_cursor
    if self.file.isatty():
       ^^^^^^^^^^^^^^^^^^
  File "C:\Users\adinh\AppData\Local\Programs\Python\Python311\Lib\tempfile.py", line 483, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
ValueError: I/O operation on closed file
Exception in thread Thread-38:
Traceback (most recent call last):
  File "C:\Users\adinh\AppData\Local\Programs\Python\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\adinh\.virtualenvs\utils-Z4NtiPxo\Lib\site-packages\deeplake\util\spin

Now, I want to make an AWS lambda function to automate my training process. It will be triggered every time a document is upladed to s3 or microsoft sharepoint. But the issue is, while I have the training logic implemented, I need to have a cloud based vector store since the lambda function wont have access to my local file system since it is not ran locally. It is ran automatically whenever the trigger event occurs, so I need to set up a cloud vector store, Pinecone for this lambda function.

In [4]:
%%writefile vectorstores/pinecone.py
from langchain.vectorstores import Pinecone

class Pinecone:
    def __init__(self, index, embedding_model):
        self.db = Pinecone(index=index, embedding_function = embedding_model)
    
    def add_docs(self, documents):
        ids = []
        for document in documents:
            id = self.db.add_documents(document)
            ids.append(id)
        if len(ids) == 1:
            return ids[0]
        return ids
    
    def find_similar(self, query):
        return self.db.similarity_search(query)
    
    def delete_by_ids(self, ids):
        self.db.delete(ids)

Overwriting vectorstores/pinecone.py


Let's write a test for it and see if this implementation works.

In [6]:
!pytest tests/test_vectorstore.py

platform win32 -- Python 3.11.3, pytest-7.4.0, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\utils
plugins: anyio-3.7.0
collected 1 item

tests\test_vectorstore.py [31mF[0m[31m                                              [100%][0m

[31m[1m_ test_deeplake[tests/docs/dummy_doc_twinkle.pdf-Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!] _[0m

file = 'tests/docs/dummy_doc_twinkle.pdf'
content = 'Twinkle, twinkle, little star,\nHow I wonder what you are!\nUp above the world so high,\nLike a diamond in the sky.\nTwinkle, twinkle, little star,\nHow I wonder what you are!'

    [37m@pytest[39;49;00m.mark.parametrize([33m"[39;49;00m[33mfile, content[39;49;00m[33m"[39;49;00m, [[90m[39;49;00m
        ([33m'[39;49;00m[33mtests/docs/dummy_doc_twinkle.pdf[39;49;00m[33m'[39;49;00m,[90m[39;49;00m
    [90m    [39;49;00m[33m"""Twinkle,