# AI package
contains the implementation for the LLM models and libraries

In [1]:
!pip install pipenv



In [2]:
!pipenv install pytest

Installing pytest...
Resolving pytest...
[    ] Installing...
Adding pytest to Pipfile's [packages] ...
[    ] Installing pytest...
Installation Succeeded
[    ] Installing pytest...
[    ] Installing pytest...

Building requirements...
[    ] Locking...
Resolving dependencies...
[    ] Locking...
[    ] Locking...
[==  ] Locking...
[=== ] Locking...
[ ===] Locking...
[  ==] Locking...
[   =] Locking...
[   =] Locking...
[  ==] Locking...
[ ===] Locking...
[====] Locking...
[=== ] Locking...
[==  ] Locking...
[    ] Locking...
[=   ] Locking...
[==  ] Locking...
[=== ] Locking...
[ ===] Locking...
[  ==] Locking...
[    ] Locking...
[   =] Locking...
[  ==] Locking...
[ ===] Locking...
[====] Locking...
[=== ] Locking...
[=   ] Locking...
[    ] Locking...
[=   ] Locking...
[==  ] Locking...
[=== ] Locking...
[ ===] Locking...
[   =] Locking...
[    ] Locking...
[   =] Locking...
[  ==] Locking...
[ ===] Locking...
[====] Locking...
[==  ] Locking...
[=   ] Locking...
[    ] Locking...

Creating a Pipfile for this project...
Pipfile.lock not found, creating...
Locking [packages] dependencies...
Locking [dev-packages] dependencies...
Updated Pipfile.lock (922e82e69ac92d524e9aec65cbead9fdef4cdb3fcff8f459d8998bfd7bd6a67f)!


Since this module will be exported as a package, \_\_init\_\_.py just needs to exist so the Python packaging mechanism knows where to look for functions.

In [5]:
%%writefile -a __init__.py
print('Hello, world!')

Writing __init__.py


The first step is processing the training data(instruction manuals). I'll move over to the utils module to implement that method to make it accessible for other modules to use. 

Now that this step is done, I'll start implementing the embeddings/vector store steps

We'll stay away from cloud based databases for now, such as Pinecone. To my knowledge, some in-memory databases are FAISS and Chroma. I'll do some more research to find some more. For now, I'll just use FAISS, which also offers efficient similarity search.


Note: For question answering over many documents, you almost always want to create an index over the data. This can be used to smartly access the most relevant documents for a given question, allowing you to avoid having to pass all the documents to the LLM (saving you time and money).

# Let's try to use FAISS in conjunction with the util modules we created earlier, the pdf loaders and text splitters. We'll just just PyMuPDF and RecursiveCharacterTextSplitter.

In [5]:
%env OPENAI_API_KEY=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

env: OPENAI_API_KEY=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


In [6]:
!pip install faiss-cpu
!pip install langchain
!pip install openai
!pip install pymupdf
!pip install tiktoken
import os
import sys
sys.path.append('../utils/loaders')
sys.path.append('../utils/splitters')
from pymupdf import PyMuPDF_Loader
from recursive import RecursiveCharacter_TextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

pdf = '../../ePortG9QuickStartGuide.pdf'
loader = PyMuPDF_Loader(pdf)
data = loader.load_text()

splitter = RecursiveCharacter_TextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len
)
docs = splitter.split_data(data)

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)

query="What are the tools required to install the ePort G9"
docs = db.similarity_search(query)

print(docs[0].page_content)

CDMA ePort G9 Quickstart Guide
TOOLS REQUIRED FOR INSTALLATION
11/32” Socket
1/4” Socket
12” Extension
Ratchet
Phillips Screwdriver
Wire Cutters
Power Drill*
3/16” & 3/8” Drill Bit*
Multi-diameter Step Drill Bit, 
1/4”-3/4”*
#VVXUD0101910
* Required for surface-mount installations
Thank you for your purchase. Before you start, please read these instructions thoroughly, 
and then take a few moments to plan your installation. 
You must have a signed ePort Connect Services Contract and a bank account assigned to 
this device for it to be able to accept credit card transactions.  Please call USA Technologies 
Customer Care at 1.888.561.4748, if you need activation documents.
G9 Telemeter
Card Reader
CANTALOUPE


We can see that FAISS works in memory since we didn't use an external DB. We can also see that the modules I made for the pymupdf loader and recursive text splitters are importable and worked as intended. Now, I want to try using Deep Lake, which seems to be a local, persistent database option so I don't have to keep re-computing embeddings. I only need to run the training model once, so it stores it locally once.

We'll go back and test the different embeddings models after we get the OpenAI model working and organized in a module. So, my next steps are to create a new subpackage called "embeddings" and make a "openai_embeddings" module

In [3]:
from pathlib import Path

#make the directory embeddings
Path("embeddings").mkdir(parents=True, exist_ok=True)

In [5]:
%%writefile -a embeddings/openai_embeddings.py
from langchain.embeddings import OpenAIEmbeddings

class OpenAI_Embeddings:
    """
    A class to initialize a new embedder/vectorizer object based on the OpenAIEmbeddings wrapper from langchain.
    It does have different methods to embed/vectorize the text, but we'll just use it as an object to pass in to the vector store class.
    If interested, visit https://python.langchain.com/en/latest/modules/models/text_embedding.html or the associated github repository.
    
    """
    def __init__(self, api_key, model):
        """
        Initializes a new instances of the embeddings object that can be passed in to a vector store.
        The vector store will handle using the chosen embedder/vectorizer to convert and store it.
        
        :param api_key: valid openai ai key to call the api
        :param model: default is "text-embedding-ada-002", visit https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
        """
        self.vectorizer = OpenAIEmbeddings(open_ai_key=api_key, model=model)
    

Writing embeddings/openai_embeddings.py


In [6]:
filename= 'embeddings/__init__.py'

with open(filename, 'w') as file:
    pass

Now make a setup.py file in the root level

In [1]:
!pip list

Package                 Version
----------------------- --------
aiohttp                 3.8.4
aiosignal               1.3.1
asttokens               2.2.1
async-timeout           4.0.2
attrs                   23.1.0
backcall                0.2.0
certifi                 2023.5.7
charset-normalizer      3.1.0
colorama                0.4.6
comm                    0.1.3
dataclasses-json        0.5.8
debugpy                 1.6.7
decorator               5.1.1
exceptiongroup          1.1.1
executing               1.2.0
faiss-cpu               1.7.4
frozenlist              1.3.3
greenlet                2.0.2
idna                    3.4
iniconfig               2.0.0
ipykernel               6.23.2
ipython                 8.14.0
jedi                    0.18.2
jupyter_client          8.2.0
jupyter_core            5.3.1
langchain               0.0.200
langchainplus-sdk       0.0.10
marshmallow             3.19.0
marshmallow-enum        1.5.1
matplotlib-inline       0.1.6
multidict               6.

In [3]:
%%writefile -a setup.py
from setuptools import setup, find_packages

setup(
    name='ai',
    version='1.0',
    author='Arvin Dinh',
    description='A Python package containing LLM model implementation for embedding models, chat models',
    packages=find_packages(),
    install_requires=['langchain=0.0.200',
                      'openai=0.27.8'
                     ]
)


Writing setup.py


# I'll try out different embeddings models after I set up my vector store and prompting, since there's not really a gauge effectiveness of embedding models and I would also have to set up API keys for the other models as well. Now, I'll head back to the utils module and set up the local vector store.

# Now that it's set up, we have to try out and implement the different LLM models, as well as the Buffer Memory and Conversational Retrieval Chain. I'll do the latter first.

I'll group the memory and convo retrieval chain into a subdirectory, "chains". More chains can be made based on use case, but I'll make a conversational.py module.

In [1]:
from pathlib import Path

Path("chains").mkdir(parents=True, exist_ok=True)

In [2]:
%%writefile -a chains/conversational.py
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

class ConversationModel:
    """
    ConversationModel is a simple wrapper for a conversational language model that uses chat history in addition to context from db
    """
    def __init__(self, llm, db):
        """
        Initializes a conversational retrieval chain based on a given llm model, vector store.
        
        :param llm: langchain language model object
        :param db: langchain vector store object
        """
        memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
        self.chat = ConversationalRetrievalChain.from_llm(llm=llm, retriever=db.as_retriever(), memory=memory)
    
    def get_response(self, query):
        """
        returns the response given by the given language model based on a given query
        
        :param query: string, question to be passed in to the llm
        
        :return: string, response given by llm based on query and embedded documents in vector store
        """
        response = self.chat({"question": query})
        return response["answer"]

Writing chains/conversational.py


Now make a subdirectory for LLM's. We'll start with a openai module in there.

In [3]:
from pathlib import Path

Path("llms").mkdir(parents=True, exist_ok=True)

In [5]:
%%writefile -a llms/openai.py

from langchain.llms import OpenAI

class Open_AI:
    """
    Simple wrapper for regular OpenAI langchain class, can adjust temperature and pass in api key
    """
    def __init__(self, temperature = 0, api_key):
        """
        Initializes an OpenAI object that can be passed in to a chain in a given chains module
        
        :param temperature:takes values 0-10, lower = more focused and deterministic, higher = random and diverse. 
        :param api_key: openai api key
        """
        self.model = OpenAI(temperature = temperature, openai_api_key = api_key)


Writing llms/openai.py


# Quick detour, I'll set up the training and prompting in their respective packages with the openai model. Then I'll start exploring other models.

Now, we'll try AI21

We'll be using a lot of API keys from here on, so I'll need to keep track of environment variables.

In [1]:
%%writefile -a .env
OPENAI_API_KEY='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'
AI21_API_KEY='aSAEAEDJMK5gpPEgbwWbfwPrKzkpY9WX'

UsageError: Cell magic `%%makefile` not found.


In [4]:
!pip install python-dotenv



Collecting python-dotenv
  Using cached python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


Another detour! We need to refactor the code and our directorites, specifically the ai and embeddings subdirectories.I'm going to start with the embeddings. We can actually use a embeddings mapper similar to how we handled the text loaders. The model is going to be passed in by name and map to the corresponding embeddings model. For example, "openai" is going to map to langchain's OpenAIEmbeddings().

In [1]:
%%writefile embeddings/embeddings_mapper.py
from langchain.embeddings import OpenAIEmbeddings

class Embeddings_Mapper:
    """
    A class to initalize a new embedding model based on the wrappers from langchain.
    """
    def __init__(self, api_key):
        self.api_key = api_key

        self.model_map = {
            "openai" : {OpenAIEmbeddings, {"model": "text-embedding-ada-002", "openai_api_key": self.api_key}}
        }
    
    def find_embeddings_model(self, model):
        if model in self.model_map:
            model_class, model_args = self.model_map[model]
            model = model_class(**model_args)
            return model
        
        raise ValueError(f"Model '{model}' not recognized")

Writing embeddings/embeddings_mapper.py


In [6]:
import os
os.remove("embeddings/openai_embeddings.py")

In [4]:
%%writefile llms/llms_mapper.py
from langchain.llms import OpenAI

class LLMs_Mapper:
    """
    A class to initalize a new language model based on the wrappers from langchain.
    """
    def __init__(self, api_key):
        self.api_key = api_key

        self.model_map = {
            "openai" : {OpenAI, {"temperature": 0, "openai_api_key": self.api_key}}
        }
    
    def find__llm(self, model):
        if model in self.model_map:
            model_class, model_args = self.model_map[model]
            model = model_class(**model_args)
            return model
        
        raise ValueError(f"LLM '{model}' not recognized")

Writing llms/llms_mapper.py


I need to make sure I can mix and match embeddings models, with llms, and etc. If not, I'll make specific combinations with the ones that work. I may put those in the chains dir afterwards.

In [11]:
%%writefile test.py 
import sys
sys.path.append('..')
import os
import argparse
from langchain.embeddings import HuggingFaceEmbeddings
from utils.loaders.loader_mapper import LoaderMapper
from utils.splitters.recursive import RecursiveCharacter_TextSplitter
from utils.vectorstores.deep_lake import DeeplakeDB

openai_api_key = os.environ.get('OPENAI_API_KEY')

def load_and_split(pdf):
    """
    This method takes an input pdf to be loaded and split into chunks
    
    :param pdf: path to training document
    
    :return: split langchain Document objects
    """
    mapper = LoaderMapper()
    loader = mapper.find_loader(pdf)
    data = loader.load()
    # split extracted text(tokenize)
    # split recursively by different characters - starting with "\n\n", then "\n", then " "
    splitter = RecursiveCharacter_TextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200,
        length_function = len
    )
    docs = splitter.split_data(data)
    return docs

def embed_and_store(docs):
    """
    This method takes an input list of chunked documents to be embedded and stored
    
    :param docs: list of split langchain Document objects
    """
    # initialize embeddings model to pass in to db
    embeddings = HuggingFaceEmbeddings()
    # initialize vector store, add split docs
    # (db will compute embeddings using embedding model and store in specified path)
    deeplake = DeeplakeDB(store_path='./embeddings_deeplake', embedding_model=embeddings)
    deeplake.add_docs(docs)

def main():
    """
    When file is run, command line takes input file paths separated by spaces. These will be loaded, split, and embedded, then stored.
    """
    docs = ['../training/docs/ePortG11InstallGuide.pdf']
    split_docs = []
    for doc in docs:
        chunks = load_and_split(doc)
        split_docs.append(chunks)
        
    embed_and_store(split_docs)
if __name__=="__main__":
    main()


Overwriting test.py


In [3]:
!del test.py

now that we know we can mix and match embeddings and llms, add the huggingface embeddings as an option to the mapper. Also, we can handle the api keys in this ai module since that's where it is actually used, so we don't have to worry about passing it in the prompting module or training module

In [5]:
%%writefile embeddings/embeddings_mapper
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

class Embeddings_Mapper:
    """
    A class to initalize a new embedding model based on the wrappers from langchain.
    """
    def __init__(self):
        self.openai_key = os.environ.get('OPENAI_API_KEY')

        self.model_map = {
            "openai" : {OpenAIEmbeddings, {"model": "text-embedding-ada-002", "openai_api_key": self.openai_key}},
            "huggingface": {HuggingFaceEmbeddings, {}},
        }
    
    def find_model(self, model):
        if model in self.model_map:
            model_class, model_args = self.model_map[model]
            model = model_class(**model_args)
            return model
        
        raise ValueError(f"Model '{model}' not recognized")



Invalid switch - "openai.py".


In [None]:
%%writefile llms/llms_mapper.py
import os
from langchain.llms import OpenAI

class LLMs_Mapper:
    """
    A class to initalize a new language model based on the wrappers from langchain.
    """
    def __init__(self):
        """
        initializes a new mapper to return a LLM object based on the langchain wrapper

        """
        self.openai_key = os.environ.get('OPENAI_API_KEY')

        self.model_map = {
            "openai" : {OpenAI, {"temperature": 0, "openai_api_key": self.openai_key}},
            #temperature:takes values 0-10, lower = more focused and deterministic, higher = random and diverse. 
        }
    
    def find_model(self, model):
        if model in self.model_map:
            model_class, model_args = self.model_map[model]
            model = model_class(**model_args)
            return model
        
        raise ValueError(f"LLM '{model}' not recognized")


Now head over to training and prompting modules to fix the imports and instantiations of the new classes we made

We need to write up some tests for the embeddings, llms, and the chains

In [1]:
!mkdir tests

In [8]:
%%writefile tests/test_embeddings.py
import sys
sys.path.append('..')
import pytest
from ai.embeddings.embeddings_mapper import Embeddings_Mapper
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

@pytest.mark.parametrize("model, expected", [
    ("openai", OpenAIEmbeddings()),
    ("huggingface", HuggingFaceEmbeddings()),
])
def test_mapper(model, expected):
    mapper = Embeddings_Mapper()
    embeddings = mapper.find_model(model)
    assert type(embeddings) == type(expected)


Overwriting tests/test_embeddings.py


In [5]:
!pytest -v tests/test_embeddings.py

platform win32 -- Python 3.10.9, pytest-7.4.0, pluggy-1.2.0 -- C:\Users\adinh\.virtualenvs\ai-TGJVfiId\Scripts\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\ai
[1mcollecting ... [0mcollected 2 items

tests/test_embeddings.py::test_mapper[openai-expected0] [32mPASSED[0m[32m           [ 50%][0m
tests/test_embeddings.py::test_mapper[huggingface-expected1] [32mPASSED[0m[32m      [100%][0m



In [10]:
%%writefile tests/test_llms.py
import sys
sys.path.append('..')
import pytest
from ai.llms.llms_mapper import LLMs_Mapper
from langchain.llms import OpenAI


@pytest.mark.parametrize("model, expected", [
    ("openai", OpenAI()),
])
def test_mapper(model, expected):
    mapper = LLMs_Mapper()
    llm = mapper.find_model(model)
    assert type(llm) == type(expected)


Overwriting tests/test_llms.py


In [11]:
!pytest tests/test_llms.py

platform win32 -- Python 3.10.9, pytest-7.4.0, pluggy-1.2.0
rootdir: C:\Users\adinh\Downloads\ctlpchatbot\ai
collected 1 item

tests\test_llms.py [32m.[0m[32m                                                     [100%][0m



In [None]:
!pip install 