# Langchain

In this tutorial, we shall use `Langchain` as a framework to create a retrieval based QA.  While `Llama index` is very nice, `Langchain` provides more complex workflow.

In [1]:
#comment this if you are using puffer or tokyo
import os
os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

In [2]:
from watermark import watermark
print(watermark(packages="torch,langchain,transformers"))

torch       : 1.13.0
langchain   : 0.0.171
transformers: 4.21.3



In [3]:
from transformers.utils import logging
logging.set_verbosity(40)

## 1. Define model

First, we shall load the large language model that will be use for our question answering model.  We shall use `flan-t5-large` which is quite good.

In [4]:
from langchain.llms.base import LLM
from transformers import pipeline
import torch

In [None]:
from typing import Mapping, Optional, Any, List
from langchain.callbacks.manager import CallbackManagerForLLMRun

class CustomLLM(LLM):
    
    n: int
    model_name = "google/flan-t5-large"
    pipeline = pipeline("text2text-generation", model=model_name, device=1, model_kwargs={"torch_dtype":torch.bfloat16})
        
    @property
    def _llm_type(self) -> str:
        return "custom"
    
    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
    ) -> str:
        if stop is not None:
            raise ValueError("stop kwargs are not permitted.")
        
        out = self.pipeline(prompt, max_length=9999)[0]["generated_text"]
        return out[:self.n]
    
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {"n": self.n}

In [None]:
llm=CustomLLM(n=500)

## 2. Define data

Here, I would like to define a `langchain.document_loaders.DirectoryLoader` to read through my folders.  Note that I have defined multiple types of documents such as `txt` and `pdf`.  Of course, I can also read other types such as web.   As you can see, this is not available in `Llama index`

In [None]:
from langchain.document_loaders import DirectoryLoader

txt_loader = DirectoryLoader('txt', glob="**/*.txt")
pdf_loader = DirectoryLoader('pdf', glob="**/*.pdf")

In [8]:
#take all the loader
loaders = [pdf_loader, txt_loader]

#lets create document 
documents = []
for loader in loaders:
    documents.extend(loader.load())

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with another strategy.
Falling back to partitioning with ocr_only.


In [9]:
len(documents)

4

To improve the search efficiency, we will split the documents into smaller documents using `CharacterTextSplitter`

In [10]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Created a chunk of size 1050, which is longer than the specified 1000
Created a chunk of size 1002, which is longer than the specified 1000
Created a chunk of size 1064, which is longer than the specified 1000
Created a chunk of size 1099, which is longer than the specified 1000
Created a chunk of size 1032, which is longer than the specified 1000
Created a chunk of size 1065, which is longer than the specified 1000
Created a chunk of size 1215, which is longer than the specified 1000
Created a chunk of size 1026, which is longer than the specified 1000
Created a chunk of size 1018, which is longer than the specified 1000
Created a chunk of size 1295, which is longer than the specified 1000


## 3. Embeddings

Next, we shall define the embeddings to embed our documents.  There are many but here we just use `HuggingFaceEmbeddings`.

Note: there are many good embeddings.....please try for example:  `hkunlp/instructor-base`

In [11]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

hfemb = HuggingFaceEmbeddings()
#note that huggingface embeddings use sentence transformers

## 4. Create vector store

Next, we create the vector store containing embeddings of each document, which would facilitate the search.  Here I use `Chromadb` which is a very efficient vector store.

In [12]:
from langchain.vectorstores import Chroma

persist_directory = 'db'

vectorstore = Chroma.from_documents(documents=texts, embedding=hfemb, persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: db


## 5. Query time

Now that we have the LLM (i.e., Flan-T5) and the vectorstores, there are many ways to run it.  In Langchain, it's called `chain`.   Here we shall use a simple one called `RetrievalQA`

In [13]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever())

In [14]:
query = "Where is AIT"
qa.run(query)

'Asia'

In [15]:
query = "What is AIT"
qa.run(query)

'AIT remains a magnet for International Organizations and institutions of higher learning, when it comes to partnerships in education, research and development and outreach activities'

In [16]:
query = "What kind of research does AIT do"
qa.run(query)

'regional and transnational research projects'