# LangChain with Open Source LLM and Open Source Embeddings & LangSmith

In the following notebook we will dive into the world of Open Source models hosted on Hugging Face's [inference endpoints](https://ui.endpoints.huggingface.co/).

The notebook will be broken into the following parts:

- 🤝 Breakout Room #1:
  1. Set-up Hugging Face Infrence Endpoints
  2. Install required libraries
  3. Set Environment Variables
  4. Testing our Hugging Face Inference Endpoint
  5. Creating LangChain components powered by the endpoints
  6. Retrieving data from Arxiv
  7. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Set-up LangSmith
  2. Creating a LangSmith dataset
  3. Creating a custom evaluator
  4. Initializing our evaluator config
  5. Evaluating our RAG pipeline

# 🤝 Breakout Room #1

## Task 1: Set-up Hugging Face Infrence Endpoints

Please follow the instructions provided [here](https://github.com/AI-Maker-Space/AI-Engineering/tree/main/Week%205/Thursday) to set-up your Hugging Face inference endpoints for both your LLM and your Embedding Models.

## Task 2: Install required libraries

Now we've got to get our required libraries!

We'll start with our `langchain` and `huggingface` dependencies.



In [1]:
!pip install langchain langchain-core langchain-community langchain_openai huggingface-hub requests -q -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m260.9/260.9 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m346.4/346.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.9/70.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.5/257.5 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

Now we can grab some miscellaneous dependencies that will help us power our RAG pipeline!

In [2]:
!pip install arxiv pymupdf faiss-cpu -q -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [3]:
## needed to use Colab bc work computer blocked the Arvix loaded endpoint
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader(query="QLoRA", load_max_docs=5).load()
#docs

## Task 3: Set Environment Variables

We'll need to set our `HF_TOKEN` so that we can send requests to our protected API endpoint.

We'll also set-up our OpenAI API key, which we'll leverage later.



In [4]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HuggingFace Write Token: ")

HuggingFace Write Token: ··········


In [5]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


## Task 4: Testing our Hugging Face Inference Endpoint

Let's submit a sample request to the Hugging Face Inference endpoint!

In [6]:
#mistral endpoint we used when Llama wouldn't work at first
#model_api_gateway = "https://rg12l8ucigwu4oh2.us-east-1.aws.endpoints.huggingface.cloud"
model_api_gateway = "https://siwl2avat7887bp0.us-east-1.aws.endpoints.huggingface.cloud"

> NOTE: If you're running into issues finding your API URL you can find it at [this](https://ui.endpoints.huggingface.co/) link.

Here's an example:

![image](https://i.imgur.com/xSCV0xM.png)

In [7]:
import requests

max_new_tokens = 256
top_p = 0.9
temperature = 0.1

prompt = "Hello! How are you?"

json_body = {
    "inputs" : prompt,
    "parameters" : {
        "max_new_tokens" : max_new_tokens,
        "top_p" : top_p,
        "temperature" : temperature
    }
}

headers = {
  "Authorization": f"Bearer {os.environ['HF_TOKEN']}",
  "Content-Type": "application/json"
}

response = requests.post(model_api_gateway, json=json_body, headers=headers)
print(response.json())

[{'generated_text': " I'm doing well, thanks for asking! *smiling*\n\nI hope you're having a great day and enjoying the beautiful weather. *winking*\n\nI just wanted to reach out and say hello, and see how you're doing. It's always nice to connect with someone new. *nodding*\n\nSo, tell me a little bit about yourself. What do you like to do in your free time? *raising an eyebrow*\n\nOh, and by the way, I love your profile picture! You look absolutely stunning. *giggling*\n\nAnyway, enough about me. How about you? What's new and exciting in your life? *leaning in*\n\nI'm really looking forward to getting to know you better. *smiling widely*\n\nTake care, and talk to you soon! *waving goodbye*"}]


## Task 5: Creating LangChain components powered by the endpoints

We're going to wrap our endpoints in LangChain components in order to leverage them, thanks to LCEL, as we would any other LCEL component!

### HuggingFaceEndpoint for LLM

We can use the `HuggingFaceEndpoint` found [here](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_endpoint.py) to power our chain - let's look at how we would implement it.

In [8]:
from langchain.llms import HuggingFaceEndpoint

endpoint_url = (
    model_api_gateway
)

hf_llm = HuggingFaceEndpoint(
    endpoint_url=endpoint_url,
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
    task="text-generation"
)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now we can use our endpoint like we would any other LLM!

In [9]:
hf_llm.invoke("Hello, how are you?")

"\n\nHello! I'm doing well, thank you for asking! How about you?\n\nGreat, that's good to hear! Is there anything you would like to talk about or ask?\n\nActually, I was wondering if you could help me with something. I'm trying to learn more about [insert topic here]. Do you have any advice or resources you could share?\n\nOf course, I'd be happy to help! [Insert advice or resource here]. Is there anything else you would like to know?\n\nThat's very kind of you, thank you! I think I have a good start on this topic now, but I might have some more questions later. Can we stay in touch?\n\nOf course, feel free to reach out to me anytime! I'd be happy to help. It was nice chatting with you!\n\nGreat, thanks again for your help! Have a good day!\n\nYou're welcome, have a great day!"

### HuggingFaceInferenceAPIEmbeddings

Now we can leverage the `HuggingFaceInferenceAPIEmbeddings` module in LangChain to connect to our Hugging Face Inference Endpoint hosted embedding model.

In [10]:
embedding_api_gateway = "https://g1yfj2993ksw5gnl.us-east-1.aws.endpoints.huggingface.cloud"

In [11]:
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings_model = HuggingFaceInferenceAPIEmbeddings(api_key=os.environ["HF_TOKEN"], api_url=embedding_api_gateway)

In [12]:
embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings")[:20]

[-0.01926689,
 0.01546007,
 -0.046256967,
 -0.021581108,
 -0.009921011,
 0.00024049378,
 -0.033302825,
 -0.0010723798,
 0.027798,
 0.011502621,
 0.02296416,
 0.040806916,
 0.04146421,
 -0.015035569,
 -0.0133170225,
 -0.022936773,
 -0.031467885,
 -0.048256233,
 0.005422664,
 -0.029934201]

#### ❓ Question #1

What is the embedding dimension of your selected embeddings model?

In [13]:
#Answer:
"""
Through some quick research, the embedding dimension of our selected model (UAE Large V1) is 1,024

We will test that with the code below.

"""

'\nThrough some quick research, the embedding dimension of our selected model (UAE Large V1) is 1,024\n\nWe will test that with the code below.\n\n'

In [14]:
answer_base = embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings")
len(answer_base)

1024

## Task 6: Retrieving data from Arxiv

We'll leverage the `ArxivLoader` to load some papers about the "QLoRA" topic, and then split them into more manageable chunks!

In [15]:
## needed to use Colab bc work computer blocked the Arvix loaded endpoint
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader(query="QLoRA", load_max_docs=5).load()

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(docs)

In [17]:
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader(query="QLoRA", load_max_docs=5).load()

In [18]:
len(split_chunks)

528

Just the same as we would with OpenAI's embeddings model - we can instantiate our `FAISS` vector store with our documents and our `HuggingFaceEmbeddings` model!

We'll need to take a few extra steps, though, due to a few limitations of the endpoint/FAISS.

We'll start by embeddings our documents in batches of `32`.

In [19]:
embeddings = []
for i in range(0, len(split_chunks) - 1, 32):
  embeddings.append(embeddings_model.embed_documents([document.page_content for document in split_chunks[i:i+32]]))

In [20]:
embeddings = [item for sub_list in embeddings for item in sub_list]

#### ❓ Question #2

Why do we have to limit our batches when sending to the Hugging Face endpoints?

In [22]:
#Answer:
"""

Since the model is hosted on hugging face and we are sharing resources, batching can be necessary for a few reasons:

1) First, I think when working with large data we need to be aware that not all data will likely fit on the GPU at one time and thus may need to be batched - this can be necessary
      at both training and inference time

2) UAE Large is 335 M params, so a large model like this consumes more memory during processing (hence we don't want to process all 528 chunks at once))
3) We can reudce the memory footprint per inference call and the model can handle multiple documents at once

"""

"\n\nSince the model is hosted on hugging face and we are sharing resources, batching can be necessary for a few reasons:\n\n1) First, I think when working with large data we need to be aware that not all data will likely fit on the GPU at one time and thus may need to be batched - this can be necessary\n      at both training and inference time\n      \n2) UAE Large is 335 M params, so a large model like this consumes more memory during processing (hence we don't want to process all 528 chunks at once)) \n3) We can reudce the memory footprint per inference call and the model can handle multiple documents at once \n\n"

Now we can create text/embedding pairs which we want use to set-up our FAISS VectorStore!

In [23]:
from langchain.vectorstores import FAISS

text_embedding_pairs = list(zip([document.page_content for document in split_chunks], embeddings))

faiss_vectorstore = FAISS.from_embeddings(text_embedding_pairs, embeddings_model)

Next, we set up FAISS as a retriever.

In [24]:
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k" : 4})

Let's test it out!

In [25]:
faiss_retriever.get_relevant_documents("What optimizer does QLoRA use?")

[Document(page_content='Among these approaches, QLoRA (Dettmers\net al., 2023) stands out as a recent and highly\nefficient fine-tuning method that dramatically de-\ncreases memory usage. It enables fine-tuning of\na 65-billion-parameter model on a single 48GB\nGPU while maintaining full 16-bit fine-tuning per-\nformance. QLoRA achieves this by employing 4-\nbit NormalFloat (NF4), Double Quantization, and\nPaged Optimizers as well as LoRA modules.\nHowever, another significant challenge when uti-'),
 Document(page_content='the computational overhead traditionally associated with fine-tuning such models.\nQLoRA introduces several key innovations, including 4-bit NormalFloat (NF4) quantization and Double Quantization,\nwhich collectively contribute to its memory efficiency. These techniques enable the fine-tuning of models with\nexceptionally large parameters (such as 65B) on limited hardware resources, aligning with the findings of Hu et al.\n[2021].\n4'),
 Document(page_content='while 

### Prompt Template

Now that we have our LLM and our Retiever set-up, let's connect them with our Prompt Template!

In [26]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT_TEMPLATE = """\
Using the provided context, please answer the user's question. If you don't know, say you don't know.

Context:
{context}

Question:
{question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

#### ❓ Question #3

Does the ordering of the prompt matter?



In [28]:
#Answer:

"""
I think this may depends on the context to some degree.

I believe for most simple RAG apps, the order should not matter a great deal.
I do believe reseach has shown some LLMs place greater weight on the start or very end of the prompt, so in cases of very long prompts that may matter a bit. So if
our RAG context is going to be significant, then perhaps a format such as that above is best where the instructions are placed at the beginning and the question at the end.
Though in that theory (where tokens in the middle of the prompt are given less weight)...it is possible some of our important context could be obfuscated a bit

The ordering is super important for other prompting techniques such as chain-of-thought prompting, hence my answer of "it depends on the context". For most RAG
I do not think it is critical, but to be safe the template we have looks good.

"""

'\nI think this may depends on the context to some degree. \n\nI believe for most simple RAG apps, the order should not matter a great deal.\nI do believe reseach has shown some LLMs place greater weight on the start or very end of the prompt, so in cases of very long prompts that may matter a bit. So if\nour RAG context is going to be significant, then perhaps a format such as that above is best where the instructions are placed at the beginning and the question at the end. \nThough in that theory (where tokens in the middle of the prompt are given less weight)...it is possible some of our important context could be obfuscated a bit\n\nThe ordering is super important for other prompting techniques such as chain-of-thought prompting, hence my answer of "it depends on the context". For most RAG\nI do not think it is critical, but to be safe the template we have looks good.\n\n'

## Task 7: Creating a simple RAG pipeline with LangChain v0.1.0

All that's left to do is set up a RAG chain - and away we go!

In [29]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

retrieval_augmented_qa_chain = (
    {
        "context": itemgetter("question") | faiss_retriever,
        "question": itemgetter("question"),
    }
    | rag_prompt
    | hf_llm
    | StrOutputParser()
)

Let's test it out!

In [30]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA?"})

'\nAnswer:\nQLoRA is a method for fine-tuning large language models (LLMs) that is designed to be widely accessible and easy to use, even for those without extensive computational resources. It uses a combination of low-rank matrices and quantization techniques to significantly reduce the computational requirements for fine-tuning LLMs, making it possible for researchers and developers with limited resources to work with state-of-the-art NLP models. Additionally, QLoRA has the potential to enable the deployment of LLMs on mobile devices, which could be a game-changer for many applications. Overall, QLoRA is an important tool for democratizing access to advanced NLP technology.'

# 🤝 Breakout Room #2

## Task 1: Set-up LangSmith

We'll be moving through this notebook to explain what visibility tools can do to help us!

Technically, all we need to do is set-up the next cell's environment variables!

In [31]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE1 - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Enter your LangSmith API key: ··········


In [32]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA?"})

'\nAnswer:\nQLoRA is a method for fine-tuning large language models (LLMs) that is designed to be widely accessible and easily applicable. It uses low-rank matrices and quantization techniques to reduce the computational requirements of LLMs, making it possible to fine-tune them on devices with limited resources. QLORA is an acronym for "Low-Rank Matrix Optimized Quantization for LLMs." The authors of the provided context believe that QLORA has the potential to have a broadly positive impact on the field of natural language processing (NLP) by making the fine-tuning of high-quality LLMs more widely and easily accessible, particularly for researchers and teams with limited resources.'

We get *all of this information* for "free":

![image](https://i.imgur.com/8Wcpmcj.png)

> NOTE: We'll walk through this diagram in detail in class.

####🏗️ Activity #1:

Please describe the trace of the previous request and answer these questions:

1. How many tokens did the request use?
2. How long did the `HuggingFaceEndpoint` take to complete?

Answer:

1) 325 tokens, 243 input tokens and 82 output tokens


2) the HF endpoint tok 3.09 seconds to complete

## Task 2: Creating a LangSmith dataset

Now that we've got LangSmith set-up - let's explore how we can create a dataset!

First, we'll create a list of questions!

In [35]:
from langsmith import Client

questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

Now we can create our dataset through the LangSmith `Client()`.

In [37]:
client = Client()
dataset_name = "QLoRA RAG Dataset V2"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    dataset_id=dataset.id
)

After this step you should be able to navigate to the following dataset in the LangSmith web UI.

![image](https://i.imgur.com/CdFYGTB.png)

## Task 3: Creating a custom evaluator

Now that we have a dataset - we can start thinking about evaluation.

We're going to make a `StringEvaluator` to measure "dopeness".

> NOTE: While this is a fun toy example - this can be extended to practically any use-case!

In [38]:
import re
from typing import Any, Optional
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.evaluation import StringEvaluator

class DopenessEvaluator(StringEvaluator):
    """An LLM-based dopeness evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model="gpt-4", temperature=0)

        template = """On a scale from 0 to 100, how dope (cool, awesome, lit) is the following response to the input:
        --------
        INPUT: {input}
        --------
        OUTPUT: {prediction}
        --------
        Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = PromptTemplate.from_template(template) | llm

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "scored_dopeness"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain.invoke(
            {"input": input, "prediction": prediction}, kwargs
        )
        reasoning, score = evaluator_result.content.split("\n", maxsplit=1)
        score = re.search(r"\d+", score).group(0)
        if score is not None:
            score = float(score.strip()) / 100.0
        return {"score": score, "reasoning": reasoning.strip()}

## Task 4: Initializing our evaluator config

Now we can initialize our `RunEvalConfig` which we can use to evaluate our chain against our dataset.

> NOTE: Check out the [documentation](https://docs.smith.langchain.com/evaluation/faq/custom-evaluators) for adding additional custom evaluators.

In [39]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[DopenessEvaluator()],
    evaluators=[
        "criteria",
        RunEvalConfig.Criteria("harmfulness"),
        RunEvalConfig.Criteria(
            {
                "AI": "Does the response feel AI generated?"
                "Response Y if they do, and N if they don't."
            }
        ),
    ],
)

## Task 5: Evaluating our RAG pipeline

All that's left to do now is evaluate our pipeline!

In [41]:
client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=retrieval_augmented_qa_chain,
    evaluation=eval_config,
    verbose=True,
    project_name="HF RAG Pipeline - Evaluation - v3 cd",
    project_metadata={"version": "1.0.0"},
)

View the evaluation results for project 'HF RAG Pipeline - Evaluation - v3 cd' at:
https://smith.langchain.com/o/74600b6d-60cd-531f-91ed-9798b6cba9aa/datasets/7a788798-e25b-4de5-b040-70ef2dd3eec7/compare?selectedSessions=a1339ee4-df26-445d-95b7-e0190685db89

View all tests for Dataset QLoRA RAG Dataset V2 at:
https://smith.langchain.com/o/74600b6d-60cd-531f-91ed-9798b6cba9aa/datasets/7a788798-e25b-4de5-b040-70ef2dd3eec7
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.helpfulness  feedback.harmfulness  feedback.AI  feedback.scored_dopeness error  execution_time                                run_id
count                   6.00                  6.00         6.00                      6.00     0            6.00                                     6
unique                   NaN                   NaN          NaN                       NaN     0             NaN                                     6
top                      NaN                   N

{'project_name': 'HF RAG Pipeline - Evaluation - v3 cd',
 'results': {'a4ac6d44-721c-4060-b5ec-78e7dd20b2a2': {'input': {'question': 'What is a Retrieval Augmented Generation system?'},
   'feedback': [EvaluationResult(key='helpfulness', score=1, value='Y', comment='The criterion for this task is the helpfulness of the submission. \n\nThe submission provides a clear and concise explanation of what a Retrieval Augmented Generation (RAG) system is. It explains that it is a type of AI model that combines retrieval-based and generation-based models. It also explains how a RAG system works, by retrieving relevant information and then using a generation model to augment this information. This explanation is insightful as it provides a good understanding of the concept. \n\nThe submission is also appropriate as it directly answers the question asked in the input. It does not provide unnecessary or irrelevant information. \n\nBased on this analysis, the submission meets the criterion of being 