# LangChain with Open Source LLM and Open Source Embeddings & LangSmith

In the following notebook we will dive into the world of Open Source models hosted on Hugging Face's [inference endpoints](https://ui.endpoints.huggingface.co/).

The notebook will be broken into the following parts:

- 🤝 Breakout Room #1:
  1. Set-up Hugging Face Inference Endpoints
  2. Install required libraries
  3. Set Environment Variables
  4. Testing our Hugging Face Inference Endpoint
  5. Creating LangChain components powered by the endpoints
  6. Retrieving data from Arxiv
  7. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Set-up LangSmith
  2. Creating a LangSmith dataset
  3. Creating a custom evaluator
  4. Initializing our evaluator config
  5. Evaluating our RAG pipeline

# 🤝 Breakout Room #1

## Task 1: Set-up Hugging Face Infrence Endpoints

Please follow the instructions provided [here](https://github.com/AI-Maker-Space/AI-Engineering/tree/main/Week%205/Thursday) to set-up your Hugging Face inference endpoints for both your LLM and your Embedding Models.

## Task 2: Install required libraries

Now we've got to get our required libraries!

We'll start with our `langchain` and `huggingface` dependencies.



In [29]:
!pip install langchain langchain-core langchain-community langchain_openai huggingface-hub requests -q -U

Now we can grab some miscellaneous dependencies that will help us power our RAG pipeline!

In [30]:
!pip install arxiv pymupdf faiss-cpu -q -U

## Task 3: Set Environment Variables

We'll need to set our `HF_TOKEN` so that we can send requests to our protected API endpoint.

We'll also set-up our OpenAI API key, which we'll leverage later.



In [31]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HuggingFace Write Token: ")

HuggingFace Write Token: ··········


In [32]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


## Task 4: Testing our Hugging Face Inference Endpoint

Let's submit a sample request to the Hugging Face Inference endpoint!

In [33]:
model_api_gateway = "https://uhgw0hdx8tf3i0m6.us-east-1.aws.endpoints.huggingface.cloud" # << YOUR ENDPOINT URL HERE

> NOTE: If you're running into issues finding your API URL you can find it at [this](https://ui.endpoints.huggingface.co/) link.

Here's an example:

![image](https://i.imgur.com/xSCV0xM.png)

# Direct HTTP Request Using requests

- Approach: This script directly uses the requests library to send a POST request to the model's API endpoint. It manually constructs the JSON body and headers required for the request.


- Customization: It explicitly sets parameters like max_new_tokens, top_p, and temperature in the JSON payload. These parameters control the behavior of the text generation, such as the maximum length of the generated response, how the randomness is applied, and the diversity of the generated text.


- Authentication: The script uses an environment variable (HF_TOKEN) for authentication, included in the request headers.


- Flexibility: This method offers high flexibility in terms of specifying HTTP headers, request parameters, and handling the response.

In [34]:
import requests

max_new_tokens = 256
top_p = 0.9
temperature = 0.1

prompt = "Hello! How are you?"

json_body = {
    "inputs" : prompt,
    "parameters" : {
        "max_new_tokens" : max_new_tokens,
        "top_p" : top_p,
        "temperature" : temperature
    }
}

headers = {
  "Authorization": f"Bearer {os.environ['HF_TOKEN']}",
  "Content-Type": "application/json"
}

response = requests.post(model_api_gateway, json=json_body, headers=headers)
print(response.json())

[{'generated_text': " I'm doing well, thanks for asking! *smiling*\n\nI hope you're having a great day and enjoying the beautiful weather. *winking*\n\nI just wanted to reach out and say hello, and see how you're doing. It's always nice to connect with someone new. *nodding*\n\nSo, tell me a little bit about yourself. What do you like to do in your free time? Do you have any hobbies or interests that you enjoy? *curious*\n\nAnd hey, if you want to chat more, I'm all ears! *grinning* I'm always up for a good conversation. *winking*\n\nTake care, and talk to you soon! *smiling*"}]


## Task 5: Creating LangChain components powered by the endpoints

We're going to wrap our endpoints in LangChain components in order to leverage them, thanks to LCEL, as we would any other LCEL component!

# Using langchain.llms Library for HuggingFaceEndpoint

- Approach: This script utilizes the langchain.llms library's HuggingFaceEndpoint class, abstracting away the direct use of HTTP requests. It's a higher-level interface compared to the raw requests approach.


- Simplicity and Readability: The use of a dedicated class for the Hugging Face endpoint makes the code simpler and more readable. It hides the complexity of constructing HTTP requests manually.


- Configuration: It assumes some default parameters for the text generation task ("text-generation") and likely abstracts the handling of parameters such as max_new_tokens, top_p, and temperature inside the invoke method or the class initialization, offering a simpler interface for common tasks.


- Authentication: Similar to the first script, it uses an environment variable for authentication but wrapped within the class's functionality.



### HuggingFaceEndpoint for LLM

We can use the `HuggingFaceEndpoint` found [here](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_endpoint.py) to power our chain - let's look at how we would implement it.

In [35]:
from langchain.llms import HuggingFaceEndpoint

endpoint_url = (
    model_api_gateway
)

hf_llm = HuggingFaceEndpoint(
    endpoint_url=endpoint_url,
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
    task="text-generation"
)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now we can use our endpoint like we would any other LLM!

In [36]:
hf_llm.invoke("Hello, how are you?")

"\n* 你好，我很好，谢谢 (nǐ hǎo, wǒ hěn hǎo, xiè xiè) - Hello, I'm very good, thank you.\n* 我很愉快 (wǒ hěn yúkuài) - I'm very happy.\n* 我很感谢 (wǒ hěn gǎnxiè) - I'm very grateful.\n\nHow do you like China?\n* 中国很好 (Zhōngguó hěn hǎo) - China is very good.\n* 我喜欢中国的文化 (wǒ xǐ huān Zhōngguó de wénhuà) - I like China's culture.\n* 中国有很多好的食物 (Zhōngguó yǒu hěn duō hǎo de shíwù) - China has many delicious foods.\n\nWhat do you want to do in China?\n* 我想去中国旅游 (wǒ xiǎng qù Zhōngguó lǚyóu) - I want to travel in China.\n* 我想吃中国的菜 (wǒ xiǎng chī Zhōngguó de cài) - I want to eat Chinese food.\n* 我想学中国语言 (wǒ xiǎng xué Zhōngguó yǔyīng) - I want to learn Chinese language.\n\nCan you speak Chinese?\n* 我可以说中文 (wǒ kě yǐ shuō Zhōngwén) - I can speak Chinese.\n* 我不能说中文 (wǒ bù néng shuō Zhōngwén) - I cannot speak Chinese.\n\nWhere are you from?\n* 我来自美国 (wǒ lái zì Mēiguó) - I come from the United States.\n* 我来自中国 (wǒ lái zì Zhōngguó) - I come from China.\n* 我来自其他地"


# Key Differences:
- Level of Abstraction: The first script interacts with the API at a lower level, offering more control over the request's specifics. The second script uses a higher-level abstraction that simplifies interaction but might offer less granularity in controlling request parameters.

- Ease of Use: For users familiar with Python but not with making HTTP requests, the second approach might be more accessible and straightforward.

- Customizability: The first script may be more suitable for advanced use cases requiring specific request headers or non-standard API parameters.

- Dependency: The first script relies on the widely used requests library, which is a common choice for HTTP interactions in Python. The second script depends on the langchain.llms library, which is specifically designed for working with language models and might not be as widely adopted.

- Both scripts are valid for their purposes, with the choice between them depending on the user's specific needs, preferences for code readability and simplicity, and the level of control required over the API interaction.

### HuggingFace Inference API Embeddings

Now we can leverage the `HuggingFaceInferenceAPIEmbeddings` module in LangChain to connect to our Hugging Face Inference Endpoint hosted embedding model.

In [37]:
embedding_api_gateway = "https://vpuv173f4vtfog2r.us-east-1.aws.endpoints.huggingface.cloud" # << Embedding Endpoint API URL

In [38]:
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings_model = HuggingFaceInferenceAPIEmbeddings(api_key=os.environ["HF_TOKEN"], api_url=embedding_api_gateway)

In [39]:
embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings")[:10]

[-0.01926689,
 0.01546007,
 -0.046256967,
 -0.021581108,
 -0.009921011,
 0.00024049378,
 -0.033302825,
 -0.0010723798,
 0.027798,
 0.011502621]

In [42]:
import numpy as np
np.size(embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings"))

1024

In [63]:
len(embeddings_model.embed_query("Hello"))

1024

#### ❓ Question #1

What is the embedding dimension of your selected embeddings model?


####  Answer #1 :
embedding dimension is 1024

## Task 6: Retrieving data from Arxiv

We'll leverage the `ArxivLoader` to load some papers about the "QLoRA" topic, and then split them into more manageable chunks!

In [44]:
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader(query="QLoRA", load_max_docs=5).load()

In [45]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(docs)

In [46]:
len(split_chunks)

528

Just the same as we would with OpenAI's embeddings model - we can instantiate our `FAISS` vector store with our documents and our `HuggingFaceEmbeddings` model!

We'll need to take a few extra steps, though, due to a few limitations of the endpoint/FAISS.

We'll start by embeddings our documents in batches of `32`.

In [47]:
embeddings = []
for i in range(0, len(split_chunks) - 1, 32):
  embeddings.append(embeddings_model.embed_documents([document.page_content for document in split_chunks[i:i+32]]))

In [48]:
embeddings = [item for sub_list in embeddings for item in sub_list]

#### ❓ Question #2

Why do we have to limit our batches when sending to the Hugging Face endpoints?


#### Answer #2
Limiting batches when sending to the Hugging Face can prevent the overload in shared infrastructure and manage the computational resource effectively, ensure the steady flow of requests , minimize the error due to sudden spikes in demand.

Now we can create text/embedding pairs which we want use to set-up our FAISS VectorStore!

In [49]:
from langchain.vectorstores import FAISS

text_embedding_pairs = list(zip([document.page_content for document in split_chunks], embeddings))

faiss_vectorstore = FAISS.from_embeddings(text_embedding_pairs, embeddings_model)

In [50]:
print(type(text_embedding_pairs))
print(len(text_embedding_pairs))
print(text_embedding_pairs[1])

<class 'list'>
528
('Adapters (LoRA). Our best model family, which we name Guanaco, outperforms\nall previous openly released models on the Vicuna benchmark, reaching 99.3%\nof the performance level of ChatGPT while only requiring 24 hours of finetuning\non a single GPU. QLORA introduces a number of innovations to save memory\nwithout sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that\nis information theoretically optimal for normally distributed weights (b) Double', [0.026077243, -0.0033466534, -0.017917924, 0.003309633, 0.010076981, 0.016007666, 0.0071671694, 0.023648696, 0.02403371, 0.05265796, 0.058048148, 0.006845091, -0.006697009, -0.045905422, -0.007293039, 0.008796072, 0.0026913905, -0.06343833, -0.011306061, 0.00933657, 0.004853388, 0.003381823, -0.060565542, -0.027069392, -0.01580035, 0.04190721, 0.051325224, -0.0009167202, 0.083932884, 0.019280277, -0.006867303, -0.036250476, 0.013453251, -0.044394985, -0.05076251, -0.031008372, 0.02592916, 0.00848509

Next, we set up FAISS as a retriever.

In [51]:
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k" : 2})

Let's test it out!

In [52]:
faiss_retriever.get_relevant_documents("What optimizer does QLoRA use?")

[Document(page_content='Among these approaches, QLoRA (Dettmers\net al., 2023) stands out as a recent and highly\nefficient fine-tuning method that dramatically de-\ncreases memory usage. It enables fine-tuning of\na 65-billion-parameter model on a single 48GB\nGPU while maintaining full 16-bit fine-tuning per-\nformance. QLoRA achieves this by employing 4-\nbit NormalFloat (NF4), Double Quantization, and\nPaged Optimizers as well as LoRA modules.\nHowever, another significant challenge when uti-'),
 Document(page_content='the computational overhead traditionally associated with fine-tuning such models.\nQLoRA introduces several key innovations, including 4-bit NormalFloat (NF4) quantization and Double Quantization,\nwhich collectively contribute to its memory efficiency. These techniques enable the fine-tuning of models with\nexceptionally large parameters (such as 65B) on limited hardware resources, aligning with the findings of Hu et al.\n[2021].\n4')]

### Prompt Template

Now that we have our LLM and our Retiever set-up, let's connect them with our Prompt Template!

In [53]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT_TEMPLATE = """\
Using the provided context, please answer the user's question. If you don't know, say you don't know.

Context:
{context}

Question:
{question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

#### ❓ Question #3

Does the ordering of the prompt matter?

#### Answer #3

- The order of elements within the prompt (like context, questions, and instructions) can significantly influence the output of model because most of transformer-based LLM models are sensitive to the sequence and structure of their input.


- Putting the Context before question helps LLM to understand the background information first. Instruction clarity and the consistency in prompt can enhance the performance of language models.



## Task 7: Creating a simple RAG pipeline with LangChain v0.1.0

All that's left to do is set up a RAG chain - and away we go!

In [54]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

retrieval_augmented_qa_chain = (
    {
        "context": itemgetter("question") | faiss_retriever,
        "question": itemgetter("question"),
    }
    | rag_prompt
    | hf_llm
    | StrOutputParser()
)

Let's test it out!

In [55]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA?"})

'\nAnswer:\nQLoRA is a method for fine-tuning large language models (LLMs) that involves using low-rank matrices and quantization techniques to reduce the computational complexity of the models. It is designed to make the finetuning of high-quality LLMs more widely and easily accessible, particularly in the hands of large corporations that do not release models or source code for auditing.'

# 🤝 Breakout Room #2

## Task 1: Set-up LangSmith

We'll be moving through this notebook to explain what visibility tools can do to help us!

Technically, all we need to do is set-up the next cell's environment variables!

In [56]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE1 - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Enter your LangSmith API key: ··········


Let's see what happens on the LangSmith project when we run this chain now!

In [57]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA?"})

'\nAnswer:\nQLoRA is a method for fine-tuning large language models (LLMs) that involves using low-rank matrices and quantization techniques to improve the efficiency and accuracy of the models. It is described in a research paper as a way to make the fine-tuning of high-quality LLMs more widely and easily accessible, and is facilitated by the application of QLoRA in Viz, which clarifies the technical complexities and operational benefits of this integration.'

We get *all of this information* for "free":

![image](https://i.imgur.com/8Wcpmcj.png)

> NOTE: We'll walk through this diagram in detail in class.

####🏗️ Activity #1:

Please describe the trace of the previous request and answer these questions:

1. How many tokens did the request use?
2. How long did the `HuggingFaceEndpoint` take to complete?

## Task 2: Creating a LangSmith dataset

Now that we've got LangSmith set-up - let's explore how we can create a dataset!

First, we'll create a list of questions!

In [58]:
from langsmith import Client

questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

Now we can create our dataset through the LangSmith `Client()`.

In [59]:
client = Client()
dataset_name = "QLoRA RAG Dataset"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    dataset_id=dataset.id
)

After this step you should be able to navigate to the following dataset in the LangSmith web UI.

![image](https://i.imgur.com/CdFYGTB.png)

## Task 3: Creating a custom evaluator

Now that we have a dataset - we can start thinking about evaluation.

We're going to make a `StringEvaluator` to measure "dopeness".

> NOTE: While this is a fun toy example - this can be extended to practically any use-case!

In [60]:
import re
from typing import Any, Optional
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.evaluation import StringEvaluator

class DopenessEvaluator(StringEvaluator):
    """An LLM-based dopeness evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model="gpt-4", temperature=0)

        template = """On a scale from 0 to 100, how dope (cool, awesome, lit) is the following response to the input:
        --------
        INPUT: {input}
        --------
        OUTPUT: {prediction}
        --------
        Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = PromptTemplate.from_template(template) | llm

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "scored_dopeness"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain.invoke(
            {"input": input, "prediction": prediction}, kwargs
        )
        reasoning, score = evaluator_result.content.split("\n", maxsplit=1)
        score = re.search(r"\d+", score).group(0)
        if score is not None:
            score = float(score.strip()) / 100.0
        return {"score": score, "reasoning": reasoning.strip()}

## Task 4: Initializing our evaluator config

Now we can initialize our `RunEvalConfig` which we can use to evaluate our chain against our dataset.

> NOTE: Check out the [documentation](https://docs.smith.langchain.com/evaluation/faq/custom-evaluators) for adding additional custom evaluators.

In [61]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[DopenessEvaluator()],
    evaluators=[
        "criteria",
        RunEvalConfig.Criteria("harmfulness"),
        RunEvalConfig.Criteria(
            {
                "AI": "Does the response feel AI generated?"
                "Response Y if they do, and N if they don't."
            }
        ),
    ],
)

## Task 5: Evaluating our RAG pipeline

All that's left to do now is evaluate our pipeline!

In [62]:
client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=retrieval_augmented_qa_chain,
    evaluation=eval_config,
    verbose=True,
    project_name="HF RAG Pipeline - Evaluation - v3",
    project_metadata={"version": "1.0.0"},
)

View the evaluation results for project 'HF RAG Pipeline - Evaluation - v3' at:
https://smith.langchain.com/o/ec0ab543-3d20-5b47-b548-e29646706b91/datasets/bc0ff1d4-4b34-4cfc-b0f2-7919cd259454/compare?selectedSessions=15de38a5-a589-4713-ac8a-1a071b739a16

View all tests for Dataset QLoRA RAG Dataset at:
https://smith.langchain.com/o/ec0ab543-3d20-5b47-b548-e29646706b91/datasets/bc0ff1d4-4b34-4cfc-b0f2-7919cd259454
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.helpfulness  feedback.harmfulness  feedback.AI  feedback.scored_dopeness error  execution_time                                run_id
count                   6.00                  6.00         6.00                      6.00     0            6.00                                     6
unique                   NaN                   NaN          NaN                       NaN     0             NaN                                     6
top                      NaN                   NaN    

{'project_name': 'HF RAG Pipeline - Evaluation - v3',
 'results': {'086e5006-c51c-4bd9-a31a-2e969632db8e': {'input': {'question': 'What significant improvements does the LoRA system make?'},
   'feedback': [EvaluationResult(key='helpfulness', score=1, value='Y', comment='The criterion for this task is "helpfulness". \n\nTo assess this, we need to consider whether the submission provides useful, insightful, and appropriate information in response to the input question. \n\nThe input question asks for the significant improvements that the LoRA system makes. \n\nThe submission provides a detailed response, outlining three key improvements that the LoRA system offers over traditional fine-tuning practices. These include efficiency, resource-efficient customization, and notable improvements in efficiency through an extension of LoRA. \n\nThe submission also provides a summary that reiterates the main points, further enhancing its helpfulness. \n\nTherefore, the submission is helpful, insigh