# Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain

This notebook demonstrates how you can quickly build a RAG (Retrieval Augmented Generation) for a project's GitHub issues using [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) model, and LangChain.


**What is RAG?**

RAG is a popular approach to address the issue of a powerful LLM not being aware of specific content due to said content not being in its training data, or hallucinating even when it has seen it before. Such specific content may be proprietary, sensitive, or, as in this example, recent and updated often.

If your data is static and doesn't change regularly, you may consider fine-tuning a large model. In many cases, however, fine-tuning can be costly, and, when done repeatedly (e.g. to address data drift), leads to "model shift". This is when the model's behavior changes in ways that are not desirable.

**RAG (Retrieval Augmented Generation)** does not require model fine-tuning. Instead, RAG works by providing an LLM with additional context that is retrieved from relevant data so that it can generate a better-informed response.

Here's a quick illustration:

![RAG diagram](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/rag-diagram.png)

* The external data is converted into embedding vectors with a separate embeddings model, and the vectors are kept in a database. Embeddings models are typically small, so updating the embedding vectors on a regular basis is faster, cheaper, and easier than fine-tuning a model.

* At the same time, the fact that fine-tuning is not required gives you the freedom to swap your LLM for a more powerful one when it becomes available, or switch to a smaller distilled version, should you need faster inference.

Let's illustrate building a RAG using an open-source LLM, embeddings model, and LandChain.

First, install the required dependencies:

In [1]:
!pip install -q torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu

In [2]:
# If running in Google Colab, you may need to run this cell to make sure you're using UTF-8 locale to install LangChain
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [3]:
!pip install -q langchain

In [12]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

## Prepare the data


In this example, we'll load all of the issues (both open and closed) from [PEFT library's repo](https://github.com/huggingface/peft).

First, you need to acquire a [GitHub personal access token](https://github.com/settings/tokens?type=beta) to access the GitHub API.

Next, we'll load all of the issues in the [huggingface/peft](https://github.com/huggingface/peft) repo:
- By default, pull requests are considered issues as well, here we chose to exclude them from data with by setting `include_prs=False`
- Setting `state = "all"` means we will load both open and closed issues.

In [6]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [30]:
# prompt: write a code to read a pdf file and extract the text

#!pip install PyPDF2
import PyPDF2

# Open the PDF file in binary mode.
with open("/content/253976162 QSG VSB3930 StarHub - HiRes - PRINT.pdf", "rb") as f:
    # Create a PDF reader object.
    pdf_reader = PyPDF2.PdfReader(f)
    Text=""
    # Get the number of pages in the PDF file.
    num_pages = len(pdf_reader.pages)

    # Loop over each page in the PDF file.
    for page in pdf_reader.pages:
        # Get the page object.
        #page = pdf_reader.pages[page_num]
        Text += page.extract_text()

        # Extract the text from the page.
        #text = page.extract_text

        # Print the text.
        print(Text)




SAFETY NO TICE:
 
 WHA T’S INCLUDED
Power Adapter
Remote Contr ol HDMI® Cable
2 x AAA Batteries
This product contains alkaline batteries. Keep the batteries out of children's reach. Do not install in incorrect 
direction, charge or dispose in ﬁre.
This product should only be operated in environments of temperature 25 ~ 40⁰C. Do not expose to direct 
sunlight or moisture.StarHub TV+ Pro
 
Top Panel
Mute Switch
SpeakersPower Button
Note: For Wi-Fi connections, streaming quality may be affected if the Wi-Fi signal strength is weak.GETTING STARTED
Visit starhub.com/
tvplus-pro-help 
if you require 
additional assistance 
 Connect your StarHub TV+ Pro
Connect the
Power Adapter. 
Connect an ethernet cable (not included) to the 
LAN port of your router or connect via Wi-Fi during 
the on-screen set-up. 1
Connect the 
HDMI Cable to the TV 
HDMI port and select 
the corresponding HDMI
source on your TV.Note: Audio output will be via the 
StarHub TV+ Pro Speakers.
1 A Google Account is r equir e

In [10]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_text(Text)

In [4]:
with open('/content/Logcat_opl_Cust_J.txt') as f:
    lines = f.read()

In [14]:
lines



In [5]:
from langchain.text_splitter import CharacterTextSplitter

with open('/content/Logcat_opl_Cust_J.txt') as f:
    lines = f.read()

splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_text(lines)

## Create the embeddings + retriever

In [3]:
from pdfminer.pdfinterp import PDFPage

ModuleNotFoundError: No module named 'pdfminer'

Now that the docs are all of the appropriate size, we can create a database with their embeddings.

To create document chunk embeddings we'll use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. There are many other embeddings models available on the Hub, and you can keep an eye on the best performing ones by checking the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).


To create the vector database, we'll use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors, which is what we need here. FAISS is currently one of the most used libraries for NN search in massive datasets.

We'll access both the embeddings model and FAISS via LangChain API.

In [6]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

db = FAISS.from_texts(chunked_docs,
                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We need a way to return(retrieve) the documents given an unstructured query. For that, we'll use the `as_retriever` method using the `db` as a backbone:
- `search_type="similarity"` means we want to perform similarity search between the query and documents
- `search_kwargs={'k': 4}` instructs the retriever to return top 4 results.


In [7]:
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 4}
)

The vector database and retriever are now set up, next we need to set up the next piece of the chain - the model.

## Load quantized model

For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.

With many models being released every week, you may want to substitute this model to the latest and greatest. The best way to keep track of open source LLMs is to check the [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

To make inference faster, we will load the quantized version of the model:

In [22]:
!pip install accelerate



In [29]:
!pip install bitsandbytes



In [30]:
# Restart the kernel
!jupyter notebook restart

[35m[C 21:29:45.122 NotebookApp][m No such file or directory: /content/restart


In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'google/flan-t5-small'#'HuggingFaceH4/zephyr-7b-beta'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name)#, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

ValueError: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

In [10]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, BitsAndBytesConfig

# Create a tokenizer object by loading the pretrained "Intel/dynamic_tinybert" tokenizer.
tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")

# Create a question-answering model object by loading the pretrained "Intel/dynamic_tinybert" model.
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [13]:
# Specify the model name you want to use
model_name = "Intel/dynamic_tinybert"

# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    "question-answering",
    model=model_name,
    tokenizer=tokenizer,
    return_tensors='pt'
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

In [14]:
from langchain.text_splitter import CharacterTextSplitter

with open('/content/Logcat_opl_Cust_J.txt') as f:
    lines = f.read()

#splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)

#chunked_docs = splitter.split_text(lines)

In [15]:

# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_text(lines)

In [16]:

# Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:]

[-0.03833850473165512,
 0.1234646886587143,
 -0.028642967343330383,
 0.053652726113796234,
 0.00884536374360323,
 -0.03983934223651886,
 -0.07300587743520737,
 0.04777130112051964,
 -0.030462518334388733,
 0.054979778826236725,
 0.08505294471979141,
 0.036656707525253296,
 -0.005319980904459953,
 -0.0022331546060740948,
 -0.06071099638938904,
 -0.0272379107773304,
 -0.011351646855473518,
 -0.04243771359324455,
 0.00912993773818016,
 0.100815549492836,
 0.07578728348016739,
 0.06911720335483551,
 0.009857457131147385,
 -0.00183774100150913,
 0.02624906226992607,
 0.03290242329239845,
 -0.07177437096834183,
 0.02838427573442459,
 0.06170952320098877,
 -0.05252953618764877,
 0.03366167098283768,
 0.07446815073490143,
 0.07536035776138306,
 0.03538402542471886,
 0.06713403761386871,
 0.010798015631735325,
 0.08167024701833725,
 0.016562890261411667,
 0.032830607146024704,
 0.036325693130493164,
 0.0021728351712226868,
 -0.09895741939544678,
 0.005046733655035496,
 0.05089650675654411,
 0.0

In [20]:
db = FAISS.from_texts(text, embeddings)

In [24]:
question = "What is this document?"
searchDocs = db.similarity_search(question)
print(searchDocs)

[Document(page_content='a'), Document(page_content='h'), Document(page_content='c'), Document(page_content='d')]


## Setup the LLM chain

Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokenizer.

Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting.

In [21]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain.chains import LLMChain

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = LLMChain(llm=llm, prompt=prompt)

NameError: name 'model' is not defined

Note: _You can also use `tokenizer.apply_chat_template` to convert a list of messages (as dicts: `{'role': 'user', 'content': '(...)'}`) into a string with the appropriate chat format._


Finally, we need to combine the `llm_chain` with the retriever to create the RAG:

In [None]:
from langchain.schema.runnable import RunnablePassthrough

retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)


## Compare the results

Let's see the difference RAG makes in generating answers to the library-specific questions.

In [None]:
question = "How to disable hands-free voice control ?"

First, let's see what kind of answer we can get with just the model itself, no context added:

In [None]:
llm_chain.invoke({"context":"", "question": question})['text']




'1. For Apple devices (iPhone, iPad, iPod touch):\n\n    a. Go to Settings > Accessibility > Touch > Call Audio Routing and select "Speaker" or "AirPlay". This will prevent calls from automatically being answered through hands-free audio.\n\n    b. Alternatively, you can go to Settings > Siri & Search > Listen for "Hey Siri" and toggle off the switch. This will disable the "Hey Siri" feature, which is used to activate Siri without touching your device.\n\n2. For Android devices:\n\n    a. Go to Settings > Accessibility > Voice Match and turn off the "Use Google Services" option. This will disable the "Ok Google" hotword detection, which is used to activate Google Assistant without touching your device.\n\n    b. Alternatively, you can go to Settings > System > Language & input > Text-to-speech output and select "None" as the preferred engine. This will disable text-to-speech functionality, which is sometimes used in conjunction with voice commands.\n\n3. For Amazon Echo devices:\n\n   

Comme vous l'avez vu, le modèle interprète la question comme étant ouverte, et répond avec ses données d'entrainement sur les produits Apple.

In [None]:
rag_chain.invoke(question)['text']



' To disable hands-free voice control on the StarHub TV+ Pro, you need to slide the mute switch located on the top of the device to the "off" position to mute the microphone. This will prevent the device from responding to voice commands without pressing the microphone button on the remote control.'

As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.

Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.