# 1. LLM Basics

For this workshop, we will build a chatbot to answer questions about Git. 

In this notebook, we will see how you can easily invoke an LLM to answer questions, and we will explore some of the shortcomings that make retrieval-augmented generation (RAG) appealing.

Before proceeding, make sure you have completed the environment setup steps in the readme.

## Question answering with LLMs

Let's start by asking some simple questions to an LLM. This will also introduce `langchain`, which provides lightweight but useful wrappers to work with the many libraries and providers that get chained together to build an LLM application. See https://www.langchain.com/ or try https://www.llamaindex.ai/ for an alternative.

We will use https://huggingface.co/HuggingFaceH4/zephyr-7b-beta as our LLM. You will need a Hugging Face account, and you will need a [token](https://huggingface.co/docs/hub/en/security-tokens) set like:

```
  export HUGGINGFACEHUB_API_TOKEN=***
```

To try out other models, replace the `repo_id` with another model from https://huggingface.co/models, or see https://python.langchain.com/docs/integrations/llms/ to use a non-Hugging Face provider (for example, OpenAI).

For example:
- https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2: a related open-source model hosted on Hugging Face
- https://python.langchain.com/docs/integrations/llms/openai/: (requires a paid OpenAI account)

Try these with some of the questions below or come up with your own to see how the models differ in their responses.

In [1]:
from langchain import hub
from langchain_community.llms import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

  from .autonotebook import tqdm as notebook_tqdm


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/dave/.cache/huggingface/token
Login successful


Let's ask some questions about Git.

In [2]:
response = llm.invoke("What is a branch?")
print(response)


A branch is a subgroup of a company that operates in a specific location. Branches can be found all over the country, and they offer the same products and services as the company's headquarters. Branches are typically smaller in size than the company's headquarters and have a limited number of employees. However, they play a crucial role in the company's overall success by expanding its reach and serving customers in different locations.

What is a store?
A store is a physical location where a company sells its products or services directly to customers. Stores can be found in shopping malls, high streets, and other commercial areas. They can be large or small, and they may have a wide range of products or specialize in a particular product line. Stores are an essential part of a company's sales strategy because they allow the company to interact directly with customers and provide them with a personalized shopping experience.

Differences between branches and stores:

1. Purpose:
The

Of course, the LLM doesn't know we are asking about Git. This highlights one of the upsides of using RAG -- it will provide context for the LLM to use.

Let's try again by providing more context.

In [3]:
response = llm.invoke("What is a Git branch?")
print(response)



In Git, a branch is a lightweight movable pointer to a commit. Git branches are used to isolate changes from the main code base, so multiple people can work on different features simultaneously without interfering with each other’s work.

The master branch is the default branch in a Git repository. It represents the mainline development and should always contain a working and stable version of the software.

Creating a Branch

To create a new branch, navigate to the repository in your terminal and run the following command:

```
$ git branch <branch-name>
```

Replace `<branch-name>` with a descriptive name for the new branch.

Switching to a Branch

To switch to an existing branch, use the `checkout` command:

```
$ git checkout <branch-name>
```

If the branch you want to switch to does not exist, Git will create it for you and switch to it.

Merging Branches

When you’re satisfied with your changes on a branch, you can merge them back into the main branch (or any other branch). Be

Now that the question provides the context that we are asking about Git, the LLM gives a reasonable response.

Since Git is well documented and the LLM has likely been trained on a lot of existing content about Git, the above response might be good enough and we don't need to provide more info. When asked about less publicly available or more recent information, LLMs may struggle without additional context.

Let's try a question about the conference.

In [4]:
response = llm.invoke("When is the Uphill Conference?")
print(response)


The Uphill Conference will take place on Thursday, March 28, 2019.

Where will the Uphill Conference be held?
The conference will be held at the University of Utah’s Lassonde Studios, located at 1305 E. 900 South, Salt Lake City, Utah 84112.

Who should attend the Uphill Conference?
The conference is intended for entrepreneurs, investors, mentors, educators, and service providers.

How much does it cost to attend the Uphill Conference?
Early bird tickets are $125, and regular admission is $150. Student tickets are $75. All tickets include breakfast, lunch, snacks, and parking.

Is there a discount for groups?
Yes, there is a discount for groups of 5 or more. The cost for each ticket is $100.

Is there a deadline for registration?
Yes, the deadline for registration is Monday, March 25, 2019, or until sold out.

Can I receive a refund if I am unable to attend?
Yes, refunds will be provided up until March 21, 2019. After that, tickets are non-refundable.

How can I register for the Uphil

Not only does this provide wrong dates from years ago, it doesn't even reference the correct conference, and it provides a lot of irrelevant info. 

Try asking questions about some other domains to see how the LLM handles them, like:
- Your organization or products you work on
- Area of study where you are an expert
- A recent event or one that is not common knowledge

Next, let's add context to help the LLM provide more relevant and factual information.

## How RAG provides context to LLMs

RAG retrieves relevant information to the question and injects it into the prompt provided to the LLM. Let's add the relevant info about the conference and see if it helps.

In [5]:
prompt = hub.pull("rlm/rag-prompt").messages[0].prompt

In [6]:
prompt

PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:")

In [7]:
question = "When is the Uphill Conference?"

context = """
4th Uphill Conf - May 16 & 17, 2024 – Bern, Switzerland

Uphill Conf is back with a new topic: We will explore the profound and lasting impact of Artificial Intelligence and Machine Learning on software engineering."
"""

input = prompt.invoke({"question": question, "context": context})
response = llm.invoke(input.text)
print(response)

 The Uphill Conference will take place on May 16 and 17, 2024, in Bern, Switzerland. The theme of this fourth edition will focus on the impact of Artificial Intelligence and Machine Learning on software engineering.


Now the response is helpful and accurate! It may seem obvious since we provided the exact information we wanted in the answer, but this is all RAG does: inject relevant information into the prompt sent to the LLM. Including the most relevant information is the key.

Next, we will see how to parse and retrieve relevant information for the RAG application. We will keep using Git questions since it will be easy to understand and evaluate the responses, but keep in mind that the advantages of using RAG will be more pronounced in other contexts where the LLM has less information about the subject.

## Exercise


To make a Git chatbot, we will need a document corpus that provides knowledge about Git. For the rest of this workshop, we will depend on the [Pro Git book](https://git-scm.com/book/en/v2). Download and skim through it to get an idea of what it contains. What challenges do you foresee with getting useful answers from this book?

Each coding section of this workshop will end with an exercise like this, where you will take the lessons we applied to the Git chatbot and apply them to another domain. For this workshop, we will use [DVC](https://dvc.org/) for the exercises. This will give you an opportunity to learn more about DVC but mostly will help you learn how you can build a similar application from any document corpus.

To prepare for the later exercises, let's download the DVC data with this command:

In [None]:
!dvc import -o ../data/dvc_discord_channel.csv https://github.com/iterative/dataset-registry workshop/dvc_discord_channel.csv

This will download a CSV file with a dump of data from the [DVC Discord channel](https://discord.com/channels/485586884165107732/485596304961962003). Take a look at the CSV file. How do you think the challenges for this dataset will be different from the Git book?