# 🕵🏻‍♀️ Create a RAG system expert in a Github repository and log your predictions in Argilla

In this tutorial, we'll show you how to create a RAG system that can answer questions about an specific Github repository. As example, we will target the [Argilla repository](https://github.com/argilla-io/argilla). This RAG system will target the docs of the repository, as that's where most of the natural language information about the repository can be found.

This tutorial includes the following steps:
-   Setting up the Argilla callback handler for LlamaIndex.
-   Initializing a Github client
-   Creating an index with an specific set of files from the Github repository of our choice.
-   Create a RAG system out of the Argilla repository, ask questions and automatically log the answers to Argilla.

This tutorial is based on the [Github Repository Reader](https://docs.llamaindex.ai/en/stable/examples/data_connectors/GithubRepositoryReaderDemo/) made by LlamaIndex.

## Setup

Firstly, you need to make sure that you have the `argilla-llama-index` integration installed. You can do so using `pip`. By installing `argilla-llama-index`, you're also installing `argilla` and `llama-index`. In addition to those two, we will also need `llama-index-readers-github`.

In [None]:
%pip install argilla-llama-index llama-index-readers-github

Now, let's set some important environment variables and do the imports necessary for running the notebook. For the environment variables, we'll need the OpenAI API KEY and our Github token. OpenAI's API key is neccesary to run the queries using GPT models, and the Github token is used to ensure that you have access to the repository you're trying to use. Even if it might not be necessary if the repository is public, it is recommended. At the end of the day, you probably also navigate Github's website logged. 

In [None]:
%env OPENAI_API_KEY=sk-...
%env GITHUB_TOKEN=github_pat_...

In [1]:
import os

from llama_index.core import VectorStoreIndex, set_global_handler
from llama_index.readers.github import GithubRepositoryReader, GithubClient

## Initializing Argilla and setting the global handler

First things first, you need to have an Argilla instance running. You can check our [installation guide](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html#Installation) to choose which way suits you better. We recommend using Hugging Face Spaces to have a remote instance, or running a local instance using Docker.

Now, we will set up an Argilla global handler for Llama Index. By doing so, we ensure that the predictions that we obtain using Llama Index is automatically uploaded to the Argilla client we initialized before Within the handler, we need to provide the dataset name that we will use. If the dataset does not exist, it will be created with the given name. You can also set the **API key**, **API URL**, and the **Workspace name**. If you want to learn more about the variables that controls Argilla initialization, please go to our [workspace management guide](https://docs.argilla.io/en/latest/getting_started/installation/configurations/workspace_management.html).

In [None]:
set_global_handler(
    "argilla", 
    api_url="https://ignacioct-argilla.hf.space", # change it to the HF Space direct link if you are running Argilla in HF Spaces
    api_key="owner.apikey",
    workspace_name="admin",
    dataset_name="repo_reader"
)

## Initialize the Github client

Our Github client shall include the Github token we'll use to access the repo and the information of the repository itself, including the owner, the repository name and the desired branch. In our case, we'll target the `main` branch of the `argilla` repository.

In [3]:
github_token = os.environ.get("GITHUB_TOKEN")
owner = "argilla-io"
repo = "argilla"
branch = "main"

github_client = GithubClient(github_token=github_token, verbose=True)

## Select which documents are included in the RAG



Before creating our `GithubRepositoryReader` instance, we need to correct the nesting. The Jupyter kernel itself runs on an event loop, so to prevent this loop for finishing before reading the whole repository, please run the cell below.

In [4]:
import nest_asyncio

nest_asyncio.apply()

Now, let's create a `GithubRepositoryReader` instance with the information about the repo we want to extract the information from. As the target of this tutorial is to focus on the documentation, we tell the reader to focs on everything in the `docs/` folder, and to avoid images and json files. You can also choose to including `.ipynb` files, depending on the target repository. In our case, there are a lot of tutorials with important information in Argilla, and we would want them included, but for the sake of perfomance on this tutorial, we will exclude them by default.

In [8]:
documents = GithubRepositoryReader(
    github_client=github_client,
    owner=owner,
    repo=repo,
    use_parser=False,
    verbose=False,
    filter_directories=(
        ["docs"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
    filter_file_extensions=(
        [
            ".png",
            ".jpg",
            ".jpeg",
            ".gif",
            ".svg",
            ".ico",
            "json",
            ".ipynb",   # Erase this line if you want to include notebooks

        ],
        GithubRepositoryReader.FilterType.EXCLUDE,
    ),
).load_data(branch=branch)

## Create the index and start asking questions

Now, let's create a LlamaIndex index out of this document, and we can start querying the RAG system.

In [9]:
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query(
    "How does an Argilla's Feedback Dataset work?"
)

print(response)

The FeedbackDataset in Argilla is designed to be a versatile and adaptable dataset that supports a wide range of NLP tasks, including those focused on large language models. It offers flexibility by allowing for multiple tasks to be represented in one coherent user interface, making it particularly useful for workflows involving large language models where multiple tasks need to be performed on the same record. Additionally, the FeedbackDataset supports multiple annotators per record, customizable tasks, and synchronization with a database. However, it currently does not support weak supervision or active learning features.


When the response is generated, it is automatically logged in our Argilla instance. Check it out! From Argilla you can quickly have a look at your predictions and annotate them, so you can combine both synthetic data and human feedback.

![RAG Example 1](../assets/rag_example_1.png)

Let's ask a couple of more questions, to see the overall behaviour of the RAG chatbot. Remember that the answers are being automatically logged to your Argilla instance.

In [11]:
questions = [
    "What types of dataset can I choose from in Argilla?",
    "How can I create and update an Argilla dataset?",
    "Can I upload Markdown files into an Argilla dataset?",
    "Could you explain how to annotate datasets in Argilla?",
]

answers = []

for question in questions:
    answers.append(query_engine.query(question))

for question, answer in zip(questions, answers):
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print("----------------------------")

Question: What types of dataset can I choose from in Argilla?
Answer: You can choose between older datasets tailored to singular NLP tasks and the FeedbackDataset, which is designed to support a wider range of NLP tasks, including those focused on large language models.
----------------------------
Question: How can I create and update an Argilla dataset?
Answer: To create and update an Argilla dataset, you can follow these steps:
- For local `FeedbackDataset` instances, you can add new fields and questions by extending the existing fields and questions lists respectively. You can also remove non-required fields or questions by using the `pop()` method.
- Both local and remote `FeedbackDataset` instances allow adding metadata properties. For remote instances, you can update metadata properties using the `update_metadata_properties` method and delete metadata properties using the `delete_metadata_properties` method.
- For both local and remote instances, you can configure vector setting