<a href="https://colab.research.google.com/github/brianMutea/Chat-with-your-Github-repositories-LlamaIndex-and-Activeloop-Deep-Lake.ipynb/blob/main/Chat_with_your_Github_repositories_LlamaIndex_and_Activeloop_Deep_Lake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat with your Github repositories using LlamaIndex

You'll learn how to effortlessly index GitHub repositories into Deep Lake and interact with your code through natural language queries.

#### How does LLamaIndex work?
In the context of leveraging LlamaIndex for data-driven applications, the underlying logic and workflow are pretty simple. Here's a breakdown:

* **Load Documents**: The first step involves loading your raw data into the system. You can do this manually, directly inputting the data, or through a data loader that automates the process. LlamaIndex offers specialized data loaders that can ingest data from various sources, transforming them into Document objects, and you can find many plugins on Llama Hub. This is a crucial step as it sets the stage for the subsequent data manipulation and querying functionalities.
* **Parse the Documents into Nodes**: Once the documents are loaded, they are parsed into Nodes, essentially structured data units. These Nodes contain chunks of the original documents and carry valuable metadata and relationship information. This parsing process is vital as it organizes the raw data into a structured format, making it easier and more efficient for the system to handle.
* **Construct an Index from Nodes or Documents**: After the Nodes are prepared, an index is constructed to make the data searchable and queryable. Depending on your needs, this index can be built directly from the original documents or the parsed Nodes. The index is often stored in structures like VectorStoreIndex, optimized for quick data retrieval. This step is the system's heart, turning your structured data into a robust, queryable database.
* **Query the Index**: With the index in place, the final step is to query it. A query engine is initialized, allowing you to make natural language queries against the indexed data. This is where the magic happens: you can conversationally ask the system questions, and it will sift through the indexed data to provide accurate and relevant answers.

In [1]:
%%capture
!pip install -q llama-index==0.9.14.post3 openai==1.3.8 cohere==4.37 deeplake

In [16]:
import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
os.environ["ACTIVELOOP_TOKEN"] = "your_activeloop_toke"
os.environ["GITHUB_TOKEN"] = "your_github_classic_token"

dataset_path=f"hub://{'brianmuteak'}/{'git_repository_vdata'}"

In [3]:
# Fetch and set API keys
openai_api_key = os.getenv("OPENAI_API_KEY")
active_loop_token = os.getenv("ACTIVELOOP_TOKEN")
github_token = os.getenv("GITHUB_TOKEN")

In [4]:
%%capture
!pip install llama_hub

In [5]:
import os
import textwrap
from llama_index import download_loader
from llama_hub.github_repo import GithubRepositoryReader, GithubClient
from llama_index import VectorStoreIndex, GPTVectorStoreIndex

from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.storage.storage_context import StorageContext
import re

In [15]:
def parse_github_url(url):
  pattern = r"https://github\.com/([^/]+)/([^/]+)"
  match = re.match(pattern, url)
  return match.groups() if match else (None, None)


def validate_owner_repo(owner, repo):
  return bool(owner) and bool(repo)


def initialize_github_client():
  github_token = os.getenv("GITHUB_TOKEN")
  return GithubClient(github_token)


def main():
  # Check for OpenAI API key
  openai_api_key = os.getenv("OPENAI_API_KEY")
  if not openai_api_key:
      raise EnvironmentError("OpenAI API key not found in environment variables")

  # Check for GitHub Token
  github_token = os.getenv("GITHUB_TOKEN")
  if not github_token:
      raise EnvironmentError("GitHub token not found in environment variables")

  # Check for Activeloop Token
  active_loop_token = os.getenv("ACTIVELOOP_TOKEN")
  if not active_loop_token:
      raise EnvironmentError("Activeloop token not found in environment variables")

  github_client = initialize_github_client()
  download_loader("GithubRepositoryReader")

  github_url = input("Please enter the GitHub repository URL: ")
  owner, repo = parse_github_url(github_url)

  while True:
      owner, repo = parse_github_url(github_url)
      if validate_owner_repo(owner, repo):
          loader = GithubRepositoryReader(
              github_client,
              owner=owner,
              repo=repo,
              filter_file_extensions=(
                  [".ipynb"],
                  GithubRepositoryReader.FilterType.INCLUDE,
              ),
              verbose=False,
              concurrent_requests=5,
          )
          print(f"Loading {repo} repository by {owner}")
          docs = loader.load_data(branch="main")
          print("Documents uploaded:")
          for doc in docs:
              print(doc.metadata)
          break  # Exit the loop once the valid URL is processed
      else:
          print("Invalid GitHub URL. Please try again.")
          github_url = input("Please enter the GitHub repository URL: ")

  print("Uploading to vector store...")

  # ====== Create vector store and upload data ======

  vector_store = DeepLakeVectorStore(
      dataset_path= dataset_path,
      overwrite=True,
      runtime={"tensor_db": True},
  )

  storage_context = StorageContext.from_defaults(vector_store = vector_store)
  index = VectorStoreIndex.from_documents(
      docs, storage_context = storage_context
      )
  query_engine = index.as_query_engine()

  # Include a simple question to test.
  intro_question = "What is the repository about?"
  print(f"Test question: {intro_question}")
  print("=" * 50)
  answer = query_engine.query(intro_question)

  print(f"Answer: {textwrap.fill(str(answer), 100)} \n")
  while True:
      user_question = input("Please enter your question (or type 'exit' to quit): ")
      if user_question.lower() == "exit":
          print("Exiting, thanks for chatting!")
          break

      print(f"Your question: {user_question}")
      print("=" * 50)

      answer = query_engine.query(user_question)
      print(f"Answer: {textwrap.fill(str(answer), 100)} \n")


if __name__ == "__main__":
  main()

Please enter the GitHub repository URL: https://github.com/brianMutea/LlamaIndex-Precision-and-Simplicity-in-Information-Retrieval
Loading LlamaIndex-Precision-and-Simplicity-in-Information-Retrieval repository by brianMutea
Documents uploaded:
{'file_path': 'LlamaIndex_Introduction_Precision_and_Simplicity_in_Information_Retrieval.ipynb', 'file_name': 'LlamaIndex_Introduction_Precision_and_Simplicity_in_Information_Retrieval.ipynb', 'url': 'https://github.com/brianMutea/LlamaIndex-Precision-and-Simplicity-in-Information-Retrieval/blob/main/LlamaIndex_Introduction_Precision_and_Simplicity_in_Information_Retrieval.ipynb'}
Uploading to vector store...
Your Deep Lake dataset has been successfully created!




Uploading data to deeplake dataset.


100%|██████████| 10/10 [00:00<00:00, 43.90it/s]


Dataset(path='hub://brianmuteak/git_repository_vdata', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (10, 1)      str     None   
 metadata     json      (10, 1)      str     None   
 embedding  embedding  (10, 1536)  float32   None   
    id        text      (10, 1)      str     None   
Test question: What is the repository about?
Answer: The repository is about precision and simplicity in information retrieval. 

Please enter your question (or type 'exit' to quit): Give me examples of LlamaIndex index types
Your question: Give me examples of LlamaIndex index types
Answer: Summary Index and Vector Store Index are examples of LlamaIndex index types. 

Please enter your question (or type 'exit' to quit): How do I create nodes with LlamaIndex?
Your question: How do I create nodes with LlamaIndex?
Answer: To create nodes with LlamaIndex, you can use the `

## Understanding the code
At first glance, a lot is happening here; let's review it. Below is a step-by-step breakdown:

### Initialization and Environment Setup
* Import Required Libraries: The script starts by importing all the necessary modules and packages.
* Load Environment Variables: Using dotenv, it loads environment variables stored in the .env file. This is where API keys and tokens are stored securely.

### Helper Functions
* `parse_github_url`(url): This function takes a GitHub URL and extracts the repository owner and name using regular expressions.
* `validate_owner_repo`(owner, repo): Validates that both the repository owner and name are present.
* `initialize_github_client()`: Initializes the GitHub client using the token fetched from the environment variables.
API Key Checks: Before proceeding, the script checks for the presence of the OpenAI API key, GitHub token, and Activeloop token, raising an error if any are missing.
* `Initialize GitHub Client`: Calls initialize_github_client() to get a GitHub client instance.
* User Input for GitHub URL: Asks the user to input a GitHub repository URL.
* URL Parsing and Validation: Parses the URL to get the repository owner and name and validates them.
* Data Loading: If the URL is valid, it uses GithubRepositoryReader from llama_index to load the repository data, specifically Python and Markdown files.
* Indexing: The loaded data is then indexed using VectorStoreIndex and stored in a DeepLake vector store. This makes the data queryable.
* Query Engine Initialization: Initializes a query engine based on the indexed data.
* Test Query: Performs a test query to demonstrate the system's operation.
* User Queries: Enters a loop where the user can input natural language queries to interact with the indexed GitHub repository. The loop continues until the user types 'exit'.

### Execution Entry Point
The script uses the standard if __name__ == "__main__":