<a href="https://colab.research.google.com/github/devsjee/LLM-payground/blob/main/Chat_with_Github_Repo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We know that the publicly available applications of LLMs like ChatGPT are very alluring in terms of their ability to create meaningul conversations. It is even more interesting to learn that these LLMs can be easily used for creating a chatbot to assist us in exploring or understanding a specific dataset of our choice. For example, let us say I want to investigate some github repository and understand its code flow. Then, I can create a LLM based chatbot to answer my queries pertaining to that github repo.


Broadly speaking, this application will have the following three components:

1.  A front end to receive inputs from user and show the outputs: A simple front end would be the Python terminal itself. We can use Python input() for getting inputs from the command line and also show outputs on same terminal itself.  We will begin with this simple front end first and subsequently use an advanced library called streamlit for creating a web app like front end.

2.  A back end for storing and accessing data: A simple backend for our application would be a Python list or a Python dictionary. But, will it be efficient for storing and querying embedding data? imagine storing the embeddings in a python list and then trying to fetch similar embeddings. For similarity based search, we need to compute cosine distance between the query and all embeddings in our list. This is computationally expensive though there might be techniques to improve the search efficiency.Fortunately, for us, there are options like DeepLake, PineCone, etc that do this embedding storage and retrieval efficiently for us. I will use DeepLake for this experiment as I am basically following DeepLake's tutorial to buidl this custom LLM chatbot.

3. A process engine for chatbot : Creating a custom chatbot based on LLM involves three main steps:

> Indented block
1. Processing the custom database and creating chunks of data i.e. creating a collection of data pieces of uniform length.
2. Converting the chunks to embeddings and storing them in a vector database. A vector database is a type of database that facilitates the efficient storage and retrieval of the high dimensional embedding vectors.
3. Using a LLM model to answer queries pertaining to this database.


In [50]:
#Front end for chatbot
#In this application, there will be two instances where the user will need to interact with our program.
# 1.  To provide the github url for processing
# 2.  To engage in chatting about the github repo after it is processed by our program


#Get the user input for github url a
repo_url= input('Enter the github url: ')
include_file_extensions= input('Enter the list of file extensions to be included in the chat database separated by spaces. Eg: .py .js .pdf')


Enter the github url: https://github.com/devsjee/LaTeX-Tutorial
Enter the list of file extensions to be included in the chat database separated by spaces. Eg: .py .js .pdf.pdf .tex


In [51]:
print(include_file_extensions)

.pdf .tex


In [4]:
# Next we will set the environment variables to hold the key values that we will need access to.
# for OpenAI API key: Signup for OpenAI at https://openai.com/ and get your api key from your account/profile page
# for Activeloop API token: Register at activeloop site https://app.activeloop.ai/register/ and get the token from
# for Activeloop User name: your registered username with activeloop

import getpass
import os

os.environ['OPENAI_API_KEY'] = getpass.getpass("Enter OpenAI API key")
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass("Enter Activeloop API token")
os.environ['ACTIVELOOP_USERNAME'] = getpass.getpass("Enter Activeloop Username")



Enter OpenAI API key··········
Enter Activeloop API token··········
Enter Activeloop Username··········


In [52]:
def extract_repo_name(repo_url):
    """Extract the repository name from the given repository URL."""
    repo_name = repo_url.split("/")[-1].replace(".git", "")
    return repo_name

In [6]:
repo_name= extract_repo_name(repo_url)
activeloop_username = os.environ.get("ACTIVELOOP_USERNAME")
activeloop_dataset_path = f"hub://{activeloop_username}/{repo_name}"
activeloop_token = os.getenv("ACTIVELOOP_TOKEN")

print("repo name: ", repo_name)
print("al uname: ",activeloop_username)
print("al dataset path:" ,activeloop_dataset_path)


repo name:  LaTeX-Tutorial
al uname:  devsjee
al dataset path: hub://devsjee/LaTeX-Tutorial


One important lesson that I learned from trying implement this application from scratch is that all the libraries used here are frequently updated and hence we need to ensure that we always have the latest versions installed on our system. so the '--upgrade' option used below was very essential to my successful creation of this chatbot.

In [62]:
!pip install -q --upgrade DeepLake
!pip install -q --upgrade langchain
!pip install -q --upgrade langchain_community
!pip install -q --upgrade langchain-openai
!pip install -q --upgrade pathspec

In [7]:
#While running the code from peterw repo, there was an error related to base64 encoding. It is mainly about how the activeloop token is encoded.
# the below will do the necessary encoding changes to activeloop token
import base64

# Remove any trailing padding characters
activeloop_token = activeloop_token.rstrip("=")

# Add the missing padding characters
padding_length = 4 - len(activeloop_token) % 4
activeloop_token += "=" * padding_length

In [8]:
#When I had to repeat this experiment again and again for debugging, I had to delete the local repository folder entirely but Colab does not allow to delete a non-empty directory
# So used the below code for recursive deletion of the local repo folder.
import os
for root, dirs, files in os.walk("devsjee/", topdown=False):
    for name in files:
        os.remove(os.path.join(root, name))
    for name in dirs:
        os.rmdir(os.path.join(root, name))

In [9]:
#a dataset will be created in the online data hub at activeloop site afer this step.
#Login to your activeloop account and see the dataset created or using the link that
#appears in the output upon succesful running of this cell.
import deeplake
ds = deeplake.empty(
        activeloop_dataset_path,
        token=activeloop_token,
        overwrite=True,

    )

ds.create_tensor("ids")
ds.create_tensor("metadata")
ds.create_tensor("embedding")
ds.create_tensor("text")



Your Deep Lake dataset has been successfully created!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/devsjee/LaTeX-Tutorial
hub://devsjee/LaTeX-Tutorial loaded successfully.




Tensor(key='text')

In [None]:
!mkdir /content/repos

In [25]:
%cd /content/repos

/content/repos


In [29]:
!git clone https://github.com/devsjee/LaTeX-Tutorial

Cloning into 'LaTeX-Tutorial'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 17 (delta 2), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (17/17), 481.72 KiB | 7.90 MiB/s, done.
Resolving deltas: 100% (2/2), done.


In [37]:
# we will clone the given github url and store it locally. this did not work for me. So I had used the earlier two cells for creating a local folder and cloning
# url directly into it.
import subprocess

repo_destination = "/content/repos/LaTeX-Tutorial"

subprocess.run(["git","clone",repo_url,repo_destination])


CompletedProcess(args=['git', 'clone', 'github.com/devsjee/LaTeX-Tutorial', '/content/repos/LaTeX-Tutorial'], returncode=128)

In [63]:
import os
import pathspec
from langchain.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import DeepLake

In [53]:
# we will load only the relevant files fom the clones repo to the 'docs' variable
# relevant files are those files whose extensions match any of the given '--to include file extensions '' given by the user
docs = []

# Load .gitignore rules
gitignore_path = os.path.join(repo_destination, ".gitignore")

if os.path.isfile(gitignore_path):
    with open(gitignore_path, "r") as gitignore_file:
        gitignore = gitignore_file.read()
    spec = pathspec.PathSpec.from_lines(
        pathspec.patterns.GitWildMatchPattern, gitignore.splitlines()
    )
else:
    spec = None

for dirpath, dirnames, filenames in os.walk(repo_destination):
    # Remove dot directories from the list of directory names
    dirnames[:] = [d for d in dirnames if not d.startswith(".")]

    for file in filenames:
        print(file)
        file_path = os.path.join(dirpath, file)

        # Skip dotfiles
        if file.startswith("."):
            continue

        # Skip files that match .gitignore rules
        if spec and spec.match_file(file_path):
            continue

        if include_file_extensions and os.path.splitext(file)[1] not in include_file_extensions:
            continue

        try:
            loader = TextLoader(file_path, encoding="utf-8")
            docs.extend(loader.load_and_split())
        except Exception:
            pass


chapter1.png
knuth.jpg
title.png
README.md
tutorial_slides.tex
tutorial_slides.pdf
hello.png
bib.jpeg
chapter3.png
chapter2.png


In [66]:
print(len(docs))

6


In [40]:
#we are splitting the collection of documents into collection of text chunks where each chunk is of 1000 Characters.

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

print(len(texts))

27


In [56]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0


In [64]:
#creating a dataset of vector embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
db = DeepLake.from_documents(texts,embeddings,dataset_path=activeloop_dataset_path)
#db.add_documents(texts)

Deep Lake Dataset in hub://devsjee/LaTeX-Tutorial already exists, loading from the storage


Creating 27 embeddings in 1 batches of size 27:: 100%|██████████| 1/1 [00:38<00:00, 38.22s/it]

Dataset(path='hub://devsjee/LaTeX-Tutorial', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype     shape      dtype  compression
  -------   -------   -------    -------  ------- 
 embedding  generic  (54, 1536)  float32   None   
    ids      text     (54, 1)      str     None   
 metadata    json     (54, 1)      str     None   
   text      text     (54, 1)      str     None   





In [58]:
# Create a 'read-only' instance of DeepLake with the specified dataset path and embeddings; this is used for search and retrieval purpose at the time of chatting
db = DeepLake(
        dataset_path=activeloop_dataset_path,
        read_only=True,
        embedding_function=embeddings)



Deep Lake Dataset in hub://devsjee/LaTeX-Tutorial already exists, loading from the storage


In [59]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA


In [60]:
def search_db(db, query):
    """Search for a response to the query in the DeepLake database."""
    # Create a retriever from the DeepLake instance
    retriever = db.as_retriever()
    # Set the search parameters for the retriever
    retriever.search_kwargs["distance_metric"] = "cos"
    retriever.search_kwargs["fetch_k"] = 100
    #retriever.search_kwargs["maximal_marginal_relevance"] = True
    retriever.search_kwargs["k"] = 10
    # Create a ChatOpenAI model instance
    model = ChatOpenAI(model="gpt-3.5-turbo")
    # Create a RetrievalQA instance from the model and retriever
    qa = RetrievalQA.from_llm(model, retriever=retriever)
    # Return the result of the query
    return qa.run(query)


In [65]:
#This is the chatting part of the application. WE are using a simple input terminal based chat for now.
# Get the user's input from the text input field
user_input = input()

# If there is user input, search for a response using the search_db function; the user can terminate the chat by entering 'Bye'
while user_input not in ['Bye', 'bye', 'Bye!']:
  output = search_db(db, user_input)
  print(output)
  user_input = input()



can you summarize the first half of latex tutorial?
In the first half of the LaTeX tutorial, we covered the basics of creating LaTeX documents. This included information on document classes like article, report, and book, as well as adding a title, author, and date to the document. We also discussed creating content in the document using the \texttt{maketitle} command and adding images. Additionally, we touched on bibliography management, compilation steps, and creating a simple LaTeX document. Lastly, we mentioned some popular standalone LaTeX editors and advanced topics like macros, Beamer for presentations, and pgfplots for creating graphs and figures.
bue
I'm not sure what you mean by "bue." If you have a question or need assistance with something related to LaTeX, feel free to ask!
bye


We have successfully created and tested a LLM powered custom chatbot that is able to answer queries relevant to the given github repo. However, we have been chatting via the Python terminal itself. Next, we will see how to use Streamlit to create a nice GUI based front end for this application in a different notebook.