<a href="https://colab.research.google.com/github/amrindersingh03/Unstructured-Machine-Learning-/blob/main/Langchain_bot_zenml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Our aim is to construct a Langchain bot that will retrieve every markdown file from ZenML's repository and use them to answer questions and gather relevant information.



In [1]:
# Install Langchain

!pip install langchain==0.0.55 requests openai transformers faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain==0.0.55
  Downloading langchain-0.0.55-py3-none-any.whl (152 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.9/152.9 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp

In [None]:
# Let's import OpenAI library. We will use various tools from open ai library to create our chat bot

In [2]:
from langchain.llms import OpenAI


In [3]:
# To access the OpenAI environment, you can import it using your unique API key provided by OpenAI.

In [4]:
import os
os.environ["OPENAI_API_KEY"] = "...Enter OpenAI API Key here..."

## Let's fetch all markdown files of ZenML's repository

In [5]:
# Import required tools to create the function to get ZenML's github repository.

import pathlib
import subprocess
import tempfile
from langchain.docstore.document import Document
import requests

In [6]:
# Creating a function to get markdown files.

In [7]:
def get_github_docs(repo_owner, repo_name):
    with tempfile.TemporaryDirectory() as d:
        subprocess.check_call(
            f"git clone --depth 1 https://github.com/{repo_owner}/{repo_name}.git .",
            cwd=d,
            shell=True,
        )
        git_sha = (
            subprocess.check_output("git rev-parse HEAD", shell=True, cwd=d)
            .decode("utf-8")
            .strip()
        )
        repo_path = pathlib.Path(d)
        markdown_files = list(repo_path.glob("*/*.md")) + list(
            repo_path.glob("*/*.mdx")
        )
        for markdown_file in markdown_files:
            with open(markdown_file, "r") as f:
                relative_path = markdown_file.relative_to(repo_path)
                github_url = f"https://github.com/{repo_owner}/{repo_name}/blob/{git_sha}/{relative_path}"
                yield Document(page_content=f.read(), metadata={"source": github_url})

## The following actions are performed by this process:

- It obtains the latest commit of the intended GitHub repository and saves it in a temporary directory.

- It retrieves the git sha, which is used to generate links that the model can reference in its sources list.

- It scans through each markdown file (.md or .mdx) in the repository.

- It generates a URL to the markdown file on GitHub, reads the file from the local storage, and produces a Document object.

In [None]:
# Use our above function to  fetch md or mdx file data from ZenML repo and store in a varaiable named " sources "

In [8]:
sources = get_github_docs("zenml-io", "zenml")

 Dividing lengthy text into smaller segments can be crucial to handle it efficiently.

To accomplish this, we can use the CharacterTextSplitter technique, which separates the text into segments and stores them in a list.

In [9]:
#  Let's import CharacterTextSplitter and break down our documents into small chunks. 

from langchain.text_splitter import CharacterTextSplitter

In [12]:
source_chunks = []
splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
for source in sources:
    for chunk in splitter.split_text(source.page_content):
        source_chunks.append(Document(page_content=chunk, metadata=source.metadata))

- Now, we need to Import the FAISS library, which is a similarity search tool developed by Facebook AI. Its primary function is to allow for rapid searching of embeddings in multimedia documents that are similar to one another. In other words, it will search for text in transcriptions that closely match the text in our question or search.

- To use FAISS, the input data must be in the form of embeddings, or vectors. Therefore, before feeding data to FAISS, the text must first be converted into embeddings.

- Once the text has been converted into embeddings, FAISS can compare and search for similar embeddings to those found in the question.

- To create embeddings, we will use a tool by OpenAI called OpenAIEmbeddings

In [10]:
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

In [None]:
# Let's transform our text documents into vector form using FAISS

In [13]:
search_index = FAISS.from_documents(source_chunks, OpenAIEmbeddings())

In [None]:
# Import load_qa_with_sources_chain tool from LangChainAI library.

In [15]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

### Create a chain using load_qa_with_sources_chain tool

load_qa_with_sources_chain will take in the query  and lookup for the documents from the vector database (created by FAISS) of markdown files data, stored in variable 'search_index'.



In [16]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0))

def print_answer(question):
    print(
        chain(
            {
                "input_documents": search_index.similarity_search(question, k=4),
                "question": question,
            },
            return_only_outputs=True,
        )["output_text"]
    )

Alright! Time to ask some questions to the bot.

In [17]:
print_answer("what is zenml framework?")

 ZenML is an open-source machine learning framework that enables data scientists and engineers to quickly develop, manage, and deploy ML pipelines. It provides a unified platform for data scientists and engineers to develop, manage, and deploy ML pipelines.

SOURCES: https://github.com/zenml-io/zenml/blob/52a777f347282987fbaedf177fb6a2342ae9a7b7/tests/README.md, https://github.com/zenml-io/zenml/blob/52a777f347282987fbaedf177fb6a2342ae9a7b7/examples/README.md


In [18]:
print_answer("what are the components of a ZenML Deployment?")

 The components of a ZenML Deployment include a ZenML server running locally as a daemon process, a ZenML server running in a Docker container, a ZenML server and a MySQL server both running in Docker containers and managed by Docker Compose, and an external ZenML server running in the cloud.
SOURCES: https://github.com/zenml-io/zenml/blob/52a777f347282987fbaedf177fb6a2342ae9a7b7/tests/README.md


In [19]:
print_answer("How to construct the API docs locally?")

 To construct the API docs locally, clone the repository, install ZenML and all dependencies, and run `bash scripts/serve_api_docs.sh` from the repository root.
SOURCES: https://github.com/zenml-io/zenml/blob/52a777f347282987fbaedf177fb6a2342ae9a7b7/docs/README.md


Perfect! Our bot has correctly answered all of our questions, and also is able to provide us the source of information.
