<a href="https://colab.research.google.com/github/amrindersingh03/Unstructured-Machine-Learning-/blob/main/Langchain_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Our aim is to construct a language chain bot that will retrieve every markdown file from Cohere's repository and use them to generate questions and gather relevant answers and information.

In [None]:
# Install Langchain

pip install langchain==0.0.55 requests openai transformers faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain==0.0.55
  Downloading langchain-0.0.55-py3-none-any.whl (152 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.9/152.9 KB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.27.0-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp

In [None]:
# Let's import OpenAI library. We will use various tools from open ai library to create our chat bot

In [None]:
from langchain.llms import OpenAI

In [None]:
# To access the OpenAI environment, you can import it using your unique API key provided by OpenAI.

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "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"

## Let's fetch all markdown files of Cohere's repository

In [None]:
# Import required tools to fetch Cohere's github repository.

import pathlib
import subprocess
import tempfile
from langchain.docstore.document import Document
import requests

In [None]:
# Creating a function to get markdown files.

In [None]:
def get_github_docs(repo_owner, repo_name):
    with tempfile.TemporaryDirectory() as d:
        subprocess.check_call(
            f"git clone --depth 1 https://github.com/{repo_owner}/{repo_name}.git .",
            cwd=d,
            shell=True,
        )
        git_sha = (
            subprocess.check_output("git rev-parse HEAD", shell=True, cwd=d)
            .decode("utf-8")
            .strip()
        )
        repo_path = pathlib.Path(d)
        markdown_files = list(repo_path.glob("*/*.md")) + list(
            repo_path.glob("*/*.mdx")
        )
        for markdown_file in markdown_files:
            with open(markdown_file, "r") as f:
                relative_path = markdown_file.relative_to(repo_path)
                github_url = f"https://github.com/{repo_owner}/{repo_name}/blob/{git_sha}/{relative_path}"
                yield Document(page_content=f.read(), metadata={"source": github_url})

## The following actions are performed by this process:

It obtains the latest commit of the intended GitHub repository and saves it in a temporary directory.

It retrieves the git sha, which is used to generate links that the model can reference in its sources list.

It scans through each markdown file (.md or .mdx) in the repository.

It generates a URL to the markdown file on GitHub, reads the file from the local storage, and produces a Document object.

In [None]:
# We will store our fetched data in a varaiable named " sources "

In [None]:
sources = get_github_docs("cohere-ai", "tokenizers")

In [None]:
# Dividing lengthy text into smaller segments can be crucial to handle it efficiently.

# To accomplish this, we can use the CharacterTextSplitter technique, which separates the text into segments and stores them in a list

 Dividing lengthy text into smaller segments can be crucial to handle it efficiently.

To accomplish this, we can use the CharacterTextSplitter technique, which separates the text into segments and stores them in a list

In [None]:
#  Let's import CharacterTextSplitter and break down our documents into small chunks. 

from langchain.text_splitter import CharacterTextSplitter

In [None]:
source_chunks = []
splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
for source in sources:
    for chunk in splitter.split_text(source.page_content):
        source_chunks.append(Document(page_content=chunk, metadata=source.metadata))

Now, we need to install the FAISS library, which is a similarity search tool developed by Facebook AI. Its primary function is to allow for rapid searching of embeddings in multimedia documents that are similar to one another. In other words, it will search for text in transcriptions that closely match the text in our question or search.

To use FAISS, the input data must be in the form of embeddings, or vectors. Therefore, before feeding data to FAISS, the text must first be converted into embeddings.

Once the text has been converted into embeddings, FAISS can compare and search for similar embeddings to those found in the question.

To create embeddings, we will use Oa tool called OpenAIEmbeddings

In [None]:
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

In [None]:
# Let's transform our text documents into vector form using FAISS

In [None]:
search_index = FAISS.from_documents(source_chunks, OpenAIEmbeddings())

In [None]:
# Import load_qa_with_sources_chain tool from @LangChainAI library.

In [None]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

### Create a chain using load_qa_with_sources_chain tool

load_qa_with_sources_chain will take in the query  and lookup for the documents from the vector database (created by FAISS) of markdown files data, stored in variable 'storesearch_index'.



In [None]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0))

def print_answer(question):
    print(
        chain(
            {
                "input_documents": search_index.similarity_search(question, k=4),
                "question": question,
            },
            return_only_outputs=True,
        )["output_text"]
    )

Alright! Time to ask some questions to the bot

In [None]:
print_answer("what is a software defined asset")

 A software defined asset is a piece of software that can be managed and configured using software.
SOURCES: https://github.com/dagster-io/dagster/blob/1985cdb22d26f4150b86bd7643c1fc9a21ea59e6/docs/README.md


In [None]:
sources = get_github_docs("cohere-ai", "tokenizers")

In [None]:
print_answer("What is cohere's tokenizers library")

 Cohere's tokenizers library is a core written in Rust that provides an implementation of today's most used tokenizers, with a focus on performance and versatility. It works as a pipeline, processing raw text as input and outputs an Encoding.
SOURCES: https://github.com/cohere-ai/tokenizers/blob/1dc19e0dd4ba73af938ff4dc80b4b1cd40edf4bf/tokenizers/README.md


Perfect! Our bot has answered our question, and also is able to provide us the source of information.

Now the problem is , our bot is little slow and expensive. Reason is every time we ask a question, it has to execute everything from scratch. 

As we know, Open AI charges per token to process. So everytime our bot process it is charged

Solution is to cache our data

In [None]:
pip install dagster dagit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from dagster import asset
import pickle

In [None]:
from dagster import asset

@asset
def source_docs():
    return list(get_github_docs("dagster-io", "dagster"))

In [None]:
x = source_docs()

In [None]:
@asset
def search_index(source_docs):
    source_chunks = []
    splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
    for source in source_docs:
        for chunk in splitter.split_text(source.page_content):
            source_chunks.append(Document(page_content=chunk, metadata=source.metadata))

    with open("search_index.pickle", "wb") as f:
        pickle.dump(FAISS.from_documents(source_chunks, OpenAIEmbeddings()), f)

In [None]:
y = search_index(x)

In [None]:
def print_answer(question):
    with open("search_index.pickle", "rb") as f:
        search_index = pickle.load(f)
    print(
        chain(
            {
                "input_documents": search_index.similarity_search(question, k=4),
                "question": question,
            },
            return_only_outputs=True,
        )["output_text"]
    )

SyntaxError: ignored

In [None]:
print_answer("What is cohere's tokenizers library")

 Cohere's tokenizers library is a library for tokenizing text.
SOURCES: https://github.com/dagster-io/dagster/blob/866577f925c09a7bdc4dadedf285dd41abb8f006/docs/README.md


In [None]:
print_answer("what is a software defined asset")

 A software defined asset is a piece of software that can be used to manage and control other software.
SOURCES: https://github.com/dagster-io/dagster/blob/866577f925c09a7bdc4dadedf285dd41abb8f006/docs/README.md
