<a href="https://colab.research.google.com/github/adharshrj/mlexplore/blob/main/langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1 : Installation

In [None]:
# Pip installation LangChain and Hugginface API
!pip install langchain
!pip install huggingface_hub

# Pip installation of additional needed libraries
!pip install sentence_transformers
!pip install faiss-cpu
!pip install unstructured

# To download the transcript of a youtube video
!pip install youtube_transcript_api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.202-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.8-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk>=0.0.9 (from langchain)
  Downloading langchainplus_sdk-0.0.10-py3-none-any.whl (21 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading op

# Step 2 : Environment Setup

In [None]:
import os
import requests
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

# Step 3 : Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


# Step 4 : Load files as test using TextLoader


### 4.1 Load Text files using TextLoader

In [None]:
import requests
from langchain.document_loaders import TextLoader

#### 4.1.1 Load text from mounted google drive file

In [None]:
def loadTXTFileFromLocal(local_file_name="/content/drive/MyDrive/LLM_Testing_Docs/local_text_file.txt"):
    # Load the text data
    with open(local_file_name, "r", encoding='utf-8') as file:
        text = file.read()

    with open(local_file_name, "w",  encoding='utf-8') as file:
      file.write(text)

    # Load the text document using TextLoader
    loader = TextLoader(local_file_name)
    loaded_docs = loader.load()
    return loaded_docs

#### 4.1.2 Load text from URL endpoint

In [None]:
def loadTXTFileFromURL(text_file_url="https://raw.githubusercontent.com/adharshrj/currency-in-wordsinr/main/README.md"):
    # Fetching the text file
    output_file_name = "url_text_file.txt"
    response = requests.get(text_file_url)
    with open(output_file_name, "w",  encoding='utf-8') as file:
      file.write(response.text)

    # Load the text document using TextLoader
    loader = TextLoader('./'+output_file_name)
    loaded_docs = loader.load()
    return loaded_docs

### 4.2 Load PDF form Mounted Google Drive using PDFLoader

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

In [None]:
def loadPDFFromLocal(pdf_file_path="/content/drive/MyDrive/LLM_Testing_Docs/Eurovision_Song_Contest_2023.pdf"):
    loader = UnstructuredPDFLoader(pdf_file_path)
    loaded_docs = loader.load()
    return loaded_docs

### 4.3 Load Data from a Website using URLLoader

In [None]:
from langchain.document_loaders import UnstructuredURLLoader

In [None]:
def loadTextFromWebsite(url="https://saturncloud.io/blog/breaking-the-data-barrier-how-zero-shot-one-shot-and-few-shot-learning-are-transforming-machine-learning/"):
    loader = UnstructuredURLLoader(urls=[url])
    loaded_docs = loader.load()
    return loaded_docs

### 4.4 Load Video Transcriptions (Youtube video) using YouTubeTranscriptApi

In [None]:
from youtube_transcript_api import YouTubeTranscriptApi

In [None]:
def loadTextFromYoutubeVideo(youtube_video_id="eg9qDjws_bU"):
    transcript = YouTubeTranscriptApi.get_transcript(youtube_video_id)

    transcript_text = ""
    for entry in transcript:
        transcript_text += ' ' + entry['text']

    youtube_local_txt_file = "youtube_transcript.txt"
    with open('./'+youtube_local_txt_file, "w",  encoding='utf-8') as file:
      file.write(transcript_text)

    # Load the text document using TextLoader
    loader = TextLoader('./'+youtube_local_txt_file)
    loaded_docs = loader.load()
    return loaded_docs

# Step 5 : Split the documents in chunks (**Important** : LLMs cannot accept long inputs)

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
def splitDocument(loaded_docs):
    # Splitting documents into chunks
    splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
    chunked_docs = splitter.split_documents(loaded_docs)
    return chunked_docs

# Step 6 : Convert the documents into embeddings and store them

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

In [None]:
def createEmbeddings(chunked_docs):
    # Create embeddings and store them in a FAISS vector store
    embedder = HuggingFaceEmbeddings()
    vector_store = FAISS.from_documents(chunked_docs, embedder)
    return vector_store

# Step 7 : Use those embeddings to feed the LLM model and Answer Questions

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [None]:
def loadLLMModel():
    llm=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-large", model_kwargs={"temperature":0, "max_length":512})
    chain = load_qa_chain(llm, chain_type="stuff")
    return chain

def askQuestions(vector_store, chain, question):
    # Ask a question using the QA chain
    similar_docs = vector_store.similarity_search(question)
    response = chain.run(input_documents=similar_docs, question=question)
    return response

In [None]:
chain = loadLLMModel()

# Step 8 : Testing

### 8.1 Test with Local file & Test with file from URL

In [None]:
LOCAL_loaded_docs = loadTXTFileFromLocal()
LOCAL_chunked_docs = splitDocument(LOCAL_loaded_docs)
LOCAL_vector_store = createEmbeddings(LOCAL_chunked_docs)

In [None]:
LOCAL_response = askQuestions(LOCAL_vector_store, chain, "Explain me how ChatGPT and Plugin are empowering Citizen Data Scientists?")
print(LOCAL_response)

ChatGPT and plugins are helping Citizen Data Scientists by providing them with the tools they need to analyze and interpret data. By enabling them to use natural language, they are able to ask questions and get answers in plain English, without knowing complex programming languages or statistical techniques. Additionally, ChatGPT is a personal expert who is always available to help them turn their idea into reality.


### 8.2 Test with file from URL

In [None]:
URL_loaded_docs = loadTXTFileFromURL()
URL_chunked_docs = splitDocument(URL_loaded_docs)
URL_vector_store = createEmbeddings(URL_chunked_docs)

In [None]:
URL_response = askQuestions(URL_vector_store, chain, "How do I import?")
print(URL_response)

Import the library using the import or require approach.


### 8.3 Test with PDF from local path

In [None]:
PDF_loaded_docs = loadPDFFromLocal()
PDF_chunked_docs = splitDocument(PDF_loaded_docs)
PDF_vector_store = createEmbeddings(PDF_chunked_docs)



In [None]:
PDF_response = askQuestions(PDF_vector_store, chain, "When did semi-final happen?")
print(PDF_response)

The second semi-final took place on 11 May 2023 at 20:00 BST (21:00 CEST).


### 8.4 Test with Website

In [None]:
WEBSITE_loaded_docs = loadTextFromWebsite()
WEBSITE_chunked_docs = splitDocument(WEBSITE_loaded_docs)
WEBSITE_vector_store = createEmbeddings(WEBSITE_chunked_docs)

In [None]:
WEBSITE_response = askQuestions(WEBSITE_vector_store, chain, "Summarize this in 1 page for someone aged 18 who is a high school senior")
print(WEBSITE_response)

This article is about the basics of machine learning. It explains how it works, how it works, and how it can be used to make predictions. It also explains how it works and how it can be used to make decisions. It also explains how it works and how it can be used to make predictions. Finally, it explains how it can be used to make predictions and how it can be used to make decisions.


### 8.5 Test with text from video

In [None]:
VIDEO_loaded_docs = loadTextFromYoutubeVideo()
VIDEO_chunked_docs = splitDocument(VIDEO_loaded_docs)
VIDEO_vector_store = createEmbeddings(VIDEO_chunked_docs)

In [None]:
VIDEO_response = askQuestions(VIDEO_vector_store, chain, "Summarize the video?")
print(VIDEO_response)

The speaker believes that AI is becoming more advanced and dangerous than nuclear warheads. They believe that AI could be used to make incredibly effective propaganda and could be used to manipulate people. They believe that one day humans will be able to serve the machines and not the other way around.
