## This Notebook is to demonstrate commonly used Loaders and Splitters

#### In LangChain, a Document is a simple structure with two fields:
- `page_content (string)`: This field contains the raw text of the document.
- `metadata (dictionary)`: This field stores additional metadata about the text, such as the source URL, author, or any other relevant information.

In [21]:
from langchain.document_loaders import TextLoader
 
# Load text data from a file using TextLoader
loader = TextLoader("loaders-samples/sample.txt")
document = loader.load()
print(document)

[Document(metadata={'source': 'loaders-samples/sample.txt'}, page_content='The Lorem ipum filling text is used by graphic designers, programmers and printers with the aim of occupying the spaces of a website, an advertising product or an editorial production whose final text is not yet ready.\n\nThis expedient serves to get an idea of the finished product that will soon be printed or disseminated via digital channels.\n\nIn order to have a result that is more in keeping with the final result, the graphic designers, designers or typographers report the Lorem ipsum text in respect of two fundamental aspects, namely readability and editorial requirements.\n\nThe choice of font and font size with which Lorem ipsum is reproduced answers to specific needs that go beyond the simple and simple filling of spaces dedicated to accepting real texts and allowing to have hands an advertising/publishing product, both web and paper, true to reality.\n\nIts nonsense allows the eye to focus only on the 

In [None]:
document[0].page_content

In [None]:
document[0].metadata

### Types of Document Loaders in LangChain

#### LangChain offers three main types of Document Loaders:

- `Transform Loaders`: These loaders handle different input formats and transform them into the Document format. For instance, consider a CSV file named "data.csv" with columns for "name" and "age". Using the CSVLoader, you can load the CSV data into Documents.
- `Public Dataset or Service Loaders`: LangChain provides loaders for popular public sources, allowing quick retrieval and creation of Documents. For example, the WikipediaLoader can load content from Wikipedia.
- `Proprietary Dataset or Service Loaders`: These loaders are designed to handle proprietary sources that may require additional authentication or setup. For instance, a loader could be created specifically for loading data from an internal database or an API with proprietary access.

### Transform Loader example

In [None]:
# CSVLoader

from langchain.document_loaders import CSVLoader
 
# Load data from a CSV file using CSVLoader
loader = CSVLoader("HR-Employee-Attrition.csv")
documents = loader.load()
 
# Access the content and metadata of each document
for document in documents:
    content = document.page_content
    metadata = document.metadata
 
    # Process the content and metadata
    print(content)
    print("------")

### PDFLoader
Loads each page of the PDF as one document

In [35]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Software-Engineer-CV.pdf")
pages = loader.load()

In [None]:
cnt = 0
for page in pages:
    cnt = cnt+1
    print("---- Document #", cnt)
    print(page.page_content.strip())


### WebBaseLoader
This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. 

In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.ibm.com/us-en/")
data = loader.load()

In [None]:
data[0].page_content

In [None]:
# Combine strip() with string formatting for basic formatting
formatted_text = data[0].page_content.strip().replace("\n\n", "\n")  # Replace double newlines with single newlines

print(formatted_text)

In [None]:
# Use regular expressions for more comprehensive cleaning:
import re

# Remove unnecessary whitespace and multiple newlines
cleaned_text = re.sub(r"\s+", " ", formatted_text)  # Replace multiple spaces with single space
cleaned_text = re.sub(r"\n+", "\n\n", cleaned_text)  # Limit newlines to two per paragraph

print(cleaned_text)

### JSON Loader

In [None]:
#!pip install jq

In [13]:
from langchain_community.document_loaders import JSONLoader

import json
from pathlib import Path
from pprint import pprint

file_path='loaders-samples/sample.json'
data = json.loads(Path(file_path).read_text())

In [None]:
pprint(data)

In [26]:
loader =JSONLoader(
    file_path="loaders-samples/sample.json", 
    jq_schema=".employees[].email", 
    text_content=False)

data = loader.load()

In [None]:
data

## Public Dataset or Service Loaders

### Wikipedia Loader

In [28]:
from langchain.document_loaders import WikipediaLoader
 
# Load content from Wikipedia using WikipediaLoader
loader = WikipediaLoader("Machine_learning")
document = loader.load()

In [None]:
document[0].page_content

In [None]:
document[0].metadata

### IMDB Movie Script Loader

In [31]:
from langchain_community.document_loaders import IMSDbLoader

loader = IMSDbLoader("https://imsdb.com/scripts/BlacKkKlansman.html")

data = loader.load()

In [None]:
# Remove unnecessary newlines and carriage returns
formatted_text = data[0].page_content[:5000].strip()

# Print the formatted text
print(formatted_text)

### YouTubeLoader

In [None]:
!pip install youtube-transcript-api

In [None]:
from langchain_community.document_loaders import YoutubeLoader
from youtube_transcript_api._errors import NoTranscriptFound

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=8grWlJSRg_w", add_video_info=False
)
# Set the language to Arabic
loader.language = ['ar']
try:
    data = loader.load()
    print(data)
except NoTranscriptFound:
    print("No transcript available for this video in the specified language.")


In [None]:
!pip install youtube-transcript-api
!pip install pytube

In [None]:
from langchain_community.document_loaders import YoutubeLoader
from youtube_transcript_api._errors import NoTranscriptFound

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=8grWlJSRg_w", add_video_info=False
)
# Set the language to Arabic
loader.language = ['ar']
try:
    data = loader.load()
    print(data[0].metadata)
    print(data[0].page_content)
except NoTranscriptFound:
    print("No transcript available for this video in the specified language.")


In [None]:
# Remove unnecessary newlines and carriage returns
formatted_text = data[0].page_content[:5000].strip()

# Print the formatted text
print(data)

#### Add Video preferences, Add language preferences
- Language param : It’s a list of language codes in a descending priority, en by default.
- translation param : It’s a translate preference, you can translate available transcript to your preferred language.

In [None]:
from langchain_community.document_loaders import YoutubeLoader
from pytube.exceptions import PytubeError

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=8grWlJSRg_w",
    add_video_info=False,
    language=["ar", "id"],
    translation="en",
)

try:
    ytdata = loader.load()
    print(ytdata)
except KeyError:
    print("The 'videoDetails' key is missing, possibly due to restricted access or changes in YouTube's metadata format.")
except PytubeError as e:
    print(f"An error occurred with Pytube: {e}")


In [None]:
ytdata

In [None]:
# Remove unnecessary newlines and carriage returns
formatted_text = ytdata[0].page_content[:5000].strip()

# Print the formatted text
print(formatted_text)

In [None]:
!pip install pytube moviepy

In [None]:
import time
from pytube import YouTube
from moviepy.editor import AudioFileClip
from urllib.error import HTTPError

video_url = "https://www.youtube.com/watch?v=8grWlJSRg_w"

max_retries = 3
for attempt in range(max_retries):
    try:
        yt = YouTube(video_url)
        video_stream = yt.streams.filter(only_audio=True).first()
        video_path = video_stream.download(filename="video.mp4")
        
        # Convert to audio file (e.g., mp3)
        audio_path = "audio.mp3"
        audio_clip = AudioFileClip(video_path)
        audio_clip.write_audiofile(audio_path)
        audio_clip.close()

        print("Audio file saved as:", audio_path)
        break
    except HTTPError as e:
        print(f"HTTPError on attempt {attempt + 1}: {e}")
        if attempt < max_retries - 1:
            time.sleep(2)  # Wait 2 seconds before retrying
        else:
            print("Max retries reached. Unable to download the video.")
    except Exception as e:
        print("An error occurred:", e)
        break


LangChain can process the contents of an MP3 file by using transcription models to convert audio into text, which can then be used as context for further processing. To achieve this, you typically need to use a transcription model, like OpenAI's Whisper, Google Speech-to-Text, or other speech recognition tools to transcribe the MP3 into text. Once transcribed, the text can be used as input or context in LangChain.

In [None]:
!pip install yt-dlp

In [None]:
import yt_dlp

ydl_opts = {
    'format': 'bestaudio/best',
    'outtmpl': 'audio_01.mp3',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
        'preferredquality': '192',
    }],
    'ffmpeg_location': '/usr/local/bin/ffmpeg'  # Replace with your ffmpeg path if needed
}

video_url = "https://www.youtube.com/watch?v=wDMdX-pbmxQ"
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([video_url])

Transcribe the Audio: Use a transcription tool to convert the MP3 file to text. Below is an example with Whisper, which has high accuracy and supports multiple languages.

Install Whisper (or another transcription tool):

In [None]:

!pip install openai-whisper

In [None]:
import whisper

model = whisper.load_model("base")  # or "large" for better accuracy
result = model.transcribe("audio.mp3")  # Use the path to your MP3 file
text = result['text']
print("Transcription:", text)


## Text Splitters

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

- Split the text up into small, semantically meaningful chunks (often sentences).
- Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
- Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

- How the text is split
- How the chunk size is measured

## This code snippet sets up a text splitter using the CharacterTextSplitter class from LangChain. The CharacterTextSplitter is used to break down long pieces of text into smaller chunks based on specified criteria. Let’s go through each parameter in detail


Parameter Breakdown
separator="\n\n":

This specifies the delimiter used to split the text. In this case, the text is split wherever there is a double newline (\n\n), often indicating the end of a paragraph.
If no separator is found, the splitter will fall back to splitting at specific character counts based on chunk_size.
chunk_size=200:

This defines the maximum size of each chunk, in terms of characters. Here, each chunk will be up to 200 characters long.
If a segment of text after splitting by the separator is longer than 200 characters, it will be further split to ensure it does not exceed the chunk size.
chunk_overlap=20:

This sets the overlap between consecutive chunks. Each chunk will have 20 characters in common with the next chunk.
This overlap is helpful for maintaining context continuity across chunks, especially when dealing with language models that have context windows.
length_function=len:

This specifies the function used to measure the length of each chunk. Here, it uses Python’s built-in len() function, which counts characters.
is_separator_regex=False:

This parameter specifies whether the separator should be interpreted as a regular expression. Setting it to False means that the separator is treated as a simple string, not as a regex pattern.


## The 20-character overlap means that when splitting a long text into chunks, the last 20 characters of one chunk will be repeated at the beginning of the next chunk. This overlap helps preserve context between chunks, making it easier for the model to understand the flow of information across multiple parts.

In [12]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [14]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.ibm.com/")
data = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [15]:
chunks = text_splitter.split_text(data[0].page_content)
len(chunks)

32

In [16]:
for chunk in chunks:
    print(chunk)
    print('----')

IBM - United States

See, play and build with AI models for business
----
Meet our trusted, open IBM Granite™ models—optimized to scale your AI applications from water management to Fantasy Football
  


    

Meet Granite 3.0


Win at Fantasy Football
----
Latest news
----
IBM "AI in Action" Report Identifies Key Characteristics of Businesses That Consider Themselves Leaders in AI
----
IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP
----
Cognizant Launches Global FinOps Center of Excellence, New Solutions Built with IBM Technology to Tackle Enterprise Modernization Challenges

IBM BOARD APPROVES REGULAR QUARTERLY CASH DIVIDEND
----
IBM Brings Apptio Product Portfolio to the Microsoft Cloud to Help Organizations Make Informed Technology Planning Decisions
----
IBM Receives FedRAMP Authorization for its Envizi ESG Data Capture, Analysis and Reporting Solution

IBM RELEASES THIRD-QUARTER RESULTS
----
IBM Advances Secure AI, Quantum Safe Technolo

In [17]:
documents = text_splitter.create_documents([data[0].page_content])
len(documents)

32

In [18]:
for doc in documents:
    print(doc)
    print('----')

page_content='IBM - United States

See, play and build with AI models for business'
----
page_content='Meet our trusted, open IBM Granite™ models—optimized to scale your AI applications from water management to Fantasy Football
  


    

Meet Granite 3.0


Win at Fantasy Football'
----
page_content='Latest news'
----
page_content='IBM "AI in Action" Report Identifies Key Characteristics of Businesses That Consider Themselves Leaders in AI'
----
page_content='IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP'
----
page_content='Cognizant Launches Global FinOps Center of Excellence, New Solutions Built with IBM Technology to Tackle Enterprise Modernization Challenges

IBM BOARD APPROVES REGULAR QUARTERLY CASH DIVIDEND'
----
page_content='IBM Brings Apptio Product Portfolio to the Microsoft Cloud to Help Organizations Make Informed Technology Planning Decisions'
----
page_content='IBM Receives FedRAMP Authorization for its Envizi ESG Data Capture,

## RecursiveCharacterTextSplitter

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.
- The RecursiveCharacterTextSplitter class does use chunk_size and overlap parameters to split the text into chunks of the specified size and overlap. This is because its split_text method recursively splits the text based on different separators until the length of the splits is less than the chunk_size.

In [186]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

rectext_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [187]:
texts = rectext_splitter.create_documents([data[0].page_content])

In [None]:
for text in texts:
    print(text)
    print("-----")

In [None]:
curl \
-X DELETE \
-H "Authorization: Bearer 2omELGyL9RxDOQBVn6xoSmPNIlz_5h3ehmPKehPr8LvcpXKwR" \
-H "Ngrok-Version: 2" \
https://api.ngrok.com/endpoints/op_2omFI7Phqf5PSSKyUPM8Tvx0nVH


curl \
-X GET \
-H "Authorization: Bearer 2omELGyL9RxDOQBVn6xoSmPNIlz_5h3ehmPKehPr8LvcpXKwR" \
-H "Ngrok-Version: 2" \
https://api.ngrok.com/endpoints