<a href="https://colab.research.google.com/github/dyumnaa/langchain/blob/main/Introduction_to_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prerequsites

In [1]:
!pip install -qU langchain langchain_community  langchain-huggingface  langchain-chroma langchain-text-splitters crawl4ai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.8/252.8 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m81.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.8/125.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00

## Documents

Class for storing a piece of text and associated metadata.

In [4]:
from langchain_core.documents import Document

document = Document(
    page_content="Hello, world!",
    metadata={"source": "https://example.com"}
)

## Document Loaders

DocumentLoaders load data into the standard LangChain Document format.

[Different Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)


In [3]:
import re

from bs4 import BeautifulSoup
from langchain_community.document_loaders import RecursiveUrlLoader




loader = RecursiveUrlLoader("https://en.wikipedia.org/wiki/Thangal_Kunju_Musaliar_College_of_Engineering", max_depth=2,
)
documents_ = loader.load()
print(documents_)

[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Thangal_Kunju_Musaliar_College_of_Engineering', 'content_type': 'text/html; charset=UTF-8', 'title': 'Thangal Kunju Musaliar College of Engineering - Wikipedia', 'language': 'en'}, page_content='<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Thangal Kunju Musaliar College of Engineering - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language

## Text Splitters

- Handling non-uniform document lengths: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents.
- Overcoming model limitations: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits.

[Different HTML Splitters](https://python.langchain.com/docs/how_to/split_html/)

In [5]:
url = "https://en.wikipedia.org/wiki/Thangal_Kunju_Musaliar_College_of_Engineering"
from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Main Title"),   # Page title
    ("h2", "Section"),      # Major sections
    ("h3", "Subsection"),   # Subsections within major sections
    ("h4", "Sub-subsection") # Further nested sections
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)

print(html_header_splits)


[Document(metadata={}, page_content='Jump to content  \nMain menu  \nMain menu  \nmove to sidebar  \nhide  \nNavigation  \nMain page  \nContents  \nCurrent events  \nRandom article  \nAbout Wikipedia  \nContact us  \nContribute  \nHelp  \nLearn to edit  \nCommunity portal  \nRecent changes  \nUpload file  \nSpecial pages  \nSearch  \nSearch  \nAppearance  \nDonate  \nCreate account  \nLog in  \nPersonal tools  \nDonate  \nCreate account  \nLog in  \nPages for logged out editors  \nlearn more  \nContributions  \nTalk  \nCentralNotice'), Document(metadata={'Section': 'Contents'}, page_content='Contents'), Document(metadata={}, page_content='move to sidebar  \nhide  \n(Top)  \n1  \nHistory  \n2  \nCampus  \n3  \nOrganisation and administration  \nToggle Organisation and administration subsection  \n3.1  \nGovernance  \n3.2  \nDepartments  \n3.3  \nFacilities  \n4  \nAcademics  \nToggle Academics subsection  \n4.1  \nAccreditation and affiliation  \n4.2  \nAdmission  \n4.3  \nSports  \n5  

## Crawling Tools

https://github.com/unclecode/crawl4ai


In [6]:
!crawl4ai-setup

[36m[INIT].... → Running post-installation setup...[0m
[36m[INIT].... → Installing Playwright browsers...[0m
Installing dependencies...
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:7 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,375 kB]
Get:11 https://ppa.launchpadcontent.net/dead

In [7]:
from langchain.document_loaders.base import BaseLoader
from langchain.docstore.document import Document
from crawl4ai import AsyncWebCrawler, BrowserConfig
import asyncio

class Crawl4AILoader(BaseLoader):
    def __init__(self, start_url: str, browser_config: BrowserConfig = None):
        """
        Initialize the Crawl4AI document loader.

        Args:
            start_url (str): The URL to start crawling from.
            browser_config (BrowserConfig, optional): Optional browser configuration for crawl4ai.
        """
        self.start_url = start_url
        self.browser_config = browser_config
        self._crawler = AsyncWebCrawler(config=browser_config)

    async def aload(self) -> list[Document]:
        """
        Asynchronously load documents using crawl4ai.

        Returns:
            A list of Document objects containing the crawled content and metadata.
        """
        # Start the asynchronous crawler.
        await self._crawler.start()

        # Crawl the provided URL to fetch content.
        results = await self._crawler.arun(self.start_url)
        metadata = {"url": results.url}
        content = results.markdown

        # Close the crawler to free resources.
        await self._crawler.close()

        # Wrap the fetched content in a Document object.
        return [Document(page_content=content, metadata=metadata)]

# Asynchronous main function to test the loader.
async def main():
    loader = Crawl4AILoader("https://en.wikipedia.org/wiki/Thangal_Kunju_Musaliar_College_of_Engineering")
    docs = await loader.aload()
    for doc in docs:
        # Display the URL and a snippet of the content.
        print(f"URL: {doc.metadata['url']}")
        print(f"Content: {doc.page_content}...")  # Show first 200 characters for brevity.

loop = asyncio.get_running_loop()
task = loop.create_task(main())
await task


[INIT].... → Crawl4AI 0.5.0.post4
[FETCH]... ↓ https://en.wikipedia.org/wiki/Thangal_Kunju_Musali... | Status: True | Time: 1.23s
[SCRAPE].. ◆ https://en.wikipedia.org/wiki/Thangal_Kunju_Musali... | Time: 0.398s
[COMPLETE] ● https://en.wikipedia.org/wiki/Thangal_Kunju_Musali... | Status: True | Total: 1.65s
URL: https://en.wikipedia.org/wiki/Thangal_Kunju_Musaliar_College_of_Engineering
Content: [Jump to content](https://en.wikipedia.org/wiki/Thangal_Kunju_Musaliar_College_of_Engineering#bodyContent)
Main menu
Main menu
move to sidebar hide
Navigation 
  * [Main page](https://en.wikipedia.org/wiki/Main_Page "Visit the main page \[alt-shift-z\]")
  * [Contents](https://en.wikipedia.org/wiki/Wikipedia:Contents "Guides to browsing Wikipedia")
  * [Current events](https://en.wikipedia.org/wiki/Portal:Current_events "Articles related to current events")
  * [Random article](https://en.wikipedia.org/wiki/Special:Random "Visit a randomly selected article \[alt-shift-x\]")
  * [About Wikipedia

## Introduction to Vector Databases

* <b>Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations.</b>

* <b>A critical advantage of embeddings vectors is they can be compared using many simple mathematical operations:

    - Cosine Similarity: Measures the cosine of the angle between two vectors.
    - Euclidean Distance: Measures the straight-line distance between two points.
    - Dot Product: Measures the projection of one vector onto another.
</b>



The key methods are:

- add_documents: Add a list of texts to the vector store.
- delete: Delete a list of documents from the vector store.
- similarity_search: Search for similar documents to a given query.


[documentation](https://python.langchain.com/docs/concepts/vectorstores/)

[list of supported vector databases](https://python.langchain.com/docs/integrations/vectorstores/)



In [8]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma


embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store = Chroma(embedding_function=embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

documents = [document_1, document_2]
vector_store.add_documents(documents=documents)

['2cfc804e-65c8-48b0-aee4-b64e2f766104',
 '10824591-b392-4d3d-ba00-0f1222bd1472']

In [10]:
vector_store.delete(ids=["0","1"])



In [15]:
vector_store.similarity_search(
    "PaNcke",
    k=1,
    filter={"source": "tweet"},
)

[Document(id='2cfc804e-65c8-48b0-aee4-b64e2f766104', metadata={'source': 'tweet'}, page_content='I had chocalate chip pancakes and scrambled eggs for breakfast this morning.')]

https://python.langchain.com/docs/introduction/
