# [Retrieval](https://python.langchain.com/docs/modules/data_connection/)

## [Document loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)
### [From URL](https://python.langchain.com/docs/integrations/document_loaders/url)

In [14]:
import json
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [15]:
# Open the file & load the JSON data
with open('dev-launchers_repositories.json', 'r') as file:
    json_data = file.read()

urls = json.loads(json_data)

In [16]:
# !pip3 install unstructured
# !brew install libmagic

In [17]:
# Classic Way (without needing JavaScript interpretation).
'''
from langchain_community.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
'''

'\nfrom langchain_community.document_loaders import UnstructuredURLLoader\n\nloader = UnstructuredURLLoader(urls=urls)\ndata = loader.load()\n'

In [18]:
from langchain_community.document_loaders import SeleniumURLLoader

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

In [19]:
data[0]

Document(page_content='dev-launchers\n\ndiscord-gateway\n\nPublic\n\nNotifications\n\nFork\n    0\n\nStar\n          1\n\n\n\nA gateway between Discord events and Discord bot api\n\nLicense\n\nMIT license\n\n1\n          star\n\n0\n          forks\n\nBranches\n\nTags\n\nActivity\n\nStar\n\nNotifications\n\nCode\n\nIssues\n          0\n\nPull requests\n          7\n\nActions\n\nProjects\n          0\n\nSecurity\n\nInsights\n\nCode\n\nIssues\n\nPull requests\n\nActions\n\nProjects\n\nSecurity\n\nInsights\n\ndev-launchers/discord-gateway\n\nThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.\n\nmaster\n\nBranches\n\nTags\n\nGo to file\n\nCode\n\nFolders and files\n\nName Name Last commit message Last commit date Latest commit History 105 Commits .github .github workload workload .flux.yaml .flux.yaml .gitignore .gitignore Dockerfile Dockerfile LICENSE LICENSE app.js app.js discord-gateway.service discord-gateway.service flux-pa

## [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
### [HTMLHeaderTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/HTML_header_metadata)

In [22]:
from langchain.text_splitter import HTMLHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 2000
chunk_overlap = 0
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

[Document(page_content='How to cite this entry. Preview the PDF version of this entry at the Friends of the SEP Society. Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers, with links to its database.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': 'Academic Tools'}),
 Document(page_content='Other Internet Resources', metadata={'Header 1': 'Kurt Gödel'}),
 Document(page_content='Avigad, Jeremy, “Gödel and the metamathematical tradition”, manuscript in PDF available online. Koellner, Peter, “Truth in Mathematics:The question of Pluralism”, manuscript in PDF available online. The Bernays Project.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': 'Other Internet Resources'}),
 Document(page_content='Related Entries', metadata={'Header 1': 'Kurt Gödel'}),
 Document(page_content='Gödel, Kurt: incompleteness theorems | Hilbert, David: program in the foundations of mathematics | Husserl, E

In [23]:
from langchain.text_splitter import HTMLHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

def process_url(url):
    headers_to_split_on = [
        ("h1", "Header 1"),
        ("h2", "Header 2"),
        ("h3", "Header 3"),
        ("h4", "Header 4"),
    ]

    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

    # for local file use html_splitter.split_text_from_file(<path_to_file>)
    html_header_splits = html_splitter.split_text_from_url(url)

    chunk_size = 2000
    chunk_overlap = 0
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    # Split
    splits = text_splitter.split_documents(html_header_splits)
    
    return splits

urls = json.loads(json_data)

all_splits = []

for url in urls:
    splits_for_url = process_url(url)
    all_splits.extend(splits_for_url)

# Access splits for a particular URL range
print(all_splits[80:85])

[Document(page_content='MIT license  \nActivity  \nCustom properties', metadata={'Header 1': 'dev-launchers/onboarding-bot-model', 'Header 2': 'About', 'Header 3': 'License'}), Document(page_content='1 star', metadata={'Header 1': 'dev-launchers/onboarding-bot-model', 'Header 2': 'About', 'Header 3': 'Stars'}), Document(page_content='1 watching', metadata={'Header 1': 'dev-launchers/onboarding-bot-model', 'Header 2': 'About', 'Header 3': 'Watchers'}), Document(page_content='0 forks  \nReport repository', metadata={'Header 1': 'dev-launchers/onboarding-bot-model', 'Header 2': 'About', 'Header 3': 'Forks'}), Document(page_content='Releases', metadata={'Header 1': 'dev-launchers/onboarding-bot-model'})]


In [24]:
all_splits

[Document(page_content='Footer  \nSkip to content Toggle navigation  \nSign in  \nProduct Solutions Open Source Pricing  \nAutomate any workflow  \nActions  \nHost and manage packages  \nPackages  \nFind and fix vulnerabilities  \nSecurity  \nInstant dev environments  \nCodespaces  \nWrite better code with AI  \nCopilot  \nManage code changes  \nCode review  \nPlan and track work  \nIssues  \nCollaborate outside of code  \nDiscussions  \nExplore  \nAll features Documentation GitHub Skills Blog  \nFor  \nEnterprise Teams Startups Education  \nBy Solution  \nCI/CD & Automation DevOps DevSecOps  \nResources  \nLearning Pathways White papers, Ebooks, Webinars Customer Stories Partners  \nFund open source developers  \nGitHub Sponsors  \nGitHub community articles  \nThe ReadME Project  \nRepositories  \nTopics Trending Collections  \nSign up  \nSearch or jump to...  \nSearch code, repositories, users, issues, pull requests...'),
 Document(page_content='Search  \nClear  \nSearch syntax tips'