<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mzjhMj6Za2EgFGjsXnuvw8LRMfQpsaS1?usp=sharing)
## Master Generative AI in 8 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

## Overview of Unstructured 📚

Unstructured is an open-source library designed to easily preprocess text documents for LLM applications. 🚀 It helps you extract, clean, and structure text from various file types, making it suitable for feeding into Language Models.

**Key Features:** ✨

- **Multi-format Support:** Handles PDFs, Word documents, HTML, and more. 📄
- **Text Extraction:** Extracts text from documents while preserving document structure. 📝
- **Data Cleaning:** Cleans and normalizes text for better LLM performance. 🧹
- **Element Chunking:** Breaks down text into meaningful chunks. 🧩
- **Easy Integration:** Integrates seamlessly with LangChain and other LLM frameworks. 🤝

### 📦 **Dependency Installation**

In [None]:
# Install the Unstructured library
!pip install unstructured[pdf] langchain_community chromadb tiktoken

###Setup API Keys


In [15]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] =userdata.get('OPENAI_API_KEY')

### 📜 **Basic Text Extraction from a PDF**

In [None]:
from unstructured.partition.auto import partition
import requests

# Example PDF URL (replace with a public PDF URL)
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example:  a research paper

# Download the PDF
response = requests.get(pdf_url)
with open("example.pdf", "wb") as f:
    f.write(response.content)

# Partition the PDF
elements = partition(filename="example.pdf")

# Print the extracted text
for element in elements:
    print(element.text)


### 📄 **Extracting text from a local .txt file**

In [None]:
from unstructured.partition.text import partition_text

# Create a dummy text file
with open("dummy_text.txt", "w") as f:
    f.write("This is a sample text file.\n")
    f.write("It contains multiple lines of text.\n")
    f.write("Unstructured can process this easily.")

# Partition the text file
elements = partition_text(filename="dummy_text.txt")

# Print the extracted text
for element in elements:
    print(element.text)

This is a sample text file.
It contains multiple lines of text.
Unstructured can process this easily.


### 🌐 **Extracting text from a website URL**

In [None]:
from unstructured.partition.html import partition_html
import requests

# Fetch HTML content from a URL
url = "https://www.unstructured.io/"
response = requests.get(url)
html_content = response.text

# Partition the HTML
elements = partition_html(text=html_content)

# Print the extracted text
for element in elements:
    print(element.text)

##🌐 Vector database ingestion



In this guide, we demonstrate how to leverage Unstructured.IO, ChromaDB, and LangChain to summarize topics from the front page of CNN Lite. Utilizing the modern LLM stack, including Unstructured, Chroma, and LangChain, this workflow is streamlined to less than two dozen lines of code.

###Gather Links with Unstructured

In [27]:
from unstructured.partition.html import partition_html

cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2025"):
            links.append(f"{cnn_lite_url}{relative_link}")

###Ingest Individual Articles with UnstructuredURLLoader


In [28]:
from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)
docs = loaders.load()

100%|██████████| 96/96 [00:29<00:00,  3.21it/s]


###Load Documents into ChromaDB


In [29]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings)
query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1)

###Summarize the Documents


In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
chain = load_summarize_chain(llm, chain_type="stuff")
chain.run(query_docs)