### Module Overview

This module cover data parsing and ingesting data for RAG systems,from basic text files to complex PDFs. Use Langchain v0.3 and explore each techinque with practical exapmle

Table of Contents

```
- Introduction to Data Ingestion
- Text Files(.txt)
- Markdown Files(.md)
- PDF Documents
- Microsoft Wrod Documents
- CSV and Excel Files
- Json and Structured Data
- Web Scraping
- Database(SQL)
- Audio and Video Transcripts
- Advanced Techniques
- Best Practices
```


### Introduction To Data Ingestion


In [1]:
import os
from typing import List, Dict, Any
import pandas as pd


In [2]:
from langchain_core.documents import Document
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
)

print("Setup Completed")

Setup Completed


### Understanding Document Structure in Langchain


In [4]:
# Create a simple doucment
doc = Document(
    page_content="This is the main text content that will be embedded and searched",
    metadata={
        "source": "example.txt",
        "page": 1,
        "author": "Abhijeet",
        "date_created": "2025-01-01",
        "custom_field": "any_value",
    },
)
print("Document Structured")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")

# Why MetaData matters:
print("\nüìù Metadata is crucial for:")
print("- Filtering search results")
print("- Tracking document sources")
print("- Providing context in responses")
print("- Debugging and auditing")

Document Structured
Content: This is the main text content that will be embedded and searched
Metadata: {'source': 'example.txt', 'page': 1, 'author': 'Abhijeet', 'date_created': '2025-01-01', 'custom_field': 'any_value'}

üìù Metadata is crucial for:
- Filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing


In [5]:
type(doc)

langchain_core.documents.base.Document

### Text Files (.txt) - The Simplest Case {#2-text-files}


In [None]:
## Create a simple txt file
import os

os.makedirs("data/text_files", exist_ok=True)

In [None]:
sample_texts={
    "data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for filepath,content in sample_texts.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("Sample text files created")

Sample text files created


### TextLoader - Read Single File

In [6]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")
print(loader) # <langchain_community.document_loaders.text.TextLoader object at 0x00000159CCB3E710>
documents = loader.load()

print(type(documents))  # <class 'list'>
print(documents)     # [Document(metadata={'source': 'data/text_files/python_intro.txt'},page_content='Python....')]   

<langchain_community.document_loaders.text.TextLoader object at 0x000001EBAEA38B30>
<class 'list'>
[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogramming languages in the world.\n\nKey Features:\n- Easy to learn and use\n- Extensive standard library\n- Cross-platform compatibility\n- Strong community support\n\nPython is widely used in web development, data science, artificial intelligence, and automation.')]


In [7]:
print(f"Loaded {len(documents)} document")
print(f"Content  preview :{documents[0].page_content[0:100]}...")
print(f"Metadata: {documents[0].metadata}")

Loaded 1 document
Content  preview :Python Programming Introduction

Python is a high-level, interpreted programming language known for ...
Metadata: {'source': 'data/text_files/python_intro.txt'}


### DirectoryLoader - Multiple Text Files

In [8]:
from langchain_community.document_loaders import DirectoryLoader

## laod all the text files from the directory

dir_loader = DirectoryLoader(
    "./data/text_files",
    glob= "**/*.txt", ## Pattern to match files
    loader_cls=TextLoader, # loader class to use
    loader_kwargs={'encoding':'utf-8'},
    show_progress= True
)

documents = dir_loader.load()
print(f" Loaded {len(documents)} documents")

for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"  Source : {doc.metadata['source']}")
    print(f"  Length: {len(doc.page_content)} characters")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 81.66it/s]

 Loaded 2 documents

Document 1:
  Source : data\text_files\machine_learning.txt
  Length: 575 characters

Document 2:
  Source : data\text_files\python_intro.txt
  Length: 489 characters





In [9]:
# üìä Analysis
print("\nüìä DirectoryLoader Characteristics:")
print("‚úÖ Advantages:")
print("  - Loads multiple files at once")
print("  - Supports glob patterns")
print("  - Progress tracking")
print("  - Recursive directory scanning")

print("\n‚ùå Disadvantages:")
print("  - All files must be same type")
print("  - Limited error handling per file")
print("  - Can be memory intensive for large directories")


üìä DirectoryLoader Characteristics:
‚úÖ Advantages:
  - Loads multiple files at once
  - Supports glob patterns
  - Progress tracking
  - Recursive directory scanning

‚ùå Disadvantages:
  - All files must be same type
  - Limited error handling per file
  - Can be memory intensive for large directories


### Text Splitting Statergies


In [9]:
## Different text  splitting strategies
from langchain_text_splitters import(
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

print(documents)

[Document(metadata={'source': 'data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '), Document(metadata={'source': 'data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprog

In [34]:
## Method 1 _ Character Text Splitter
text = documents[0].page_content

char_splitter = CharacterTextSplitter(
    separator=" ", # Split on newline
    chunk_size = 200, # Max chunk size in characters
    chunk_overlap = 20, # Overlap between chunks
    length_function = len # How to measure chunk size
)

char_chunks = char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][0:100]}...")
print("----------")
print(f" Chunk 1: {char_chunks[0]} \n")
print(f" Chunk 2: {char_chunks[1]} \n")
print(f" Chunk 3: {char_chunks[2]} \n")

Created 3 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...
----------
 Chunk 1: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing 

 Chunk 2: on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: 

 Chunk 3: Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems 



In [33]:
## Method 1 _ Character Text Splitter
text = documents[0].page_content

char_splitter = CharacterTextSplitter(
    separator="\n", # Split on newline
    chunk_size = 200, # Max chunk size in characters
    chunk_overlap = 50, # Overlap between chunks
    length_function = len # How to measure chunk size
)

char_chunks = char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][0:100]}...")


Created 4 chunks
First chunk: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems...
----------
 Chunk 1: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve 

 Chunk 2: from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.
Types of Machine Learning: 

 Chunk 3: Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data 



In [18]:
print(f" Chunk 1: {char_chunks[0]} \n")
print(f" Chunk 2: {char_chunks[1]} \n")
print(f" Chunk 3: {char_chunks[2]} \n")
print(f" Chunk 4: {char_chunks[3]} \n")

 Chunk 1: Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables systems to learn and improve 

 Chunk 2: from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.
Types of Machine Learning: 

 Chunk 3: 1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties 

 Chunk 4: Applications include image recognition, speech processing, and recommendation systems 



In [20]:
text

'Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '

In [28]:
## Method 2: Recursive character Splitting (RECOMMENDED)
print("\n 2. RECURSIVE CHARTER TEXT SPLITTER")
recursive_splitter = RecursiveCharacterTextSplitter(
    # separators=["\n\n","\n", " ", ""], # Try these sepraters in order
    separators=[" "], # Try these sepraters in order
    chunk_size = 200, # Max chunk size in characters
    chunk_overlap = 20, # Overlap between chunks
    length_function = len # How to measure chunk size
)
recursive_chunks = recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
print(f"First chunk: {recursive_chunks[0][0:100]}...")


 2. RECURSIVE CHARTER TEXT SPLITTER
Created 4 chunks
First chunk: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables system...


In [29]:
print(f" Chunk 1: {recursive_chunks[0]} \n")
print(f" Chunk 2: {recursive_chunks[1]} \n")
print(f" Chunk 3: {recursive_chunks[2]} \n")
print(f" Chunk 4: {recursive_chunks[3]} \n")
# print(f" Chunk 5: {recursive_chunks[4]} \n")
# print(f" Chunk 6: {recursive_chunks[5]} \n")

 Chunk 1: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing 

 Chunk 2: on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: 

 Chunk 3: Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation 

 Chunk 4: and recommendation systems 



In [27]:
# Create text without natural break points
simple_text = "This is sentence one and it is quite long. This is sentence two and it is also quite long. This is sentence three which is even longer than the others. This is sentence four. This is sentence five. This is sentence six."

splitter = RecursiveCharacterTextSplitter(
    separators=[" "], # only split on spaces
    chunk_size = 80,
    chunk_overlap = 20,
    length_function = len 
)

chunks = splitter.split_text(simple_text)

print(f"\nSimple text example - {len(chunks)} chunks")

for i in range(len(chunks)-1):
    print(f"chunk {i+1}: {chunks[i]}")
    print(f"chunk {i+2}: {chunks[i+1]} \n")
    


Simple text example - 4 chunks
chunk 1: This is sentence one and it is quite long. This is sentence two and it is also
chunk 2: two and it is also quite long. This is sentence three which is even longer than 

chunk 2: two and it is also quite long. This is sentence three which is even longer than
chunk 3: is even longer than the others. This is sentence four. This is sentence five. 

chunk 3: is even longer than the others. This is sentence four. This is sentence five.
chunk 4: is sentence five. This is sentence six. 



In [37]:
## Token Based  splitting
print("\n TOKEN TEXT SPLITTER")
token_splitter = TokenTextSplitter(
    chunk_size = 50, # size in tokens (not characters)
    chunk_overlap = 10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print("----------")
print(f" Chunk 1: {char_chunks[0]} \n")
print(f" Chunk 2: {char_chunks[1]} \n")
print(f" Chunk 3: {char_chunks[2]} \n")


 TOKEN TEXT SPLITTER
Created 3 chunks
----------
 Chunk 1: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing 

 Chunk 2: on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: 

 Chunk 3: Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems 



In [38]:
# üìä Comparison
print("\nüìä Text Splitting Methods Comparison:")
print("\nCharacterTextSplitter:")
print("  ‚úÖ Simple and predictable")
print("  ‚úÖ Good for structured text")
print("  ‚ùå May break mid-sentence")
print("  Use when: Text has clear delimiters")

print("\nRecursiveCharacterTextSplitter:")
print("  ‚úÖ Respects text structure")
print("  ‚úÖ Tries multiple separators")
print("  ‚úÖ Best general-purpose splitter")
print("  ‚ùå Slightly more complex")
print("  Use when: Default choice for most texts")

print("\nTokenTextSplitter:")
print("  ‚úÖ Respects model token limits")
print("  ‚úÖ More accurate for embeddings")
print("  ‚ùå Slower than character-based")
print("  Use when: Working with token-limited models")


üìä Text Splitting Methods Comparison:

CharacterTextSplitter:
  ‚úÖ Simple and predictable
  ‚úÖ Good for structured text
  ‚ùå May break mid-sentence
  Use when: Text has clear delimiters

RecursiveCharacterTextSplitter:
  ‚úÖ Respects text structure
  ‚úÖ Tries multiple separators
  ‚úÖ Best general-purpose splitter
  ‚ùå Slightly more complex
  Use when: Default choice for most texts

TokenTextSplitter:
  ‚úÖ Respects model token limits
  ‚úÖ More accurate for embeddings
  ‚ùå Slower than character-based
  Use when: Working with token-limited models
