### DATA INGESTION - techniques

This notebook covers everything related to data ingestion, including loading, processing, and storing data from various sources. From basic text files to complex PDF's and data bases.'

In [1]:
import os
import langchain

import pandas as pd
from typing import List, Dict, Any

In [2]:
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter

print("Importing done")

Importing done


###       Creating a simple document

In [None]:
#1          Creating a simple document

doc = Document(
    page_content ="This is a simple sample document that will be used for testing. So stay tuned and follow my instructions.",
    metadata={
        "source":"example.txt",
        "page":1,
        "author":"Akarshan Kapoor",
        "date_created":"2025-15-08"
    }
)

print("My Document Structure:")
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")

My Document Structure:
Content: This is a simple sample document that will be used for testing. So stay tuned and follow my instructions.
Metadata: {'source': 'example.txt', 'page': 1, 'author': 'Akarshan Kapoor', 'date_created': '2025-15-08'}


### Creating a simple text file

In [None]:
#2      Creating a simple text file

import os
os.makedirs("data/text_files", exist_ok=True)


In [6]:
sample_text = {
    "data/text_files/india_independence.txt": 
    """
India’s struggle for independence was long and challenging, spanning several decades of resistance against British colonial rule. The freedom movement included non-violent protests, civil disobedience, and mass movements led by iconic leaders such as Mahatma Gandhi, Jawaharlal Nehru, Sardar Vallabhbhai Patel, Bhagat Singh, Subhas Chandra Bose, and many others. The efforts of countless ordinary citizens—farmers, students, women, and workers—also played a crucial role.

On 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at the Red Fort in Delhi, inspiring the country with his speech, “Tryst with Destiny.”

Today, Independence Day is celebrated across India with flag hoisting, parades, cultural programs, and patriotic events, honoring the sacrifices, unity, and resilience of those who fought for freedom. It is a day to remember the past, cherish democracy, and look forward to building a strong and progressive nation.
    """
}

# Write content to the file
for filepath, content in sample_text.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("Sample text file created.")


Sample text file created.


### Reading Single Text File (Text Loaders)

In [None]:
# Text Loaders
from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/text_files/india_independence.txt", encoding="utf-8")
document = loader.load()

print(type(document))
print(document)
print(f"Content:{document[0].page_content}")
print(f"Metadata:{document[0].metadata}")
print(f"Length of the document:{len(document)}")

<class 'list'>
[Document(metadata={'source': 'data/text_files/india_independence.txt'}, page_content='\nIndia’s struggle for independence was long and challenging, spanning several decades of resistance against British colonial rule. The freedom movement included non-violent protests, civil disobedience, and mass movements led by iconic leaders such as Mahatma Gandhi, Jawaharlal Nehru, Sardar Vallabhbhai Patel, Bhagat Singh, Subhas Chandra Bose, and many others. The efforts of countless ordinary citizens—farmers, students, women, and workers—also played a crucial role.\n\nOn 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at the Red Fort in Delhi, inspiring the country with his speech, “Tryst with Destiny.”\n\nToday, Independence Day is celebrated across India with flag hoisting, parades, cultural programs, and patriotic events, honoring the sacrifices, unity, and re

### Reading multiple text files from the Directory (DirectoryLoader)



In [None]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "data/text_files",
    glob="**/*.txt", # pattern to match files **/*.txt
    loader_cls=TextLoader,  # Use TextLoader
    loader_kwargs={"encoding": "utf-8"}
)
documents = loader.load()

print(type(documents))
print(f"Number of documents loaded: {len(documents)}")

<class 'list'>
Number of documents loaded: 1


### Text Splitting Methods

#### 🔹 1. Character-based Splitters

Split text into chunks based on a fixed character length.

Can include some overlap between chunks to preserve context.

Example: Splitting a 10,000-character document into 1,000-character chunks with 200-character overlap.

👉 Libraries: CharacterTextSplitter (LangChain).

In [15]:
text = document[0].page_content
text

'\nIndia’s struggle for independence was long and challenging, spanning several decades of resistance against British colonial rule. The freedom movement included non-violent protests, civil disobedience, and mass movements led by iconic leaders such as Mahatma Gandhi, Jawaharlal Nehru, Sardar Vallabhbhai Patel, Bhagat Singh, Subhas Chandra Bose, and many others. The efforts of countless ordinary citizens—farmers, students, women, and workers—also played a crucial role.\n\nOn 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at the Red Fort in Delhi, inspiring the country with his speech, “Tryst with Destiny.”\n\nToday, Independence Day is celebrated across India with flag hoisting, parades, cultural programs, and patriotic events, honoring the sacrifices, unity, and resilience of those who fought for freedom. It is a day to remember the past, cherish democracy, and lo

In [18]:
print("1. Character Text Splitter")
char_splitter =CharacterTextSplitter(
    separator="\n", # Split on newline
    chunk_size=500, # Chunk size of 500 characters
    chunk_overlap=20, # Overlap of 20 characters between chunks
    length_function=len # how to measure chunk size
)

char_chunks = char_splitter.split_text(text)

print(f"Number of chunks created: {len(char_chunks)}")  
print(char_chunks[0])  # Print the first chunk
print(char_chunks[1])  # Print the second chunk 
print(char_chunks[2])  # Print the third chunk

1. Character Text Splitter
Number of chunks created: 3
India’s struggle for independence was long and challenging, spanning several decades of resistance against British colonial rule. The freedom movement included non-violent protests, civil disobedience, and mass movements led by iconic leaders such as Mahatma Gandhi, Jawaharlal Nehru, Sardar Vallabhbhai Patel, Bhagat Singh, Subhas Chandra Bose, and many others. The efforts of countless ordinary citizens—farmers, students, women, and workers—also played a crucial role.
On 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at the Red Fort in Delhi, inspiring the country with his speech, “Tryst with Destiny.”
Today, Independence Day is celebrated across India with flag hoisting, parades, cultural programs, and patriotic events, honoring the sacrifices, unity, and resilience of those who fought for freedom. It is a day t

🔹 2. Recursive Splitters

Splits text hierarchically based on priority:

Paragraphs → 2. Sentences → 3. Words → 4. Characters.

Tries to create chunks without breaking semantic meaning.

Best for preserving readability and context.

👉 Libraries: RecursiveCharacterTextSplitter (LangChain).

In [21]:
print("Recursive Character Text Splitter")
recursive_char_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks = recursive_char_splitter.split_text(text)

print(f"Number of chunks created: {len(recursive_chunks)}")

for i in range(len(recursive_chunks)-1):
    print(f"Chunk {i+1}:'{recursive_chunks[i]}'")
    print(f"Chunk {i+2}:'{recursive_chunks[i+1]}'")
    print()

Recursive Character Text Splitter
Number of chunks created: 3
Chunk 1:'India’s struggle for independence was long and challenging, spanning several decades of resistance against British colonial rule. The freedom movement included non-violent protests, civil disobedience, and mass movements led by iconic leaders such as Mahatma Gandhi, Jawaharlal Nehru, Sardar Vallabhbhai Patel, Bhagat Singh, Subhas Chandra Bose, and many others. The efforts of countless ordinary citizens—farmers, students, women, and workers—also played a crucial role.'
Chunk 2:'On 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at the Red Fort in Delhi, inspiring the country with his speech, “Tryst with Destiny.”'

Chunk 2:'On 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at th

🔹 3. Token-based Splitters

Splits based on LLM tokens (not characters).

Ensures that chunk size aligns with the model’s tokenization.

Useful because tokens ≠ characters (e.g., "chatGPT" → 2 tokens).

👉 Libraries: TokenTextSplitter in LangChain (uses model-specific tokenizer like OpenAI tiktoken).

In [23]:
print("Token Text Splitters")
token_splitter = TokenTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

token_chunks = token_splitter.split_text(text)

print(f"Number of chunks created: {len(token_chunks)}")

for i in range(len(token_chunks)-1):
    print(f"Chunk {i+1}:'{token_chunks[i]}'")
    print(f"Chunk {i+2}:'{token_chunks[i+1]}'")
    print()

Token Text Splitters
Number of chunks created: 2
Chunk 1:'
India’s struggle for independence was long and challenging, spanning several decades of resistance against British colonial rule. The freedom movement included non-violent protests, civil disobedience, and mass movements led by iconic leaders such as Mahatma Gandhi, Jawaharlal Nehru, Sardar Vallabhbhai Patel, Bhagat Singh, Subhas Chandra Bose, and many others. The efforts of countless ordinary citizens—farmers, students, women, and workers—also played a crucial role.

On 15th August 1947, India finally became a sovereign nation, ending centuries of foreign domination. Jawaharlal Nehru, the first Prime Minister, hoisted the national flag at the Red Fort in Delhi, inspiring the country with his speech, “Tryst with Destiny.”

Today, Independence Day is celebrated across India with flag hoisting, parades, cultural programs, and patriotic events, honoring the sacrifices, unity, and resilience of those who'
Chunk 2:'ades, cultural pr