# Introduction to Data Ingestion

In [1]:
import os
from typing import List,Dict,Any
import pandas as pd

In [2]:
from langchain_core.documents import Document
from langchain.text_splitter import (
                                    RecursiveCharacterTextSplitter,
                                    CharacterTextSplitter,
                                    TokenTextSplitter
                                    )

In [3]:
print("Set up Completed...")

Set up Completed...


In [4]:
## Understand document structure in langchain

In [5]:
## Crreate a simple document
doc = Document(
    page_content="This is the main text content that will be embdded and searched.",
    metadata ={
        "source":"example.txt",
        "page":1,
        "author":"Siddhesh m",
        "date_created":"2025-08-17",
        "custom_field":"any_value"  

    }
)

In [6]:
print("Document Structure")
print(f"Content: {doc.page_content}")
print(f"Content: {doc.metadata}")

# metadata is useful for 
'''
- filtering search results
- Tracking document sources
- Providing context in responses
- Debugging and auditing
'''

Document Structure
Content: This is the main text content that will be embdded and searched.
Content: {'source': 'example.txt', 'page': 1, 'author': 'Siddhesh m', 'date_created': '2025-08-17', 'custom_field': 'any_value'}


'\n- filtering search results\n- Tracking document sources\n- Providing context in responses\n- Debugging and auditing\n'

In [7]:
type(doc)

langchain_core.documents.base.Document

# Text File (.txt)

In [8]:
# create a simple text file 
import os
os.makedirs("data/text_files",exist_ok=True)

In [9]:
sample_text={
    "data/text_files/pyhton_intro.txt":"""1-data_ingestion.ipynb
    Python is a high-level, versatile, and easy-to-learn programming language that 
    is widely used across various domains such as web development, data science, machine 
    learning, automation, and software development. Known for its simple syntax and readability, 
    Python allows developers to write clean and efficient code with fewer lines compared to many 
    other languages. It supports multiple programming paradigms including object-oriented,
    procedural, and functional programming, making it flexible for different types of projects. 
    With a vast ecosystem of libraries and frameworks like NumPy, Pandas, Django, and TensorFlow,
    Python empowers beginners and professionals alike to build powerful applications quickly. 
    Its strong community support and cross-platform compatibility further make it one of the most 
    popular and in-demand programming languages in the world today.

    Here are some key features of Python 

Simple & Easy to Learn – Python has a clean, readable syntax close to English, making it beginner-friendly.

Interpreted Language – No need for compilation; Python code runs directly, making debugging easier.

Cross-Platform – Works on Windows, Mac, Linux, and even mobile/embedded systems.

Open Source & Free – Anyone can use and modify it without cost.

Extensive Libraries – Comes with rich standard libraries and third-party modules for almost every domain (AI, Data Science, Web, etc.).

Object-Oriented & Multi-Paradigm – Supports OOP, procedural, and functional programming styles.

Dynamic Typing – No need to declare variable types; Python decides at runtime.

High-Level Language – Focuses on problem-solving rather than low-level details like memory management.
    """,

    "data/text_files/machine_learning.txt":"""1-data_ingestion.ipynb

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing systems capable of learning from data and improving their performance over time without being explicitly programmed. Unlike traditional programming, where rules are predefined and outcomes are strictly determined by code, machine learning algorithms allow computers to recognize patterns, make predictions, and adapt to new information. This makes it an incredibly powerful tool in today’s digital era where vast amounts of data are generated every second.

The foundation of machine learning lies in algorithms and models that process data to identify trends and patterns. These algorithms are broadly categorized into three types: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the system is trained with labeled data, meaning the input and corresponding correct output are provided. This helps the model learn relationships between inputs and outputs, making it useful for tasks such as spam detection, sentiment analysis, or predicting house prices. Unsupervised learning, on the other hand, deals with unlabeled data. Here, the algorithm groups or clusters the data based on similarities, which is often used in customer segmentation or market basket analysis. Reinforcement learning is inspired by behavioral psychology, where an agent learns to make decisions by interacting with its environment and receiving rewards or penalties based on its actions. This technique is widely applied in robotics, gaming, and autonomous driving.

One of the most important aspects of machine learning is the data itself. The quality, quantity, and relevance of data directly influence the performance of a model. Machine learning involves several steps, including data collection, preprocessing, feature selection, training, testing, and evaluation. Preprocessing ensures that raw data is cleaned, standardized, and made suitable for analysis. Feature selection helps in identifying the most relevant attributes that contribute to accurate predictions. The training phase involves feeding the data into an algorithm to build a predictive model, while testing evaluates how well the model performs on unseen data.

Machine learning has found widespread applications in almost every domain. In healthcare, it assists in disease prediction, drug discovery, and medical imaging analysis. In finance, ML models are used for fraud detection, credit scoring, and algorithmic trading. Retailers employ machine learning for personalized recommendations, demand forecasting, and inventory management. Social media platforms use ML algorithms to filter content, detect fake accounts, and enhance user engagement. Moreover, self-driving cars rely heavily on machine learning to process sensor data, recognize objects, and make real-time driving decisions.

Despite its immense benefits, machine learning also faces challenges. Issues such as biased data, overfitting, lack of interpretability, and ethical concerns regarding privacy and fairness must be carefully addressed. For example, if a model is trained on biased data, it may produce unfair outcomes that reinforce existing inequalities. Researchers are actively working on explainable AI techniques to make ML models more transparent and trustworthy.

In conclusion, machine learning is revolutionizing the way technology interacts with humans by making systems smarter, more adaptive, and capable of independent decision-making. As data continues to grow at an unprecedented rate, the role of machine learning will only expand, opening new possibilities for innovation in science, business, and everyday life. With its ability to learn, predict, and improve, machine learning stands as a cornerstone of modern artificial intelligence and a driving force behind future technological advancements.

    """
}

for filepath, content in sample_text.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)
print("Sample text file created")

Sample text file created


## TextLoader - Read Single FIle

In [10]:
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

In [11]:
# Loading single text file
loader = TextLoader("data/text_files/pyhton_intro.txt", encoding="utf-8")

In [12]:
loader

<langchain_community.document_loaders.text.TextLoader at 0x22aef619dd0>

In [13]:
document = loader.load()
print(type(document))
print(document)

<class 'list'>
[Document(metadata={'source': 'data/text_files/pyhton_intro.txt'}, page_content='1-data_ingestion.ipynb\n    Python is a high-level, versatile, and easy-to-learn programming language that \n    is widely used across various domains such as web development, data science, machine \n    learning, automation, and software development. Known for its simple syntax and readability, \n    Python allows developers to write clean and efficient code with fewer lines compared to many \n    other languages. It supports multiple programming paradigms including object-oriented,\n    procedural, and functional programming, making it flexible for different types of projects. \n    With a vast ecosystem of libraries and frameworks like NumPy, Pandas, Django, and TensorFlow,\n    Python empowers beginners and professionals alike to build powerful applications quickly. \n    Its strong community support and cross-platform compatibility further make it one of the most \n    popular and in-de

In [14]:
print(f"loaded {len(document)} document")
print(f"content preview: {document[0].page_content[:100]}...")
print(f"Metadata: {document[0].metadata}")

loaded 1 document
content preview: 1-data_ingestion.ipynb
    Python is a high-level, versatile, and easy-to-learn programming language...
Metadata: {'source': 'data/text_files/pyhton_intro.txt'}


## DirectoryLoader - Multiple Texxt Files

In [15]:
from langchain_community.document_loaders import DirectoryLoader

In [16]:
dir_loader = DirectoryLoader(
    "data/text_files",
    glob="**/*.txt", ## pattern to match files
    loader_cls=TextLoader, ## Loader class to use,
    loader_kwargs={'encoding':'utf-8'},
    show_progress=True
)

documents = dir_loader.load()

100%|██████████| 2/2 [00:00<00:00, 2273.95it/s]


In [17]:
print(f"loaded {len(documents)} document")
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1}:")
    print(f"Source: {doc.metadata['source']} ")
    print(f"Length: {len(doc.page_content)} characters")

loaded 2 document

Document 1:
Source: data\text_files\machine_learning.txt 
Length: 3906 characters

Document 2:
Source: data\text_files\pyhton_intro.txt 
Length: 1763 characters


## Text Splitting Stratergy

In [18]:
# different text splitting strategies

from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter   
)

print(documents)

[Document(metadata={'source': 'data\\text_files\\machine_learning.txt'}, page_content='1-data_ingestion.ipynb\n\nMachine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing systems capable of learning from data and improving their performance over time without being explicitly programmed. Unlike traditional programming, where rules are predefined and outcomes are strictly determined by code, machine learning algorithms allow computers to recognize patterns, make predictions, and adapt to new information. This makes it an incredibly powerful tool in today’s digital era where vast amounts of data are generated every second.\n\nThe foundation of machine learning lies in algorithms and models that process data to identify trends and patterns. These algorithms are broadly categorized into three types: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the system is trained with labeled data, meaning the input and 

In [19]:
### method -1 >> character text splitter

text=documents[0].page_content
text

'1-data_ingestion.ipynb\n\nMachine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing systems capable of learning from data and improving their performance over time without being explicitly programmed. Unlike traditional programming, where rules are predefined and outcomes are strictly determined by code, machine learning algorithms allow computers to recognize patterns, make predictions, and adapt to new information. This makes it an incredibly powerful tool in today’s digital era where vast amounts of data are generated every second.\n\nThe foundation of machine learning lies in algorithms and models that process data to identify trends and patterns. These algorithms are broadly categorized into three types: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the system is trained with labeled data, meaning the input and corresponding correct output are provided. This helps the model learn relationships be

In [20]:
## mehtod-1 character based splitting
print("Character Text Splitter")
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

Character Text Splitter


In [21]:
char_chuncks=char_splitter.split_text(text)
print(f"created {len(char_chuncks)} chunks")
print(f"first chunk: {char_chuncks[0][:100]}...")

Created a chunk of size 549, which is longer than the specified 200
Created a chunk of size 1028, which is longer than the specified 200
Created a chunk of size 665, which is longer than the specified 200
Created a chunk of size 629, which is longer than the specified 200
Created a chunk of size 451, which is longer than the specified 200
Created a chunk of size 544, which is longer than the specified 200


created 7 chunks
first chunk: 1-data_ingestion.ipynb...


In [22]:
print(char_chuncks)

['1-data_ingestion.ipynb', 'Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing systems capable of learning from data and improving their performance over time without being explicitly programmed. Unlike traditional programming, where rules are predefined and outcomes are strictly determined by code, machine learning algorithms allow computers to recognize patterns, make predictions, and adapt to new information. This makes it an incredibly powerful tool in today’s digital era where vast amounts of data are generated every second.', 'The foundation of machine learning lies in algorithms and models that process data to identify trends and patterns. These algorithms are broadly categorized into three types: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the system is trained with labeled data, meaning the input and corresponding correct output are provided. This helps the model learn relationships b

In [27]:
print(char_chuncks[0])
print(char_chuncks[1])
print(char_chuncks[2])
print(char_chuncks[3])
print(char_chuncks[4])


1-data_ingestion.ipynb
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing systems capable of learning from data and improving their performance over time without being explicitly programmed. Unlike traditional programming, where rules are predefined and outcomes are strictly determined by code, machine learning algorithms allow computers to recognize patterns, make predictions, and adapt to new information. This makes it an incredibly powerful tool in today’s digital era where vast amounts of data are generated every second.
The foundation of machine learning lies in algorithms and models that process data to identify trends and patterns. These algorithms are broadly categorized into three types: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the system is trained with labeled data, meaning the input and corresponding correct output are provided. This helps the model learn relationships between i

In [32]:
## mehtod-2 Recursive character splitting (recommended)
print("Recursive Character Text Splitter")
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n","\n"," ",""],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

Recursive Character Text Splitter


In [33]:
recursive_chuncks=recursive_splitter.split_text(text)
print(f"created {len(char_chuncks)} chunks")
print(f"first chunk: {char_chuncks[0][:100]}...")

created 7 chunks
first chunk: 1-data_ingestion.ipynb...


In [34]:
print(recursive_chuncks[0])
print(recursive_chuncks[1])


1-data_ingestion.ipynb
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on developing systems capable of learning from data and improving their performance over time without being explicitly


In [36]:
# token based splitting
print("\n Token based splittng")

token_splitter = TokenTextSplitter(
    chunk_size=50,
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"first chunk: {token_chunks[0][:100]}..")


 Token based splittng
Created 18 chunks
first chunk: 1-data_ingestion.ipynb

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focus..
