## Intro to Data Ingestion

In [48]:
import os
from typing import List, Dict, Any
import pandas as pd

In [49]:
from langchain_core.documents import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

## Understanding te document structure

In [50]:
## create a simple document
doc = Document(
    page_content="This is the main text content that will be embeded and searched",
    metadata = {
        "source":"example.txt",
        "page":1,
        "author":"Biswa",
        "date_created":"31-08-2024",
        "custom_field":"any value"
    }
)

print("Document Structure")

print("\n")

print(f"content: {doc.page_content}")
print(f"metadata: {doc.metadata}")

Document Structure


content: This is the main text content that will be embeded and searched
metadata: {'source': 'example.txt', 'page': 1, 'author': 'Biswa', 'date_created': '31-08-2024', 'custom_field': 'any value'}


## Text files - the simplest use case

In [51]:
os.makedirs("data/text_files", exist_ok = True)

In [52]:
sample_text = {
    "data/text_files/python_intro.txt": """\
# Introduction to Python

Python is one of the most popular programming languages used for Data Science, Web Development, and AI.
In this notebook (1-data_ingestion.ipynb), we will cover:

- Setting up Python environment
- Understanding variables and data types
- Writing simple functions
- Reading and writing files
- A quick intro to using external libraries

By the end, you will be able to write basic Python programs and prepare for more advanced topics.
""",

    "data/text_files/data_cleaning.txt": """\
# Data Cleaning

Data cleaning is one of the most important steps before applying any ML model.
In this notebook (2-data_cleaning.ipynb), we will cover:

- Handling missing values (drop, fill, interpolate)
- Removing duplicates
- Fixing inconsistent formatting
- Detecting and removing outliers
- Standardizing column names

Clean data = Better results for your ML pipeline.
""",

    "data/text_files/eda.txt": """\
# Exploratory Data Analysis (EDA)

EDA helps us understand the data better before modeling.
In this notebook (3-eda.ipynb), we will:

- Summarize datasets with Pandas
- Visualize distributions (histograms, boxplots, scatterplots)
- Correlation analysis
- Feature relationships

Good EDA often reveals hidden patterns in the data.
""",

    "data/text_files/feature_engineering.txt": """\
# Feature Engineering

Feature Engineering is the art of creating better inputs for ML models.
In this notebook (4-feature_engineering.ipynb), we will:

- Encoding categorical variables (OneHot, Label Encoding)
- Normalization and Standardization
- Binning numerical variables
- Feature selection techniques
- Creating interaction features

Better features = Smarter models.
""",

    "data/text_files/model_training.txt": """\
# Model Training

In this notebook (5-model_training.ipynb), we will train our ML models.
Topics covered:

- Train/Test split
- Cross-validation
- Training models (Linear Regression, Decision Trees, Random Forests)
- Hyperparameter tuning (GridSearchCV, RandomizedSearchCV)
- Saving and loading models with joblib/pickle
""",

    "data/text_files/model_evaluation.txt": """\
# Model Evaluation

The last step in the ML pipeline is model evaluation.
In this notebook (6-model_evaluation.ipynb), we will cover:

- Accuracy, Precision, Recall, F1-score
- ROC-AUC curves
- Confusion Matrix
- Bias-variance tradeoff
- Model interpretability with SHAP/feature importance

Evaluation ensures that our model is reliable and ready for deployment.
"""
}


In [53]:
for filepath,content in sample_text.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)

print("Created sample files succesfully")


Created sample files succesfully


### TextLoader - Read single file

In [54]:
from langchain.document_loaders import TextLoader

In [55]:
loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")

In [56]:
document = loader.load()
document

[Document(metadata={'source': 'data/text_files/python_intro.txt'}, page_content='# Introduction to Python\n\nPython is one of the most popular programming languages used for Data Science, Web Development, and AI.\nIn this notebook (1-data_ingestion.ipynb), we will cover:\n\n- Setting up Python environment\n- Understanding variables and data types\n- Writing simple functions\n- Reading and writing files\n- A quick intro to using external libraries\n\nBy the end, you will be able to write basic Python programs and prepare for more advanced topics.\n')]

In [57]:
print(type(document))

<class 'list'>


### Directory loader - Reading multiple text files

In [58]:
from langchain.document_loaders import DirectoryLoader

In [59]:
dir_loader = DirectoryLoader(
    "data/text_files",
    glob = "**/*.txt", ## pattern to match files
    loader_cls = TextLoader, ## loader class to use
    loader_kwargs = {'encoding':'utf-8'},
    show_progress = True
)

documents2 = dir_loader.load()

100%|██████████| 6/6 [00:00<00:00, 2001.10it/s]


In [60]:
documents2

[Document(metadata={'source': 'data\\text_files\\data_cleaning.txt'}, page_content='# Data Cleaning\n\nData cleaning is one of the most important steps before applying any ML model.\nIn this notebook (2-data_cleaning.ipynb), we will cover:\n\n- Handling missing values (drop, fill, interpolate)\n- Removing duplicates\n- Fixing inconsistent formatting\n- Detecting and removing outliers\n- Standardizing column names\n\nClean data = Better results for your ML pipeline.\n'),
 Document(metadata={'source': 'data\\text_files\\eda.txt'}, page_content='# Exploratory Data Analysis (EDA)\n\nEDA helps us understand the data better before modeling.\nIn this notebook (3-eda.ipynb), we will:\n\n- Summarize datasets with Pandas\n- Visualize distributions (histograms, boxplots, scatterplots)\n- Correlation analysis\n- Feature relationships\n\nGood EDA often reveals hidden patterns in the data.\n'),
 Document(metadata={'source': 'data\\text_files\\feature_engineering.txt'}, page_content='# Feature Engine

In [61]:
print(f"Loaded {len(documents2)}")
for i, doc in enumerate(documents2):
    print(f"\nDocument {1+1}")
    print(f"source: {doc.metadata['source']}")
    print(f"Length: {len(doc.page_content)} characters")

Loaded 6

Document 2
source: data\text_files\data_cleaning.txt
Length: 375 characters

Document 2
source: data\text_files\eda.txt
Length: 330 characters

Document 2
source: data\text_files\feature_engineering.txt
Length: 375 characters

Document 2
source: data\text_files\model_evaluation.txt
Length: 363 characters

Document 2
source: data\text_files\model_training.txt
Length: 321 characters

Document 2
source: data\text_files\python_intro.txt
Length: 460 characters


## Text Splitting Strategies

In [62]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)

print(documents2)


[Document(metadata={'source': 'data\\text_files\\data_cleaning.txt'}, page_content='# Data Cleaning\n\nData cleaning is one of the most important steps before applying any ML model.\nIn this notebook (2-data_cleaning.ipynb), we will cover:\n\n- Handling missing values (drop, fill, interpolate)\n- Removing duplicates\n- Fixing inconsistent formatting\n- Detecting and removing outliers\n- Standardizing column names\n\nClean data = Better results for your ML pipeline.\n'), Document(metadata={'source': 'data\\text_files\\eda.txt'}, page_content='# Exploratory Data Analysis (EDA)\n\nEDA helps us understand the data better before modeling.\nIn this notebook (3-eda.ipynb), we will:\n\n- Summarize datasets with Pandas\n- Visualize distributions (histograms, boxplots, scatterplots)\n- Correlation analysis\n- Feature relationships\n\nGood EDA often reveals hidden patterns in the data.\n'), Document(metadata={'source': 'data\\text_files\\feature_engineering.txt'}, page_content='# Feature Engineer

#### Character Text Splitter

In [63]:
text = documents2[0].page_content
text

'# Data Cleaning\n\nData cleaning is one of the most important steps before applying any ML model.\nIn this notebook (2-data_cleaning.ipynb), we will cover:\n\n- Handling missing values (drop, fill, interpolate)\n- Removing duplicates\n- Fixing inconsistent formatting\n- Detecting and removing outliers\n- Standardizing column names\n\nClean data = Better results for your ML pipeline.\n'

In [64]:
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size = 200,
    chunk_overlap = 20,
    length_function = len
)

char_chunks = char_splitter.split_text(text)

In [65]:
print(f"created {len(char_chunks)} chunks")
print(f"first chunk {char_chunks[0]}")

created 3 chunks
first chunk # Data Cleaning
Data cleaning is one of the most important steps before applying any ML model.
In this notebook (2-data_cleaning.ipynb), we will cover:


In [66]:
recur_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n","\n"," ", ""],
    chunk_size = 200,
    chunk_overlap = 5,
    length_function = len
)

recur_chunks = recur_splitter.split_text(text)

In [67]:
print(f"created {len(recur_chunks)} chunks")
print(f"first chunk {recur_chunks[0]}")
print()
print(f"second chunk {recur_chunks[1]}")
print()
print(f"third chunk {recur_chunks[2]}")

created 3 chunks
first chunk # Data Cleaning

Data cleaning is one of the most important steps before applying any ML model.
In this notebook (2-data_cleaning.ipynb), we will cover:

second chunk - Handling missing values (drop, fill, interpolate)
- Removing duplicates
- Fixing inconsistent formatting
- Detecting and removing outliers
- Standardizing column names

third chunk Clean data = Better results for your ML pipeline.
