# üìñ Chapter 02 ‚Äî Data Processing & Chunking

## üéØ Objectives

In this chapter, we will process the raw tourism data into a format suitable for RAG implementation.

We will create unified document structures, implement chunking strategies, and save processed data for vector database ingestion.

In [1]:
import pandas as pd
from src.config import RAW_DATA_DIR, PROCESSED_DATA_DIR
import json
from langchain_text_splitters import RecursiveCharacterTextSplitter

## üßπ Step 01 ‚Äî Data Loading

Load the raw tourism data from JSON file.

In [2]:
with open(RAW_DATA_DIR / "scenic_spot.json", "r", encoding="utf-8-sig") as f:
    data = json.load(f)

attraction = data["XML_Head"]["Infos"]["Info"]
df = pd.DataFrame(attraction)

text_cols = ["Add", "Description", "Opentime", "Ticketinfo", "Travellinginfo"]
df[text_cols] = df[text_cols].fillna("")

## üìÑ Step 02 ‚Äî Document Creation

Transform each attraction into a unified document format.

In [3]:
def create_document(row):
    # Combine the text columns
    content_parts = [
        f"Name: {row['Name']}",
        f"Region: {row['Region']}",
    ]

    # if column exists, return it, otherwise, return "", and "" is false in if
    if str(row.get("Add", "")).strip():
        content_parts.append(f"Address: {row['Add']}")

    if str(row.get("Description", "")).strip():
        content_parts.append(f"Description: {row['Description']}")

    if str(row.get("Opentime", "")).strip():
        content_parts.append(f"Open_time: {row['Opentime']}")

    if str(row.get("Ticketinfo", "")).strip():
        content_parts.append(f"Ticket: {row['Ticketinfo']}")

    # Filter the empty
    content = "\n".join([part for part in content_parts if part])

    return {
        "id": row["Id"],
        "content": content,
        "metadata": {
            "name": row["Name"],
            "region": row["Region"],
            "category": row.get("Class1", "Unknown"),
        },
    }

In [4]:
contents = df.apply(
    create_document, axis=1
).tolist()  # apply manipulates the columns default.

## ‚úÇÔ∏è Step 03 ‚Äî Text Chunking

Test chunking strategies for longer documents.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=[
        "\n\n",
        "\n",
        " ",
        "",
    ],  # Prefer to separate at paragraph, new line, and space
)

In [6]:
sample_doc = contents[0]
print("Original docs:")
print(sample_doc["content"])
print(f"\nlength: {len(sample_doc["content"])} characters")

Original docs:
Name: ÂÆè‰∫ûÈ£üÂìÅÂ∑ßÂÖãÂäõËßÄÂÖâÂ∑•Âª†
Region: Ê°ÉÂúíÂ∏Ç
Address: Ê°ÉÂúíÁ∏£ÂÖ´Âæ∑Â∏ÇÂª∫ÂúãË∑Ø386Ëôü
Description: Â∑ßÂÖãÂäõÂÖ±ÂíåÂúãÊòØ‰∏ÄÂ∫ß‰ª•Â∑ßÂÖãÂäõÁÇ∫‰∏ªÈ°åÁöÑËßÄÂÖâÂ∑•Âª†ÔºåÂª∫ÁØâË®≠Ë®à„ÄÅÈ§®ÂÖß‰∏ªÈ°åË®≠Ë®àÁöÜ‰ª•Â∑ßÂÖãÂäõÁÇ∫‰∏ªÈ°åÔºåÈÄôË£°‰πüÊèê‰æõË±êÂØåÁöÑÂ∑ßÂÖãÂäõÁõ∏ÈóúÁü•Ë≠òÔºå‰∫¶ÂèØ‰ª•DIYÂâµ‰ΩúÂ∑ßÂÖãÂäõÔºåÁÇ∫‰∏ÄÂØìÊïôÊñºÊ®Ç„ÄÅÈÅ©ÂêàË¶™Â≠ê‰ºëÈñíÂ®õÊ®ÇÁöÑÁµï‰Ω≥ÂéªËôï„ÄÇ
Ticket: Êî∂Ë≤ªÊñπÂºèË´ãÈõªÊ¥Ω

length: 173 characters


In [7]:
# Chunk
chunks = text_splitter.split_text(sample_doc["content"])
print(f"Separate to {len(chunks)} chunks")

for i, chunk in enumerate(chunks, 1):
    print(f"\n--- Chunk {i} ({len(chunk)} characters) ---")
    print(chunk)

Separate to 1 chunks

--- Chunk 1 (173 characters) ---
Name: ÂÆè‰∫ûÈ£üÂìÅÂ∑ßÂÖãÂäõËßÄÂÖâÂ∑•Âª†
Region: Ê°ÉÂúíÂ∏Ç
Address: Ê°ÉÂúíÁ∏£ÂÖ´Âæ∑Â∏ÇÂª∫ÂúãË∑Ø386Ëôü
Description: Â∑ßÂÖãÂäõÂÖ±ÂíåÂúãÊòØ‰∏ÄÂ∫ß‰ª•Â∑ßÂÖãÂäõÁÇ∫‰∏ªÈ°åÁöÑËßÄÂÖâÂ∑•Âª†ÔºåÂª∫ÁØâË®≠Ë®à„ÄÅÈ§®ÂÖß‰∏ªÈ°åË®≠Ë®àÁöÜ‰ª•Â∑ßÂÖãÂäõÁÇ∫‰∏ªÈ°åÔºåÈÄôË£°‰πüÊèê‰æõË±êÂØåÁöÑÂ∑ßÂÖãÂäõÁõ∏ÈóúÁü•Ë≠òÔºå‰∫¶ÂèØ‰ª•DIYÂâµ‰ΩúÂ∑ßÂÖãÂäõÔºåÁÇ∫‰∏ÄÂØìÊïôÊñºÊ®Ç„ÄÅÈÅ©ÂêàË¶™Â≠ê‰ºëÈñíÂ®õÊ®ÇÁöÑÁµï‰Ω≥ÂéªËôï„ÄÇ
Ticket: Êî∂Ë≤ªÊñπÂºèË´ãÈõªÊ¥Ω


In [8]:
doc_lengths = [len(doc["content"]) for doc in contents]
length_df = pd.DataFrame({
    "length": doc_lengths
})

print("Docs length summary:")
print(length_df.describe())

Docs length summary:
            length
count  5086.000000
mean    204.130751
std     135.558116
min      21.000000
25%      85.000000
50%     175.000000
75%     287.750000
max     643.000000


In [9]:
need_chunking = sum(1 for length in doc_lengths if length > 500)
print(f"The docs with over 500 characters: {need_chunking} / {len(contents)}")
print(f"Ratio: {need_chunking / len(contents) * 100:.1f}%")

The docs with over 500 characters: 99 / 5086
Ratio: 1.9%


## üíæ Step 04 ‚Äî Save Processed Data

Export processed documents to JSON format.

In [16]:
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

output_path = PROCESSED_DATA_DIR / "documents.json"

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(contents, f, ensure_ascii=False, indent=2)

print(f"Saved {len(contents)} docs to:")
print(f"{output_path} ‚úÖ ")
print(f"File capacity: {output_path.stat().st_size / 1024 / 1024:.2f} MB")

Saved 5086 docs to:
c:\Users\dinni\OneDrive\Ê°åÈù¢\Travel_rag\data\processed\documents.json ‚úÖ 
File capacity: 3.21 MB


In [17]:
# Double check
with open(output_path, "r", encoding="utf-8") as f:
    loaded_docs = json.load(f)

print("Success! ‚úÖ ")
print(f"Original docs: {len(contents)}")
print(f"Loaded docs: {len(loaded_docs)}")
print(f"Data consistency: {len(contents) == len(loaded_docs)}")

Success! ‚úÖ 
Original docs: 5086
Loaded docs: 5086
Data consistency: True
