# üìñ Chapter 02 ‚Äî Data Enrichment
## üéØ Objectives
In this chapter, we will enrich our attraction data with detailed descriptions from Wikipedia.

**What we'll accomplish:**

- Set up Wikipedia API access

- Fetch Wikipedia extracts for attractions

- Clean and process description text

- Combine Geoapify data with Wikipedia content

- Create final enriched dataset for RAG

- Validate data quality and completeness


## üì¶ Step 01 ‚Äî Import Libraries
Import necessary libraries for working with Wikipedia API and data processing.

In [82]:
import pandas as pd
import requests
import json
import time

from src.config import PROCESSED_DATA_DIR, RAW_DATA_DIR, TARGET_CITY, CITY_BBOX
from src.utils.emoji_log import success, data, error, info, task, done, warn ,save

## üîç Step 02 ‚Äî Load Processed Data
Load the attractions with Wikipedia links from Chapter 1.

In [63]:
data_file = PROCESSED_DATA_DIR / "seattle_attractions_with_wikipedia.json"

with open(data_file, "r", encoding="utf-8") as f:
    attractions_with_wiki = json.load(f)

success(f"Loaded {len(attractions_with_wiki)} attractions with Wikipedia links")
info(f"File name: {data_file.name}")

‚úÖ Loaded 62 attractions with Wikipedia links
üí¨ File name: seattle_attractions_with_wikipedia.json


## üåê Step 03 ‚Äî Test Wikipedia API Connection
Configure Wikipedia API endpoint and test connection.

In [64]:
WIKIPEDIA_API_BASE = "https://{language}.wikipedia.org/w/api.php"

task("Testing Wikipedia API connection...")

headers = {"User-Agent": "TravelRAG/1.0 (Educational Project; dinnis1107@gmail.com)"}

test_params = {
    "action": "query",
    "format": "json",
    "titles": "Seattle",
    "prop": "extracts",
    "exintro": True,
    "explaintext": True,
}

# request test
test_response = requests.get(
    WIKIPEDIA_API_BASE.format(language="en"), params=test_params, headers=headers
)

if test_response.status_code == 200:
    success("Wikipedia API connection successful!")
else:
    error(f"Connection failed: {test_response.status_code}")

üöÄ Testing Wikipedia API connection...
‚úÖ Wikipedia API connection successful!


In [65]:
data(json.dumps(test_response.json(), indent=2))

üìä {
  "batchcomplete": "",
  "query": {
    "pages": {
      "11388236": {
        "pageid": 11388236,
        "ns": 0,
        "title": "Seattle",
        "extract": "Seattle (  see-AT-\u0259l) is the most populous city in the U.S. state of Washington and the Pacific Northwest region of North America. It is the 18th-most populous city in the United States with a population of 780,995 in 2024, while the Seattle metropolitan area at over 4.15 million residents is the 15th-most populous metropolitan area in the nation. The city is the county seat of King County, the most populous county in Washington. Seattle's growth rate of 21.1% between 2010 and 2020 made it one of the country's fastest-growing large cities.\nSeattle is situated on an isthmus between Puget Sound, an inlet of the Pacific Ocean, and Lake Washington. It is the northernmost major city in the United States, located about 100 miles (160 km) south of the Canadian border. A gateway for trade with the West Pacific, the Port

## üìö Step 04 ‚Äî Fetch Wikipedia Descriptions
Fetch Wikipedia extracts for each attraction using the Wikipedia API.

In [66]:
attractions_with_wiki[0]

{'index': 0,
 'name': 'Seattle Public Library - Central Library',
 'place_id': '5186d2eaed4a955ec059a29297cfa8cd4740f00102f901ba6f35020000000092032853656174746c65205075626c6963204c696272617279202d2043656e7472616c204c696272617279',
 'wiki_data': {'wikidata': 'Q2531939',
  'wikipedia': 'en:Seattle Central Library',
  'wikimedia_commons': 'File:Seattle_(WA,_USA),_Seattle_Central_Library_--_2022_--_200930.jpg',
  'image': 'https://commons.wikimedia.org/wiki/File:Seattle_(WA,_USA),_Seattle_Central_Library_--_2022_--_200930.jpg'}}

In [67]:
task("Fetching Wikipedia descriptions for all attractions...")

success_count = 0
fail_count = 0

for i, attraction in enumerate(attractions_with_wiki, 1):
    wiki_code = attraction["wiki_data"]["wikipedia"]

    # Split the language and title
    language, title = wiki_code.split(":", maxsplit=1)

    # api url
    api_url = WIKIPEDIA_API_BASE.format(language=language)

    # params
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "exintro": True,
        "explaintext": True,
        "redirects": 1,
    }

    response = requests.get(
        api_url,
        params=params,
        headers=headers,
    )

    if response.status_code == 200:
        response_data = response.json()

        # Retrieve the description
        pages = response_data.get("query", {}).get("pages", {})

        # Iterate the pages
        description_found = False
        for page_id, page_data in pages.items():
            if "extract" in page_data and "missing" not in page_data:
                attraction["description"] = page_data["extract"]
                success_count += 1
                description_found = True
                success(f"[{i}/{len(attractions_with_wiki)}] {attraction['name']}")
                break

        if not description_found:
            attraction["description"] = None
            fail_count += 1
            error(
                f"[{i}/{len(attractions_with_wiki)}] {attraction['name']} - No extract found"
            )
    else:
        attraction["description"] = None
        fail_count += 1
        error(
            f"[{i}/{len(attractions_with_wiki)}] {attraction['name']} - API Error: {response.status_code}"
        )

    # Rate limiting
    time.sleep(0.5)

done(f"Completed! Success: {success_count}, Failed: {fail_count}")

üöÄ Fetching Wikipedia descriptions for all attractions...
‚úÖ [1/62] Seattle Public Library - Central Library
‚úÖ [2/62] Space Needle
‚úÖ [3/62] Pike Place Market
‚úÖ [4/62] Seattle Great Wheel
‚úÖ [5/62] Japanese Garden
‚úÖ [6/62] Grass Blades
‚úÖ [7/62] Seattle Center
‚úÖ [8/62] Carl S. English Jr. Botanical Gardens
‚úÖ [9/62] West Point Light
‚úÖ [10/62] Alki Point Lighthouse
‚úÖ [11/62] Gum Wall
‚úÖ [12/62] Bruce and Brandon Lee Graves
‚úÖ [13/62] Red Square
‚úÖ [14/62] Ravenna Park Bridge
‚úÖ [15/62] Pioneer Building
‚úÖ [16/62] Large Lock
‚úÖ [17/62] Small Lock
‚úÖ [18/62] Swiftsure (LV-83)
‚úÖ [19/62] Arthur Foss
‚úÖ [20/62] Virginia V
‚úÖ [21/62] Ward House
‚úÖ [22/62] Historic Ballard Fire Station No. 18
‚úÖ [23/62] Panama Hotel
‚úÖ [24/62] Montlake Boulevard East
‚úÖ [25/62] Chinatown Gate
‚úÖ [26/62] Duwamish
‚úÖ [27/62] A Sound Garden
‚úÖ [28/62] Sick's Stadium
‚úÖ [29/62] Jose Rizal Bridge
‚úÖ [30/62] Montlake Bridge
‚úÖ [31/62] Aurora Avenue North
‚úÖ [32/62] Union Trus

## üßπ Step 05 ‚Äî Clean and Process Text

Analyze data quality and clean the dataset:
- Check for duplicate entries
- Remove duplicates based on place_id
- Check for special characters
- Analyze description lengths

In [68]:
attractions_with_wiki[0]

{'index': 0,
 'name': 'Seattle Public Library - Central Library',
 'place_id': '5186d2eaed4a955ec059a29297cfa8cd4740f00102f901ba6f35020000000092032853656174746c65205075626c6963204c696272617279202d2043656e7472616c204c696272617279',
 'wiki_data': {'wikidata': 'Q2531939',
  'wikipedia': 'en:Seattle Central Library',
  'wikimedia_commons': 'File:Seattle_(WA,_USA),_Seattle_Central_Library_--_2022_--_200930.jpg',
  'image': 'https://commons.wikimedia.org/wiki/File:Seattle_(WA,_USA),_Seattle_Central_Library_--_2022_--_200930.jpg'},
 'description': 'The Seattle Central Library is the flagship library of the Seattle Public Library system. The 11-story (185 feet or 56.9 meters high) glass and steel building in the downtown core of Seattle, Washington was opened to the public on May 23, 2004. Rem Koolhaas and Joshua Prince-Ramus of OMA/LMN were the principal architects, and Magnusson Klemencic Associates was the structural engineer with Arup. Arup also provided mechanical, electrical, and plumbin

In [69]:
# Check the duplicates and remove
info("Checking for duplicates")

seen_place_ids = {}
duplicates = []

for attraction in attractions_with_wiki:
    place_id = attraction["place_id"]
    name = attraction["name"]

    if place_id in seen_place_ids:
        duplicates.append(
            {
                "name": name,
                "place_id": place_id,
                "first_index": seen_place_ids[place_id],
            }
        )
    else:
        seen_place_ids[place_id] = attraction["index"]

if duplicates:
    warn(f"Found {len(duplicates)} duplicate entries:")

    unique_attractions = []
    seen_ids = set()

    for attraction in attractions_with_wiki:
        place_id = attraction["place_id"]
        if place_id not in seen_ids:
            unique_attractions.append(attraction)
            seen_ids.add(place_id)

    attractions_with_wiki = unique_attractions
    done(
        f"Removed {len(duplicates)} duplicates. Remaining {len(attractions_with_wiki)} attractions"
    )
else:
    success("No duplicates found based on place_id")
    unique_attractions = attractions_with_wiki

üí¨ Checking for duplicates
‚úÖ No duplicates found based on place_id


In [70]:
# Check for special characters

info("Checking for special characters in descriptions...")

special_char_count = 0
special_char_examples = []

for attraction in attractions_with_wiki:
    desc = attraction.get("description", "")
    if desc:
        # Check if Unicode escape sequences exists
        if any(ord(char) > 127 for char in desc): # ord -> return ordinary value
            special_char_count += 1
            special_char_examples.append(attraction["name"])

info(f"Descriptions with special characters: {special_char_count/len(attractions_with_wiki)*100:.2f}%")

if special_char_examples:
    print("Examples:")
    for name in special_char_examples:
        print(f"- {name}")

üí¨ Checking for special characters in descriptions...
üí¨ Descriptions with special characters: 27.42%
Examples:
- Red Square
- Ward House
- Panama Hotel
- Montlake Boulevard East
- Chinatown Gate
- Jose Rizal Bridge
- Montlake Bridge
- Jose Rizal Bridge
- Mount Baker Ridge Tunnel (old bore, south)
- Mount Baker Ridge Tunnel (old bore, north)
- Mount Baker Ridge Tunnel (cut and cover lid)
- Montlake Boulevard East
- Montlake Boulevard East
- Waiting for the Interurban
- 9 Spaces 9 Trees
- Broken Obelisk
- Mount Baker Ridge Tunnel (new bore)


In [71]:
# Analyze description lengths
info("Analyzing description lengths...")

desc_lengths = []
no_desc_count = 0

for attraction in attractions_with_wiki:
    desc = attraction.get("description", "")
    if desc:
        desc_lengths.append(len(desc))
    else:
        no_desc_count += 1

if desc_lengths:
    avg = sum(desc_lengths) / len(desc_lengths)
    min_length = min(desc_lengths)
    max_length = max(desc_lengths)

data("Description Length Statistics:")
print(f"  - With descriptions: {len(desc_lengths)}")
print(f"  - Without descriptions: {no_desc_count}")
print(f"  - Average length: {avg:.0f} characters")
print(f"  - Min length: {min_length} characters")
print(f"  - Max length: {max_length} characters")

üí¨ Analyzing description lengths...
üìä Description Length Statistics:
  - With descriptions: 62
  - Without descriptions: 0
  - Average length: 859 characters
  - Min length: 54 characters
  - Max length: 3121 characters


## üîó Step 06 ‚Äî Enrich with Location Data

Load complete location information from raw data and merge with Wikipedia descriptions.

In [72]:
task("Loading raw data and enriching with location information...")

raw_data_file = RAW_DATA_DIR / "seattle_attractions_raw.json"
with open(raw_data_file, "r", encoding="utf-8") as f:
    raw_attractions = json.load(f)

# property -> lon, lat, formatted, address_line1, address_line2, city, state
raw_place_id_list = {}
for attraction in raw_attractions:
    props = attraction["properties"]
    place_id = props["place_id"]
    raw_place_id_list[place_id] = {
        "lon": props.get("lon"),
        "lat": props.get("lat"),
        "address": props.get("formatted"),
        "address_line1": props.get("address_line1"),
        "address_line2": props.get("address_line2"),
        "city": props.get("city"),
        "state": props.get("state"),
        "postcode": props.get("postcode"),
    }

for attraction in attractions_with_wiki:
    attr_place_id = attraction.get("place_id")
    location_value = raw_place_id_list.get(attr_place_id)

    if location_value:
        attraction["location"] = location_value

attractions_with_wiki[0]

üöÄ Loading raw data and enriching with location information...


{'index': 0,
 'name': 'Seattle Public Library - Central Library',
 'place_id': '5186d2eaed4a955ec059a29297cfa8cd4740f00102f901ba6f35020000000092032853656174746c65205075626c6963204c696272617279202d2043656e7472616c204c696272617279',
 'wiki_data': {'wikidata': 'Q2531939',
  'wikipedia': 'en:Seattle Central Library',
  'wikimedia_commons': 'File:Seattle_(WA,_USA),_Seattle_Central_Library_--_2022_--_200930.jpg',
  'image': 'https://commons.wikimedia.org/wiki/File:Seattle_(WA,_USA),_Seattle_Central_Library_--_2022_--_200930.jpg'},
 'description': 'The Seattle Central Library is the flagship library of the Seattle Public Library system. The 11-story (185 feet or 56.9 meters high) glass and steel building in the downtown core of Seattle, Washington was opened to the public on May 23, 2004. Rem Koolhaas and Joshua Prince-Ramus of OMA/LMN were the principal architects, and Magnusson Klemencic Associates was the structural engineer with Arup. Arup also provided mechanical, electrical, and plumbin

## üìÑ Step 07 ‚Äî Create Final Documents
Format the final documents for RAG ingestion following the design from Chapter 1.

In [75]:
task("Creating final documents for RAG...")

documents = []

for attraction in attractions_with_wiki:
    name = attraction.get("name", "Unknown")
    location = attraction.get("location", {})
    description = attraction.get("description", "No description available")

    doc = f"""Name: {name}

    Location: {location.get("address", "N/A")}

    Coordinates: {location.get("lat", "N/A"), location.get("lon", "N/A")}

    Description: {description}
    """

    documents.append({
        "place_id": attraction.get("place_id"),
        "name": name,
        "document": doc
    })

done(f"Created {len(documents)} documents for RAG")

# Sample
print(documents[0]["document"])

üöÄ Creating final documents for RAG...
üèÅ Created 62 documents for RAG
Name: Seattle Public Library - Central Library

    Location: Seattle Central Library, 1000 4th Avenue, Seattle, WA 98104, United States of America

    Coordinates: (47.6067142, -122.33269832546111)

    Description: The Seattle Central Library is the flagship library of the Seattle Public Library system. The 11-story (185 feet or 56.9 meters high) glass and steel building in the downtown core of Seattle, Washington was opened to the public on May 23, 2004. Rem Koolhaas and Joshua Prince-Ramus of OMA/LMN were the principal architects, and Magnusson Klemencic Associates was the structural engineer with Arup. Arup also provided mechanical, electrical, and plumbing engineering, as well as fire/life safety, security, IT and communications, and audio visual consulting. Hoffman Construction Company of Portland, Oregon, was the general contractor.
The 362,987 square feet (33,722.6 m2) public library has the capacity t

## ‚úÖ Step 08 ‚Äî Quality Validation
Validate the enriched dataset:
- Check for missing descriptions
- Verify data completeness
- Analyze description lengths
- Sample quality review

In [76]:
# Check for missing descriptions
info("Checking document completeness...")

total_docs = len(documents)
missing_location = 0
missing_description = 0
short_description = 0

for doc in documents:
    doc_text = doc["document"]

    if "N/A" in doc_text:
        missing_location += 1

    if "No description available" in doc_text:
        missing_description += 1

    desc_start = doc_text.find("Description: ") + len("Description: ")
    description = doc_text[desc_start:].strip()
    if len(description) < 100 and description != "No description available":
        short_description += 1

print(f"  - Total documents: {total_docs}")
print(f"  - Missing location data: {missing_location}")
print(f"  - Missing descriptions: {missing_description}")
print(f"  - Short descriptions (<100 chars): {short_description}")
print(f"  - Complete documents: {total_docs - missing_location - missing_description}")

üí¨ Checking document completeness...
  - Total documents: 62
  - Missing location data: 0
  - Missing descriptions: 0
  - Short descriptions (<100 chars): 2
  - Complete documents: 62


In [77]:
info("Checking data completeness...")

complete_count = 0
for attraction in attractions_with_wiki:
    has_name = bool(attraction.get("name"))
    has_location = bool(attraction.get("location"))
    has_description = bool(attraction.get("description"))

    if has_name and has_description and has_location:
        complete_count += 1

print(
    f"  - Complete records: {complete_count}/{len(attractions_with_wiki)} ({complete_count/len(attractions_with_wiki)*100:.1f}%)"
)

üí¨ Checking data completeness...
  - Complete records: 62/62 (100.0%)


In [78]:
info("Analyzing description lengths...")

desc_lengths = []

for attraction in attractions_with_wiki:
    desc = attraction.get("description", "")
    if desc and desc != "No description available":
        desc_lengths.append(len(desc))

if desc_lengths:
    avg = sum(desc_lengths) / len(desc_lengths)
    max_length = max(desc_lengths)
    min_length = min(desc_lengths)

    data("Description Length Statistics:")
    print(f"  - Average: {avg:.0f} characters")
    print(f"  - Min: {min_length} characters")
    print(f"  - Max: {max_length} characters")

    short = sum(1 for l in desc_lengths if l < 200)
    medium = sum(1 for l in desc_lengths if 200 <= l < 1000)
    long = sum(1 for l in desc_lengths if l >= 1000)

    data("\n  Length Distribution:")
    print(f"  - Short (<200 chars): {short}")
    print(f"  - Medium (200-1000 chars): {medium}")
    print(f"  - Long (‚â•1000 chars): {long}")

üí¨ Analyzing description lengths...
üìä Description Length Statistics:
  - Average: 859 characters
  - Min: 54 characters
  - Max: 3121 characters
üìä 
  Length Distribution:
  - Short (<200 chars): 10
  - Medium (200-1000 chars): 26
  - Long (‚â•1000 chars): 26


In [79]:
info("Performing sample quality review...")

for i in range(min(3, len(documents))):
    doc_item = documents[i]
    doc_text = doc_item["document"]

    print(f"{i + 1}. {doc_item["name"]}")

    desc_start = doc_text.find("Description: ")
    preview = doc_text[desc_start: (desc_start+200)] + "..." if len(doc_text[desc_start:]) > 200 else doc_text[desc_start: (desc_start+200)]
    print(preview)

üí¨ Performing sample quality review...
1. Seattle Public Library - Central Library
Description: The Seattle Central Library is the flagship library of the Seattle Public Library system. The 11-story (185 feet or 56.9 meters high) glass and steel building in the downtown core of Seat...
2. Space Needle
Description: The Space Needle is an observation tower in Seattle, Washington, United States. Considered to be an icon of the city, it has been designated a Seattle landmark. Located in the Lower Queen...
3. Pike Place Market
Description: Pike Place Market is a public market in Seattle, Washington, United States. It opened on August 17, 1907, and is one of the older continuously operated public farmers' markets in the Unit...


## üíæ Step 09 ‚Äî Save Enriched Dataset
Save the final enriched dataset and update metadata.

In [None]:
enriched_data = PROCESSED_DATA_DIR / "seattle_attractions_enriched_with_location.json"
with open(enriched_data, "w", encoding="utf-8") as f:
    json.dump(attractions_with_wiki, f, indent=2, ensure_ascii=False)
save(f"Enriched data saved at: {enriched_data}")

üíæ Enriched data saved at: c:\Users\dinni\OneDrive\Ê°åÈù¢\Travel_rag\data\processed\seattle_attractions_enriched_with_location.json


In [86]:
task("Saving final documents and updating metadata...")

document_file = PROCESSED_DATA_DIR / "seattle_attractions_documents.json"

with open(document_file, "w", encoding="utf-8") as f:
    json.dump(documents, f, indent=2, ensure_ascii=False)

save(f"Saved {len(documents)} documents to: {document_file.name}")

üöÄ Saving final documents and updating metadata...
üíæ Saved 62 documents to: seattle_attractions_documents.json


In [87]:
metadata_file = PROCESSED_DATA_DIR / "metadata.json"

with open(metadata_file, "r", encoding="utf-8") as f:
    metadata = json.load(f)

metadata["enrichment"] = {
    "enrichment_date": pd.Timestamp.now().isoformat(),
    "wikipedia_descriptions": {
        "total_attractions": len(attractions_with_wiki),
        "with_descriptions": len(
            [a for a in attractions_with_wiki if a.get("description")]
        ),
    },
    "avg_description_length": sum(
        len(a.get("description", "")) for a in attractions_with_wiki
    )
    / len(attractions_with_wiki),
    "location_data": {
        "with_location": len([a for a in attractions_with_wiki if a.get("location")])
    },
    "final_documents": {
        "total_documents": len(documents),
        "file": "seattle_attractions_documents.json",
    },
    "data_quality": {
        "complete_records": len(
            [
                a
                for a in attractions_with_wiki
                if a.get("name") and a.get("description") and a.get("location")
            ]
        ),
        "completeness_rate": f"{len(
                [
                    a
                    for a in attractions_with_wiki
                    if a.get("name") and a.get("description") and a.get("location")
                ]) / len(attractions_with_wiki) * 100:.1f}%",
    },
}

with open(metadata_file, "w", encoding="utf-8") as f:
    json.dump(metadata, f, indent=2, ensure_ascii=False)

save(f"Updated metadata: {metadata_file.name}")

üíæ Updated metadata: metadata.json


---

# üìã Chapter 2 Summary

## ‚úÖ Completed Steps

1. **Setup & API Connection** - Successfully connected to Wikipedia API
2. **Fetch Descriptions** - 62/62 attractions (100% success)
3. **Data Cleaning** - No duplicates, 27.4% with special chars
4. **Location Enrichment** - Merged lat/lon and address data
5. **Document Creation** - 62 RAG-ready documents
6. **Quality Validation** - 100% complete records

## üìä Final Results

| Metric | Value |
|--------|-------|
| Total Attractions | 62 |
| With Descriptions | 62 (100%) |
| With Location Data | 62 (100%) |
| Avg Description Length | 860 chars |

## üíæ Output Files

- `seattle_attractions_enriched_with_location.json` - Full enriched data
- `seattle_attractions_documents.json` - RAG-ready documents  
- [metadata.json](cci:7://file:///c:/Users/dinni/OneDrive/%E6%A1%8C%E9%9D%A2/Travel_rag/data/processed/metadata.json:0:0-0:0) - Updated with Chapter 2 statistics

## üîë Key Takeaways

- Wikipedia API requires User-Agent header
- 100% data completeness achieved
- Documents ready for RAG embedding
- Location data successfully merged from raw data