# üìñ Chapter 01 ‚Äî Data Exploration

## üéØ Objectives

In this chapter, we will explore the Geoapify Places API and understand the structure of global travel data.

**What we'll accomplish:**
- Set up Geoapify API credentials
- Explore available data for Seattle
- Understand the data structure and fields
- Analyze data quality and coverage
- Design our document format for RAG

## üì¶ Step 01 ‚Äî Import Libraries

Import necessary libraries for working with Geoapify Places API.

In [46]:
import requests
import json
import pandas as pd

from src.config import (
    GEOAPIFY_API_KEY,
    TARGET_CITY,
    CITY_BBOX,
    RAW_DATA_DIR,
    PROCESSED_DATA_DIR,
    GEOAPIFY_BASE_URL,
)
from src.utils.emoji_log import (
    success,
    info,
    error,
    data,
    task,
    save,
    done,
)

## üîë Step 02 ‚Äî Set Up API Credentials

Load Geoapify API key from environment variables and set up the API endpoint.

In [2]:
if not GEOAPIFY_API_KEY:
    raise ValueError(error("GEOAPIFY_API_KEY not found in .env file!"))

success("API credentials configured!")
info(f"API key: {GEOAPIFY_API_KEY[:8]}...")
info(f"Base URL: {GEOAPIFY_BASE_URL}")

‚úÖ API credentials configured!
üí¨ API key: 3199830a...
üí¨ Base URL: https://api.geoapify.com/v2/places


## ‚òï Step 03 ‚Äî Explore Seattle Data

Fetch tourist attractions within Seattle's bounding box.

In [3]:
# Geoapify bbox format: rect:lon_min,lat_min,lon_max,lat_max
filter_box = f"rect:{CITY_BBOX["lon_min"]},{CITY_BBOX["lat_min"]},{CITY_BBOX["lon_max"]},{CITY_BBOX["lat_max"]}"

params = {
    "apiKey": GEOAPIFY_API_KEY,
    "categories": "tourism",
    "filter": filter_box,
    "limit": 500,
}

task(f"Fetching attractions in {TARGET_CITY}...")
info(f"Bounding box: {CITY_BBOX}")

response = requests.get(url=GEOAPIFY_BASE_URL, params=params)

if response.status_code == 200:
    raw_data = response.json()
    attractions = raw_data["features"]
    done(f"Found {len(attractions)} attractions")

    print("Sample attraction:")
    data(json.dumps(attractions[0], indent=2))
else:
    error(f"Error: {response.status_code}")
    error(response.text)

üöÄ Fetching attractions in Seattle...
üí¨ Bounding box: {'lon_min': -122.45, 'lat_min': 47.48, 'lon_max': -122.22, 'lat_max': 47.73}
üèÅ Found 500 attractions
Sample attraction:
üìä {
  "type": "Feature",
  "properties": {
    "name": "Seattle Public Library - Central Library",
    "country": "United States",
    "country_code": "us",
    "state": "Washington",
    "county": "King County",
    "city": "Seattle",
    "postcode": "98104",
    "district": "Central Business District",
    "suburb": "First Hill",
    "street": "4th Avenue",
    "housenumber": "1000",
    "iso3166_2": "US-WA",
    "lon": -122.33269832546111,
    "lat": 47.6067142,
    "state_code": "WA",
    "formatted": "Seattle Central Library, 1000 4th Avenue, Seattle, WA 98104, United States of America",
    "address_line1": "Seattle Central Library",
    "address_line2": "1000 4th Avenue, Seattle, WA 98104, United States of America",
    "categories": [
      "building",
      "building.public_and_civil",
      "bu

## üîç Step 04 ‚Äî Analyze Data Structure

Examine the structure of returned data and available fields.

In [58]:
attractions_list = []
for feature in attractions:
    props = feature.get("properties", {})
    coords = feature.get("geometry", {}).get("coordinates", [None, None])

    attractions_list.append(
        {
            "place_id": props.get("place_id"),
            "name": props.get("name"),
            "category": ", ".join(props.get("categories", [])),
            "address_line1": props.get("address_line1"),
            "address_line2": props.get("address_line2"),
            "city": props.get("city"),
            "state": props.get("state"),
            "postcode": props.get("postcode"),
            "lon": coords[0],
            "lat": coords[1],
            "formatted": props.get("formatted"),
            "datasource": props.get("datasource", {}).get("sourcename"),
        }
    )

df = pd.DataFrame(attractions_list)
df.head(5)

Unnamed: 0,place_id,name,category,address_line1,address_line2,city,state,postcode,lon,lat,formatted,datasource
0,5186d2eaed4a955ec059a29297cfa8cd4740f00102f901...,Seattle Public Library - Central Library,"building, building.public_and_civil, building....",Seattle Central Library,"1000 4th Avenue, Seattle, WA 98104, United Sta...",Seattle,Washington,98104,-122.332698,47.606714,"Seattle Central Library, 1000 4th Avenue, Seat...",openstreetmap
1,51ea567bfd5a965ec05928ad27f96ccf4740f00102f901...,Space Needle,"access, access.yes, building, building.tourism...",Space Needle,"400 Broad Street, Seattle, WA 98109, United St...",Seattle,Washington,98109,-122.349304,47.620513,"Space Needle, 400 Broad Street, Seattle, WA 98...",openstreetmap
2,519c74775d03955ec059508ee7cd97ce4740f00102f901...,Starbucks Reserve,"building, building.catering, building.commerci...",Starbucks Reserve,"1124 Pike Street, Seattle, WA 98101, United St...",Seattle,Washington,98101,-122.32833,47.614008,"Starbucks Reserve, 1124 Pike Street, Seattle, ...",openstreetmap
3,51a1e711aad9955ec059d8d473b600ce4740f00102f901...,Pike Place Market,"tourism, tourism.attraction, wheelchair, wheel...",Pike Place Market,"Post Alley, Seattle, WA 98181, United States o...",Seattle,Washington,98181,-122.34141,47.609397,"Pike Place Market, Post Alley, Seattle, WA 981...",openstreetmap
4,518758b384eb955ec059c2c581e096cd4740f00103f901...,Seattle Great Wheel,"fee, tourism, tourism.attraction",Seattle Great Wheel,"1301 Alaskan Way, Seattle, WA 98101, United St...",Seattle,Washington,98101,-122.3425,47.606167,"Seattle Great Wheel, 1301 Alaskan Way, Seattle...",openstreetmap


In [12]:
print("Available fields:")
info(f"{df.columns.tolist()}")

Available fields:
üí¨ ['place_id', 'name', 'category', 'address_line1', 'address_line2', 'city', 'state', 'postcode', 'lon', 'lan', 'formatted', 'datasource']


In [13]:
info(f"Data shape: {df.shape}")

üí¨ Data shape: (500, 12)


In [14]:
print("First few attractions:")
df[["name", "category", "city", "formatted"]].head(10)

First few attractions:


Unnamed: 0,name,category,city,formatted
0,Seattle Public Library - Central Library,"building, building.public_and_civil, building....",Seattle,"Seattle Central Library, 1000 4th Avenue, Seat..."
1,Space Needle,"access, access.yes, building, building.tourism...",Seattle,"Space Needle, 400 Broad Street, Seattle, WA 98..."
2,Starbucks Reserve,"building, building.catering, building.commerci...",Seattle,"Starbucks Reserve, 1124 Pike Street, Seattle, ..."
3,Pike Place Market,"tourism, tourism.attraction, wheelchair, wheel...",Seattle,"Pike Place Market, Post Alley, Seattle, WA 981..."
4,Seattle Great Wheel,"fee, tourism, tourism.attraction",Seattle,"Seattle Great Wheel, 1301 Alaskan Way, Seattle..."
5,Merchant's Cafe and Saloon,"building, building.catering, building.historic...",Seattle,"Merchant's Cafe and Saloon, 109 Yesler Way, Se..."
6,Japanese Garden,"access_limited, access_limited.customers, leis...",Seattle,"Japanese Garden, 1075 Lake Washington Boulevar..."
7,Seattle Glassblowing Studio,"building, building.tourism, tourism, tourism.a...",Seattle,"Seattle Glassblowing Studio, 2227 5th Avenue, ..."
8,Sky View Observatory,"tourism, tourism.attraction",Seattle,"Sky View Observatory, 701 5th Avenue, Seattle,..."
9,Argosy Cruises,"building, building.tourism, tourism, tourism.a...",Seattle,"Argosy Cruises, 1101 Alaskan Way, Seattle, WA ..."


## üìä Step 05 ‚Äî Get Detailed Information

Fetch detailed information for sample attractions including descriptions from Wikipedia.

In [20]:
# Checking for attractions with Wikipedia information...

attractions_with_wiki = []

for i, feature in enumerate(attractions):
    props = feature.get("properties", {})
    wiki_media = props.get("wiki_and_media")

    if wiki_media:
        attractions_with_wiki.append({
            "index": i,
            "name": props.get("name"),
            "place_id": props.get("place_id"),
            "wiki_data": wiki_media,
        })

info(f"Found {len(attractions_with_wiki)} attractions with Wikipedia/Media data")
info(f"Out of {len(attractions)} total attractions ({len(attractions_with_wiki)/len(attractions)*100:.1f}%)")

üí¨ Found 181 attractions with Wikipedia/Media data
üí¨ Out of 500 total attractions (36.2%)


In [22]:
for i, item in enumerate(attractions_with_wiki[:5]):
    print(f"{'='*70}")
    print(f"{i+1}. {item['name']}")
    print(f"{'='*70}")

    wiki_data = item["wiki_data"]

    if "wikipedia" in wiki_data:
        info(f"Wikipedia: {wiki_data['wikipedia']}")

    if "wikidata" in wiki_data:
        info(f"Wikidata: {wiki_data['wikidata']}")

    if "image" in wiki_data:
        info(f"Image: {wiki_data['image'][:80]}...")

    print()

done(f"Displayed {min(5, len(attractions_with_wiki))} attractions with Wikipedia data")

1. Seattle Public Library - Central Library
üí¨ Wikipedia: en:Seattle Central Library
üí¨ Wikidata: Q2531939
üí¨ Image: https://commons.wikimedia.org/wiki/File:Seattle_(WA,_USA),_Seattle_Central_Libra...

2. Space Needle
üí¨ Wikipedia: en:Space Needle
üí¨ Wikidata: Q5317

3. Starbucks Reserve
üí¨ Wikidata: Q111398756

4. Pike Place Market
üí¨ Wikipedia: en:Pike Place Market
üí¨ Wikidata: Q1373418

5. Seattle Great Wheel
üí¨ Wikipedia: en:Seattle Great Wheel
üí¨ Wikidata: Q7442108

üèÅ Displayed 5 attractions with Wikipedia data


## üìà Step 06 ‚Äî Data Quality Analysis

Analyze data completeness and quality:
- How many attractions have descriptions?
- What categories are available?
- What data sources are used?

In [24]:
# 1. How many attractions have descriptions?

info(f"Total attractions: {len(df)}")
info(f"Attractions with Wikipedia/Media: {len(attractions_with_wiki)} ({len(attractions_with_wiki)/len(df)*100:.1f}%)")

üí¨ Total attractions: 500
üí¨ Attractions with Wikipedia/Media: 181 (36.2%)


In [32]:
# 2. What categories are available?

all_categories = []
for cats in df["category"].dropna():
    all_categories.extend([c.strip() for c in cats.split(",")])

from collections import Counter
cat_counts = Counter(all_categories)
info(f"Total unique categories: {len(cat_counts)}")

print("Top 10 categories")
for cat, count in cat_counts.most_common(10):
    if cat:
        print(f"- {cat}: {count}")

üí¨ Total unique categories: 89
Top 10 categories
- tourism: 500
- tourism.attraction: 329
- tourism.attraction.artwork: 279
- tourism.sights: 189
- building: 117
- heritage: 108
- building.historic: 106
- tourism.sights.memorial: 40
- building.tourism: 20
- wheelchair: 19


In [35]:
# 3. What data sources are used?

source_counts = df["datasource"].value_counts()
for source, count in source_counts.items():
    print(f"- {source}: {count} attractions")

- openstreetmap: 500 attractions


## üìù Step 07 ‚Äî Design Document Format

Based on the data structure, design our document format for RAG.

**Proposed format:**

Name: [Attraction Name]

Categories: [Categories]

Location: [Full Address]

Coordinates: [Latitude, Longitude]

Description: [Wikipedia extract or description]

In [59]:
sample = attractions_with_wiki[0]
sample_row = df[df["place_id"] == sample["place_id"]].iloc[0]

example_doc = f"""Name: {sample_row["name"]}

Categories: {sample_row["category"]}

Location: {sample_row["formatted"]}

Coordinates: {sample_row["lat"]}, {sample_row["lon"]}

Description: [Will be fetched from Wikipedia: {sample["wiki_data"].get("wikipedia", "N/A")}]
"""

print(example_doc)

Name: Seattle Public Library - Central Library

Categories: building, building.public_and_civil, building.tourism, education, education.library, internet_access, tourism, tourism.attraction, wheelchair, wheelchair.yes

Location: Seattle Central Library, 1000 4th Avenue, Seattle, WA 98104, United States of America

Coordinates: 47.606714200029515, -122.33269832546111

Description: [Will be fetched from Wikipedia: en:Seattle Central Library]



In [47]:
# ========================================
# DATA LAKE (Raw Data)
# ========================================
raw_file = RAW_DATA_DIR / "seattle_attractions_raw.json"
with open(raw_file, "w", encoding="utf-8") as f:
    json.dump(attractions, f, indent=2, ensure_ascii=False)
save(f"Raw data: {raw_file.name} ({len(attractions)} attractions)")

üíæ Raw data: seattle_attractions_raw.json (500 attractions)


In [57]:
# ========================================
# DATA WAREHOUSE (Processed Data)
# ========================================
attractions_with_wikipedia = [
    item for item in attractions_with_wiki if "wikipedia" in item["wiki_data"]
]

processed_file = PROCESSED_DATA_DIR / "seattle_attractions_with_wikipedia.json"
with open(processed_file, "w", encoding="utf-8") as f:
    json.dump(attractions_with_wikipedia, f, indent=2, ensure_ascii=False)
save(f"Processed data: {processed_file.name} ({len(attractions_with_wikipedia)} attractions)")

üíæ Processed data: seattle_attractions_with_wikipedia.json (62 attractions)


In [None]:
wiki_only_wikidata = [item for item in attractions_with_wiki if "wikipedia" not in item["wiki_data"]]
wiki_with_image = [item for item in attractions_with_wiki if "image" in item["wiki_data"]]

22

In [56]:
# 3. Save metadata
metadata = {
    "dataset": "Seattle Tourist Attractions",
    "source": "Geoapify Places API",
    "city": TARGET_CITY,
    "bbox": CITY_BBOX,
    "collection_date": pd.Timestamp.now().isoformat(),
    "raw_data": {
        "total_attractions": len(attractions),
        "file": "seattle_attractions_raw.json"
    },
    "processed_data": {
        "attractions_with_wikipedia": len(attractions_with_wikipedia),
        "filter_criteria": "Has Wikipedia link for description",
        "file": "seattle_attractions_with_wikipedia.json"
    },
    "data_quality": {
        "total_with_wiki_data": len(attractions_with_wiki),
        "with_wikipedia_link": len(attractions_with_wikipedia),
        "only_wikidata": len(wiki_only_wikidata),
        "with_images": len(wiki_with_image)
    }
}
metadata_file = PROCESSED_DATA_DIR / "metadata.json"
with open(metadata_file, 'w', encoding='utf-8') as f:
    json.dump(metadata, f, indent=2)
save(f"Metadata: {metadata_file.name}")

üíæ Metadata: metadata.json
