# Synthetic GeoJSON Fusion Dataset Generator

This notebook explores the provided GeoJSON assets and offers tooling to synthesize large tweet-like records that blend hurricane situational data with U.S. city metadata. Adjust the configuration cells to produce datasets as large as you need.

## Workflow

1. Inspect the hurricane and city GeoJSON inputs to understand their schema.
2. Define a reusable generator that fuses hurricane points with city metadata and text templates.
3. Tune the dataset size and export options to create massive synthetic samples on demand.

In [18]:
from pathlib import Path
import json
from collections import Counter
from datetime import datetime, timedelta
import random
import math
from typing import Dict, Any, Iterator, Optional

BASE_PATH = Path('C:/Users/colto/Documents/GitHub/Tweet_project') / 'data' / 'geojson'
if not BASE_PATH.exists():
    raise FileNotFoundError(f'Expected GeoJSON directory at {BASE_PATH!s}')
BASE_PATH

WindowsPath('C:/Users/colto/Documents/GitHub/Tweet_project/data/geojson')

## Inspect hurricane GeoJSON files

We gather headline statistics about the storm feature collections to understand how many facilities, place mentions, and timestamps they provide for sampling.

In [19]:
hurricane_files = ['francine.geojson', 'helene.geojson']
for name in hurricane_files:
    data = json.loads((BASE_PATH / name).read_text())
    features = data.get('features', [])
    facilities = Counter()
    places = Counter()
    times = []
    latitudes = []
    longitudes = []

    for feature in features:
        props = feature.get('properties', {})
        fac = props.get('FAC')
        place = props.get('GPE')
        timestamp = props.get('time')
        lat = props.get('Latitude')
        lon = props.get('Longitude')

        if fac:
            facilities[fac] += 1
        if place:
            places[place] += 1
        if timestamp:
            try:
                times.append(datetime.fromisoformat(timestamp.replace('Z', '+00:00')))
            except ValueError:
                pass
        if lat not in (None, ''):
            latitudes.append(float(lat))
        if lon not in (None, ''):
            longitudes.append(float(lon))

    print(f"{name}: {len(features)} features")
    print(f"  unique facilities: {len(facilities)}")
    print(f"  unique place mentions: {len(places)}")
    if times:
        print(f"  time span: {min(times).isoformat()} -> {max(times).isoformat()}")
    if latitudes and longitudes:
        print(f"  latitude range: {min(latitudes):.3f} -> {max(latitudes):.3f}")
        print(f"  longitude range: {min(longitudes):.3f} -> {max(longitudes):.3f}")
    print(f"  top facilities: {facilities.most_common(5)}")

francine.geojson: 2303 features
  unique facilities: 29
  unique place mentions: 290
  time span: 2024-09-09T11:00:36+00:00 -> 2024-09-16T15:24:14+00:00
  latitude range: 25.774 -> 41.876
  longitude range: -106.487 -> -80.194
  top facilities: [('I-10', 5), ('I-65', 2), ('I-35', 2), ('Interstate 10', 2), ('Tulane Drive', 2)]
helene.geojson: 3007 features
  unique facilities: 59
  unique place mentions: 478
  time span: 2024-09-26T02:29:25+00:00 -> 2024-09-27T19:59:41+00:00
  latitude range: 25.582 -> 41.681
  longitude range: -90.052 -> -76.274
  top facilities: [('I-4', 26), ('Bayshore Boulevard', 5), ('Capitol', 4), ('Freedom Parkway', 4), ('Amalie Arena', 4)]


## Inspect U.S. city GeoJSON file

Population distributions and timezone coverage help determine how rich the downstream synthetic dataset can be.

In [20]:
city_data = json.loads((BASE_PATH / 'us_cities.geojson').read_text(encoding='utf-8'))
city_features = city_data.get('features', [])
print(f"us_cities.geojson: {len(city_features)} records")
populations = []
timezones = Counter()
for feature in city_features:
    props = feature.get('properties', {})
    population = props.get('population')
    timezone = props.get('timezone')
    if population not in (None, ''):
        populations.append(int(population))
    if timezone:
        timezones[timezone] += 1

print(f"  population min/max: {min(populations)} -> {max(populations)}")
print(f"  population mean: {sum(populations)/len(populations):,.1f}")
sorted_pops = sorted(populations)
print(f"  population median: {sorted_pops[len(sorted_pops)//2]:,}")
print(f"  sample timezones: {timezones.most_common(5)}")

us_cities.geojson: 7471 records
  population min/max: 5001 -> 8804190
  population mean: 33,715.2
  population median: 13,277
  sample timezones: [('America/New_York', 3509), ('America/Chicago', 2020), ('America/Los_Angeles', 1063), ('America/Denver', 324), ('America/Detroit', 182)]


## Synthetic dataset generator

The generator blends hurricane observations with randomly paired U.S. cities, perturbs geographic coordinates, and composes narrative strings. It exposes helpers for bulk iteration and CSV export so you can scale to millions of rows.

In [21]:
def _haversine(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    radius_km = 6371.0
    phi1 = math.radians(lat1)
    phi2 = math.radians(lat2)
    delta_phi = math.radians(lat2 - lat1)
    delta_lambda = math.radians(lon2 - lon1)
    a = (math.sin(delta_phi / 2) ** 2
         + math.cos(phi1) * math.cos(phi2) * math.sin(delta_lambda / 2) ** 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return radius_km * c


class SyntheticTweetDatasetGenerator:
    '''Fuse hurricane GeoJSON features with city metadata to create synthetic samples.'''

    def __init__(self, base_path: Path):
        self.base_path = Path(base_path)
        self.events = self._load_events()
        self.cities = self._load_cities()
        self.event_pool = [
            (event_name, feature)
            for event_name, features in self.events.items()
            for feature in features
        ]
        if not self.event_pool:
            raise ValueError('No event data loaded from hurricane GeoJSON files.')
        if not self.cities:
            raise ValueError('No city data loaded from the us_cities GeoJSON file.')

    def _load_events(self) -> Dict[str, list]:
        events: Dict[str, list] = {}
        for filename in ['francine.geojson', 'helene.geojson']:
            data = json.loads((self.base_path / filename).read_text(encoding='utf-8'))
            features = []
            for feature in data.get('features', []):
                props = feature.get('properties', {})
                geometry = feature.get('geometry', {})
                timestamp = props.get('time')
                parsed_time = None
                if timestamp:
                    cleaned = timestamp.replace('Z', '+00:00')
                    try:
                        parsed_time = datetime.fromisoformat(cleaned)
                    except ValueError:
                        parsed_time = None
                latitude = props.get('Latitude')
                longitude = props.get('Longitude')
                features.append({
                    'facility': props.get('FAC'),
                    'place': props.get('GPE'),
                    'latitude': float(latitude) if latitude not in (None, '') else None,
                    'longitude': float(longitude) if longitude not in (None, '') else None,
                    'time': parsed_time,
                    'geometry_type': geometry.get('type'),
                })
            events[filename.split('.')[0]] = features
        return events

    def _load_cities(self) -> list:
        data = json.loads((self.base_path / 'us_cities.geojson').read_text(encoding='utf-8'))
        cities = []
        for feature in data.get('features', []):
            props = feature.get('properties', {})
            cities.append({
                'geonameid': props.get('geonameid'),
                'name': props.get('name'),
                'latitude': float(props.get('latitude')),
                'longitude': float(props.get('longitude')),
                'population': (int(props.get('population'))
                               if props.get('population') not in (None, '') else None),
                'timezone': props.get('timezone'),
                'pop_category': props.get('pop_category'),
            })
        return cities

    def describe_sources(self) -> Dict[str, Any]:
        summary = {'events': {}, 'cities': {}}
        for event_name, features in self.events.items():
            facilities = {f['facility'] for f in features if f['facility']}
            places = {f['place'] for f in features if f['place']}
            times = [f['time'] for f in features if f['time']]
            summary['events'][event_name] = {
                'records': len(features),
                'unique_facilities': len(facilities),
                'unique_places': len(places),
                'time_range': (
                    (min(times).isoformat(), max(times).isoformat()) if times else None
                ),
            }
        summary['cities'] = {
            'records': len(self.cities),
            'population_min': min(
                (c['population'] for c in self.cities if c['population'] is not None),
                default=None,
            ),
            'population_max': max(
                (c['population'] for c in self.cities if c['population'] is not None),
                default=None,
            ),
        }
        return summary

    def iter_synthetic_records(self, size: int, seed: Optional[int] = None) -> Iterator[Dict[str, Any]]:
        rng = random.Random(seed)
        narrative_templates = [
            "{event} update: {facility} in {place} is coordinating support with leaders in {city} (pop {population:,}).",
            "Emergency crews from {facility} ({event}) are staging near {city}, {timezone} timezone, to assist {place} region.",
            "{city} (population {population:,}) is receiving {event} briefings about {facility} operations near {place}.",
            "Situation report: {facility} teams tied to {event} are aligning with {city} officials to cover {place}.",
        ]
        for idx in range(1, size + 1):
            event_name, event_feature = rng.choice(self.event_pool)
            city = rng.choice(self.cities)
            event_time = event_feature.get('time')
            if event_time:
                jitter = timedelta(minutes=rng.randint(-240, 240))
                jittered_time = event_time + jitter
            else:
                jittered_time = None
            base_lat = (
                event_feature.get('latitude')
                if event_feature.get('latitude') is not None
                else city['latitude']
            )
            base_lon = (
                event_feature.get('longitude')
                if event_feature.get('longitude') is not None
                else city['longitude']
            )
            lat_noise = rng.gauss(0, 0.18)
            lon_noise = rng.gauss(0, 0.18)
            synthetic_lat = max(-90, min(90, base_lat + lat_noise))
            synthetic_lon = max(-180, min(180, base_lon + lon_noise))
            distance_km = _haversine(synthetic_lat, synthetic_lon, city['latitude'], city['longitude'])
            template = rng.choice(narrative_templates)
            narrative = template.format(
                event=event_name.title(),
                facility=event_feature.get('facility') or 'local teams',
                place=event_feature.get('place') or 'the impact area',
                city=city['name'],
                population=city['population'] or 0,
                timezone=city['timezone'] or 'UTC',
            )
            yield {
                'sample_id': idx,
                'event': event_name,
                'event_time': jittered_time.isoformat() if jittered_time else None,
                'facility': event_feature.get('facility'),
                'place_mention': event_feature.get('place'),
                'geometry_type': event_feature.get('geometry_type'),
                'synthetic_latitude': round(synthetic_lat, 6),
                'synthetic_longitude': round(synthetic_lon, 6),
                'reference_latitude': event_feature.get('latitude'),
                'reference_longitude': event_feature.get('longitude'),
                'city_name': city['name'],
                'city_geonameid': city['geonameid'],
                'city_population': city['population'],
                'city_timezone': city['timezone'],
                'city_pop_category': city['pop_category'],
                'distance_to_city_km': round(distance_km, 2),
                'urgency_score': round(rng.uniform(0.2, 0.98), 3),
                'narrative': narrative,
                'data_source': 'synthetic_fusion_v1',
            }

    def generate_dataset(self, size: int, seed: Optional[int] = None) -> list:
        return list(self.iter_synthetic_records(size, seed=seed))

    def write_csv(self, output_path: Path, size: int, seed: Optional[int] = None) -> Path:
        import csv
        iterator = self.iter_synthetic_records(size, seed=seed)
        try:
            first = next(iterator)
        except StopIteration as exc:
            raise ValueError('Requested dataset size must be positive.') from exc
        fieldnames = list(first.keys())
        output_path = Path(output_path)
        output_path.parent.mkdir(parents=True, exist_ok=True)
        with output_path.open('w', newline='', encoding='utf-8') as handle:
            writer = csv.DictWriter(handle, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerow(first)
            for record in iterator:
                writer.writerow(record)
        return output_path

In [22]:
generator = SyntheticTweetDatasetGenerator(BASE_PATH)
source_summary = generator.describe_sources()
import pprint
pprint.pprint(source_summary)

{'cities': {'population_max': 8804190, 'population_min': 5001, 'records': 7471},
 'events': {'francine': {'records': 2303,
                         'time_range': (﻿'2024-09-09T11:00:36+00:00',
                                        '2024-09-16T15:24:14+00:00'),
                         'unique_facilities': 29,
                         'unique_places': 290},
            'helene': {'records': 3007,
                       'time_range': (﻿'2024-09-26T02:29:25+00:00',
                                      '2024-09-27T19:59:41+00:00'),
                       'unique_facilities': 59,
                       'unique_places': 478}}}


## Configure dataset size

Adjust `SAMPLE_SIZE` (and optionally the random `SEED`) to drive how many synthetic records are produced. Increase the number into the millions to stress-test downstream analytics.

In [23]:
SAMPLE_SIZE = 5000  # Change this to scale the dataset size (e.g., 100_000 or 1_000_000)
SEED = 42

In [24]:
synthetic_records = generator.generate_dataset(SAMPLE_SIZE, seed=SEED)
len(synthetic_records)

5000

In [25]:
for record in synthetic_records[:5]:
    print(json.dumps(record, indent=2))

{
  "sample_id": 1,
  "event": "helene",
  "event_time": "2024-09-27T16:07:34+00:00",
  "facility": "",
  "place_mention": "FL, Tampa",
  "geometry_type": "Point",
  "synthetic_latitude": 27.9406,
  "synthetic_longitude": -82.593169,
  "reference_latitude": 27.9477595,
  "reference_longitude": -82.458444,
  "city_name": "Hidden Valley",
  "city_geonameid": 4258871,
  "city_population": 5387,
  "city_timezone": "America/New_York",
  "city_pop_category": "medium",
  "distance_to_city_km": 1264.96,
  "urgency_score": 0.774,
  "narrative": "Emergency crews from local teams (Helene) are staging near Hidden Valley, America/New_York timezone, to assist FL, Tampa region.",
  "data_source": "synthetic_fusion_v1"
}
{
  "sample_id": 2,
  "event": "helene",
  "event_time": "2024-09-27T10:54:20+00:00",
  "facility": "",
  "place_mention": "Florida",
  "geometry_type": "Point",
  "synthetic_latitude": 27.717714,
  "synthetic_longitude": -81.443124,
  "reference_latitude": 27.7567667,
  "reference_lo

In [26]:
event_counts = Counter(rec['event'] for rec in synthetic_records)
facility_counts = Counter(rec['facility'] for rec in synthetic_records if rec['facility'])
urgency_values = [rec['urgency_score'] for rec in synthetic_records]
distance_values = [rec['distance_to_city_km'] for rec in synthetic_records]
print('Event counts:', event_counts)
print('Top facilities:', facility_counts.most_common(10))
print(f"Urgency range: {min(urgency_values):.3f} -> {max(urgency_values):.3f}")
print(f"Distance range: {min(distance_values):.1f} -> {max(distance_values):.1f} km")

Event counts: Counter({'helene': 2816, 'francine': 2184})
Top facilities: [('I-4', 21), ('River City Marketplace', 7), ('White Settlement', 6), ('US-21', 5), ('Bayshore Boulevard', 5), ('Highway 90', 4), ('Lake Lure Dam', 4), ('US15', 4), ('Amalie Arena', 4), ('Deerfield Road', 4)]
Urgency range: 0.200 -> 0.980
Distance range: 14.5 -> 7766.1 km


## Optional: export directly to CSV

Run the next cell to stream records straight to disk without holding them all in memory. Adjust `CSV_SAMPLE_SIZE` to the volume you need.

In [27]:
OUTPUT_DIR = Path('..') / 'data' / 'generated_samples'
CSV_SAMPLE_SIZE = 10000  # Tweak as needed for large exports
csv_path = generator.write_csv(OUTPUT_DIR / f'synthetic_samples_{CSV_SAMPLE_SIZE}.csv',
                                 size=CSV_SAMPLE_SIZE, seed=SEED)
csv_path

WindowsPath('../data/generated_samples/synthetic_samples_10000.csv')