# Testing data.gov API Access

Let's explore how to fetch data from data.gov using the Socrata API.

In [1]:
# Install sodapy if needed
# !pip install sodapy

from sodapy import Socrata
import pandas as pd
import os
from pathlib import Path

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## Available Data Sources

Data.gov uses the Socrata Open Data API. Different datasets are hosted on different domains:

- **data.cityofnewyork.us** - NYC Open Data
- **data.seattle.gov** - Seattle Open Data
- **data.cityofchicago.org** - Chicago Data Portal
- **chronicdata.cdc.gov** - CDC Data

Popular datasets:
- NYC 311 Service Requests: `fhrw-4uyv` (NYC)
- COVID-19 Case Surveillance: `vbim-akqf` (CDC)
- Chicago Crimes: `ijzp-q8t2` (Chicago)

In [2]:
# Test with NYC 311 Service Requests (smaller sample)
dataset_id = "fhrw-4uyv"
domain = "data.cityofnewyork.us"

print(f"Testing with dataset: {dataset_id}")
print(f"Domain: {domain}")

try:
    client = Socrata(domain, None)
    print("✓ Connected to Socrata API")
    
    # Fetch just 10 records to test
    results = client.get(dataset_id, limit=10)
    df = pd.DataFrame.from_records(results)
    
    print(f"\n✓ Successfully fetched {len(df)} records!")
    print(f"Columns: {list(df.columns)}")
    
except Exception as e:
    print(f"❌ Error: {e}")



Testing with dataset: fhrw-4uyv
Domain: data.cityofnewyork.us
✓ Connected to Socrata API
❌ Error: HTTPSConnectionPool(host='data.cityofnewyork.us', port=443): Read timed out. (read timeout=10)


In [None]:
# Display the data
if 'df' in locals() and not df.empty:
    display(df.head())
    print(f"\nShape: {df.shape}")
    print(f"\nData types:\n{df.dtypes}")
else:
    print("No data loaded yet")

## Try Different Datasets

Let's try fetching from different sources:

In [3]:
# Chicago Crime Data
def test_dataset(domain, dataset_id, name, limit=10):
    """
    Test fetching a dataset from Socrata
    """
    print(f"\n{'='*60}")
    print(f"Testing: {name}")
    print(f"Dataset ID: {dataset_id}")
    print(f"Domain: {domain}")
    print(f"{'='*60}")
    
    try:
        client = Socrata(domain, None)
        results = client.get(dataset_id, limit=limit)
        df = pd.DataFrame.from_records(results)
        
        print(f"✓ Success! Fetched {len(df)} records")
        print(f"Columns ({len(df.columns)}): {', '.join(df.columns[:5].tolist())}...")
        return df
        
    except Exception as e:
        print(f"❌ Failed: {e}")
        return None

In [4]:
# Test multiple datasets
datasets = [
    ("data.cityofnewyork.us", "fhrw-4uyv", "NYC 311 Service Requests"),
    ("data.cityofchicago.org", "ijzp-q8t2", "Chicago Crimes"),
    ("data.seattle.gov", "kzjm-xkqj", "Seattle Police Reports"),
    ("chronicdata.cdc.gov", "vbim-akqf", "CDC COVID-19 Data"),
]

results = {}
for domain, dataset_id, name in datasets:
    df = test_dataset(domain, dataset_id, name, limit=5)
    if df is not None:
        results[name] = df




Testing: NYC 311 Service Requests
Dataset ID: fhrw-4uyv
Domain: data.cityofnewyork.us




❌ Failed: HTTPSConnectionPool(host='data.cityofnewyork.us', port=443): Read timed out. (read timeout=10)

Testing: Chicago Crimes
Dataset ID: ijzp-q8t2
Domain: data.cityofchicago.org




❌ Failed: HTTPSConnectionPool(host='data.cityofchicago.org', port=443): Read timed out. (read timeout=10)

Testing: Seattle Police Reports
Dataset ID: kzjm-xkqj
Domain: data.seattle.gov




✓ Success! Fetched 5 records
Columns (10): address, type, datetime, latitude, longitude...

Testing: CDC COVID-19 Data
Dataset ID: vbim-akqf
Domain: chronicdata.cdc.gov
✓ Success! Fetched 5 records
Columns (12): cdc_case_earliest_dt, cdc_report_dt, pos_spec_dt, onset_dt, current_status...


In [5]:
# Show which datasets worked
print(f"\n\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}\n")
print(f"Successfully loaded {len(results)} datasets:\n")
for name in results.keys():
    print(f"  ✓ {name}")



SUMMARY

Successfully loaded 2 datasets:

  ✓ Seattle Police Reports
  ✓ CDC COVID-19 Data


## Fetch a Larger Dataset

Now let's fetch more data from a working dataset:

In [6]:
# Pick the first working dataset and fetch more records
if results:
    # Use Seattle Police Reports (working dataset)
    domain = "data.seattle.gov"
    dataset_id = "kzjm-xkqj"
    limit = 1000
    
    print(f"Fetching {limit} records from Seattle Police Reports...")
    
    client = Socrata(domain, None)
    results_large = client.get(dataset_id, limit=limit)
    df_large = pd.DataFrame.from_records(results_large)
    
    print(f"\n✓ Fetched {len(df_large):,} records")
    print(f"✓ Columns: {len(df_large.columns)}")
    print(f"✓ Memory usage: {df_large.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
else:
    print("No datasets loaded successfully")



Fetching 1000 records from Seattle Police Reports...

✓ Fetched 1,000 records
✓ Columns: 12
✓ Memory usage: 0.83 MB


In [7]:
# Explore the data
if 'df_large' in locals():
    print("First few rows:")
    display(df_large.head())
    
    print("\nData info:")
    df_large.info()

First few rows:


Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,:@computed_region_ru88_fbhk,:@computed_region_kuhn_3gp2,:@computed_region_q256_3sug,:@computed_region_2day_rhn5,:@computed_region_cyqu_gs94
0,2300 Elliott Ave,Auto Fire Alarm,2025-10-05T10:12:00.000,47.612046,-122.348232,"{'type': 'Point', 'coordinates': [-122.348232,...",F250138091,5,9,19576,,
1,715 9th Ave,Aid Response,2025-10-05T10:09:00.000,47.606371,-122.325496,"{'type': 'Point', 'coordinates': [-122.325496,...",F250138089,19,12,18379,,
2,Madison St / Minor Ave,Aid Response,2025-10-05T10:08:00.000,47.609802,-122.324029,"{'type': 'Point', 'coordinates': [-122.324029,...",F250138090,19,12,18379,,
3,430 Minor Ave N,Automatic Medical Alarm,2025-10-05T10:08:00.000,47.622312,-122.332995,"{'type': 'Point', 'coordinates': [-122.332995,...",F250138088,56,10,18390,,
4,7501 56th Ave Ne,Aid Response,2025-10-05T10:05:00.000,47.683193,-122.268012,"{'type': 'Point', 'coordinates': [-122.268012,...",F250138087,55,48,18792,,



Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   address                      1000 non-null   object
 1   type                         1000 non-null   object
 2   datetime                     1000 non-null   object
 3   latitude                     999 non-null    object
 4   longitude                    999 non-null    object
 5   report_location              1000 non-null   object
 6   incident_number              1000 non-null   object
 7   :@computed_region_ru88_fbhk  998 non-null    object
 8   :@computed_region_kuhn_3gp2  998 non-null    object
 9   :@computed_region_q256_3sug  1000 non-null   object
 10  :@computed_region_2day_rhn5  81 non-null     object
 11  :@computed_region_cyqu_gs94  74 non-null     object
dtypes: object(12)
memory usage: 93.9+ KB


## Save to Data Lake

In [8]:
# Save the data
if 'df_large' in locals() and not df_large.empty:
    # Create data directories
    Path('data/raw').mkdir(parents=True, exist_ok=True)
    
    # Save to CSV
    from datetime import datetime
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f"data/raw/nyc_311_{timestamp}.csv"
    
    df_large.to_csv(output_file, index=False)
    
    file_size_mb = os.path.getsize(output_file) / (1024**2)
    
    print(f"\n{'='*60}")
    print("✓ Data Saved Successfully!")
    print(f"{'='*60}")
    print(f"File: {output_file}")
    print(f"Records: {len(df_large):,}")
    print(f"Size: {file_size_mb:.2f} MB")
    print(f"\nNext steps:")
    print("1. Create a dbt staging model")
    print("2. Run data quality checks")
    print("3. Build analytics models")


✓ Data Saved Successfully!
File: data/raw/nyc_311_20251005_121831.csv
Records: 1,000
Size: 0.16 MB

Next steps:
1. Create a dbt staging model
2. Run data quality checks
3. Build analytics models


## Data Analysis Preview

In [9]:
# Quick analysis
if 'df_large' in locals() and not df_large.empty:
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Set style
    sns.set_style('whitegrid')
    plt.rcParams['figure.figsize'] = (12, 6)
    
    # Example: If there's a category column
    if 'complaint_type' in df_large.columns:
        print("\nTop 10 Complaint Types:")
        top_complaints = df_large['complaint_type'].value_counts().head(10)
        print(top_complaints)
        
        # Visualization
        plt.figure(figsize=(12, 6))
        top_complaints.plot(kind='barh', color='steelblue')
        plt.title('Top 10 NYC 311 Complaint Types', fontsize=16, fontweight='bold')
        plt.xlabel('Number of Complaints', fontsize=12)
        plt.ylabel('Complaint Type', fontsize=12)
        plt.tight_layout()
        plt.show()
    
    # Missing data analysis
    print("\nMissing Data:")
    missing = df_large.isnull().sum()
    missing_pct = (missing / len(df_large) * 100).round(2)
    missing_df = pd.DataFrame({'Count': missing, 'Percentage': missing_pct})
    missing_df = missing_df[missing_df['Count'] > 0].sort_values('Count', ascending=False)
    display(missing_df)


Missing Data:


Unnamed: 0,Count,Percentage
:@computed_region_cyqu_gs94,926,92.6
:@computed_region_2day_rhn5,919,91.9
:@computed_region_ru88_fbhk,2,0.2
:@computed_region_kuhn_3gp2,2,0.2
latitude,1,0.1
longitude,1,0.1


## Summary

This notebook demonstrates:
1. ✅ Connecting to data.gov via Socrata API
2. ✅ Testing different datasets
3. ✅ Fetching larger datasets
4. ✅ Saving to data lake
5. ✅ Basic data analysis

**Next Steps:**
- Create dbt staging models for the fetched data
- Build data quality tests
- Create analytics/mart models
- Set up automated data refreshes