# U.S. Census Data Analysis for Journalists

This notebook provides tools and examples for journalists to analyze and visualize U.S. Census Bureau data. You'll learn how to:
- Access the Census Bureau API
- Retrieve demographic data
- Process population statistics
- Create visualizations
- Export data in various formats

First, let's set up our environment and import the required libraries.

In [None]:
# Import required libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from census import Census
import requests

# Set plotting style
plt.style.use('seaborn')
%matplotlib inline

## 1. Configure Census API Access

To access the Census Bureau API, you'll need an API key. You can get one for free at:
https://api.census.gov/data/key_signup.html

Once you have your key, you can either:
1. Set it as an environment variable: `export CENSUS_API_KEY=your_key_here`
2. Or enter it directly in the notebook (less secure)

In [None]:
# Get Census API key from environment variable
api_key = os.getenv('CENSUS_API_KEY')

# If not set in environment, you can enter it here
if not api_key:
    api_key = input('Enter your Census API key: ')

# Initialize Census API client
c = Census(api_key)

# Test the connection by getting a simple query
try:
    # Get total population for California (state code 06)
    ca_pop = c.acs5.state(('NAME', 'B01003_001E'), '06', year=2019)
    print(f"Connection successful! Example data:")
    print(f"California population (2019 ACS5): {int(ca_pop[0]['B01003_001E']):,}")

## 2. Fetch Demographic Data

Now that we're connected to the Census API, let's fetch some demographic data. We'll start with a basic example of getting population and demographic information for all states.

In [None]:
# Define variables we want to retrieve
# B01003_001E: Total population
# B19013_001E: Median household income
# B01002_001E: Median age
variables = ('NAME', 'B01003_001E', 'B19013_001E', 'B01002_001E')

# Get data for all states
states_data = c.acs5.state(variables, '*', year=2019)

# Convert to DataFrame
df = pd.DataFrame(states_data)

# Rename columns for clarity
df = df.rename(columns={
    'B01003_001E': 'population',
    'B19013_001E': 'median_household_income',
    'B01002_001E': 'median_age'
})

# Convert numeric columns
df[['population', 'median_household_income', 'median_age']] = \
    df[['population', 'median_household_income', 'median_age']].apply(pd.to_numeric)

# Show the first few rows
print("Top 5 states by population:")
print(df.nlargest(5, 'population')[['NAME', 'population', 'median_household_income', 'median_age']])

## 3. Process Population Statistics

Let's analyze the population data to find interesting patterns and calculate some key statistics that could be useful for stories.

In [None]:
# Calculate key statistics
total_us_population = df['population'].sum()
avg_state_population = df['population'].mean()
median_state_population = df['population'].median()
population_std = df['population'].std()

# Print summary statistics
print(f"U.S. Population Statistics (2019 ACS5):")
print(f"Total Population: {total_us_population:,}")
print(f"Average State Population: {avg_state_population:,.0f}")
print(f"Median State Population: {median_state_population:,.0f}")
print(f"Population Standard Deviation: {population_std:,.0f}")

# Calculate per-state percentage of total population
df['population_percentage'] = (df['population'] / total_us_population * 100)

# Show states with highest and lowest population percentages
print("\nStates with highest population percentage:")
print(df.nlargest(5, 'population_percentage')[['NAME', 'population_percentage']].round(2))

print("\nStates with lowest population percentage:")
print(df.nsmallest(5, 'population_percentage')[['NAME', 'population_percentage']].round(2))

## 4. Create Visualizations

Let's create some visualizations to help tell stories with this data. We'll make a few different types of plots that are commonly used in news articles.

In [None]:
# Create a bar plot of the top 10 states by population
plt.figure(figsize=(12, 6))
top_10_states = df.nlargest(10, 'population')
plt.bar(top_10_states['NAME'], top_10_states['population'])
plt.xticks(rotation=45, ha='right')
plt.title('Top 10 States by Population (2019)')
plt.ylabel('Population')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Create a scatter plot of population vs median household income
plt.figure(figsize=(12, 6))
plt.scatter(df['population'], df['median_household_income'])
plt.xlabel('Population')
plt.ylabel('Median Household Income ($)')
plt.title('State Population vs Median Household Income (2019)')
plt.grid(True, alpha=0.3)

# Add state labels for the top 5 populated states
for idx, row in df.nlargest(5, 'population').iterrows():
    plt.annotate(row['NAME'], 
                (row['population'], row['median_household_income']),
                xytext=(5, 5), textcoords='offset points')

plt.tight_layout()
plt.show()

## 5. Export Data

Finally, let's export our processed data in formats that are easy to use in other tools or share with colleagues.

In [None]:
# Create an output directory if it doesn't exist
output_dir = '../output'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Export to CSV
csv_path = os.path.join(output_dir, 'state_demographics_2019.csv')
df.to_csv(csv_path, index=False)
print(f"Data exported to CSV: {csv_path}")

# Export summary statistics to Excel
excel_path = os.path.join(output_dir, 'state_demographics_summary_2019.xlsx')
with pd.ExcelWriter(excel_path) as writer:
    # Main data sheet
    df.to_excel(writer, sheet_name='State Data', index=False)
    
    # Summary statistics sheet
    summary_stats = pd.DataFrame({
        'Metric': ['Total Population', 'Average State Population', 'Median State Population'],
        'Value': [total_us_population, avg_state_population, median_state_population]
    })
    summary_stats.to_excel(writer, sheet_name='Summary Statistics', index=False)
    
print(f"Summary report exported to Excel: {excel_path}")

# Save the visualizations
plt.figure(figsize=(12, 6))
top_10_states = df.nlargest(10, 'population')
plt.bar(top_10_states['NAME'], top_10_states['population'])
plt.xticks(rotation=45, ha='right')
plt.title('Top 10 States by Population (2019)')
plt.ylabel('Population')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(output_dir, 'top_10_states_population.png'), dpi=300, bbox_inches='tight')
plt.close()

print(f"\nAll files have been saved to the '{output_dir}' directory.")