# Phase 2: Exploration - Data Overview

## Overview

Understand the structure and basic characteristics of the crime dataset.

### Objectives
1. Load consolidated Parquet file
2. Display shape, columns, and data types
3. Show sample records and basic statistics
4. Visualize distributions of key columns
5. Identify temporal coverage

## Cell 1: Setup and Imports

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path
PROJECT_ROOT = Path.cwd().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

from src.data import loader
from src.analysis import profiler
from src.utils.config import get_processed_data_path

# Configure visualization
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Imports successful")

## Cell 2: Load Data

In [None]:
# Load the consolidated crime data
df = loader.load_crime_data()
print(f"Loaded {len(df):,} crime records")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## Cell 3: Basic Info

In [None]:
print("=" * 60)
print("DATA SHAPE AND STRUCTURE")
print("=" * 60)
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns\n")

print("Columns:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes)

## Cell 4: Data Profiling

In [None]:
# Use the DataProfiler utility
profile = profiler.DataProfiler(df)

print("\n" + "=" * 60)
print("SUMMARY STATISTICS")
print("=" * 60)
summary = profile.get_summary()
for key, value in summary.items():
    print(f"{key}: {value}")

## Cell 5: Sample Records

In [None]:
print("\n" + "=" * 60)
print("SAMPLE RECORDS (First 10)")
print("=" * 60)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
print(df.head(10))

## Cell 6: Temporal Coverage

In [None]:
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'])
    print(f"Date range: {df['date'].min()} to {df['date'].max()}")
    print(f"Span: {(df['date'].max() - df['date'].min()).days} days ({(df['date'].max() - df['date'].min()).days / 365:.1f} years)")
    print(f"\nRecords by year:")
    print(df.groupby(df['date'].dt.year).size())

## Cell 7: Key Distributions

In [None]:
# Create a quick visualization dashboard
fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# Records by year
if 'date' in df.columns:
    df.groupby(df['date'].dt.year).size().plot(ax=axes[0, 0], kind='bar')
    axes[0, 0].set_title('Records by Year')
    axes[0, 0].set_xlabel('Year')
    axes[0, 0].set_ylabel('Count')

# Top crime types
if 'general_crime_category' in df.columns:
    df['general_crime_category'].value_counts().head(10).plot(ax=axes[0, 1], kind='barh')
    axes[0, 1].set_title('Top 10 Crime Types')
    axes[0, 1].set_xlabel('Count')

# Top districts
if 'district' in df.columns:
    df['district'].value_counts().head(10).plot(ax=axes[1, 0], kind='barh')
    axes[1, 0].set_title('Top 10 Districts')
    axes[1, 0].set_xlabel('Count')

# Records by month (last year)
if 'date' in df.columns:
    last_year = df[df['date'] >= df['date'].max() - pd.Timedelta(days=365)]
    last_year.groupby(last_year['date'].dt.month).size().plot(ax=axes[1, 1])
    axes[1, 1].set_title('Records by Month (Last Year)')
    axes[1, 1].set_xlabel('Month')
    axes[1, 1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## Summary

✓ **Data overview complete!** You now understand:
- Dataset size and structure
- Available columns and data types
- Temporal coverage
- Distribution of records across years, crime types, and districts

### Next Steps
- Proceed to **02_data_quality_assessment.ipynb** to identify data quality issues
- Or jump to Phase 3 if quality looks good