# Notebook 01 – Data Ingestion & Initial Exploration



## 1. Environment Setup

Ensuring a clean and reproducible environment is fundamental. Here, we import the core libraries and set up our workspace so that anyone cloning this repo can replicate our analysis without compatibility issues.



In [5]:
# Essential imports
import pandas as pd
import geopandas as gpd
import os
from pathlib import Path

# 2. Load Raw Data

Loading raw datasets is the first critical step. By separating this phase, we maintain a clear pipeline: raw files remain untouched, enabling transparent version control and easy auditing of original sources.


In [6]:

# 1) Define raw data directory
RAW_DIR = Path('..') / 'data' / 'raw'

# 2) Find all NYPD CSV parts (nypd_1.csv, nypd_2.csv, …)
crime_files = sorted(RAW_DIR.glob('nypd_*.csv'))

# 3) Read each part into a DataFrame, parse the date, and collect
crime_dfs = []
for f in crime_files:
    print(f"Loading {f.name}…")
    df_part = pd.read_csv(
        f,
        parse_dates=['cmplnt_fr_dt'],   # your lowercase date column
        low_memory=False
    )
    crime_dfs.append(df_part)

# 4) Concatenate into one DataFrame
df_crime = pd.concat(crime_dfs, ignore_index=True)
print("Combined crime shape:", df_crime.shape)

# 5) Load weather and boroughs as before
df_weather = pd.read_csv(
    RAW_DIR / 'noaa_ghcnd_2024.csv',
    parse_dates=['DATE'],
    low_memory=False
)
print("Weather shape:", df_weather.shape)

gdf_boroughs = gpd.read_file(RAW_DIR / 'nyc_boroughs.geojson')
print("Boroughs shape:", gdf_boroughs.shape)


Loading nypd_1.csv…
Loading nypd_2.csv…
Loading nypd_3.csv…
Loading nypd_4.csv…
Combined crime shape: (565118, 35)
Weather shape: (366, 151)
Boroughs shape: (5, 5)


## 3. Initial Data Inspection

### 3.1 Data Types & Shapes

Understanding the structure and scale of each dataset is key to planning the analysis. Checking data types and shapes helps identify conversion needs, memory constraints, and potential anomalies before diving deeper.



In [7]:
print("Crime data shape:", df_crime.shape)
print(df_crime.dtypes)

print("Weather data shape:", df_weather.shape)
print(df_weather.dtypes)

print("Boroughs GeoDataFrame shape:", gdf_boroughs.shape)
print(gdf_boroughs.dtypes)

Crime data shape: (565118, 35)
cmplnt_num                   object
cmplnt_fr_dt         datetime64[ns]
cmplnt_fr_tm                 object
cmplnt_to_dt                 object
cmplnt_to_tm                 object
addr_pct_cd                   int64
rpt_dt                       object
ky_cd                         int64
ofns_desc                    object
pd_cd                       float64
pd_desc                      object
crm_atpt_cptd_cd             object
law_cat_cd                   object
boro_nm                      object
loc_of_occur_desc            object
prem_typ_desc                object
juris_desc                   object
jurisdiction_code             int64
parks_nm                     object
hadevelopt                   object
housing_psa                 float64
x_coord_cd                  float64
y_coord_cd                  float64
susp_age_group               object
susp_race                    object
susp_sex                     object
transit_district            float


### 3.2 Preview Samples

Previewing a few rows offers an immediate look at real records, revealing formatting quirks and guiding early decisions on column selection, renaming, or basic transformations.


In [8]:
# Display first rows of crime data
df_crime.head()

# Display first rows of weather data
df_weather.head()

# Display first rows of borough geometries
gdf_boroughs.head()

Unnamed: 0,BoroCode,BoroName,Shape_Leng,Shape_Area,geometry
0,5,Staten Island,330385.03697,1623853000.0,"MULTIPOLYGON (((-74.05051 40.56642, -74.05047 ..."
1,4,Queens,861038.4793,3049947000.0,"MULTIPOLYGON (((-73.83668 40.59495, -73.83678 ..."
2,3,Brooklyn,726568.94634,1959432000.0,"MULTIPOLYGON (((-73.86706 40.58209, -73.86769 ..."
3,1,Manhattan,358532.95642,636442200.0,"MULTIPOLYGON (((-74.01093 40.68449, -74.01193 ..."
4,2,Bronx,464517.89055,1186804000.0,"MULTIPOLYGON (((-73.89681 40.79581, -73.89694 ..."


### 3.3 Missing Values Check

Understanding missing data is crucial in any real-world dataset. Here, we assess the extent of missingness in both crime and weather records. This step provides early insights into data quality, helps identify potential inconsistencies in data collection, and prepares us to choose appropriate handling techniques — such as imputation or exclusion — for the downstream analysis. Recruiters and real-world projects highly value this kind of diligence and data awareness, as it reflects a thoughtful and responsible data science mindset.


In [9]:
# Count missing values in crime data
df_crime.isna().sum().sort_values(ascending=False)

# Count missing values in weather data
df_weather.isna().sum().sort_values(ascending=False)

ACMC    366
ACSC    366
ACMH    366
ACSH    366
DAPR    366
       ... 
PRCP      0
SNWD      0
SNOW      0
TMIN      0
TMAX      0
Length: 151, dtype: int64

## 4. Next Steps & Conclusions

At this point, we’ve established a reproducible environment, loaded our core datasets, and performed an initial inspection. The insights gained here lay the groundwork for informed cleaning strategies, feature engineering, and subsequent exploratory analyses. Documenting each decision fosters transparency and trust—qualities that recruiters and collaborators highly appreciate.

* Verify date columns for completeness and correctness.
* Determine if filtering by year range or crime type is necessary.
* Assess missing data and choose an appropriate strategy (e.g., imputation or removal).

