# Notebook 01 – Data Ingestion & Initial Exploration



## 1. Environment Setup

Ensuring a clean and reproducible environment is fundamental. Here, we import the core libraries and set up our workspace so that anyone cloning this repo can replicate our analysis without compatibility issues.



In [None]:
# Essential imports
import pandas as pd
import geopandas as gpd

# 2. Load Raw Data

Loading raw datasets is the first critical step. By separating this phase, we maintain a clear pipeline: raw files remain untouched, enabling transparent version control and easy auditing of original sources.


In [None]:
# 2.1 Load NYPD crime data
# Adjust column name 'CMPLNT_FR_DT' as needed
df_crime = pd.read_csv(
    '../data/raw/nypd_crime.csv',
    parse_dates=['CMPLNT_FR_DT'],
    dayfirst=False,
    low_memory=False
)

# 2.2 Load NOAA weather data
# Adjust column name 'DATE' as needed
df_weather = pd.read_csv(
    '../data/raw/noaa_weather.csv',
    parse_dates=['DATE'],
    dayfirst=False,
    low_memory=False
)

# 2.3 Load NYC boroughs GeoJSON
gdf_boroughs = gpd.read_file(
    '../data/raw/nyc_boroughs.geojson'
)

## 3. Initial Data Inspection

### 3.1 Data Types & Shapes

Understanding the structure and scale of each dataset is key to planning the analysis. Checking data types and shapes helps identify conversion needs, memory constraints, and potential anomalies before diving deeper.



In [None]:
print("Crime data shape:", df_crime.shape)
print(df_crime.dtypes)

print("Weather data shape:", df_weather.shape)
print(df_weather.dtypes)

print("Boroughs GeoDataFrame shape:", gdf_boroughs.shape)
print(gdf_boroughs.dtypes)


### 3.2 Preview Samples

Previewing a few rows offers an immediate look at real records, revealing formatting quirks and guiding early decisions on column selection, renaming, or basic transformations.


In [None]:
# Display first rows of crime data
df_crime.head()

# Display first rows of weather data
df_weather.head()

# Display first rows of borough geometries
gdf_boroughs.head()

### 3.3 Missing Values Check

Understanding missing data is crucial in any real-world dataset. Here, we assess the extent of missingness in both crime and weather records. This step provides early insights into data quality, helps identify potential inconsistencies in data collection, and prepares us to choose appropriate handling techniques — such as imputation or exclusion — for the downstream analysis. Recruiters and real-world projects highly value this kind of diligence and data awareness, as it reflects a thoughtful and responsible data science mindset.


In [None]:
# Count missing values in crime data
df_crime.isna().sum().sort_values(ascending=False)

# Count missing values in weather data
df_weather.isna().sum().sort_values(ascending=False)

## 4. Next Steps & Conclusions

At this point, we’ve established a reproducible environment, loaded our core datasets, and performed an initial inspection. The insights gained here lay the groundwork for informed cleaning strategies, feature engineering, and subsequent exploratory analyses. Documenting each decision fosters transparency and trust—qualities that recruiters and collaborators highly appreciate.

* Verify date columns for completeness and correctness.
* Determine if filtering by year range or crime type is necessary.
* Assess missing data and choose an appropriate strategy (e.g., imputation or removal).

