# Croton Cholera Project
## Notebook 01: Data Validation

Purpose:


*   Load frozen ward-level dataset from GitHub

*   Verify structure


*   Confirm data types and completeness
*   Ensure dataset is ready for modeling


This notebook performs no modeling. It only validates the dataset


### Clone Repository

In [22]:
# Clone GitHub repo into colab
!git clone https://github.com/atoothman/croton-cholera-analysis.git


fatal: destination path 'croton-cholera-analysis' already exists and is not an empty directory.


### Import Libraries, import and define paths

In [23]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [24]:
# Define repository and data paths

# Define root repo path
base_path = "/content/croton-cholera-analysis"

# Define path to the data folder inside repo
data_path = os.path.join(base_path, "data")

# Confirm files exist
print("Files inside data folder:")
print(os.listdir(data_path))


Files inside data folder:
['data_dictionary.csv', 'ward_level_data.csv', 'sources_and_notes.csv']


### Load ward-level dataset & inspect index, columns and ward_id

In [25]:
# Load ward-level dataset

# Create file path to ward-level dataset
ward_file = os.path.join(data_path, "ward_level_data.csv")

# Load dataset into dataframe
df = pd.read_csv(ward_file)

# Display dataset
print("Dataset shape:", df.shape)
df.head()


Dataset shape: (19, 29)


Unnamed: 0,ward_id,ward_name,population_1830,population_1850,ward_area_1832,ward_area_1849,pop_density_1830,pop_density_1850,cholera_1832_hotspot,cholera_1849_deaths,...,cellar_exposure_rate_1850,cellar_density_1850,stopcock_count_1850,stopcock_density_1850,hydrant_count_1850,hydrant_density_1850,stopcocks_per_sq_mile_1850,hydrants_per_sq_mile_1850,stopcocks_per_1000_1850,hydrants_per_1000_1850
0,1,First Ward,11327.0,19754,0.25,0.25,45308.0,79016.0,0.0,31.0,...,0.029766,488.0,79.0,3.0,106.0,3.0,316.0,424.0,3.99919,5.366002
1,2,Second Ward,8202.0,6655,0.12,0.12,68350.0,55458.33333,0.0,87.0,...,0.029452,350.0,44.0,2.0,58.0,2.0,366.666667,483.333333,6.61157,8.715252
2,3,Third Ward,9620.0,10355,0.15,0.15,64133.33333,69033.33333,1.0,81.0,...,0.045485,606.666667,45.0,2.0,64.0,2.0,300.0,426.666667,4.345727,6.180589
3,4,Fourth Ward,12705.0,23250,0.13,0.13,97730.76923,178846.1538,1.0,674.0,...,0.03871,1430.769231,35.0,2.0,48.0,1.0,269.230769,369.230769,1.505376,2.064516
4,5,Fifth Ward,17722.0,22686,0.24,0.24,73841.66667,94525.0,1.0,165.0,...,0.023847,395.833333,62.0,3.0,88.0,3.0,258.333333,366.666667,2.732963,3.879044


In [26]:
# Inspect index and columns

# Display index structure
print("Index structure:")
print(df.index)

# Display column names
print("Column names:")
print(df.columns)


Index structure:
RangeIndex(start=0, stop=19, step=1)
Column names:
Index(['ward_id', 'ward_name', 'population_1830', 'population_1850',
       'ward_area_1832', 'ward_area_1849', 'pop_density_1830',
       'pop_density_1850', 'cholera_1832_hotspot', 'cholera_1849_deaths',
       'cholera_1849_rate_per_1000', 'sewer_present_1838',
       'sewer_density_1838', 'sewer_present_1847', 'sewer_density_1847',
       'cellar_or_basement_count_1850', 'cellar_room_count_1850',
       'cellar_inhabitant_count_1850', 'persons_per_room_1850',
       'cellar_exposure_rate_1850', 'cellar_density_1850',
       'stopcock_count_1850', 'stopcock_density_1850', 'hydrant_count_1850',
       'hydrant_density_1850', 'stopcocks_per_sq_mile_1850',
       'hydrants_per_sq_mile_1850', 'stopcocks_per_1000_1850',
       'hydrants_per_1000_1850'],
      dtype='object')


In [27]:
# Ward Id check - each ward should appear once

# Count and print number of unqiue ward_id values
print("Unique ward_id count:", df["ward_id"].nunique())

# Count and print total number of rows in the dataset
print("Total rows:", len(df))


Unique ward_id count: 19
Total rows: 19


Above steps confirm: that the dataset is loaded correctly, there are 19 rows and 19 unique wards, and no duplicated identifiers. Structural integrity is confirmed.

## Check for missing values

In [28]:
# Check for missing values

# Count missing values in each column
missing_counts = df.isna().sum()

# Print missing values
print("Missing values per column:")
print(missing_counts)

Missing values per column:
ward_id                          0
ward_name                        0
population_1830                  5
population_1850                  0
ward_area_1832                   5
ward_area_1849                   1
pop_density_1830                 5
pop_density_1850                 1
cholera_1832_hotspot             5
cholera_1849_deaths              1
cholera_1849_rate_per_1000       1
sewer_present_1838               5
sewer_density_1838               5
sewer_present_1847               1
sewer_density_1847               1
cellar_or_basement_count_1850    1
cellar_room_count_1850           1
cellar_inhabitant_count_1850     1
persons_per_room_1850            1
cellar_exposure_rate_1850        1
cellar_density_1850              1
stopcock_count_1850              3
stopcock_density_1850            3
hydrant_count_1850               3
hydrant_density_1850             3
stopcocks_per_sq_mile_1850       3
hydrants_per_sq_mile_1850        3
stopcocks_per_1000_1850     

In [29]:
# Identify wards missing 1832 population
df[df["population_1830"].isna()][["ward_id", "ward_name"]]

Unnamed: 0,ward_id,ward_name
14,15,Fifteenth Ward
15,16,Sixteenth Ward
16,17,Seventeenth Ward
17,18,Eighteenth Ward
18,19,Nineteenth Ward


### Interpretation of missing values
These are not errors in the dataset but reflect structural differences across the years.


*   Variables tied to 1832 are only available for 14 wards - consistant with 5 missing values per column. Structure is intentionally restricted
*   Variables ties to 1850 with 1 missing value per column reflects that there was 19 wards in the 1850 census count, but only 18 wards recorded on cholera, cellar and sewer mapping


*   Columns with 3 missing values indicates which records could not be confirmed based on contemporary mapping sources
*   All missing values listed are expected and previously documented





### Validate Data Types

In [30]:
# Display data types for all columns
print("Data types:")
print(df.dtypes)

Data types:
ward_id                            int64
ward_name                         object
population_1830                  float64
population_1850                    int64
ward_area_1832                   float64
ward_area_1849                   float64
pop_density_1830                 float64
pop_density_1850                 float64
cholera_1832_hotspot             float64
cholera_1849_deaths              float64
cholera_1849_rate_per_1000       float64
sewer_present_1838               float64
sewer_density_1838               float64
sewer_present_1847               float64
sewer_density_1847               float64
cellar_or_basement_count_1850    float64
cellar_room_count_1850           float64
cellar_inhabitant_count_1850     float64
persons_per_room_1850            float64
cellar_exposure_rate_1850        float64
cellar_density_1850              float64
stopcock_count_1850              float64
stopcock_density_1850            float64
hydrant_count_1850               float64
hydr