# Sprint 3 - Data Integration & Cleaning Notebook
**Emily Nguyen, Kaylynn Francisco-Nelson, Angela Iraya**

### LA Hospital Dataset Exploration
---

The Points of Interest Hospitals dataset is maintained by the Los Angeles County Internal Services Department, Enterprise GIS Section. The dataset is part of the LA County’s “Points of Interest” collection and provides geolocated information on hospitals and related health facilities across LA, which will help us integrate contextual healthcare accessibility data into our main sexual crimes dataset. Access to hospitals may provide insights into emergency response times, healthcare proximity, or victim outcomes.

The hospital dataset contains 93 hospital records. Some of the key variables in the dataset include: 
- Facility Information: FACNAME, BUSINESS_NAME, FAC_TYPE_CODE, FAC_STATUS_TYPE_CODE, CAPACITY
- Location Data: ADDRESS, CITY, ZIP Code, LATITUDE, LONGITUDE
- Administrative Fields: LICENSE_NUMBER, LICENSE_STATUS_DESCRIPTION, DISTRICT_NAME, COUNTY_NAME
- Healthcare Attributes: BIRTHING_FACILITY_FLAG, TRAUMA_CTR, CRITICAL_ACCESS_HOSPITAL

In [11]:
# Import modules
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
import matplotlib.pyplot as plt
import requests
import zipfile
import os

In [32]:
# Uploaded data from new Hospital dataset
df = pd.read_csv("../data/Points_of_Interest_Hospitals.csv") 
df.head()

Unnamed: 0,OBJECTID,City,ZIP Code,LICENSED_CERTIFIED,FLAG,T18_19,FACID,FAC_STATUS_TYPE_CODE,ASPEN_FACID,CCN,...,CCLHO_NAME,FIPS_COUNTY_CODE,BIRTHING_FACILITY_FLAG,TRAUMA_PED_CTR,TRAUMA_CTR,TYPE_OF_CARE,CRITICAL_ACCESS_HOSPITAL,DATA_DATE,x,y
0,122506,TORRANCE,90502,LICENSED AND CERTIFIED,,,60000027,OPEN,CA060000027,50376.0,...,LOS ANGELES,6037,YES,LEVEL II PED,LEVEL I,,,2025-03-17T00:00:00,6472977.0,1760762.0
1,122507,DOWNEY,90242,LICENSED AND CERTIFIED,,,60000028,OPEN,CA060000028,50717.0,...,LOS ANGELES,6037,,,,,,2025-03-17T00:00:00,6513737.0,1796618.0
2,122508,LOS ANGELES,90059,LICENSED AND CERTIFIED,,,60000035,OPEN,CA06000035,50779.0,...,LOS ANGELES,6037,YES,,,,,2025-03-17T00:00:00,6487597.0,1794882.0
3,122509,SYLMAR,91342,LICENSED AND CERTIFIED,,,60000038,OPEN,CA060000038,50040.0,...,LOS ANGELES,6037,YES,,,,,2025-03-17T00:00:00,6425291.0,1941683.0
4,122510,LOS ANGELES,90033,LICENSED AND CERTIFIED,,,60000040,OPEN,CA060000040,50373.0,...,LOS ANGELES,6037,YES,LEVEL II PED,LEVEL I,,,2025-03-17T00:00:00,6499125.0,1842927.0


In [13]:
# look at columns in Hospital dataset
df.columns.unique()

Index(['OBJECTID', 'City', 'ZIP Code', 'LICENSED_CERTIFIED', 'FLAG', 'T18_19',
       'FACID', 'FAC_STATUS_TYPE_CODE', 'ASPEN_FACID', 'CCN', 'TERMINAT_SW',
       'PARTICIPATION_DATE', 'APPROVAL_DATE', 'NPI', 'CAN_BE_DEEMED_FAC_TYPE',
       'CAN_BE_CERTIFIED_FAC_TYPE', 'DEEMED', 'AO_CD', 'DMG_EFCTV_DT',
       'AO_TRMNTN_DT', 'AO_NAME', 'FACNAME', 'FAC_TYPE_CODE', 'FAC_FDR', 'LTC',
       'CAPACITY', 'ADDRESS', 'ZIP9', 'FACADMIN', 'CONTACT_EMAIL',
       'CONTACT_FAX', 'CONTACT_PHONE_NUMBER', 'COUNTY_CODE', 'COUNTY_NAME',
       'DISTRICT_NUMBER', 'DISTRICT_NAME', 'ISFACMAIN', 'PARENT_FACID',
       'FAC_FAC_RELATIONSHIP_TYPE_CODE', 'START_DATE', 'LICENSE_NUMBER',
       'BUSINESS_NAME', 'LICENSE_STATUS_DESCRIPTION', 'INITIAL_LICENSE_DATE',
       'LICENSE_EFFECTIVE_DATE', 'LICENSE_EXPIRATION_DATE',
       'ENTITY_TYPE_DESCRIPTION', 'LATITUDE', 'LONGITUDE', 'LOCATION',
       'HCAI_ID', 'CCLHO_CODE', 'CCLHO_NAME', 'FIPS_COUNTY_CODE',
       'BIRTHING_FACILITY_FLAG', 'TRAUMA_PED_CTR', 

**Initial Observations**:

The dataset contains over 60 columns, however in our case we would likely only need to use a few of the features. 
Key observations so far:
- Some columns (e.g., x, y, LATITUDE, LONGITUDE) provide location data.
- Others like FAC_STATUS_TYPE_CODE, TYPE_OF_CARE, and TRAUMA_CTR describe each facility’s services.
- ZIP Code appears to be the most useful for linking to our crime dataset, which also includes ZIP codes.

In [14]:
df['City'].unique()

array(['TORRANCE', 'DOWNEY', 'LOS ANGELES', 'SYLMAR', 'WEST HOLLYWOOD',
       'ALHAMBRA', 'LANCASTER', 'MONTEBELLO', 'AVALON', 'CULVER CITY',
       'BELLFLOWER', 'POMONA', 'INGLEWOOD', 'PASADENA', 'PANORAMA CITY',
       'WEST HILLS', 'PARAMOUNT', 'DUARTE', 'NORWALK', 'GARDENA',
       'HUNTINGTON PARK', 'SAN GABRIEL', 'WEST COVINA', 'MARINA DEL REY',
       'LAKEWOOD', 'SANTA MONICA', 'ENCINO', 'GLENDORA', 'MONTEREY PARK',
       'GLENDALE', 'SOUTH EL MONTE', 'VALENCIA', 'MISSION HILLS',
       'COVINA', 'HARBOR CITY', 'PALMDALE', 'LA MIRADA', 'TARZANA',
       'LONG BEACH', 'ARCADIA', 'WHITTIER', 'BURBANK', 'MONROVIA',
       'NORTHRIDGE', 'SAN DIMAS', 'LYNWOOD', 'VAN NUYS', 'SUN VALLEY',
       'SHERMAN OAKS', 'BALDWIN PARK', 'SAN PEDRO', 'WOODLAND HILLS'],
      dtype=object)

Variables like T18_T19 and TYPE_OF_CARE, may not be useful to include in our analyses, as there are only 93 records in our dataset and those 2 columns have all records missing.

In [35]:
# View data types and missing values per column
df.isna().sum().sort_values(ascending=False).head(10)

T18_19                      93
TYPE_OF_CARE                93
CRITICAL_ACCESS_HOSPITAL    92
TERMINAT_SW                 92
AO_TRMNTN_DT                91
FLAG                        88
TRAUMA_PED_CTR              85
PARENT_FACID                84
START_DATE                  84
TRAUMA_CTR                  78
dtype: int64

**Key Variables Kept for Analyses**: 
- ZIP Code: Postal area of the hospital, used to join with the crime dataset
- City: City where the facility is located, secondary spatial identifier
- FAC_STATUS_TYPE_CODE: Operational status (e.g., OPEN, CLOSED), indicates active healthcare coverage
- TRAUMA_CTR: Trauma care level (e.g., Level I, II), reflects emergency service capacity
- BIRTHING_FACILITY_FLAG: Indicates birthing facility availability, relevant for gender-related healthcare accessibility
- CRITICAL_ACCESS_HOSPITAL: Marks federally designated rural emergency hospitals, adds rural–urban healthcare context
- LATITUDE / LONGITUDE: Facility coordinates, could be used for spatial visualization later
- COUNTY_NAME: County in which the hospital is located, supports geographic summaries

In [36]:
# Only keeping useful columns
key_vars = [
    "FACNAME",
    "City",
    "ZIP Code",
    "FAC_STATUS_TYPE_CODE",
    "TRAUMA_CTR",
    "BIRTHING_FACILITY_FLAG",
    "CRITICAL_ACCESS_HOSPITAL",
    "LATITUDE",
    "LONGITUDE",
    "COUNTY_NAME"
]

df = df[key_vars]
df.head()

Unnamed: 0,FACNAME,City,ZIP Code,FAC_STATUS_TYPE_CODE,TRAUMA_CTR,BIRTHING_FACILITY_FLAG,CRITICAL_ACCESS_HOSPITAL,LATITUDE,LONGITUDE,COUNTY_NAME
0,LAC/HARBOR UCLA MEDICAL CENTER,TORRANCE,90502,OPEN,LEVEL I,YES,,33.830325,-118.292018,LOS ANGELES
1,LAC/RANCHO LOS AMIGOS NATIONAL REHABILITATION ...,DOWNEY,90242,OPEN,,,,33.9291,-118.157999,LOS ANGELES
2,"MARTIN LUTHER KING, JR. COMMUNITY HOSPITAL",LOS ANGELES,90059,OPEN,,YES,,33.924186,-118.244151,LOS ANGELES
3,LAC/OLIVE VIEW-UCLA MEDICAL CENTER,SYLMAR,91342,OPEN,,YES,,34.326981,-118.4517,LOS ANGELES
4,LOS ANGELES GENERAL MEDICAL CENTER,LOS ANGELES,90033,OPEN,LEVEL I,YES,,34.056278,-118.206478,LOS ANGELES


In [37]:
# get the number of rows and columns
num_rows, num_cols = df.shape

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

Number of rows: 93
Number of columns: 10


**Handling Missing Values**

In [38]:
# View data types and missing values per column
df.isna().sum().sort_values(ascending=False).head()

CRITICAL_ACCESS_HOSPITAL    92
TRAUMA_CTR                  78
BIRTHING_FACILITY_FLAG      42
FACNAME                      0
City                         0
dtype: int64

We looked for facilities that are 'OPEN' to represent hospitals that are currently operating. In our favor, all 93 hospitals in the dataset are active. 

In [39]:
# Filter and check for active hospitals
active_hospitals = df[df['FAC_STATUS_TYPE_CODE'] == 'OPEN']
print("Number of active hospitals:", active_hospitals.shape[0])

Number of active hospitals: 93


In [40]:
# Fill missing categorical data
df['TRAUMA_CTR'] = df['TRAUMA_CTR'].fillna('No Trauma')
df['BIRTHING_FACILITY_FLAG'] = df['BIRTHING_FACILITY_FLAG'].fillna('No')
df['CRITICAL_ACCESS_HOSPITAL'] = df['CRITICAL_ACCESS_HOSPITAL'].fillna('No')

# Checking for missing values again
df.isna().sum().sort_values(ascending=False).head()

FACNAME                 0
City                    0
ZIP Code                0
FAC_STATUS_TYPE_CODE    0
TRAUMA_CTR              0
dtype: int64

### Joining Datasets by Zip Code
---

In [41]:
df["ZIP Code"].unique()

array([90502, 90242, 90059, 91342, 90033, 90048, 90027, 90015, 91801,
       93534, 90640, 90704, 90026, 90232, 90706, 91767, 90301, 91105,
       91402, 91307, 90067, 90723, 91010, 90650, 90247, 90255, 91776,
       91791, 90292, 90712, 90404, 90241, 90023, 91436, 91741, 91754,
       91206, 91733, 91355, 90028, 91345, 90017, 91723, 90710, 90034,
       93551, 90503, 90056, 90638, 91356, 91204, 90806, 91007, 90602,
       91505, 90813, 91016, 91325, 91773, 90505, 90262, 91405, 91790,
       91352, 91403, 91706, 90605, 90732, 90036, 90095, 90089, 91208,
       91367])