---
format: 
  html:
    toc: false
    page-layout: full
    code-fold: true
    code-tools: true
execute:
    echo: true
    warning: false
---

# 2) NFIP Data Processing


FEMA administers the National Flood Insurance Program (NFIP), which helps to identify flood risks, support floodplain management, and provide flood insurance and protection. The OpenFEMA data portal provides access to an Federal Insurance and Mitigation Administration (FIMA) [NFIP Redacted Claims data set](https://www.fema.gov/openfema-data-page/fima-nfip-redacted-claims-v2). This data set has over 2 million records, with personally identifying information removed and latitude/longitude coordinates simplified to one decimal. Due to challenges with accessing the data through FEMA's API and being unable to host the file on GitHub due to file size, the data was instead downloaded locally, cleaned, and resaved to a smaller version available in this projects repository.

Cleaning steps included extracting latituden and longitude coordinates, then filtering for New Jersey claims between 1995 and 2023, with the latter date chosen since it aligns with the start of the Blue Acres program. Additionally, only single family home claims were included in this cleaned data set, since only private property is eligble for buyouts and the Blue Acres program has focused on homeowners.

In [1]:
#| eval: false

# packages
import geopandas as gpd
import numpy as np
import pandas as pd

# options
pd.options.display.max_columns = 999

# load data (not available in repo, file too large - download here: https://www.fema.gov/openfema-data-page/fima-nfip-redacted-claims-v2 
claims_df = pd.read_csv("data/claims.csv")

claims_features= claims_df.loc[:, ("dateOfLoss", 
                          "baseFloodElevation", 
                          "ratedFloodZone", 
                          "locationOfContents", 
                          "occupancyType", 
                          "amountPaidOnBuildingClaim",
                          "amountPaidOnContentsClaim",
                          "totalBuildingInsuranceCoverage",
                          "totalContentsInsuranceCoverage",
                          "yearOfLoss",
                          "primaryResidenceIndicator",
                          "buildingDamageAmount", 
                          "netBuildingPaymentAmount",
                          "buildingPropertyValue",
                          "contentsDamageAmount",
                          "netContentsPaymentAmount",
                          "contentsPropertyValue",
                          "floodCharacteristicsIndicator",
                          "floodWaterDuration",
                          "floodproofedIndicator",
                          "floodEvent",
                          "buildingReplacementCost",
                          "contentsReplacementCost",
                          "stateOwnedIndicator",
                          "buildingDescriptionCode",
                          "rentalPropertyIndicator",
                          "state",
                          "countyCode",
                          "censusTract",
                          "censusBlockGroupFips",
                          "latitude",
                          "longitude",
                          "id")]

# fix number formatting, have to repeat again when loading into a different notebook
claims_features['countyCode'] = claims_features['countyCode'].astype(str).str.rstrip('.0')
claims_features['censusTract'] = claims_features['censusTract'].astype(str).str.rstrip('.0')
claims_features['censusBlockGroupFips'] = claims_features['censusBlockGroupFips'].astype(str).str.rstrip('.0')

# Remove rows with missing geometry
claims_features = claims_features.dropna(subset=["latitude", "longitude"])

# Create geoDataFrame
claims = gpd.GeoDataFrame(
    claims_features,
    geometry=gpd.points_from_xy(claims_df["longitude"], claims_df["latitude"]),
    crs="EPSG:4326",
)

# projected coordinate
claims_gpd = claims.to_crs('EPSG:3857')

# extract lat and lon
claims_gpd['x'] = claims_gpd['geometry'].x
claims_gpd['y'] = claims_gpd['geometry'].y

# filtering for NJ claims since 1995
states = ['NJ']
claims_NJ = claims_gpd[(claims_gpd['yearOfLoss'] >= 1995) & (claims_gpd['state'].isin(states))]

# save all NJ claims
from pathlib import Path  
filepath = Path('data/claims_NJ.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
claims_NJ.to_csv(filepath) 

# cleaned claims add sfh parameter, claims_sfh creates separate df for them
claims_NJ_clean=claims_NJ
sfh = (claims_NJ_clean['occupancyType'] == 1) | (claims_NJ_clean['occupancyType'] == 11)
claims_NJ_clean['sfh'] = np.where(sfh, 1, 0)
claims_NJ_clean['observation'] = 1
claims_sfh = claims_NJ_clean[claims_NJ_clean['sfh'] == 1]

# save cleaned NJ claims
from pathlib import Path  
filepath = Path('data/claims_NJ_clean.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
claims_NJ_clean.to_csv(filepath) 

# save single family housing NJ claims
from pathlib import Path  
filepath = Path('data/claims_sfh.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
claims_sfh.to_csv(filepath) 