# Flood Insurance Data Cleaning
_Calvin Whealton_

This notebook processes the redacted National Flood Insurance Program redacted claims dataset. The data was obtained from https://www.fema.gov/media-library/assets/documents/180374. Data includes the claims for 1970 to 2019 and in addition to many characteristics of the type of claim it includes the zip code of the claim. The main values that will be analyzed for this work is the amount paid on the claims.

In [None]:
import os
import pandas as pd
import geopandas as gpd

In [None]:
# directory where the data is stored
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data/FIMA_NFIP_Redacted_Claims_Data_Set')

# reading in file
claims = pd.read_csv('openFEMA_claims20190831.csv')

In [None]:
claims.head()

In [None]:
claims.shape

In [None]:
min(claims['yearofloss']),max(claims['yearofloss'])

In [None]:
len(claims['reportedzipcode'].unique())

Focus is on housing. Small business, agricultural buildings, non-profit buildings, and places of worship are not of immediate interest.

In [None]:
claims.drop(claims[claims['houseworship']=='Y'].index,inplace=True)
claims.drop(claims[claims['agriculturestructureindicator']=='Y'].index,inplace=True)
claims.drop(claims[claims['nonprofitindicator']=='Y'].index,inplace=True)
claims.drop(claims[claims['smallbusinessindicatorbuilding']=='Y'].index,inplace=True)

In [None]:
claims.shape

The metadata states that a negative claims amount means that the check was not cashed and had to be reissued. Therefore, the payment would still presumptively be positive. The claims are divided into building and contents. For the purpose of this analysis, both would be considered as representing a damage to the structure.

In [None]:
min(claims['amountpaidonbuildingclaim']),max(claims['amountpaidonbuildingclaim'])

In [None]:
claims['amountpaidonbuildingclaim'] = claims['amountpaidonbuildingclaim'].abs()
claims['amountpaidoncontentsclaim'] = claims['amountpaidoncontentsclaim'].abs()
min(claims['amountpaidonbuildingclaim']),max(claims['amountpaidonbuildingclaim'])

Making a column that will be used in aggregating the losses to the monthly values.

In [None]:
claims['yearmonthofloss'] = claims['dateofloss'].str[:-3]

In [None]:
claims.head()

In [None]:
claims['GEOID10_str'] = claims['reportedzipcode'].apply(lambda x: '{0:0>5}'.format(x))

In [None]:
claims_for_groupby = claims.filter(['GEOID10_str','yearmonthofloss','amountpaidoncontentsclaim','amountpaidonbuildingclaim'])

Loading a zip code shapefile that will be used to evaluated valid zip codes. The valid zip codes are those in the US Census ZCTA (Zip Code Tabulation Area) shapefile. The shapefile has been clipped to the 48 contiguous states.

In [None]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data/geo_data/tl_2019_us_zcta510_clipped48contig')
zip_shape = gpd.read_file('clipped48contig.shp')

In [None]:
valid_zips = zip_shape['ZCTA5CE10'].values

Looping through the dataframe and dropping rows (zip codes) that are not in the list of valid zip codes. Pre-processing the claims before using group by to reduce the number of results.

Using the `isin()` function because it is faster than looping through the dataframe.

In [None]:
claims_for_groupby = claims_for_groupby.loc[claims_for_groupby['GEOID10_str'].isin(valid_zips)]

In [None]:
claims_for_groupby.head()

In [None]:
claims_gb = claims_for_groupby.groupby(['GEOID10_str','yearmonthofloss']).sum()

In [None]:
claims_gb.head()

Checking that nothing was lost in the groupby() operation.

In [None]:
claims_gb['amountpaidoncontentsclaim'].sum()

In [None]:
claims_for_groupby['amountpaidoncontentsclaim'].sum()

Converting the different types of flood zones into binary variables (dummy variables). Result will be a 0 or 1 depending on the flood zone.

In [None]:
claims_gb['amountpaid'] = claims_gb['amountpaidoncontentsclaim'] + claims_gb['amountpaidonbuildingclaim']

In [None]:
min(sorted(claims_for_groupby['yearmonthofloss'])),max(sorted(claims_for_groupby['yearmonthofloss']))

In [None]:
# unzipping the tuple list
zips,dates = zip(*claims_gb.index)
zips2 = sorted(list(set(zips)))
dates2 = sorted(list(set(dates)))

In [None]:
# making claims dataframe
claims_ts = pd.DataFrame({'GEOID10_str':zips2})

In [None]:
for d in dates2:
    claims_ts[d] = 0

In [None]:
for vals in range(len(zips)):
    claims_ts.loc[claims_ts['GEOID10_str']==zips[vals],dates[vals]] = claims_gb['amountpaid'].values[vals]

In [None]:
claims_ts.sum()

Checking values above and below indicates that the sums match to within rounding. Values of claims were not lost or gained.

In [None]:
claims_for_groupby.groupby('yearmonthofloss').sum()

In [None]:
# number of zeros (no claims)
num_zero_claims = sum((ts_claims == 0).astype(int).sum(axis=1))

In [None]:
num_possible_claims = (ts_claims.shape[1]-1)*ts_claims.shape[0]

In [None]:
num_claims = num_possible_claims - num_zero_claims
num_claims

In [None]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data/processed_data')
claims_ts.to_csv('ts_claims_month.csv')