# Fusing Data

Data comes from five sources:

1. BAH data available from https://www.travel.dod.mil/Allowances/Basic-Allowance-for-Housing/BAH-Rate-Lookup/
2. Housing Price Index for ​Three-Digit ZIP Codes (Developmental Index; Not Seasonally Adjusted) available from https://www.fhfa.gov/data/hpi/datasets?tab=quarterly-data
3. US City average consumer price index for all items available from https://data.bls.gov/timeseries/CUUR0000SA0
4. Political Party control of different branches of Federal government (Congress, House, Presidency)
5. Location data for each military housing area based on the data available from https://www.kaggle.com/datasets/mexwell/us-military-bases/data


# 1. Data File Preparation

*Package Imports*

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re
from geopy.geocoders import Nominatim

*Global Variables/Settings*

In [2]:
raw_dir = "Raw Data"
fuse_dir = "Fused Data"
clean_dir = "Clean Data"

*Metadata Files*

In [160]:
mha_names_df = pd.read_csv(os.path.join(raw_dir,"mhanames.csv"))
mha_zips_df = pd.read_csv(os.path.join(raw_dir,"sorted_zipmha.csv"))

## 1.1. BAH Data

>Basic Allowance for Housing, or BAH, provides uniformed service members equitable housing compensation based on housing costs in local civilian housing markets within the 50 U.S. states when government quarters are not provided.
>
>BAH is not intended to cover all of a service member’s housing costs. The opportunity for Service members to choose their off-base housing is important to DoD. Each member has the freedom to decide how to allocate their income (including the housing allowance) without a penalty for deciding to conserve some dollars on rent to pay other expenses. Therefore, actual out-of-pocket expense for an individual may be higher or lower than the prescribed rate based on choice of housing.
>
>Basic Allowance for Housing | BAH | Defense Travel Management Office. (2025). Dod.mil. https://www.travel.dod.mil/Allowances/Basic-Allowance-for-Housing/

BAH Data is separated by year and is generally available as an ASCII/csv file. Data is further spread across two categories: with and without dependents.

The ASCII/csv file does not include headers, but a PDF that accompanies the data includes definitions of the columns as follows:

| Column | Data Type |
|---|---|
| MHA | CHAR (5) |
| E1 | NUM |
| E2 | NUM |
| E3 | NUM |
| E4 | NUM |
| E5 | NUM |
| E6 | NUM |
| E7 | NUM |
| E8 | NUM |
| E9 | NUM |
| W1 | NUM |
| W2 | NUM |
| W3 | NUM |
| W4 | NUM |
| W5 | NUM |
| O1E | NUM |
| O2E | NUM |
| O3E | NUM |
| O1 | NUM |
| O2 | NUM |
| O3 | NUM |
| O4 | NUM |
| O5 | NUM |
| O6 | NUM |
| O7 | NUM |

Some CSV files also include ranks O8-O10, but do not specify it in their data definitions. Columns are expanded below to account for the difference. If those values do not exist then they are populated with null values.

In [78]:
# Create the column labels for the CSV files
csv_header = ["MHA", 
              "E1", "E2", "E3", "E4", "E5", "E6", "E7", "E8", "E9", 
              "W1", "W2", "W3", "W4", "W5", 
              "O1E", "O2E", "O3E", 
              "O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9", "O10"]

# Create the column labels for the final dataframe
final_header = ["Dependents", "Year", "MHA", 
                "E1", "E2", "E3", "E4", "E5", "E6", "E7", "E8", "E9", 
                "W1", "W2", "W3", "W4", "W5", 
                "O1E", "O2E", "O3E", 
                "O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9", "O10"]


In [79]:
# get the list of data files
bah_path = os.path.join(raw_dir, "BAH Data")
files = os.listdir(bah_path)

# create blank dataframe
bah_df = pd.DataFrame(columns=final_header)

In [80]:
for file in files:
    path = os.path.join(bah_path, file)
    name, ext = os.path.splitext(file)
    
    year = name[-2:]
    # if year = 9_ then set it to 199_, otherwise it is 20__
    if year[0] == "9":
        year = "19" + year
    else:
        year = "20" + year
    year = int(year)
    
    dependents = name[-3:-2]
    # dependents = 1 for with (w) and 0 for without (o)
    if dependents == "w":
        dependents = 1
    else:
        dependents = 0
    
    # read file to dataframe
    df = pd.read_csv(path, names=csv_header)
    df.insert(loc=0, column="Year", value = year)
    df.insert(loc=0, column="Dependents", value = dependents)
    bah_df = pd.concat([bah_df, df])

  bah_df = pd.concat([bah_df, df])


In [82]:
# save dataframe to file
bah_df.to_csv(os.path.join(bah_path, "bah_df.csv"), index=False)

Unused MHAs will be removed at a later step.

## 1.2. Housing Price Index Data

>The FHFA HPI is a broad measure of the movement of single-family house prices. The FHFA HPI is a weighted, repeat-sales index, meaning that it measures average price changes in repeat sales or refinancings on the same properties. This information is obtained by reviewing repeat mortgage transactions on single-family properties whose mortgages have been purchased or securitized by Fannie Mae or Freddie Mac since January 1975.
>
>FHFA House Price Index | FHFA. (2025, January 28). FHFA.gov. https://www.fhfa.gov/data/hpi

‌

In [126]:
# identify the data path
hpi_path = os.path.join(raw_dir, "HPI Data")

In [127]:
# read data file into a dataframe
hpi_df = pd.read_excel(os.path.join(hpi_path,"hpi_at_3zip.xlsx"),skiprows=3, header=1)
hpi_df.head()

Unnamed: 0,Three-Digit ZIP Code,Year,Quarter,Index (NSA),Index Type
0,10,1995,1,100.0,Native 3-Digit ZIP index
1,10,1995,2,101.39,Native 3-Digit ZIP index
2,10,1995,3,103.63,Native 3-Digit ZIP index
3,10,1995,4,103.66,Native 3-Digit ZIP index
4,10,1996,1,104.96,Native 3-Digit ZIP index


In [128]:
# remove extraneous index types
hpi_df.drop(hpi_df.loc[hpi_df["Index Type"]!="Native 3-Digit ZIP index"].index, inplace=True)

In [129]:
# limit data to the first HPI value per year (quarter = 1)
hpi_df.drop(hpi_df.loc[hpi_df["Quarter"]!=1].index, inplace=True)

# drop the quarter column
hpi_df.drop(labels="Quarter", axis=1, inplace=True)

hpi_df.head()

Unnamed: 0,Three-Digit ZIP Code,Year,Index (NSA),Index Type
0,10,1995,100.0,Native 3-Digit ZIP index
4,10,1996,104.96,Native 3-Digit ZIP index
8,10,1997,104.04,Native 3-Digit ZIP index
12,10,1998,107.96,Native 3-Digit ZIP index
16,10,1999,110.83,Native 3-Digit ZIP index


In [130]:
# change three-digit ZIP to string datatype
hpi_df["Three-Digit ZIP Code"] = hpi_df["Three-Digit ZIP Code"].astype("string")

# add leading zero to the three digit codes
hpi_df["Three-Digit ZIP Code"] = hpi_df["Three-Digit ZIP Code"].str.zfill(3)

# drop index type column
hpi_df.drop(labels="Index Type", axis=1, inplace=True)

hpi_df.head()

Unnamed: 0,Three-Digit ZIP Code,Year,Index (NSA)
0,10,1995,100.0
4,10,1996,104.96
8,10,1997,104.04
12,10,1998,107.96
16,10,1999,110.83


In [131]:
# save the dataframe to a CSV
hpi_df.to_csv(os.path.join(hpi_path, "hpi_df.csv"), index=False)

## 1.3. CPI (Inflation) Data

**CPI Series ID:** CUUR0000SA0
Not Seasonally Adjusted
**Series Title:** All items in U.S. city average, all urban consumers, not seasonally adjusted
https://data.bls.gov/timeseries/CUUR0000SA0

>The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. Indexes are available for the U.S. and various geographic areas. Average price data for select utility, automotive fuel, and food items are also available.
>
>CPI Home. (2023, February 15). Bureau of Labor Statistics. https://www.bls.gov/cpi/


In [132]:
# identify the data path
cpi_path = os.path.join(raw_dir, "CPI Data")

In [133]:
# read data file into a dataframe
cpi_df = pd.read_excel(os.path.join(cpi_path,"CPI Inflation Data (CUUR0000SA0).xlsx"), skiprows=10, header=1)
cpi_df.head()

  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,HALF1,HALF2
0,1995,150.3,150.9,151.4,151.9,152.2,152.5,152.5,152.9,153.2,153.7,153.6,153.5,151.5,153.2
1,1996,154.4,154.9,155.7,156.3,156.6,156.7,157.0,157.3,157.8,158.3,158.6,158.6,155.8,157.9
2,1997,159.1,159.6,160.0,160.2,160.1,160.3,160.5,160.8,161.2,161.6,161.5,161.3,159.9,161.2
3,1998,161.6,161.9,162.2,162.5,162.8,163.0,163.2,163.4,163.6,164.0,164.0,163.9,162.3,163.7
4,1999,164.3,164.5,165.0,166.2,166.2,166.2,166.7,167.1,167.9,168.2,168.3,168.3,165.4,167.8


In [134]:
# dropping the Half 1 and Half 2 features (they are averages of the first half of the year and second half)
cpi_df.drop(labels=["HALF1","HALF2"], axis=1, inplace=True)

In [135]:
# add an average per year column (excluding the year column)
avg_cols = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
cpi_df["Average CPI"] = cpi_df[avg_cols].mean(axis=1)

# drop the month columns as they will not be used
cpi_df.drop(labels=avg_cols, axis=1, inplace=True)
cpi_df.head()

Unnamed: 0,Year,Average CPI
0,1995,152.383333
1,1996,156.85
2,1997,160.516667
3,1998,163.008333
4,1999,166.575


In [136]:
# save dataframe to CSV
cpi_df.to_csv(os.path.join(cpi_path, "cpi_df.csv"), index=False)

## 1.4. Political Party Data

The dataset consists of each political body (House of Representatives, Senate, and the President), identifying the majority party for the House and Senate as well as the President and their party. It also includes a column of "unified" or "divided", which can be interpreted from the earlier data.

In [137]:
# identify the data path
party_path = os.path.join(raw_dir, "Political Party Data")

In [138]:
# read data into dataframe
party_df = pd.read_csv(os.path.join(party_path, "Party_Government_Data.csv"))
party_df.head()

Unnamed: 0,Year,Senate Majority Party,House Majority Party,President Party,President Name,Government Unity
0,1993,Democrats,Democrats,Democrat,Clinton,Unified
1,1994,Democrats,Democrats,Democrat,Clinton,Unified
2,1995,Democrats,Democrats,Democrat,Clinton,Unified
3,1995,Republicans,Republicans,Democrat,Clinton,Divided
4,1996,Republicans,Republicans,Democrat,Clinton,Divided


In [139]:
# drop unecessary columns
del_cols = ["President Name", "Government Unity"]
party_df.drop(labels=del_cols, axis=1, inplace=True)

In [140]:
# reduce entries to first letter of political party
# this eliminates some strange entries in the House Majority Party column
party_df["Senate Majority Party"] = party_df["Senate Majority Party"].str[0]
party_df["House Majority Party"] = party_df["House Majority Party"].str[0]
party_df["President Party"] = party_df["President Party"].str[0]

In [141]:
# change the d-type of each to a category, excluding year
party_df["Senate Majority Party"] = party_df["Senate Majority Party"].astype("category")
party_df["House Majority Party"] = party_df["House Majority Party"].astype("category")
party_df["President Party"] = party_df["President Party"].astype("category")

# use category encoding to change to 0 and 1
party_df["Senate Majority Party"] = party_df["Senate Majority Party"].cat.codes
party_df["House Majority Party"] = party_df["House Majority Party"].cat.codes
party_df["President Party"] = party_df["President Party"].cat.codes

party_df.head()

Unnamed: 0,Year,Senate Majority Party,House Majority Party,President Party
0,1993,0,0,0
1,1994,0,0,0
2,1995,0,0,0
3,1995,1,1,0
4,1996,1,1,0


In [142]:
# save dataframe to CSV
party_df.to_csv(os.path.join(party_path, "party_df.csv"), index=False)

## 1.5. Location Data

>The Military Bases dataset is as of May 21, 2019, and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics's (BTS's) National Transportation Atlas Database (NTAD). The dataset depicts the authoritative boundaries of the most commonly known Department of Defense (DoD) sites, installations, ranges, and training areas in the United States and Territories. These sites encompass land which is federally owned or otherwise managed.
>
>mexwell. (2019). 🪖 US Military Bases. Kaggle.com. https://www.kaggle.com/datasets/mexwell/us-military-bases/data

‌Since the data is from 2019 some of it is slightly dated (for example, name changes to various military bases occured and some bases were designated Space Force bases) it needed significant cleaning. It also includes all federal DoD real property, which is not necessarily indicative of military presence, just land holdings. Locations were filtered down as described later.

In [113]:
# File path
loc_path = os.path.join(raw_dir, "Location Data\\military-bases.csv")
mha_path = os.path.join(raw_dir, "Location Data")

In [11]:
# Load data to a data frame
loc_df = pd.read_csv(loc_path, delimiter=";")

# eliminate inactive bases
loc_df.drop(loc_df.loc[loc_df["Oper Stat"]=="Inactive"].index, inplace=True)

# drop bases not within the United States
loc_df.drop(loc_df.loc[loc_df["COUNTRY"]!="United States"].index, inplace=True)

# Remove unnecessary features
loc_ft_del = ["Geo Shape", "OBJECTID_1", "OBJECTID", "PERIMETER", "Oper Stat", "Shape_Leng", "Shape_Area"]
loc_df.drop(columns=loc_ft_del, axis=1, inplace=True)

# Remove Guard and Reserve locations
loc_df.drop(loc_df.loc[loc_df["COMPONENT"].str.contains("Guard")].index, inplace=True)
loc_df.drop(loc_df.loc[loc_df["COMPONENT"].str.contains("Reserve")].index, inplace=True)

# Remove small areas
loc_df.drop(loc_df.loc[loc_df["AREA"]<1].index, inplace=True)

loc_df.head()

Unnamed: 0,Geo Point,COMPONENT,Site Name,Joint Base,State Terr,COUNTRY,AREA
2,"33.1594636742, -106.425696182",Army Active,White Sands Missile Range NM,,New Mexico,United States,3548.570164
5,"21.3621462703, -157.718266082",MC Active,MC Trng Area Bellows,,Hawaii,United States,1.611546
7,"27.8965304593, -98.0434307225",Navy Active,ALF Orange,,Texas,United States,5.293969
9,"42.7402329675, -115.563336812",AF Active,Saylor Creek Air Force Range,,Idaho,United States,171.097376
10,"29.3567318395, -100.782932919",AF Active,Laughlin AFB,,Texas,United States,6.396106


In [None]:
# write dataframe to CSV file
loc_path = os.path.join(raw_dir, "Location Data\\loc_df_reduced.csv")
loc_df.to_csv(loc_path, index=False)

From this point data is manually reduced by cross-referencing https://installations.militaryonesource.mil/view-all

Data Reduction rationale:
- Inactive bases are dropped due to lack of military population in the area due to base closure
- Bases outside of the CONUS (plus Hawaii and Alaska) likely have different economic situations and are not considered due to likely outliers
- Features are removed due to being irrelevant, most were used for mapping purposes
- Guard and Reserve members do not normally receive BAH so those bases/areas were removed
- Small areas were removed to reduce to major military populations, small areas were indicative of very small federal property that wouldn't contribute to actual personnel numbers

In [None]:
# read back in the reduced data
loc_df = pd.read_csv(loc_path)

In [None]:
# # CAUTION code makes call to external API to get location data from lat-long
# city_loc = []
# geolocator = Nominatim(user_agent="me")
# for ll in loc_df["Geo Point"]:
#     location = geolocator.reverse(ll)
#     city_loc.append(location)

In [None]:
# Decompose the location data into the city and lat-long
city_name = []
city_ll = []
for i in range(0,len(city_loc)):
    city_name.append(city_loc[i][0])
    city_ll.append(city_loc[i][1])

163
['Maxwell Air Force Base, Burkett Drive, Montgomery, Montgomery County, Alabama, 36113, United States', 'Mariner Road, Madison County, Alabama, United States', 'Elwood Avenue, Calhoun County, Alabama, United States', 'Dale County, Alabama, United States', 'AN/GSC-52 SATCOM terminal, Satellite Communications Facility, Unorganized Borough, Alaska, United States', '37th Street, Anchorage, Alaska, 99506, United States', 'Unorganized Borough, Alaska, 99731, United States', 'Anchorage, Alaska, 99505, United States', 'Fairbanks North Star Borough, Alaska, United States', 'Clear Highway, Anderson, Denali Borough, Alaska, 99704, United States', 'Fairbanks North Star Borough, Alaska, 99702, United States', 'Yuma County, Arizona, United States', 'East Prescott Street, Tucson, Pima County, Arizona, 85708, United States', 'Luke Air Force Base, West Bethany Home Road, Maricopa County, Arizona, 85309, United States', 'Thompson Street, Sierra Vista, Cochise County, Arizona, 85670, United States', 

In [None]:
# stitch together the city and lat-long into a dataframe
city_df = pd.DataFrame({"City": city_name, "Lat-Long": city_ll})
city_df["City"] = city_df["City"].str.split(",")

# save the city dataframe to a file
city_path = os.path.join(raw_dir, "Location Data\\city_df.csv")
city_df.to_csv(city_path, index=False)

In [None]:
# read in file to dataframe
city_df = pd.read_csv(city_path)

In [28]:
# add ZIP code to the dataframe if it exists
pattern = re.compile(r"\b\d{5}\b")
zip_code = []
for c in city_df["City"]:
    matches = re.findall(pattern,c)
    if not matches:
        zip_code.append("")
    else:
        zip_code.append(matches[0])

city_df["ZIP Code"] = zip_code
city_df.head()

Unnamed: 0,City,Lat-Long,ZIP Code
0,"['Maxwell Air Force Base', ' Burkett Drive', '...","(32.3818479, -86.35330759865383)",36113.0
1,"['Mariner Road', ' Madison County', ' Alabama'...","(34.63287805104256, -86.65713769680043)",
2,"['Elwood Avenue', ' Calhoun County', ' Alabama...","(33.6522778, -85.9688511)",
3,"['Dale County', ' Alabama', ' United States']","(31.39992542910119, -85.74747847025453)",
4,"['AN/GSC-52 SATCOM terminal', ' Satellite Comm...","(52.72235205, 174.10329374865051)",


In [None]:
# save updated dataframe to file (overwrite existing)
city_df.to_csv(city_path, index=False)

Remaining ZIP Codes were entered manually

In [31]:
# read CSV back in after manual edits
city_df = pd.read_csv(city_path)

In [39]:
# merge the dataframe with the MHA/ZIP code dataframe
# we will inner join onto the MHA/ZIP dataframe to drop MHAs that are not represented
mha_coords_df = pd.merge(mha_zips_df, city_df, how="inner", left_on="ZIP", right_on="ZIP Code")
mha_coords_df.head()

Unnamed: 0,ZIP,MHA,City,Lat-Long,ZIP Code
0,1731,MA377,"['Marrett Street', ' Lincoln', ' Middlesex Cou...","(42.4585923, -71.27662042742608)",1731
1,2841,RI256,"['110', ' Lay Street', ' Middletown', ' Newpor...","(41.535219999999995, -71.309740644)",2841
2,7416,NJ202,"['Craine Road', ' Rockaway Township', ' Morris...","(40.95409665120382, -74.54447631172972)",7416
3,8527,NJ204,"['Jackson Township', ' Ocean County', ' New Je...","(40.02707537090311, -74.37507652215362)",8527
4,8641,NJ204,"['Manor Road', ' New Hanover Township', ' Burl...","(40.02511345, -74.58019596065915)",8641


In [40]:
# remove unnecessary columns
del_cols = ["City","ZIP Code"]
mha_coords_df.drop(del_cols, axis=1, inplace=True)
mha_coords_df.head()

Unnamed: 0,ZIP,MHA,Lat-Long
0,1731,MA377,"(42.4585923, -71.27662042742608)"
1,2841,RI256,"(41.535219999999995, -71.309740644)"
2,7416,NJ202,"(40.95409665120382, -74.54447631172972)"
3,8527,NJ204,"(40.02707537090311, -74.37507652215362)"
4,8641,NJ204,"(40.02511345, -74.58019596065915)"


In [42]:
# expand Lat-Long into two columns
mha_coords_df["Lat-Long"] = mha_coords_df["Lat-Long"].str.replace("(", "")
mha_coords_df["Lat-Long"] = mha_coords_df["Lat-Long"].str.replace(")", "")
coords_split = mha_coords_df["Lat-Long"].str.split(", ", expand=True)
mha_coords_df["Latitude"] = coords_split[0]
mha_coords_df["Longitude"] = coords_split[1]
del_cols = ["Lat-Long"]
mha_coords_df.drop(del_cols, axis=1, inplace=True)
mha_coords_df.head()

Unnamed: 0,ZIP,MHA,Latitude,Longitude
0,1731,MA377,42.4585923,-71.27662042742608
1,2841,RI256,41.53522,-71.309740644
2,7416,NJ202,40.95409665120382,-74.54447631172972
3,8527,NJ204,40.02707537090311,-74.37507652215362
4,8641,NJ204,40.02511345,-74.58019596065915


In [49]:
mha_names_df.columns

Index(['MHA', 'CITY', 'STATE'], dtype='object')

In [50]:
# merge with the MHA names metadata
# use a LEFT join on the MHA column onto the mha_coords_df

mha_df = pd.merge(mha_coords_df, mha_names_df, how="left", on="MHA")
mha_df.head()

Unnamed: 0,ZIP,MHA,Latitude,Longitude,CITY,STATE
0,1731,MA377,42.4585923,-71.27662042742608,HANSCOM AFB,MA
1,2841,RI256,41.53522,-71.309740644,NEWPORT,RI
2,7416,NJ202,40.95409665120382,-74.54447631172972,NORTHERN NEW JERSEY,NJ
3,8527,NJ204,40.02707537090311,-74.37507652215362,JB MCGUIRE-DIX-LAKEHURST,NJ
4,8641,NJ204,40.02511345,-74.58019596065915,JB MCGUIRE-DIX-LAKEHURST,NJ


In [51]:
# remove duplicate MHAs
mha_df.drop_duplicates(subset=["MHA"], inplace=True)
mha_df.head()

Unnamed: 0,ZIP,MHA,Latitude,Longitude,CITY,STATE
0,1731,MA377,42.4585923,-71.27662042742608,HANSCOM AFB,MA
1,2841,RI256,41.53522,-71.309740644,NEWPORT,RI
2,7416,NJ202,40.95409665120382,-74.54447631172972,NORTHERN NEW JERSEY,NJ
3,8527,NJ204,40.02707537090311,-74.37507652215362,JB MCGUIRE-DIX-LAKEHURST,NJ
6,10922,NY217,41.36385255,-74.02383342525056,WEST POINT,NY


In [None]:
# save mha_df to csv
mha_df.to_csv(os.path.join(raw_dir, "Location Data\\mha_df.csv"), index=False)

# 2. Fuse Data Together

In [168]:
# read in the data files
bah_df = pd.read_csv(os.path.join(bah_path, "bah_df.csv"))
hpi_df = pd.read_csv(os.path.join(hpi_path, "hpi_df.csv"))
cpi_df = pd.read_csv(os.path.join(cpi_path, "cpi_df.csv"))
party_df = pd.read_csv(os.path.join(party_path, "party_df.csv"))
mha_df = pd.read_csv(os.path.join(mha_path, "mha_df.csv"))

In [169]:
# get a list of the relevant MHA values
mha_values = mha_df["MHA"]

In [170]:
# reduce the BAH dataframe based on MHA values
bah_df = bah_df.loc[bah_df["MHA"].isin(mha_values)]
bah_df.head()

Unnamed: 0,Dependents,Year,MHA,E1,E2,E3,E4,E5,E6,E7,...,O1,O2,O3,O4,O5,O6,O7,O8,O9,O10
4,1,2000,AK404,792.0,792.0,792.0,792.0,983.0,1101.0,1185.0,...,997.0,1098.0,1331.0,1487.0,1594.0,1608.0,1627.0,,,
5,1,2000,AK405,823.0,823.0,823.0,823.0,892.0,973.0,1039.0,...,896.0,940.0,1185.0,1255.0,1360.0,1362.0,1390.0,,,
6,1,2000,AL001,382.0,382.0,397.0,422.0,490.0,547.0,592.0,...,510.0,570.0,673.0,811.0,912.0,942.0,964.0,,,
7,1,2000,AL002,380.0,380.0,395.0,420.0,456.0,484.0,515.0,...,460.0,504.0,586.0,710.0,810.0,843.0,946.0,,,
8,1,2000,AL003,436.0,436.0,436.0,437.0,495.0,545.0,588.0,...,514.0,568.0,669.0,828.0,947.0,977.0,1067.0,,,


In [171]:
# create a fused dataframe based on the BAH dataframe
fused_df = bah_df.copy(deep=True)

# add the MHA location data
fused_df = pd.merge(fused_df, mha_df, how="left", on="MHA")

# add the CPI data
fused_df = pd.merge(fused_df, cpi_df, how="left", on="Year")

# add party data
fused_df = pd.merge(fused_df, party_df, how="left", on="Year")

fused_df.head()

Unnamed: 0,Dependents,Year,MHA,E1,E2,E3,E4,E5,E6,E7,...,O10,ZIP,Latitude,Longitude,CITY,STATE,Average CPI,Senate Majority Party,House Majority Party,President Party
0,1,2000,AK404,792.0,792.0,792.0,792.0,983.0,1101.0,1185.0,...,,99505,61.26647,-149.641223,ANCHORAGE,AK,172.2,1,1,0
1,1,2000,AK405,823.0,823.0,823.0,823.0,892.0,973.0,1039.0,...,,99701,64.864904,-146.775162,FAIRBANKS,AK,172.2,1,1,0
2,1,2000,AL001,382.0,382.0,397.0,422.0,490.0,547.0,592.0,...,,36201,33.652278,-85.968851,ANNISTON/FORT MCCLELLAN,AL,172.2,1,1,0
3,1,2000,AL002,380.0,380.0,395.0,420.0,456.0,484.0,515.0,...,,36362,31.399925,-85.747478,FORT RUCKER,AL,172.2,1,1,0
4,1,2000,AL003,436.0,436.0,436.0,437.0,495.0,545.0,588.0,...,,35808,34.632878,-86.657138,HUNTSVILLE,AL,172.2,1,1,0


In [161]:
# reduce the MHA ZIP code metadata to 3-digit ZIP codes
mha_zips_df["ZIP"] = mha_zips_df["ZIP"].astype("string")
mha_zips_df["ZIP"] = mha_zips_df["ZIP"].str.zfill(5)
mha_zips_df["ZIP"] = mha_zips_df["ZIP"].str[:3]
mha_zips_df.head()

Unnamed: 0,ZIP,MHA
0,5,NY218
1,5,NY218
2,6,XX499
3,6,XX499
4,6,XX499


In [162]:
# reduce down to the actual MHAs
mha_zips_df = mha_zips_df.loc[mha_zips_df["MHA"].isin(mha_values)]
mha_zips_df.head()

Unnamed: 0,ZIP,MHA
228,10,ZZ820
241,10,ZZ820
258,10,ZZ820
323,13,ZZ820
324,13,ZZ820


In [172]:
# set HPI zip codes to a string
hpi_df["Three-Digit ZIP Code"] = hpi_df["Three-Digit ZIP Code"].astype("string")
hpi_df["Three-Digit ZIP Code"] = hpi_df["Three-Digit ZIP Code"].str.zfill(3)

# merge with the HPI data on three-digit ZIP
hpi_df = pd.merge(hpi_df, mha_zips_df, how="left", left_on="Three-Digit ZIP Code", right_on="ZIP")
hpi_df.dropna(subset=["MHA"], inplace=True)
hpi_df.drop(labels=["Three-Digit ZIP Code","ZIP"], axis=1, inplace=True)
hpi_df.head()

Unnamed: 0,Year,Index (NSA),MHA
0,1995,100.0,ZZ820
1,1995,100.0,ZZ820
2,1995,100.0,ZZ820
3,1996,104.96,ZZ820
4,1996,104.96,ZZ820


In [173]:
# remove the duplicate rows
hpi_df.drop_duplicates(inplace=True)
hpi_df.head()

Unnamed: 0,Year,Index (NSA),MHA
0,1995,100.0,ZZ820
3,1996,104.96,ZZ820
6,1997,104.04,ZZ820
9,1998,107.96,ZZ820
12,1999,110.83,ZZ820


In [174]:
# merge to the fused dataframe on year and MHA
fused_df = pd.merge(fused_df, hpi_df, how="left", on=["Year","MHA"])
fused_df.head()

Unnamed: 0,Dependents,Year,MHA,E1,E2,E3,E4,E5,E6,E7,...,ZIP,Latitude,Longitude,CITY,STATE,Average CPI,Senate Majority Party,House Majority Party,President Party,Index (NSA)
0,1,2000,AK404,792.0,792.0,792.0,792.0,983.0,1101.0,1185.0,...,99505,61.26647,-149.641223,ANCHORAGE,AK,172.2,1,1,0,117.55
1,1,2000,AK404,792.0,792.0,792.0,792.0,983.0,1101.0,1185.0,...,99505,61.26647,-149.641223,ANCHORAGE,AK,172.2,1,1,0,117.57
2,1,2000,AK405,823.0,823.0,823.0,823.0,892.0,973.0,1039.0,...,99701,64.864904,-146.775162,FAIRBANKS,AK,172.2,1,1,0,
3,1,2000,AL001,382.0,382.0,397.0,422.0,490.0,547.0,592.0,...,36201,33.652278,-85.968851,ANNISTON/FORT MCCLELLAN,AL,172.2,1,1,0,122.9
4,1,2000,AL001,382.0,382.0,397.0,422.0,490.0,547.0,592.0,...,36201,33.652278,-85.968851,ANNISTON/FORT MCCLELLAN,AL,172.2,1,1,0,123.67


In [175]:
# sort final dataframe by year and MHA
fused_df.sort_values(by=["Year","MHA"], inplace=True)
fused_df.head()

Unnamed: 0,Dependents,Year,MHA,E1,E2,E3,E4,E5,E6,E7,...,ZIP,Latitude,Longitude,CITY,STATE,Average CPI,Senate Majority Party,House Majority Party,President Party,Index (NSA)
21182,1,1995,AK404,343.72,336.38,342.4,389.17,478.92,545.71,612.63,...,99505,61.26647,-149.641223,ANCHORAGE,AK,152.383333,0,0,0,100.0
21183,1,1995,AK404,343.72,336.38,342.4,389.17,478.92,545.71,612.63,...,99505,61.26647,-149.641223,ANCHORAGE,AK,152.383333,1,1,0,100.0
45170,0,1995,AK404,192.42,211.35,252.19,271.72,334.33,371.4,425.62,...,99505,61.26647,-149.641223,ANCHORAGE,AK,152.383333,0,0,0,100.0
45171,0,1995,AK404,192.42,211.35,252.19,271.72,334.33,371.4,425.62,...,99505,61.26647,-149.641223,ANCHORAGE,AK,152.383333,1,1,0,100.0
21184,1,1995,AK405,257.4,257.4,262.27,313.24,375.78,413.67,444.98,...,99701,64.864904,-146.775162,FAIRBANKS,AK,152.383333,0,0,0,


In [176]:
# save fused dataframe

fused_df.to_csv(os.path.join(fuse_dir, "fused_df.csv"), index=False)