# Fusing Data

Data comes from five sources:

1. BAH data available from https://www.travel.dod.mil/Allowances/Basic-Allowance-for-Housing/BAH-Rate-Lookup/
2. Housing Price Index for ​Three-Digit ZIP Codes (Developmental Index; Not Seasonally Adjusted) available from https://www.fhfa.gov/data/hpi/datasets?tab=quarterly-data
3. US City average consumer price index for all items available from https://data.bls.gov/timeseries/CUUR0000SA0
4. Political Party control of different branches of Federal government (Congress, House, Presidency)
5. Location data for each military housing area based on the data available from https://www.kaggle.com/datasets/mexwell/us-military-bases/data


*Package Imports*

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re
from geopy.geocoders import Nominatim

*Global Variables/Settings*

In [2]:
raw_dir = "Raw Data"
fuse_dir = "Fused Data"
clean_dir = "Clean Data"

*Metadata Files*

In [48]:
mha_names_df = pd.read_csv(os.path.join(raw_dir,"mhanames.csv"))
mha_zips_df = pd.read_csv(os.path.join(raw_dir,"sorted_zipmha.csv"))

## 1. BAH Data

BAH Data is separated by year and is generally available as an ASCII/csv file. Data is further spread across two categories: with and without dependents.

The ASCII/csv file does not include headers, but a PDF that accompanies the data includes definitions of the columns as follows:

| Column | Data Type |
|---|---|
| MHA | CHAR (5) |
| E1 | NUM |
| E2 | NUM |
| E3 | NUM |
| E4 | NUM |
| E5 | NUM |
| E6 | NUM |
| E7 | NUM |
| E8 | NUM |
| E9 | NUM |
| W1 | NUM |
| W2 | NUM |
| W3 | NUM |
| W4 | NUM |
| W5 | NUM |
| O1E | NUM |
| O2E | NUM |
| O3E | NUM |
| O1 | NUM |
| O2 | NUM |
| O3 | NUM |
| O4 | NUM |
| O5 | NUM |
| O6 | NUM |
| O7 | NUM |

Some CSV files also include ranks O8-O10, but do not specify it in their data definitions. Columns are expanded below to account for the difference. If those values do not exist then they are populated with null values.

In [78]:
# Create the column labels for the CSV files
csv_header = ["MHA", 
              "E1", "E2", "E3", "E4", "E5", "E6", "E7", "E8", "E9", 
              "W1", "W2", "W3", "W4", "W5", 
              "O1E", "O2E", "O3E", 
              "O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9", "O10"]

# Create the column labels for the final dataframe
final_header = ["Dependents", "Year", "MHA", 
                "E1", "E2", "E3", "E4", "E5", "E6", "E7", "E8", "E9", 
                "W1", "W2", "W3", "W4", "W5", 
                "O1E", "O2E", "O3E", 
                "O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9", "O10"]


In [79]:
# get the list of data files
bah_path = os.path.join(raw_dir, "BAH Data")
files = os.listdir(bah_path)

# create blank dataframe
bah_df = pd.DataFrame(columns=final_header)

In [80]:
for file in files:
    path = os.path.join(bah_path, file)
    name, ext = os.path.splitext(file)
    
    year = name[-2:]
    # if year = 9_ then set it to 199_, otherwise it is 20__
    if year[0] == "9":
        year = "19" + year
    else:
        year = "20" + year
    year = int(year)
    
    dependents = name[-3:-2]
    # dependents = 1 for with (w) and 0 for without (o)
    if dependents == "w":
        dependents = 1
    else:
        dependents = 0
    
    # read file to dataframe
    df = pd.read_csv(path, names=csv_header)
    df.insert(loc=0, column="Year", value = year)
    df.insert(loc=0, column="Dependents", value = dependents)
    bah_df = pd.concat([bah_df, df])

  bah_df = pd.concat([bah_df, df])


In [82]:
# save dataframe to file
bah_df.to_csv(os.path.join(bah_path, "bah_df.csv"), index=False)

Unused MHAs will be removed at a later step.

## 2. Housing Price Index Data

>The FHFA HPI is a broad measure of the movement of single-family house prices. The FHFA HPI is a weighted, repeat-sales index, meaning that it measures average price changes in repeat sales or refinancings on the same properties. This information is obtained by reviewing repeat mortgage transactions on single-family properties whose mortgages have been purchased or securitized by Fannie Mae or Freddie Mac since January 1975.
>FHFA House Price Index | FHFA. (2025, January 28). FHFA.gov. https://www.fhfa.gov/data/hpi

‌

In [83]:
# identify the data path
hpi_path = os.path.join(raw_dir, "HPI Data")

In [88]:
# read data file into a dataframe
hpi_df = pd.read_excel(os.path.join(hpi_path,"hpi_at_3zip.xlsx"),skiprows=3, header=1)
hpi_df.head()

Unnamed: 0,Three-Digit ZIP Code,Year,Quarter,Index (NSA),Index Type
0,10,1995,1,100.0,Native 3-Digit ZIP index
1,10,1995,2,101.39,Native 3-Digit ZIP index
2,10,1995,3,103.63,Native 3-Digit ZIP index
3,10,1995,4,103.66,Native 3-Digit ZIP index
4,10,1996,1,104.96,Native 3-Digit ZIP index


In [89]:
# remove extraneous index types
hpi_df.drop(hpi_df.loc[hpi_df["Index Type"]!="Native 3-Digit ZIP index"].index, inplace=True)

In [91]:
# save the dataframe to a CSV
hpi_df.to_csv(os.path.join(hpi_path, "hpi_df.csv"))

## 5. Location Data

A dataset containing latitude and longitude coordinates for various military bases. This is used to create geographic similarities between different bases.

In [5]:
# File path
loc_path = os.path.join(raw_dir, "Location Data\\military-bases.csv")

In [11]:
# Load data to a data frame
loc_df = pd.read_csv(loc_path, delimiter=";")

# eliminate inactive bases
loc_df.drop(loc_df.loc[loc_df["Oper Stat"]=="Inactive"].index, inplace=True)

# drop bases not within the United States
loc_df.drop(loc_df.loc[loc_df["COUNTRY"]!="United States"].index, inplace=True)

# Remove unnecessary features
loc_ft_del = ["Geo Shape", "OBJECTID_1", "OBJECTID", "PERIMETER", "Oper Stat", "Shape_Leng", "Shape_Area"]
loc_df.drop(columns=loc_ft_del, axis=1, inplace=True)

# Remove Guard and Reserve locations
loc_df.drop(loc_df.loc[loc_df["COMPONENT"].str.contains("Guard")].index, inplace=True)
loc_df.drop(loc_df.loc[loc_df["COMPONENT"].str.contains("Reserve")].index, inplace=True)

# Remove small areas
loc_df.drop(loc_df.loc[loc_df["AREA"]<1].index, inplace=True)

loc_df.head()

Unnamed: 0,Geo Point,COMPONENT,Site Name,Joint Base,State Terr,COUNTRY,AREA
2,"33.1594636742, -106.425696182",Army Active,White Sands Missile Range NM,,New Mexico,United States,3548.570164
5,"21.3621462703, -157.718266082",MC Active,MC Trng Area Bellows,,Hawaii,United States,1.611546
7,"27.8965304593, -98.0434307225",Navy Active,ALF Orange,,Texas,United States,5.293969
9,"42.7402329675, -115.563336812",AF Active,Saylor Creek Air Force Range,,Idaho,United States,171.097376
10,"29.3567318395, -100.782932919",AF Active,Laughlin AFB,,Texas,United States,6.396106


In [None]:
# write dataframe to CSV file
loc_path = os.path.join(raw_dir, "Location Data\\loc_df_reduced.csv")
loc_df.to_csv(loc_path, index=False)

From this point data is manually reduced by cross-referencing https://installations.militaryonesource.mil/view-all

Data Reduction rationale:
- Inactive bases are dropped due to lack of military population in the area due to base closure
- Bases outside of the CONUS (plus Hawaii and Alaska) likely have different economic situations and are not considered due to likely outliers
- Features are removed due to being irrelevant, most were used for mapping purposes
- Guard and Reserve members do not normally receive BAH so those bases/areas were removed
- Small areas were removed to reduce to major military populations, small areas were indicative of very small federal property that wouldn't contribute to actual personnel numbers

In [None]:
# read back in the reduced data
loc_df = pd.read_csv(loc_path)

In [None]:
# # CAUTION code makes call to external API to get location data from lat-long
# city_loc = []
# geolocator = Nominatim(user_agent="me")
# for ll in loc_df["Geo Point"]:
#     location = geolocator.reverse(ll)
#     city_loc.append(location)

[Location(Maxwell Air Force Base, Burkett Drive, Montgomery, Montgomery County, Alabama, 36113, United States, (32.3818479, -86.35330759865383, 0.0)), Location(Mariner Road, Madison County, Alabama, United States, (34.63287805104256, -86.65713769680043, 0.0)), Location(Elwood Avenue, Calhoun County, Alabama, United States, (33.6522778, -85.9688511, 0.0)), Location(Dale County, Alabama, United States, (31.39992542910119, -85.74747847025453, 0.0)), Location(AN/GSC-52 SATCOM terminal, Satellite Communications Facility, Unorganized Borough, Alaska, United States, (52.72235205, 174.10329374865051, 0.0)), Location(37th Street, Anchorage, Alaska, 99506, United States, (61.269377907823475, -149.8110735235881, 0.0)), Location(Unorganized Borough, Alaska, 99731, United States, (63.96595475, -145.71614973785188, 0.0)), Location(Anchorage, Alaska, 99505, United States, (61.26647030083226, -149.6412227645382, 0.0)), Location(Fairbanks North Star Borough, Alaska, United States, (64.8649039, -146.775

In [None]:
# Decompose the location data into the city and lat-long
city_name = []
city_ll = []
for i in range(0,len(city_loc)):
    city_name.append(city_loc[i][0])
    city_ll.append(city_loc[i][1])

163
['Maxwell Air Force Base, Burkett Drive, Montgomery, Montgomery County, Alabama, 36113, United States', 'Mariner Road, Madison County, Alabama, United States', 'Elwood Avenue, Calhoun County, Alabama, United States', 'Dale County, Alabama, United States', 'AN/GSC-52 SATCOM terminal, Satellite Communications Facility, Unorganized Borough, Alaska, United States', '37th Street, Anchorage, Alaska, 99506, United States', 'Unorganized Borough, Alaska, 99731, United States', 'Anchorage, Alaska, 99505, United States', 'Fairbanks North Star Borough, Alaska, United States', 'Clear Highway, Anderson, Denali Borough, Alaska, 99704, United States', 'Fairbanks North Star Borough, Alaska, 99702, United States', 'Yuma County, Arizona, United States', 'East Prescott Street, Tucson, Pima County, Arizona, 85708, United States', 'Luke Air Force Base, West Bethany Home Road, Maricopa County, Arizona, 85309, United States', 'Thompson Street, Sierra Vista, Cochise County, Arizona, 85670, United States', 

In [None]:
# stitch together the city and lat-long into a dataframe
city_df = pd.DataFrame({"City": city_name, "Lat-Long": city_ll})
city_df["City"] = city_df["City"].str.split(",")

# save the city dataframe to a file
city_path = os.path.join(raw_dir, "Location Data\\city_df.csv")
city_df.to_csv(city_path, index=False)

In [None]:
# read in file to dataframe
city_df = pd.read_csv(city_path)

In [28]:
# add ZIP code to the dataframe if it exists
pattern = re.compile(r"\b\d{5}\b")
zip_code = []
for c in city_df["City"]:
    matches = re.findall(pattern,c)
    if not matches:
        zip_code.append("")
    else:
        zip_code.append(matches[0])

city_df["ZIP Code"] = zip_code
city_df.head()

Unnamed: 0,City,Lat-Long,ZIP Code
0,"['Maxwell Air Force Base', ' Burkett Drive', '...","(32.3818479, -86.35330759865383)",36113.0
1,"['Mariner Road', ' Madison County', ' Alabama'...","(34.63287805104256, -86.65713769680043)",
2,"['Elwood Avenue', ' Calhoun County', ' Alabama...","(33.6522778, -85.9688511)",
3,"['Dale County', ' Alabama', ' United States']","(31.39992542910119, -85.74747847025453)",
4,"['AN/GSC-52 SATCOM terminal', ' Satellite Comm...","(52.72235205, 174.10329374865051)",


In [None]:
# save updated dataframe to file (overwrite existing)
city_df.to_csv(city_path, index=False)

Remaining ZIP Codes were entered manually

In [31]:
# read CSV back in after manual edits
city_df = pd.read_csv(city_path)

In [39]:
# merge the dataframe with the MHA/ZIP code dataframe
# we will inner join onto the MHA/ZIP dataframe to drop MHAs that are not represented
mha_coords_df = pd.merge(mha_zips_df, city_df, how="inner", left_on="ZIP", right_on="ZIP Code")
mha_coords_df.head()

Unnamed: 0,ZIP,MHA,City,Lat-Long,ZIP Code
0,1731,MA377,"['Marrett Street', ' Lincoln', ' Middlesex Cou...","(42.4585923, -71.27662042742608)",1731
1,2841,RI256,"['110', ' Lay Street', ' Middletown', ' Newpor...","(41.535219999999995, -71.309740644)",2841
2,7416,NJ202,"['Craine Road', ' Rockaway Township', ' Morris...","(40.95409665120382, -74.54447631172972)",7416
3,8527,NJ204,"['Jackson Township', ' Ocean County', ' New Je...","(40.02707537090311, -74.37507652215362)",8527
4,8641,NJ204,"['Manor Road', ' New Hanover Township', ' Burl...","(40.02511345, -74.58019596065915)",8641


In [40]:
# remove unnecessary columns
del_cols = ["City","ZIP Code"]
mha_coords_df.drop(del_cols, axis=1, inplace=True)
mha_coords_df.head()

Unnamed: 0,ZIP,MHA,Lat-Long
0,1731,MA377,"(42.4585923, -71.27662042742608)"
1,2841,RI256,"(41.535219999999995, -71.309740644)"
2,7416,NJ202,"(40.95409665120382, -74.54447631172972)"
3,8527,NJ204,"(40.02707537090311, -74.37507652215362)"
4,8641,NJ204,"(40.02511345, -74.58019596065915)"


In [42]:
# expand Lat-Long into two columns
mha_coords_df["Lat-Long"] = mha_coords_df["Lat-Long"].str.replace("(", "")
mha_coords_df["Lat-Long"] = mha_coords_df["Lat-Long"].str.replace(")", "")
coords_split = mha_coords_df["Lat-Long"].str.split(", ", expand=True)
mha_coords_df["Latitude"] = coords_split[0]
mha_coords_df["Longitude"] = coords_split[1]
del_cols = ["Lat-Long"]
mha_coords_df.drop(del_cols, axis=1, inplace=True)
mha_coords_df.head()

Unnamed: 0,ZIP,MHA,Latitude,Longitude
0,1731,MA377,42.4585923,-71.27662042742608
1,2841,RI256,41.53522,-71.309740644
2,7416,NJ202,40.95409665120382,-74.54447631172972
3,8527,NJ204,40.02707537090311,-74.37507652215362
4,8641,NJ204,40.02511345,-74.58019596065915


In [49]:
mha_names_df.columns

Index(['MHA', 'CITY', 'STATE'], dtype='object')

In [50]:
# merge with the MHA names metadata
# use a LEFT join on the MHA column onto the mha_coords_df

mha_df = pd.merge(mha_coords_df, mha_names_df, how="left", on="MHA")
mha_df.head()

Unnamed: 0,ZIP,MHA,Latitude,Longitude,CITY,STATE
0,1731,MA377,42.4585923,-71.27662042742608,HANSCOM AFB,MA
1,2841,RI256,41.53522,-71.309740644,NEWPORT,RI
2,7416,NJ202,40.95409665120382,-74.54447631172972,NORTHERN NEW JERSEY,NJ
3,8527,NJ204,40.02707537090311,-74.37507652215362,JB MCGUIRE-DIX-LAKEHURST,NJ
4,8641,NJ204,40.02511345,-74.58019596065915,JB MCGUIRE-DIX-LAKEHURST,NJ


In [51]:
# remove duplicate MHAs
mha_df.drop_duplicates(subset=["MHA"], inplace=True)
mha_df.head()

Unnamed: 0,ZIP,MHA,Latitude,Longitude,CITY,STATE
0,1731,MA377,42.4585923,-71.27662042742608,HANSCOM AFB,MA
1,2841,RI256,41.53522,-71.309740644,NEWPORT,RI
2,7416,NJ202,40.95409665120382,-74.54447631172972,NORTHERN NEW JERSEY,NJ
3,8527,NJ204,40.02707537090311,-74.37507652215362,JB MCGUIRE-DIX-LAKEHURST,NJ
6,10922,NY217,41.36385255,-74.02383342525056,WEST POINT,NY


In [52]:
# save mha_df to csv
mha_df.to_csv(os.path.join(raw_dir, "Location Data\\mha_df_final.csv"), index=False)