# Dataset Preparation
- Inputs:
    - American Housing Survey (AHS) for survey years 2015, 2017, 2019, 2021, 2023.
    - HUD Income Limits for Select Metro Areas and Years (2015, 2017, 2019, 2021, 2023)
        - Atlanta-Sandy Springs-Roswell, GA (12060)
        - Boston-Cambridge-Quincy, MA-NH (14460)
        - Dallas-Fort Worth-Arlington, TX (19100)
        - Houston-The Woodlands-Sugar Land, TX (26420)
        - Phoenix-Mesa-Scottsdale, AZ (38060)
        - Seattle-Tacoma-Bellevue, WA (42660)
        - Washington-Arlington-Alexandria, DC-VA-MD (47900)
- Output:
    - A panel dataset (2015 - 2023) of renter households within the target metropolitan areas listed above. The panel data includes a new `'AMI'` variable that categorizes each observation/household (identified by the `'CONTROL'` variable) by Area Median Income level based on the corresponding Income Limit thresholds for that Metropolitan and fiscal year defined by the U.S. Department of Housing and Urban Development (HUD). 

This notebook accomplishes the following parts of the workflow/Analysis:
1. Complies a longitudinal dataset where observations are for the same subjects each time.
2. Isolates the AHS data to renter households within the metropolitan areas listed above.
3. Identfies unique renter households by AMI level:
    - Above LI: >80% AMI
    - Low-Income (LI): <= 80% AMI
    - Very Low-Income (VLI): <= 50% AMI
    - Extemely Low-Income (ELI): <= 30% AMI

In [1]:
#Importing Libraries
import requests
from io import StringIO

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns

### Data Access

The AHS Public Use Files (PUFs) used in this analysis are not stored in this repo due to size restrictions.

You can download the data directly from HUD here:
- [2015 AHS National PUF](https://www.census.gov/programs-surveys/ahs/data/2015/ahs-2015-public-use-file--puf-/ahs-2015-national-public-use-file--puf-.html)
- [2017 AHS National PUF](https://www.census.gov/programs-surveys/ahs/data/2017/ahs-2017-public-use-file--puf-/ahs-2017-national-public-use-file--puf-.html)
- [2019 AHS National PUF](https://www.census.gov/programs-surveys/ahs/data/2019/ahs-2019-public-use-file--puf-/ahs-2019-national-public-use-file--puf-.html)
- [2021 AHS National PUF](https://www.census.gov/programs-surveys/ahs/data/2021/ahs-2021-public-use-file--puf-/ahs-2021-national-public-use-file--puf-.html)
- [2023 AHS National PUF](https://www.census.gov/programs-surveys/ahs/data/2023/ahs-2023-public-use-file--puf-/ahs-2023-national-public-use-file--puf-.html)

**Note:** This analysis uses the PUF "Flat File."

In [2]:
#Selected relevant variables for the analysis
columns = ['CONTROL', 'WEIGHT', 'OMB13CBSA', 'TENURE', 'NUMPEOPLE', 'HINCP', 'RMCOSTS', 'RENT', 'HUDSUB', 'RENTSUB', 'RENTCNTRL']

#Extra columns to bring into the analysis at a later time (if needed):
#['VACANCY', 'VACMONTHS', 'BLD', 'YRBUILT', 'BEDROOMS', 'HSHLDTYPE','HHMOVE', 'HHRACE', 'HHSEX', 'HHSPAN', 'HHSOGIG', 
#'HHSOGILGBT', 'HHSOGISO', 'FINCP', 'MVG1COST', 'MVG1LOC', 'MVG2COST', 'MVG2LOC', 'MVG3COST', 'MVG3LOC', 'MOVFORCE', 
#'MOVWHY', 'HIAFFORD', 'HIHALF', 'HIPREVNUM', 'HINUMOVE', 'HINUMOVE', 'HIEVICLK', 'HIEVICNOTE', 'HIEVICTHT']

#Loading 2015 AHS METRO PUF w/ selected columns
df_15 = pd.read_csv('data/ahs/ahs2015n.csv',
                    usecols=columns)

#Loading 2017 AHS METRO PUF w/ selected columns
df_17 = pd.read_csv('data/ahs/ahs2017n.csv',
                    usecols=columns)

#Loading 2019 AHS METRO PUF w/ selected columns
df_19 = pd.read_csv('data/ahs/ahs2019n.csv',
                    usecols=columns)
#Loading 2021 AHS METRO PUF w/ selected columns
df_21 = pd.read_csv('data/ahs/ahs2021n.csv',
                    usecols=columns)

#Loading 2023 AHS METRO PUF w/ selected columns
df_23 = pd.read_csv('data/ahs/ahs2023n.csv',
                    usecols=columns)

In [3]:
#Storing the DataFrames in a dictionary with full survey years as keys
dfs_by_year = {
    2023: df_23,
    2021: df_21,
    2019: df_19,
    2017: df_17,
    2015: df_15
}

#Previewing each df
for year, df in dfs_by_year.items():
    print(f'--- AHS {year} ---')
    print(df.head(), '\n')

--- AHS 2023 ---
      CONTROL  RENT RMCOSTS TENURE RENTCNTRL RENTSUB OMB13CBSA       WEIGHT  \
0  '11000002'  1600    '-6'   '-6'      '-6'     '8'   '99998'   813.890194   
1  '11000003'   840     '1'    '2'      '-6'     '8'   '99998'   581.103231   
2  '11000005'    -6    '-6'    '1'      '-6'    '-6'   '99998'  7335.965001   
3  '11000006'    -6    '-6'    '1'      '-6'    '-6'   '99998'  6562.865941   
4  '11000008'   800    '-6'    '2'      '-6'     '8'   '99998'  1490.800600   

   NUMPEOPLE HUDSUB   HINCP  
0         -6   '-6'      -6  
1          3    '3'   48000  
2          2   '-6'  292500  
3          3   '-6'   56000  
4          1    '3'   36000   

--- AHS 2021 ---
      CONTROL  RENT RMCOSTS TENURE RENTCNTRL RENTSUB OMB13CBSA       WEIGHT  \
0  '11000005'    -6    '-6'    '1'      '-6'    '-6'   '99998'  7686.712735   
1  '11000007'    -6    '-6'    '1'      '-6'    '-6'   '37980'  1371.137443   
2  '11000009'    -6    '-6'   '-6'      '-6'    '-6'   '99998'  2014.302

In [4]:
#Checking the shape/info of each DataFrame
for year, df in dfs_by_year.items():
    print(f'\n--- AHS {year} ---')
    print('Shape:', df.shape)
    df.info()


--- AHS 2023 ---
Shape: (55669, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55669 entries, 0 to 55668
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CONTROL    55669 non-null  object 
 1   RENT       55669 non-null  int64  
 2   RMCOSTS    55669 non-null  object 
 3   TENURE     55669 non-null  object 
 4   RENTCNTRL  55669 non-null  object 
 5   RENTSUB    55669 non-null  object 
 6   OMB13CBSA  55669 non-null  object 
 7   WEIGHT     55669 non-null  float64
 8   NUMPEOPLE  55669 non-null  int64  
 9   HUDSUB     55669 non-null  object 
 10  HINCP      55669 non-null  int64  
dtypes: float64(1), int64(3), object(7)
memory usage: 4.7+ MB

--- AHS 2021 ---
Shape: (64141, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64141 entries, 0 to 64140
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CONTROL    64141 non-null  object 
 1   R

## Setting up and Preparing DataFrames for Longitudinal Analysis
- Adding `'SRVYEAR'` variable/column to identify the year of the AHS dataset that will be concatenated later in the analysis
- Using `.concat()` to stack the DataFrame vertically (long format)
- Stripping the extra `''` from the sting values 

In [5]:
#Add the survey year as a new column in each DataFrame
dfs_by_year = {
    year: df.assign(SRVYEAR=str(year)) for year, df in dfs_by_year.items()
}

In [6]:
#Checking the DataFrames to see if the column was added
for year, df in dfs_by_year.items():
    print(f'--- AHS {year} ---')
    print(df.head(), '\n')

--- AHS 2023 ---
      CONTROL  RENT RMCOSTS TENURE RENTCNTRL RENTSUB OMB13CBSA       WEIGHT  \
0  '11000002'  1600    '-6'   '-6'      '-6'     '8'   '99998'   813.890194   
1  '11000003'   840     '1'    '2'      '-6'     '8'   '99998'   581.103231   
2  '11000005'    -6    '-6'    '1'      '-6'    '-6'   '99998'  7335.965001   
3  '11000006'    -6    '-6'    '1'      '-6'    '-6'   '99998'  6562.865941   
4  '11000008'   800    '-6'    '2'      '-6'     '8'   '99998'  1490.800600   

   NUMPEOPLE HUDSUB   HINCP SRVYEAR  
0         -6   '-6'      -6    2023  
1          3    '3'   48000    2023  
2          2   '-6'  292500    2023  
3          3   '-6'   56000    2023  
4          1    '3'   36000    2023   

--- AHS 2021 ---
      CONTROL  RENT RMCOSTS TENURE RENTCNTRL RENTSUB OMB13CBSA       WEIGHT  \
0  '11000005'    -6    '-6'    '1'      '-6'    '-6'   '99998'  7686.712735   
1  '11000007'    -6    '-6'    '1'      '-6'    '-6'   '37980'  1371.137443   
2  '11000009'    -6    '

### Concatenating the 2015, 2017, 2019, etc. PUFs to produce a **long-format panel dataset**.
This will allow:
- **Track units over time** --> This requires **multiple rows per unit** (i.e., one row per unit * per year).
- **Identify the same unit** --> This is already possible with the `'CONTROL'` variable.
- **Compare status across years** (e.g., 2015 vs. 2023) --> This is easiest when **filtering by year** can be cleanly, which is what the `'SRVYEAR'` column enables.

In [7]:
#Stack all years into a single long-form DataFrame
ahs_long_df = pd.concat(dfs_by_year.values(), ignore_index=True)

In [8]:
#Inspecting the shape and the preview of the concatenated DataFrames
print(ahs_long_df.shape)
print(ahs_long_df.info())
print(ahs_long_df.head())
print(ahs_long_df['SRVYEAR'].value_counts())

(319240, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319240 entries, 0 to 319239
Data columns (total 12 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   CONTROL    319240 non-null  object 
 1   RENT       319240 non-null  int64  
 2   RMCOSTS    319240 non-null  object 
 3   TENURE     319240 non-null  object 
 4   RENTCNTRL  319240 non-null  object 
 5   RENTSUB    319240 non-null  object 
 6   OMB13CBSA  319240 non-null  object 
 7   WEIGHT     319240 non-null  float64
 8   NUMPEOPLE  319240 non-null  int64  
 9   HUDSUB     319240 non-null  object 
 10  HINCP      319240 non-null  int64  
 11  SRVYEAR    319240 non-null  object 
dtypes: float64(1), int64(3), object(8)
memory usage: 29.2+ MB
None
      CONTROL  RENT RMCOSTS TENURE RENTCNTRL RENTSUB OMB13CBSA       WEIGHT  \
0  '11000002'  1600    '-6'   '-6'      '-6'     '8'   '99998'   813.890194   
1  '11000003'   840     '1'    '2'      '-6'     '8'   '99998'   581.103231 

In [9]:
#Checking for unique unit-year combinations, the lines of code should produce the same output
print(len(ahs_long_df))
print(ahs_long_df[['CONTROL', 'SRVYEAR']].drop_duplicates().shape[0])

319240
319240


In [10]:
#Ensuring no columns were dropped
print(ahs_long_df.columns)

Index(['CONTROL', 'RENT', 'RMCOSTS', 'TENURE', 'RENTCNTRL', 'RENTSUB',
       'OMB13CBSA', 'WEIGHT', 'NUMPEOPLE', 'HUDSUB', 'HINCP', 'SRVYEAR'],
      dtype='object')


In [11]:
#Looking for missing or duplicated `CONTROL`-year rows
dupes = ahs_long_df.duplicated(subset=["CONTROL", "SRVYEAR"], keep=False)
print("Duplicated CONTROL + SRVYEAR rows:", dupes.sum())

Duplicated CONTROL + SRVYEAR rows: 0


In [12]:
#Checking counts by year to confirm stacking
ahs_long_df["SRVYEAR"].value_counts().sort_index()

2015    69493
2017    66752
2019    63185
2021    64141
2023    55669
Name: SRVYEAR, dtype: int64

##### Stripping the extra quotation marks `''` from the sting values of the entire dataset to make coding a cleaner/easier

In [13]:
ahs_long_df[ahs_long_df.select_dtypes(include='object')
            .columns] = ahs_long_df.select_dtypes(include='object').apply(lambda col: col.str.strip("'"))
ahs_long_df.head()

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,HINCP,SRVYEAR
0,11000002,1600,-6,-6,-6,8,99998,813.890194,-6,-6,-6,2023
1,11000003,840,1,2,-6,8,99998,581.103231,3,3,48000,2023
2,11000005,-6,-6,1,-6,-6,99998,7335.965001,2,-6,292500,2023
3,11000006,-6,-6,1,-6,-6,99998,6562.865941,3,-6,56000,2023
4,11000008,800,-6,2,-6,8,99998,1490.8006,1,3,36000,2023


### Isolating Renter Households

In [14]:
#Examining `TENURE` variable
ahs_long_df['TENURE'].value_counts()

1     162666
2     111084
-6     41729
3       3761
Name: TENURE, dtype: int64

I am interested in the 111,084 renter households in the DataFrame since I am creating a new DataFrame for renters.

I am disregarding:
- Owners ('1'): 162,666
- Not Applicable ('-6'): 41,729
- Units occupied w/o payment of rent ('3'): 3,761

In [15]:
#Looking at the characteristic of 'HINCP' to see if "N" values exists in the DataFrame or if its just '-6' values
print(ahs_long_df['HINCP'].dtype)
print(ahs_long_df['HINCP'].unique())

#Number of unique values in the df
print(ahs_long_df['HINCP'].nunique())

int64
[    -6  48000 292500 ...    364 327100 111660]
16891


In [16]:
#Looking at the number of renter households that have incomes that are "not applicable" (coded as '-6')
ahs_long_df[ahs_long_df['TENURE'] == '2'].groupby('HINCP').size().get(-6,0)

0

In [17]:
ahs_long_df[(ahs_long_df['TENURE'] == '2') & (ahs_long_df['HINCP'] == -6)].shape[0]

0

In [18]:
#Checking any NA/-6 values among renter houhseholds in the df
grouped = ahs_long_df[ahs_long_df['TENURE'] == '2'].groupby('HINCP').size()

#Check if -6 is in the index before using .loc
if -6 in grouped.index:
    count_minus_6 = grouped.loc[-6]
else:
    count_minus_6 = 0

print("Number of renter households with HINCP == -6:", count_minus_6)

Number of renter households with HINCP == -6: 0


All renter households/observations have a dollar amount within the `'HINCP'` variable.

In [19]:
#New df consisting of renters
rent_ahs_long_df = ahs_long_df[ahs_long_df['TENURE'] == '2']

#Checking if `rent_df` was filtered correctly
len(rent_ahs_long_df)

111084

### Examinning the household income and number of people in the unit variables for renters
- These variables are critical for matching renter households/observations to the correct Area Median Income (AMI) category

In [20]:
pd.options.display.float_format = '{:.4f}'.format #changing numeric format to see the dollar amount more clearly

rent_ahs_long_df[['HINCP', 'NUMPEOPLE']].describe()

Unnamed: 0,HINCP,NUMPEOPLE
count,111084.0,111084.0
mean,52883.5915,2.2678
std,77385.2487,1.4754
min,-10000.0,1.0
25%,14100.0,1.0
50%,34200.0,2.0
75%,68000.0,3.0
max,6445000.0,19.0


In [21]:
#Examining the the oultiers
rent_ahs_long_df[['HINCP', 'NUMPEOPLE']].quantile([0, .01, .2,.4,.6,.8,.99])

Unnamed: 0,HINCP,NUMPEOPLE
0.0,-10000.0,1.0
0.01,0.0,1.0
0.2,11400.0,1.0
0.4,25000.0,1.0
0.6,45000.0,2.0
0.8,79800.0,3.0
0.99,315000.0,7.0


In [22]:
#Looking at the number of renter households with negative incomes
rent_ahs_long_df.loc[rent_ahs_long_df["HINCP"] < 0]

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,HINCP,SRVYEAR
49144,11088288,1900,-6,2,-6,8,99998,5379.9862,3,3,-5000,2023
112419,11086871,2000,2,2,-6,8,33100,1367.133,3,3,-10000,2021
131499,11016858,740,1,2,-6,-9,99998,2464.2658,1,3,-800,2019
183849,11001154,800,-6,2,-6,8,99998,3748.5769,1,3,-5000,2017
257153,11009196,1700,-6,2,2,8,41860,705.1834,1,3,-5000,2015


There are negative values for `'HINCP'` which means the values are invalid due to either survey misreporting or processing errors. I will remove these rows from the DataFrame. The `'CONTROL'` (respondent/unit ID) for those rows will still appear in others years, if it had valid income data in those years. The panel structure will remain intact, but some years will be missing for some units.

In [23]:
#Keeping a copy for debugging or comparison
rent_ahs_long_df_raw = rent_ahs_long_df.copy()

#Dropping observations with negative 'HINCP' values
rent_ahs_long_df = rent_ahs_long_df[rent_ahs_long_df["HINCP"] >= 0]

### Isolating to renters in the following metropolitan areas/CBSAs within the National PUF
- Atlanta-Sandy Springs-Roswell, GA (12060)
- Boston-Cambridge-Quincy, MA-NH (14460)
- Dallas-Fort Worth-Arlington, TX (19100)
- Houston-The Woodlands-Sugar Land, TX (26420)
- Phoenix-Mesa-Scottsdale, AZ (38060)
- Seattle-Tacoma-Bellevue, WA (42660)
- Washington-Arlington-Alexandria, DC-VA-MD (47900)

In [24]:
#Examing 'OMB13CBSA' unique values
print(rent_ahs_long_df['OMB13CBSA'].unique())

#Looking at number of renter households/observations within each CBSA in `rent_df`
print(rent_ahs_long_df['OMB13CBSA'].value_counts())

#Number of Metros in the dataframe
print(rent_ahs_long_df['OMB13CBSA'].nunique())

['99998' '37980' '47900' '99999' '35620' '14460' '41860' '26420' '33100'
 '12060' '38060' '16980' '19100' '19820' '42660' '31080' '40140']
99998    40614
99999     7800
35620     6702
31080     6367
41860     4758
42660     4413
19100     4385
33100     4114
47900     4091
26420     4062
14460     3978
40140     3577
12060     3569
16980     3558
38060     3271
19820     2924
37980     2896
Name: OMB13CBSA, dtype: int64
17


In [25]:
#Creating a DataFrame of only renter households in ATL, BOS, PHX, Dallas/Ft. Worth, Houston, Seattle, Washington DC
#Define CBSA codes of interest as strings
target_cbsa_codes = ['12060', '14460', '19100', '26420', '38060', '42660', '47900']

#Filter the DataFrame
metro_long_rent_df = rent_ahs_long_df[rent_ahs_long_df['OMB13CBSA'].isin(target_cbsa_codes)]

#Checking filter was correct
len(metro_long_rent_df)

27769

It worked. The length of the new DataFrame and the total count of observations under each CBSA is the same.

In [26]:
#Checking the distribution of household income and number of people in renter-occupied units in the new DataFrame
metro_long_rent_df[['HINCP', 'NUMPEOPLE']].describe()

Unnamed: 0,HINCP,NUMPEOPLE
count,27769.0,27769.0
mean,65398.4558,2.3234
std,94875.4947,1.4813
min,0.0,1.0
25%,21170.0,1.0
50%,46200.0,2.0
75%,84000.0,3.0
max,6445000.0,19.0


In [27]:
#Examining the the oultiers
metro_long_rent_df[['HINCP', 'NUMPEOPLE']].quantile([0, .01, .2,.4,.6,.8,.99])

Unnamed: 0,HINCP,NUMPEOPLE
0.0,0.0,1.0
0.01,0.0,1.0
0.2,16802.4,1.0
0.4,35800.0,2.0
0.6,60000.0,2.0
0.8,96000.0,3.0
0.99,340788.0,7.0


In [28]:
#Examinig how many renter-occupied households have more than 8 people 
#Since HUD only publishes Income Limits for households up to 8 people, explained more below
metro_long_rent_df['NUMPEOPLE'].value_counts()

1     10514
2      7585
3      4121
4      2975
5      1518
6       653
7       260
8        86
9        43
10        6
12        3
19        1
11        1
13        1
14        1
15        1
Name: NUMPEOPLE, dtype: int64

### Loading in Income Limits to catagorize renter households by AMI
Accessing 2017, 2019, 2021, and 2023 Income Limit data via the API Function from HUD USER (https://www.huduser.gov/portal/dataset/fmr-api.h) and will merge the newly created `stacked_il_df` on the CBSA code (`'OMB13CBSA'`) variable to the AHS data and then create a new `'AMI'` categorical variable with the following values:
- Above LI: Above Low-Income
    - Households above the low-income thresholds of 80% Area Median Income (AMI).
    - Households above 80% AMI are not elibigle for federal rental subsidies in most cases.
- LI: Low-Income
    - Renter households with incomes at or below 80% AMI.
- VLI: Very Low-Income
    - Renter Households with incomes at or below 50% AMI.
- ELI: Extremely Low-Income
    - Renter Households with incomes at or below 30% AMI

A stacked DataFrame will be created with a new column `'il_fiscal_year'` to merge the stacked income limits DataFrame with the longitudinal AHS data to match an household's `'HINCP'` (household income) with the corresponding `'AMI'` category based on the income limits of that fiscal year. 

For example, a household's income/AMI categorization for 2015 will be based on FY2015 Income Limits for that metro area. 

**NOTE:** HUD _**does not**_ support API requests for years prior to 2017. I will need to download the FY2015 from HUDUSER, isolate the corresponding CBSA/Metro Area, and merge that into the API-requested DataFrame.

In [29]:
#API key
API_KEY = 'YOUR_API_KEY'

#Identifying metro area
cbsa_codes = [
    'METRO12060M12060', #ATL
    'METRO14460MM1120', #BOS
    'METRO19100M19100', #DAL
    'METRO26420M26420', #HOU
    'METRO38060M38060', #PHX
    'METRO42660MM7600', #SEA
    'METRO47900M47900'  #DC
]

#Identifying years for the staked DataFrame
years = [2017, 2019, 2021, 2023] 

#Initialize both dictionaries & storage
metro_dict = {}  
year_dict = {}   
all_rows = []

#API request & loop through CBSA and year
for cbsa_code in cbsa_codes:
    for year in years:
        url = f'https://www.huduser.gov/hudapi/public/il/data/{cbsa_code}?year={year}'
        headers = {'Authorization': f'Bearer {API_KEY}'}
        response = requests.get(url, headers=headers)
    
        json_data = response.json()
        data = json_data.get('data',{})
        
        #Structuring the json into a panda dataframe that is reflective of the excel file avaialbe on the web
        #Parse row
        row = {
            'hud_area_code': cbsa_code,
            'hud_area_name': data.get('area_name'),
            'il_fiscal_year': data.get('year'),
            'median_income': data.get('median_income')
        }
        
        #Add l50 (Very Low), ELI (Exrememly Low), l80 (Low)
        for i in range(1, 9):
            row[f'l50_{i}'] = data.get('very_low', {}).get(f'il50_p{i}')
            row[f'ELI_{i}'] = data.get('extremely_low', {}).get(f'il30_p{i}')
            row[f'l80_{i}'] = data.get('low', {}).get(f'il80_p{i}')
            
        #Add dictionaries
        metro_dict.setdefault(cbsa_code, []).append(row)
        year_dict.setdefault(int(data.get('year')), []).append(row)
        
        #Adding to masterlist
        all_rows.append(row)
        
#Creating the final Dataframe for income limits from 2017-2023 for the target metro areas
metro_staked_il_df = pd.DataFrame(all_rows)
metro_staked_il_df.head(50)

Unnamed: 0,hud_area_code,hud_area_name,il_fiscal_year,median_income,l50_1,ELI_1,l80_1,l50_2,ELI_2,l80_2,...,l80_5,l50_6,ELI_6,l80_6,l50_7,ELI_7,l80_7,l50_8,ELI_8,l80_8
0,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2017,69700,24400,14650,39050,27900,16750,44600,...,60250,40450,32960,64700,43250,37140,69150,46050,41320,73600
1,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2019,79700,27900,16750,44650,31900,19150,51000,...,68850,46250,34590,73950,49450,39010,79050,52650,43430,84150
2,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2021,86200,30200,18100,48300,34500,20700,55200,...,74500,50000,35580,80000,53450,40120,85500,56900,44660,91050
3,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2023,103500,35750,21500,57200,40850,24550,65350,...,88200,59250,40280,94750,63350,45420,101250,67400,50560,107800
4,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2017,103400,36200,21700,54750,41400,24800,62550,...,84450,60000,36000,90700,64150,38450,96950,68250,41320,103200
5,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2019,113300,41500,24900,62450,47400,28450,71400,...,96350,68750,41250,103500,73500,44100,110650,78250,46950,117750
6,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2021,120800,47000,28200,70750,53700,32200,80850,...,109150,77850,46700,117250,83250,49950,125350,88600,53150,133400
7,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2023,149300,51950,31150,82950,59400,35600,94800,...,127950,86100,51650,137450,92050,55200,146900,97950,58750,156400
8,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",2017,73400,25700,15400,41100,29400,17600,47000,...,63400,42600,32960,68100,45550,37140,72800,48450,41320,77500
9,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",2019,83100,29100,17500,46550,33250,20000,53200,...,71850,48200,34590,77150,51550,39010,82500,54850,43430,87800


In [30]:
#Loading in HUD's 2015 Income Limit Data for target CBSAs

#File path
file_path = '/Volumes/LaCie/GCPI 2025 Visiting Fellowship/Policy Brief 1/Data/Section8_Rev.xlsx'

#Using excel rows to identify target CBSAs (1-based indexing)
target_rows_excel = [661, 2182, 3880, 3924, 105, 4513, 482]

#Convert to 0-based pandas row indices
target_indices = [i - 1 for i in target_rows_excel]

#Read the file, only desired rows (skip all others)
il_metro_15_df = pd.read_excel(
    file_path,
    header=None,
    skiprows=lambda x: x not in target_indices
)

#Read only the header row separately (row 0) for column names
headers = pd.read_excel(file_path, nrows=0).columns.tolist()

#Assign headers back into DataFrame
il_metro_15_df.columns = headers
il_metro_15_df

Unnamed: 0,State_Alpha,fips2000,State,County,County_Name,CBSASub,Metro_Area_Name,fips2010,median2015,l50_1,...,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8,MSA,county_town_name,state_name,metro
0,AZ,401399999,4,13,Maricopa County,METRO38060M38060,"Phoenix-Mesa-Glendale, AZ MSA",401399999,64000,22400,...,46100,51200,55300,59400,63500,67600,6200,Maricopa County,Arizona,1
1,DC,1100199999,11,1,District of Columbia,METRO47900M47900,"Washington-Arlington-Alexandria, DC-VA-MD HUD ...",1100199999,109200,38250,...,61200,68000,73450,78900,84350,89800,8840,District of Columbia,District of Columbia,1
2,GA,1322799999,13,227,Pickens County,METRO12060M12060,"Atlanta-Sandy Springs-Marietta, GA HUD Metro F...",1322799999,68300,23900,...,49100,54550,58950,63300,67650,72050,520,Pickens County,Georgia,1
3,MA,2502507000,25,25,Suffolk County,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2502507000,98500,34500,...,62750,69700,75300,80900,86450,92050,1120,Boston city,Massachusetts,1
4,TX,4811399999,48,113,Dallas County,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",4811399999,70400,24650,...,50700,56300,60850,65350,69850,74350,1920,Dallas County,Texas,1
5,TX,4820199999,48,201,Harris County,METRO26420M26420,"Houston-Baytown-Sugar Land, TX HUD Metro FMR Area",4820199999,69300,24300,...,49950,55450,59900,64350,68800,73200,3360,Harris County,Texas,1
6,WA,5303399999,53,33,King County,METRO42660MM7600,"Seattle-Bellevue, WA HUD Metro FMR Area",5303399999,89600,31400,...,59250,65800,71100,76350,81600,86900,7600,King County,Washington,1


#### Structuring the FY2015 Income Limit DataFrame to match the API-request DataFrame above.

In [31]:
#Dropping columns that are not relevant 
il_metro_15_df.drop(columns=['State_Alpha', 'fips2000', 'State', 'County', 'County_Name', 'fips2010', 'MSA', 'county_town_name',
                           'state_name', 'metro'], inplace=True)

#Adding `year` column
il_metro_15_df['il_fiscal_year'] = '2015'

#Moving `year` to 3rd column (index position 2)
il_metro_15_df = il_metro_15_df[[*il_metro_15_df.columns[:2], 'il_fiscal_year', *il_metro_15_df.columns[2:-1]]]

#Rename columns to match API-requested DataFrame
il_metro_15_df.rename(columns={'CBSASub': 'hud_area_code', 'Metro_Area_Name': 'hud_area_name', 'median2015': 'median_income'}, 
                      inplace=True)
il_metro_15_df

Unnamed: 0,hud_area_code,hud_area_name,il_fiscal_year,median_income,l50_1,l50_2,l50_3,l50_4,l50_5,l50_6,...,ELI_7,ELI_8,l80_1,l80_2,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8
0,METRO38060M38060,"Phoenix-Mesa-Glendale, AZ MSA",2015,64000,22400,25600,28800,32000,34600,37150,...,36730,40890,35850,41000,46100,51200,55300,59400,63500,67600
1,METRO47900M47900,"Washington-Arlington-Alexandria, DC-VA-MD HUD ...",2015,109200,38250,43700,49150,54600,59000,63350,...,40650,43250,47600,54400,61200,68000,73450,78900,84350,89800
2,METRO12060M12060,"Atlanta-Sandy Springs-Marietta, GA HUD Metro F...",2015,68300,23900,27300,30700,34100,36850,39600,...,36730,40890,38200,43650,49100,54550,58950,63300,67650,72050
3,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2015,98500,34500,39400,44350,49250,53200,57150,...,36730,40890,48800,55800,62750,69700,75300,80900,86450,92050
4,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",2015,70400,24650,28200,31700,35200,38050,40850,...,36730,40890,39450,45050,50700,56300,60850,65350,69850,74350
5,METRO26420M26420,"Houston-Baytown-Sugar Land, TX HUD Metro FMR Area",2015,69300,24300,27750,31200,34650,37450,40200,...,36730,40890,38850,44400,49950,55450,59900,64350,68800,73200
6,METRO42660MM7600,"Seattle-Bellevue, WA HUD Metro FMR Area",2015,89600,31400,35850,40350,44800,48400,52000,...,36730,40890,46100,52650,59250,65800,71100,76350,81600,86900


#### Stacking the Income Limit DataFrames

In [32]:
stacked_il_df = pd.concat([il_metro_15_df, metro_staked_il_df], ignore_index=True)

In [33]:
#Exmaining the DataFrame
stacked_il_df

Unnamed: 0,hud_area_code,hud_area_name,il_fiscal_year,median_income,l50_1,l50_2,l50_3,l50_4,l50_5,l50_6,...,ELI_7,ELI_8,l80_1,l80_2,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8
0,METRO38060M38060,"Phoenix-Mesa-Glendale, AZ MSA",2015,64000,22400,25600,28800,32000,34600,37150,...,36730,40890,35850,41000,46100,51200,55300,59400,63500,67600
1,METRO47900M47900,"Washington-Arlington-Alexandria, DC-VA-MD HUD ...",2015,109200,38250,43700,49150,54600,59000,63350,...,40650,43250,47600,54400,61200,68000,73450,78900,84350,89800
2,METRO12060M12060,"Atlanta-Sandy Springs-Marietta, GA HUD Metro F...",2015,68300,23900,27300,30700,34100,36850,39600,...,36730,40890,38200,43650,49100,54550,58950,63300,67650,72050
3,METRO14460MM1120,"Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area",2015,98500,34500,39400,44350,49250,53200,57150,...,36730,40890,48800,55800,62750,69700,75300,80900,86450,92050
4,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",2015,70400,24650,28200,31700,35200,38050,40850,...,36730,40890,39450,45050,50700,56300,60850,65350,69850,74350
5,METRO26420M26420,"Houston-Baytown-Sugar Land, TX HUD Metro FMR Area",2015,69300,24300,27750,31200,34650,37450,40200,...,36730,40890,38850,44400,49950,55450,59900,64350,68800,73200
6,METRO42660MM7600,"Seattle-Bellevue, WA HUD Metro FMR Area",2015,89600,31400,35850,40350,44800,48400,52000,...,36730,40890,46100,52650,59250,65800,71100,76350,81600,86900
7,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2017,69700,24400,27900,31400,34850,37650,40450,...,37140,41320,39050,44600,50200,55750,60250,64700,69150,73600
8,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2019,79700,27900,31900,35900,39850,43050,46250,...,39010,43430,44650,51000,57400,63750,68850,73950,79050,84150
9,METRO12060M12060,"Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...",2021,86200,30200,34500,38800,43100,46550,50000,...,40120,44660,48300,55200,62100,68950,74500,80000,85500,91050


In [34]:
#Inspecting the shape/info of the stacked DataFrame
print(stacked_il_df.shape)
print(stacked_il_df.info())

(35, 28)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   hud_area_code   35 non-null     object
 1   hud_area_name   35 non-null     object
 2   il_fiscal_year  35 non-null     object
 3   median_income   35 non-null     int64 
 4   l50_1           35 non-null     int64 
 5   l50_2           35 non-null     int64 
 6   l50_3           35 non-null     int64 
 7   l50_4           35 non-null     int64 
 8   l50_5           35 non-null     int64 
 9   l50_6           35 non-null     int64 
 10  l50_7           35 non-null     int64 
 11  l50_8           35 non-null     int64 
 12  ELI_1           35 non-null     int64 
 13  ELI_2           35 non-null     int64 
 14  ELI_3           35 non-null     int64 
 15  ELI_4           35 non-null     int64 
 16  ELI_5           35 non-null     int64 
 17  ELI_6           35 non-null     int64 
 18  ELI

#### Creating new column `'cbsa_code'` that extracts the 5-digit cbsa code from the `'hud_area_code'` so I can merge it to the AHS `metro_long_rent_df` DataFrame.

In [35]:
stacked_il_df['OMB13CBSA'] = stacked_il_df['hud_area_code'].str.extract(r'(\d{5})')

#Looking at the DataFrame to see if the Column was created correctly
print(stacked_il_df)

       hud_area_code                                      hud_area_name  \
0   METRO38060M38060                      Phoenix-Mesa-Glendale, AZ MSA   
1   METRO47900M47900  Washington-Arlington-Alexandria, DC-VA-MD HUD ...   
2   METRO12060M12060  Atlanta-Sandy Springs-Marietta, GA HUD Metro F...   
3   METRO14460MM1120  Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area   
4   METRO19100M19100                      Dallas, TX HUD Metro FMR Area   
5   METRO26420M26420  Houston-Baytown-Sugar Land, TX HUD Metro FMR Area   
6   METRO42660MM7600            Seattle-Bellevue, WA HUD Metro FMR Area   
7   METRO12060M12060  Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...   
8   METRO12060M12060  Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...   
9   METRO12060M12060  Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...   
10  METRO12060M12060  Atlanta-Sandy Springs-Roswell, GA HUD Metro FM...   
11  METRO14460MM1120  Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area   
12  METRO14460MM1120  Bos

#### Merging the AHS and IL datasets so a new `'AMI`' variable can be computed and renter households can be catagorized by AMI based on the household income (`'HINCP'`) and HUD's Income Limits.

In [36]:
#Merging AHS and IL DataFrames
metro_rent_ami_df = pd.merge(
    metro_long_rent_df,
    stacked_il_df,
    left_on=['OMB13CBSA', 'SRVYEAR'],
    right_on=['OMB13CBSA', 'il_fiscal_year'],
    how='left',  #use left since I only want records from AHS
    indicator=True
)

### Validating that the join worked correctly
1. Checking `'_merge'` column to see whether each row came from the left, right, or both DataFrames. All or most values should be `both`.
2. Check for `NaN` values in a colum that should always be present if the merge worked. A result near `0.0` means that all values are present.
3. Cross-tabulate year values to spot mismatches
4. Inspect unmatched rows - look at examples of merge failures

In [37]:
#Checking `_merge` column
metro_rent_ami_df['_merge'].value_counts()

both          27769
left_only         0
right_only        0
Name: _merge, dtype: int64

In [38]:
#Checking `NaN` values
metro_rent_ami_df['median_income'].isna().mean()

0.0

In [39]:
#Cross-tab year values to spot mismatches
pd.crosstab(metro_rent_ami_df['SRVYEAR'], metro_rent_ami_df['_merge'])

_merge,both
SRVYEAR,Unnamed: 1_level_1
2015,6209
2017,6024
2019,5165
2021,5700
2023,4671


In [40]:
#Inspecting unmatched rows
metro_rent_ami_df[metro_rent_ami_df['_merge'] != 'both'].head(50)

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,...,ELI_8,l80_1,l80_2,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8,_merge


In [41]:
#Rows only in AHS DataFrame
unmatched_from_metro_df = metro_rent_ami_df[metro_rent_ami_df['_merge'] == 'left_only']

#Rows only in Income Limit DataFrame
unmatched_from_il_metro_df = metro_rent_ami_df[metro_rent_ami_df['_merge'] == 'right_only']

In [42]:
#Counting the unmatched rows
print("unmatched_from_metro_df:", len(unmatched_from_metro_df))
print("unmatched_from_il_metro_df:", len(unmatched_from_il_metro_df))

unmatched_from_metro_df: 0
unmatched_from_il_metro_df: 0


In [43]:
#Viewing the actual unmatched rows
print(unmatched_from_metro_df.head())
print(unmatched_from_il_metro_df.head())

Empty DataFrame
Columns: [CONTROL, RENT, RMCOSTS, TENURE, RENTCNTRL, RENTSUB, OMB13CBSA, WEIGHT, NUMPEOPLE, HUDSUB, HINCP, SRVYEAR, hud_area_code, hud_area_name, il_fiscal_year, median_income, l50_1, l50_2, l50_3, l50_4, l50_5, l50_6, l50_7, l50_8, ELI_1, ELI_2, ELI_3, ELI_4, ELI_5, ELI_6, ELI_7, ELI_8, l80_1, l80_2, l80_3, l80_4, l80_5, l80_6, l80_7, l80_8, _merge]
Index: []

[0 rows x 41 columns]
Empty DataFrame
Columns: [CONTROL, RENT, RMCOSTS, TENURE, RENTCNTRL, RENTSUB, OMB13CBSA, WEIGHT, NUMPEOPLE, HUDSUB, HINCP, SRVYEAR, hud_area_code, hud_area_name, il_fiscal_year, median_income, l50_1, l50_2, l50_3, l50_4, l50_5, l50_6, l50_7, l50_8, ELI_1, ELI_2, ELI_3, ELI_4, ELI_5, ELI_6, ELI_7, ELI_8, l80_1, l80_2, l80_3, l80_4, l80_5, l80_6, l80_7, l80_8, _merge]
Index: []

[0 rows x 41 columns]


In [44]:
#Looking at the dataframe
metro_rent_ami_df

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,...,ELI_8,l80_1,l80_2,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8,_merge
0,11000095,5100,2,2,2,8,47900,1594.6085,3,3,...,59700,66750,76250,85800,95300,102950,110550,118200,125800,both
1,11000127,250,-6,2,-6,4,47900,449.0207,2,1,...,59700,66750,76250,85800,95300,102950,110550,118200,125800,both
2,11000152,1200,-6,2,-6,5,47900,1294.5002,3,3,...,59700,66750,76250,85800,95300,102950,110550,118200,125800,both
3,11000157,1100,-6,2,1,8,47900,935.6503,1,3,...,59700,66750,76250,85800,95300,102950,110550,118200,125800,both
4,11000173,1600,-6,2,2,8,47900,935.6503,1,3,...,59700,66750,76250,85800,95300,102950,110550,118200,125800,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27764,11084589,660,1,2,-6,8,12060,955.9019,3,3,...,40890,38200,43650,49100,54550,58950,63300,67650,72050,both
27765,11084594,600,2,2,-6,8,12060,917.6095,3,3,...,40890,38200,43650,49100,54550,58950,63300,67650,72050,both
27766,11084786,1400,2,2,-6,6,19100,1012.9549,2,3,...,40890,39450,45050,50700,56300,60850,65350,69850,74350,both
27767,11085055,850,-6,2,-6,8,12060,988.1582,5,3,...,40890,38200,43650,49100,54550,58950,63300,67650,72050,both


All obversations/rows were matched. Dropping the `'_merge'` column.

In [45]:
metro_rent_ami_df.drop(columns=['_merge'], inplace=True)

In [46]:
#Checking the DataFrame again
metro_rent_ami_df.head()

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,...,ELI_7,ELI_8,l80_1,l80_2,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8
0,11000095,5100,2,2,2,8,47900,1594.6085,3,3,...,56050,59700,66750,76250,85800,95300,102950,110550,118200,125800
1,11000127,250,-6,2,-6,4,47900,449.0207,2,1,...,56050,59700,66750,76250,85800,95300,102950,110550,118200,125800
2,11000152,1200,-6,2,-6,5,47900,1294.5002,3,3,...,56050,59700,66750,76250,85800,95300,102950,110550,118200,125800
3,11000157,1100,-6,2,1,8,47900,935.6503,1,3,...,56050,59700,66750,76250,85800,95300,102950,110550,118200,125800
4,11000173,1600,-6,2,2,8,47900,935.6503,1,3,...,56050,59700,66750,76250,85800,95300,102950,110550,118200,125800


In [47]:
#Checking the shape/info of the new metro_rent_ami_df DataFrame
print(metro_rent_ami_df.shape)
print(metro_rent_ami_df.info())

(27769, 40)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27769 entries, 0 to 27768
Data columns (total 40 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CONTROL         27769 non-null  object 
 1   RENT            27769 non-null  int64  
 2   RMCOSTS         27769 non-null  object 
 3   TENURE          27769 non-null  object 
 4   RENTCNTRL       27769 non-null  object 
 5   RENTSUB         27769 non-null  object 
 6   OMB13CBSA       27769 non-null  object 
 7   WEIGHT          27769 non-null  float64
 8   NUMPEOPLE       27769 non-null  int64  
 9   HUDSUB          27769 non-null  object 
 10  HINCP           27769 non-null  int64  
 11  SRVYEAR         27769 non-null  object 
 12  hud_area_code   27769 non-null  object 
 13  hud_area_name   27769 non-null  object 
 14  il_fiscal_year  27769 non-null  object 
 15  median_income   27769 non-null  int64  
 16  l50_1           27769 non-null  int64  
 17  l50_2           277

In [48]:
#Verify `CONTROL` is unique by survey year
grouped = metro_rent_ami_df.groupby('SRVYEAR')

for year, group in grouped:
    assert group['CONTROL'].nunique() == len(group), f"Control numbers are not unique for SRVYEAR {year}!"

In [49]:
#observing values of the `hud_area_name` column at a list
hud_area_name_list = metro_rent_ami_df['hud_area_name'].unique()

#observing unique CBSA codes
hud_area_code_list = metro_rent_ami_df['hud_area_code'].unique()
OMB13CBSA_list = metro_rent_ami_df['OMB13CBSA'].unique()

# Print the list
print(hud_area_name_list)
print(hud_area_code_list)
print(OMB13CBSA_list)

['Washington-Arlington-Alexandria, DC-VA-MD HUD Metro FMR Area'
 'Boston-Cambridge-Quincy, MA-NH HUD Metro FMR Area'
 'Houston-The Woodlands-Sugar Land, TX HUD Metro FMR Area'
 'Atlanta-Sandy Springs-Roswell, GA HUD Metro FMR Area'
 'Phoenix-Mesa-Scottsdale, AZ MSA' 'Dallas, TX HUD Metro FMR Area'
 'Seattle-Bellevue, WA HUD Metro FMR Area'
 'Houston-Baytown-Sugar Land, TX HUD Metro FMR Area'
 'Atlanta-Sandy Springs-Marietta, GA HUD Metro FMR Area'
 'Phoenix-Mesa-Glendale, AZ MSA']
['METRO47900M47900' 'METRO14460MM1120' 'METRO26420M26420'
 'METRO12060M12060' 'METRO38060M38060' 'METRO19100M19100'
 'METRO42660MM7600']
['47900' '14460' '26420' '12060' '38060' '19100' '42660']


Since some values under the `'hud_area_name_list'` (metropolitan area names) change within the panel data, I will refer to the CBSA codes (`'OMB13CBSA'`) for metro/CBSA-based analyses.

## Categorizing Renter Households based on Area Median Income (AMI)/Income Limits.

I am creating a new `AMI` column consisting of catagorical variables (ELI, VLI, LI, and Above LI) that catagorizes each observation into an AMI group. I will include larger households by incresing the 8-person limit by 8% of the 4-person limit for each additional person beyond 8. 


### 1. Creating a function to assign an AMI catagory to renter households that have over eight (8) peoeple in the unit.

_This is from [HUD USER FY23 Income Limit Documentation ](https://www.huduser.gov/portal/datasets/il//il23/IncomeLimitsMethodology-FY23.pdf)_

| **Number of Persons in Family and Percentage Adjustments** | 1    | 2    | 3    | 4    | 5     | 6     | 7     | 8     |
|:-----------------------------------------------------------:|:----:|:----:|:----:|:----:|:-----:|:-----:|:-----:|:-----:|
|                                                             | 70%  | 80%  | 90%  | Base | 108%  | 116%  | 124%  | 132%  |


HUD does not include income limits for families with more than eight persons in the printed lists because of space limitations. For each person over eight-persons, the four-person income limit should be multiplied by an additional 8 percent. (For example, the nine-person limit equals 140 percent \[132 + 8\] of the relevant four-person income limit.) HUD rounds income limits up to the nearest &#36;50.

**Note:**  HUD has used the very low-income limits as the basis for deriving other income limits, unless the relevant statutory language has no references or relationship to low- and very low-income limits, as defined by the U.S. Housing Act of 1937.

#### Example:

* Very Low-Inome limit for 4-person household = &#36;46,750 
* Very Low-Income limit for 8-person household = &#36;61,750 
    * (&#36;46,750 * .32) _rounded to the nearest &#36;50_  
* Low-Income limit for 9-person = &#36;65,450 
    * (&#36;46,750 * .40) _not rounded since the product is a multiple of &#36;50_

In [50]:
def adjusted_thresholds(threshold_4_array, ppl_count_array):
    """
    Vectorized adjustment of income thresholds for households with more than 8 people.
    - Applies HUD rule: Add 8% of the 4-person threshold for each person beyond 8.
    - Returns NaN for households with 8 or fewer people (which will be handled in the function below)
    
    Parameters:
    - threshold_4_array: array of 4-person income thresholds
    - ppl_count_array: array of household sizes (number of people)
    
    Returns:
    - array of adjusted thresholds (or Nan if 'NUMPEOPLE' <= 8)
    """
    #Convert inputs to NumPy arrays if they are not already an array
    threshold_4_array = np.asarray(threshold_4_array)
    ppl_count_array = np.asarray(ppl_count_array)
    
    #Calculate how many obversations/renter households exceed 8 people
    extra_people = np.maximum(ppl_count_array - 8, 0)
    
    #Compute the total adjustment: 8% increase for each "extra" person and rounds to the nearest $50 
    adjustment = np.round((threshold_4_array * 0.08 * extra_people) / 50) * 50
    
    #Apply the adjusment only if there are extra people; else return NaN
    adjusted_thresholds = np.where(extra_people > 0, threshold_4_array + adjustment, np.nan)
    
    return adjusted_thresholds

### 2. Assigning the correct income threshold based on houhsehold size.

In [51]:
def get_threshold_by_size(df, base_col):
    """
    Return a pandas Series with the appropriate income threshold for each household size ('NUMPEOPLE' 1–8).
    
    Parameters:
    - df: DataFrame containing threshold columns like 'ELI_1' to 'ELI_8', 'l50_1' to l50_8`, etc.
    - base_cal: string, the prefix for the threshold columns ('ELI', 'l50', 'l80', etc.)
    
    Returns:
    - A pandas Series of thresholds matched to each household's NUMPEOPLE.
    """
    #Create list of column names to pull threshold values by size
    columns = [f'{base_col}_{i}' for i in range(1, 9)]
    
    #Convert these threshold columns to a NumPY matrix (2d array)
    matrix = df[columns].to_numpy()
    
    #Calculate the index of the column that corresponds to each row's NUMPEOPLE
    #Clip NUMPEOPLE to 1-8 range, subtract 1 for zero-based indexing
    indexer = df['NUMPEOPLE'].clip(1, 8).astype(int) - 1
    
    #Select the threshold from the appropriate column for each row
    return matrix[np.arange(len(df)), indexer]

### 3. Assign the AMI to each unique renter houhsehold
This is the final part of the function/method for assigning renter households to **AMI categories** (Extremely Low Income `ELI`, Very Low Income `VLI`, Low Income `LI`, and Above Low Income `Above LI`) based on:
- AHS Household inomce variable (`'HINCP'`)
- AHS Household size (`'NUMPEOPLE'`)
- HUD Income Limts (`'ELI_1'` to `'ELI_8'`, `'l50_1'` to `'l50_8'`, etc.

In [52]:
def assign_ami_category(df):
    """
    Assign AMI category (ELI, VLI, LI, Above LI) to each unique renter household (`CONTROL`).
    
    Expects Columns:
    - 'HINCP': household income
    - 'NUMPEOPLE': household size
    - Thresholds: 'ELI_1' - 'ELI_8', 'l50_1' - 'l50_8', 'l80_1' - 'l80_8'
    - And 'ELI_4', 'l50_4', 'l80_4' for adjustments for households with more than 8 people
    """
    df = df.copy()
    
    
    #--------- Vectorized thresholds for ELI, VLI, LI ---------
    for prefix, out_col, fallback_col in [
        ('ELI', 'ELI_threshold', 'ELI_4'),
        ('l50', 'VLI_threshold', 'l50_4'),
        ('l80', 'LI_threshold', 'l80_4'),
    ]:
        #Threshold matrix for 1-8 people: shape (n_rows, 8)
        cols = [f'{prefix}_{i}' for i in range (1, 9)]
        matrix = df[cols].to_numpy()
        
        #Index into the threshold matrix using NUMPEOPLE
        ppl_idx = df['NUMPEOPLE'].clip(1, 8).astype(int) - 1
        size_thresholds = matrix[np.arange(len(df)), ppl_idx]
        
        #Adjust for larger households (>8)
        adjusted = adjusted_thresholds(df[fallback_col], df['NUMPEOPLE'])
        
        #Combine adjusted and base thresholds
        df[out_col] = np.where(df['NUMPEOPLE'] <= 8, size_thresholds, adjusted)
        
    #--------- AMI assignment (Fully vectorized) ---------

    conditions = [
        df['HINCP'] <= df['ELI_threshold'],
        df['HINCP'] <= df['VLI_threshold'],
        df['HINCP'] <= df['LI_threshold']
    ]
    choices = ['ELI', 'VLI', 'LI']
    df['AMI'] = np.select(conditions, choices, default='Above LI')

    return df

### 4. Using the functions to assign an `'AMI'` category and examining `metro_rent_ami_df` after the funciton was executed.

In [53]:
#Usage
metro_rent_ami_df = assign_ami_category(metro_rent_ami_df)

In [54]:
#Looking at the newly created DataFrame with the 'AMI' values
metro_rent_ami_df.head()

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,...,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8,ELI_threshold,VLI_threshold,LI_threshold,AMI
0,11000095,5100,2,2,2,8,47900,1594.6085,3,3,...,85800,95300,102950,110550,118200,125800,40700.0,67850.0,85800.0,Above LI
1,11000127,250,-6,2,-6,4,47900,449.0207,2,1,...,85800,95300,102950,110550,118200,125800,36200.0,60300.0,76250.0,ELI
2,11000152,1200,-6,2,-6,5,47900,1294.5002,3,3,...,85800,95300,102950,110550,118200,125800,40700.0,67850.0,85800.0,VLI
3,11000157,1100,-6,2,1,8,47900,935.6503,1,3,...,85800,95300,102950,110550,118200,125800,31650.0,52750.0,66750.0,Above LI
4,11000173,1600,-6,2,2,8,47900,935.6503,1,3,...,85800,95300,102950,110550,118200,125800,31650.0,52750.0,66750.0,Above LI


In [55]:
#Preliminary review of AMI counts over the survey years.
print(metro_rent_ami_df.groupby('SRVYEAR')['AMI'].value_counts())

SRVYEAR  AMI     
2015     Above LI    2546
         ELI         1673
         LI          1001
         VLI          989
2017     Above LI    2581
         ELI         1493
         LI          1086
         VLI          864
2019     Above LI    2156
         ELI         1296
         LI           968
         VLI          745
2021     Above LI    2246
         ELI         1667
         LI           950
         VLI          837
2023     Above LI    1775
         ELI         1328
         LI           817
         VLI          751
Name: AMI, dtype: int64


### 5. Data Integrity Check

In [56]:
# Data integrity check
def check_data_integrity(df):
    print("🔍 DATA INTEGRITY CHECK\n" + "-"*30)
    
    total_rows = len(df)
    unique_ids = df[['CONTROL', 'SRVYEAR']].drop_duplicates().shape[0]
    control_duplicates = df['CONTROL'].duplicated().sum()
    
    print(f"Total rows: {total_rows}")
    print(f"Unique CONTROL + SRVYEAR pairs: {unique_ids}")
    print(f"Duplicate CONTROL values (across years): {control_duplicates}")
    
    # Nulls in key columns
    missing_income = df['HINCP'].isna().sum()
    missing_size = df['NUMPEOPLE'].isna().sum()
    missing_ami = df['AMI'].isna().sum()
    
    print(f"Missing household income (HINCP): {missing_income}")
    print(f"Missing household size (NUMPEOPLE): {missing_size}")
    print(f"Missing AMI category assignments: {missing_ami}")
    
    # AMI distribution
    print("\n📊 AMI Category Distribution:")
    print(df['AMI'].value_counts(dropna=False).sort_index())
    
    print("\n✅ All checks complete.\n")

In [57]:
#Usage
check_data_integrity(metro_rent_ami_df)

🔍 DATA INTEGRITY CHECK
------------------------------
Total rows: 27769
Unique CONTROL + SRVYEAR pairs: 27769
Duplicate CONTROL values (across years): 17012
Missing household income (HINCP): 0
Missing household size (NUMPEOPLE): 0
Missing AMI category assignments: 0

📊 AMI Category Distribution:
Above LI    11304
ELI          7457
LI           4822
VLI          4186
Name: AMI, dtype: int64

✅ All checks complete.



In [58]:
#Looking at the instance were `NUMPEOPLE` is greater than 8 people and checking matches outside of python
metro_rent_ami_df.loc[metro_rent_ami_df['NUMPEOPLE'] > 8]

Unnamed: 0,CONTROL,RENT,RMCOSTS,TENURE,RENTCNTRL,RENTSUB,OMB13CBSA,WEIGHT,NUMPEOPLE,HUDSUB,...,l80_3,l80_4,l80_5,l80_6,l80_7,l80_8,ELI_threshold,VLI_threshold,LI_threshold,AMI
475,11009681,2900,-6,2,2,8,47900,3725.606,9,3,...,85800,95300,102950,110550,118200,125800,48800.0,81400.0,102900.0,VLI
1481,11025555,4700,-6,2,-6,8,38060,1749.065,15,3,...,67350,74800,80800,86800,92800,98750,46800.0,72950.0,116700.0,LI
2610,11042999,5100,-6,2,2,8,47900,1300.2463,9,3,...,85800,95300,102950,110550,118200,125800,48800.0,81400.0,102900.0,Above LI
3078,11054195,2200,2,2,-6,1,47900,572.2356,9,3,...,85800,95300,102950,110550,118200,125800,48800.0,81400.0,102900.0,Above LI
4837,11002809,1200,-6,2,-6,8,26420,1393.38,9,3,...,57050,63350,68450,73500,78600,83650,28600.0,42750.0,68400.0,ELI
5010,11005625,1100,-6,2,-6,8,26420,1076.682,10,3,...,57050,63350,68450,73500,78600,83650,30750.0,45950.0,73500.0,ELI
5285,11010547,1500,-6,2,-6,8,26420,1185.2185,11,3,...,57050,63350,68450,73500,78600,83650,32850.0,49100.0,78550.0,Above LI
6027,11020063,1100,-6,2,-6,8,38060,813.6709,9,3,...,56900,63200,68300,73350,78400,83450,28600.0,42650.0,68250.0,ELI
6146,11021460,1500,-6,2,-6,8,38060,556.0477,9,3,...,56900,63200,68300,73350,78400,83450,28600.0,42650.0,68250.0,VLI
6150,11021476,1000,-6,2,-6,8,38060,768.4264,9,3,...,56900,63200,68300,73350,78400,83450,28600.0,42650.0,68250.0,Above LI


## Exporting `metro_rent_ami_df` to csv for analysis in subsequent Notebooks.

In [59]:
#Export the DataFRame to a CSV file without the index
metro_rent_ami_df.to_csv('metro_rent_ami_df.csv', index=False)