# Importing Data

Data is being fetched as an excel file directly from HRSA's data warehouse for federally qualified health centers (FQHCs). I'll start by importing 2022 data.

In [1]:
import pandas as pd
from urllib.request import urlretrieve

In [2]:
url = 'https://www.hrsa.gov/sites/default/files/hrsa/foia/h80-2022.xlsx'

# save file locally
urlretrieve(url, f'/Users/katialopes-gilbert/repos/springboard-projects/capstone-project-fqhc-model/data/2022-h80-data.xlsx')

('/Users/katialopes-gilbert/repos/springboard-projects/capstone-project-fqhc-model/data/2022-h80-data.xlsx',
 <http.client.HTTPMessage at 0x12972c5d0>)

In [3]:
# load file into dictionary
df = pd.read_excel('/Users/katialopes-gilbert/repos/springboard-projects/capstone-project-fqhc-model/data/2022-h80-data.xlsx', sheet_name=None)

## Some key information about the 2022 UDS dataset:

**How this data is collected:** 
Data is collected through HRSA's Uniform Data System report that health center grantees must fill out annually. In 2022, there were 1,370 entities that filled out the UDS report.

**Missing Values Representation**
1. "-" represents no data entry by health center
2. "--" represents suppressed patient counts between 1-15 to protect patient privacy
3. "---" represents suppressed health center confidential data  

In [4]:
df.keys()



In [5]:
# load sheets of interest into separate dataframes

health_centers = df['HealthCenterInfo']
health_center_sites = df['HealthCenterSiteInfo']
health_center_funding = df['Table9E']
health_center_zipcodes = df['HealthCenterZipCodes']
personnel_and_visits = df['Table5']
patients_age = df['Table3A']
patients_race = df['Table3B']
patients_other_demographics = df['Table4']
patient_services_revenue = df['Table9D']

In [6]:
health_centers.head(3)

Unnamed: 0,BHCMISID,GrantNumber,ReportingYear,HealthCenterName,HealthCenterStreetAddress,HealthCenterOtherAddress,HealthCenterCity,HealthCenterState,HealthCenterZIPCode,ProjectDirector,ProjectDirectorPhone,ProjectDirectorPhoneExt,ProjectDirectorFax,ProjectDirectorEmail,FundingCHC,FundingMHC,FundingHO,FundingPH,UrbanRuralFlag
0,10030,H80CS00803,2022,"HOLYOKE HEALTH CENTER, INC.",230 MAPLE ST,-,Holyoke,MA,1040,Alejandro Esparza Perez,(413)420-2175,-,-,alejandro.esparza@hhcinc.org,True,False,False,False,Urban
1,10040,H80CS00443,2022,MAINE MOBILE HEALTH PROGRAM INC.,9 GREEN ST STE 1,-,Augusta,ME,4330,Carol Murphy,(917)209-3777,-,-,cmurphy@mainemobile.org,False,True,False,False,Rural
2,10060,H80CS00741,2022,"FAIR HAVEN COMMUNITY HEALTH CLINIC, INC.",374 GRAND AVE,-,New Haven,CT,6513,Suzanne Lagarde,(203)752-5129,-,(203)777-8506,s.lagarde@fhchc.org,True,False,False,False,Urban


In [7]:
health_center_sites.head(3)

Unnamed: 0,BHCMISID,GrantNumber,HealthCenterName,SiteName,SiteType,SiteStatus,LocationType,LocationSetting,OperationalSchedule,CalendarSchedule,...,SiteCity,SiteState,SiteZIPCode,MailingStreetAddress,MailingCity,MailingState,MailingZIPCode,MedicaidNumber,MedicaidPharmNumber,DataAsof
0,10030,H80CS00803,"HOLYOKE HEALTH CENTER, INC.",CHICOPEE HEALTH CENTER,Service Delivery Site,Active,Permanent,All Other Clinic Types,Full-Time,Year-Round,...,Chicopee,MA,01013-3140,505-Front St,Chicopee,MA,01013-3140,1320874,401480,12/31/2022 11:59 PM EST
1,10030,H80CS00803,"HOLYOKE HEALTH CENTER, INC.","HOLYOKE HEALTH CENTER, INC.",Service Delivery Site,Active,Permanent,All Other Clinic Types,Full-Time,Year-Round,...,Holyoke,MA,01040-5144,230-Maple St,Holyoke,MA,01040-5144,1300237,401480,12/31/2022 11:59 PM EST
2,10030,H80CS00803,"HOLYOKE HEALTH CENTER, INC.",Holyoke Soldier Home,Service Delivery Site,Active,Permanent,All Other Clinic Types,Full-Time,Year-Round,...,Holyoke,MA,01040-7002,-,-,-,-,1300237,401480,12/31/2022 11:59 PM EST


In [8]:
health_center_funding.head(3)

Unnamed: 0,BHCMISID,GrantNumber,T9E_L1a_Ca,T9E_L1b_Ca,T9E_L1c_Ca,T9E_L1e_Ca,T9E_L1g_Ca,T9E_L1k_Ca,T9e_L1l_Ca,T9e_L1m_Ca,...,T9E_L6a_Other,T9E_L6a_Ca,T9E_L7_Other,T9E_L7_Ca,T9E_L8_Other,T9E_L8_Ca,T9E_L9_Ca,T9E_L10_Other,T9E_L10_Ca,T9E_L11_Ca
0,,,Migrant Health Center-Amount (a),Community Health Center-Amount (a),Health Care for the Homeless-Amount (a),Public Housing Primary Care-Amount (a),Total Health Center (Sum of Lines 1a through 1...,"Capital Development Grants, including School-B...",Coronavirus Preparedness and Response Suppleme...,"Coronavirus Aid, Relief, and Economic Security...",...,State/Local Indigent Care Programs-Source,State/Local Indigent Care Programs-Amount (a),Local Government Grants and Contracts-Source,Local Government Grants and Contracts-Amount (a),Foundation/Private Grants and Contracts-Source,Foundation/Private Grants and Contracts-Amount...,Total Non-Federal Grants and Contracts (Sum of...,Other Revenue (non-patient service revenue not...,Other Revenue (non-patient service revenue not...,Total Revenue (Sum of Lines 1 + 5 + 9 + 10)-Am...
1,10030.0,H80CS00803,0,5721128,0,0,5721128,0,0,0,...,HSN,1442182,-,0,"MA League - CHWs, La Linda Manita, Project Bre...",764680,5122518,"Rental Income from tenants,\nInterest Income, ...",14258919,28330029
2,10040.0,H80CS00443,1758567,-,-,-,1758567,-,1256,-,...,-,-,-,-,MeHAF Advocacy Grant,25000,25000,"Interest $703; Other Income $33,875; Donations...",36578,2320228


In [9]:
health_center_zipcodes.head(3)

Unnamed: 0,BHCMISID,GrantNumber,ReportingYear,ZipCode,ZipCodeType,None_UninsuredPatients,Medicaid_CHIP_OtherPublicPatients,MedicarePatients,PrivatePatients,TotalNumberofPatients
0,10030,H80CS00803,2022,1011,ZipCode,--,--,0,--,--
1,10030,H80CS00803,2022,1013,ZipCode,61,1346,385,182,1974
2,10030,H80CS00803,2022,1014,ZipCode,0,--,--,0,21


In [10]:
personnel_and_visits.head(3)

Unnamed: 0,BHCMISID,GrantNumber,T5_L1_Ca,T5_L1_Cb,T5_L1_Cb2,T5_L2_Ca,T5_L2_Cb,T5_L2_Cb2,T5_L3_Ca,T5_L3_Cb,...,T5_L21f_Cb2,T5_L21f_Cc,T5_L21g_Ca1,T5_L21g_Cb,T5_L21g_Cb2,T5_L21g_Cc,T5_L21h_Ca1,T5_L21h_Cb,T5_L21h_Cb2,T5_L21h_Cc
0,,,Family Physicians-FTEs (a),Family Physicians-Clinic Visits (b),Family Physicians-Virtual Visits (b2),General Practitioners-FTEs (a),General Practitioners-Clinic Visits (b),General Practitioners-Virtual Visits (b2),Internists-FTEs (a),Internists-Clinic Visits (b),...,Licensed Clinical Psychologists-Virtual Visits...,Licensed Clinical Psychologists-Patients (c),Licensed Clinical Social Workers-Personnel (a1),Licensed Clinical Social Workers-Clinic Visits...,Licensed Clinical Social Workers-Virtual Visit...,Licensed Clinical Social Workers-Patients (c),Other Licensed Mental Health Providers-Personn...,Other Licensed Mental Health Providers-Clinic ...,Other Licensed Mental Health Providers-Virtual...,Other Licensed Mental Health Providers-Patient...
1,---,---,---,---,---,---,---,---,---,---,...,---,---,---,---,---,---,---,---,---,---
2,010040,H80CS00443,0.92,1013,3,0,0,0,0.1,16,...,-,-,2,6,10,7,-,-,-,-


In [11]:
patients_age.head(3)

Unnamed: 0,BHCMISID,GrantNumber,T3a_L1_Ca,T3a_L1_Cb,T3a_L2_Ca,T3a_L2_Cb,T3a_L3_Ca,T3a_L3_Cb,T3a_L4_Ca,T3a_L4_Cb,...,T3a_L35_Ca,T3a_L35_Cb,T3a_L36_Ca,T3a_L36_Cb,T3a_L37_Ca,T3a_L37_Cb,T3a_L38_Ca,T3a_L38_Cb,T3a_L39_Ca,T3a_L39_Cb
0,,,Under age 1-Male Patients (a),Under age 1-Female Patients (b),Age 1-Male Patients (a),Age 1-Female Patients (b),Age 2-Male Patients (a),Age 2-Female Patients (b),Age 3-Male Patients (a),Age 3-Female Patients (b),...,Ages 70–74-Male Patients (a),Ages 70–74-Female Patients (b),Ages 75–79-Male Patients (a),Ages 75–79-Female Patients (b),Ages 80–84-Male Patients (a),Ages 80–84-Female Patients (b),Age 85 and over-Male Patients (a),Age 85 and over-Female Patients (b),Total Patients (Sum of Lines 1-38)-Male Patien...,Total Patients (Sum of Lines 1-38)-Female Pati...
1,10030.0,H80CS00803,81,69,104,99,112,113,126,120,...,364,411,226,294,141,161,126,121,8821,10323
2,10040.0,H80CS00443,--,--,--,0,--,--,--,--,...,--,--,--,--,0,--,0,0,609,241


In [12]:
patients_race.head(3)

Unnamed: 0,BHCMISID,GrantNumber,T3b_L1_Ca,T3b_L1_Cb,T3b_L1_Cd,T3b_L2a_Ca,T3b_L2a_Cb,T3b_L2a_Cd,T3b_L2b_Ca,T3b_L2b_Cb,...,T3b_L18a_Ca,T3b_L19_Ca,T3b_L20_Ca,T3b_L21_Ca,T3b_L22_Ca,T3b_L23_Ca,T3b_L24_Ca,T3b_L25_Ca,T3b_L25a_Ca,T3b_L26_Ca
0,,,Asian-Hispanic or Latino/a (a),Asian-Non-Hispanic or Latino/a (b),Asian-Total (d) (Sum Columns a+b+c),Native Hawaiian-Hispanic or Latino/a (a),Native Hawaiian-Non-Hispanic or Latino/a (b),Native Hawaiian-Total (d) (Sum Columns a+b+c),Other Pacific Islander-Hispanic or Latino/a (a),Other Pacific Islander-Non-Hispanic or Latino/...,...,Unknown-Number (a),Total Patients (Sum of Lines 13 to 18a)-Number...,Male-Number (a),Female-Number (a),Transgender Man/Transgender Male/Transmasculin...,Transgender Woman/Transgender Female/Transfemi...,Other-Number (a),Chose not to disclose-Number (a),Unknown-Number (a),Total Patients (Sum of Lines 20 to 25a)-Number...
1,10030.0,H80CS00803,--,148,158,215,17,232,--,--,...,51,19144,7744,9138,33,28,64,2109,28,19144
2,10040.0,H80CS00443,0,0,0,0,0,0,0,--,...,0,850,609,241,0,0,0,0,0,850


In [13]:
patients_other_demographics.head(3)

Unnamed: 0,BHCMISID,GrantNumber,T4_L1_Ca,T4_L2_Ca,T4_L3_Ca,T4_L4_Ca,T4_L5_Ca,T4_L6_Ca,T4_L7_Ca,T4_L7_Cb,...,T4_L18_Ca,T4_L19_Ca,T4_L20_Ca,T4_L21a_Ca,T4_L21_Ca,T4_L22_Ca,T4_L23_Ca,T4_L24_Ca,T4_L25_Ca,T4_L26_Ca
0,,,100% and below-Number of Patients (a),101–150%-Number of Patients (a),151–200%-Number of Patients (a),Over 200%-Number of Patients (a),Unknown-Number of Patients (a),TOTAL (Sum of Lines 1–5)-Number of Patients (a),None/Uninsured-0-17 years old (a),None/Uninsured-18 and older (b),...,Transitional (330h awardees only)-Number of Pa...,Doubling Up (330h awardees only)-Number of Pat...,Street (330h awardees only)-Number of Patients...,Permanent Supportive Housing (330h awardees on...,Other (330h awardees only)-Number of Patients (a),Unknown (330h awardees only)-Number of Patient...,Total Homeless (All health centers report this...,Total School-Based Service Site Patients (All ...,Total Veterans (All health centers report this...,Total Patients Served at a Health Center Locat...
1,10030.0,H80CS00803,2706,184,103,168,15983,19144,104,609,...,-,-,-,-,-,-,2998,0,332,19144
2,10040.0,H80CS00443,768,68,--,--,0,850,35,734,...,-,-,-,-,-,-,--,0,--,0


In [14]:
patient_services_revenue.head(3)

Unnamed: 0,BHCMISID,GrantNumber,T9D_L1_Ca,T9D_L1_Cb,T9D_L1_Cc1,T9D_L1_Cc2,T9D_L1_Cc3,T9D_L1_Cc4,T9D_L1_Cd,T9D_L2a_Ca,...,T9D_L13_Cf,T9D_L14_Ca,T9D_L14_Cb,T9D_L14_Cc1,T9D_L14_Cc2,T9D_L14_Cc3,T9D_L14_Cc4,T9D_L14_Cd,T9D_L14_Ce,T9D_L14_Cf
0,,,Medicaid Non-Managed Care-Full Charges This Pe...,Medicaid Non-Managed Care-Amount Collected Thi...,Medicaid Non-Managed Care-Collection of Reconc...,Medicaid Non-Managed Care-Collection of Reconc...,Medicaid Non-Managed Care-Collection of Other ...,Medicaid Non-Managed Care-Penalty/Payback (c4),Medicaid Non-Managed Care-Adjustments (d),Medicaid Managed Care (capitated)-Full Charges...,...,Self-Pay-Bad Debt Write-Off (f),TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Full ...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Amoun...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Colle...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Colle...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Colle...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Penal...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Adjus...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Slidi...,TOTAL (Sum of Lines 3 + 6 + 9 + 12 + 13)-Bad D...
1,---,---,---,---,---,---,---,---,---,---,...,---,---,---,---,---,---,---,---,---,---
2,010040,H80CS00443,4963,151,-,-,-,-,99,-,...,1235,636869,2724,-,-,-,-,184,617158,1235


# Data Cleaning

Write up an initial overview of how data will need to be cleaned here.

I am going to need to rename the columns for several of the dataframes. The current names are abbreviations that will be too hard to understand by checking HRSA's reference for each one. I'll create a function to keep the first and second values of the original column names, and replace all other columns names for the values in the first row. 

In [22]:
def rename_columns(df):
    """Renames dataframe columns by preserving the first two column names
       and setting the first row as the new column names for all columns after the 
       2nd column."""
    
    # save the first two column names
    original_columns = df.columns[:2]
    # create new column names by combining saved columns + first row
    new_column_names = list(original_columns) + df.iloc[0, 2:].tolist()
    # rename columns
    df.columns = new_column_names
    # drop redundant first row
    df = df.drop(index=0, inplace=True)
    
    return df

In [23]:
dataframes_to_rename = [health_center_funding, patient_services_revenue, personnel_and_visits, 
                        patients_age, patients_race, patients_other_demographics]

renamed_dataframes = [rename_columns(df) for df in dataframes_to_rename]

In [24]:
health_center_funding.head(2)

Unnamed: 0,BHCMISID,GrantNumber,Migrant Health Center-Amount (a),Community Health Center-Amount (a),Health Care for the Homeless-Amount (a),Public Housing Primary Care-Amount (a),Total Health Center (Sum of Lines 1a through 1e)-Amount (a),"Capital Development Grants, including School-Based Service Site Capital Grants-Amount (a)",Coronavirus Preparedness and Response Supplemental Appropriations Act (H8C)-Amount (a),"Coronavirus Aid, Relief, and Economic Security Act (CARES) (H8D)-Amount (a)",...,State/Local Indigent Care Programs-Source,State/Local Indigent Care Programs-Amount (a),Local Government Grants and Contracts-Source,Local Government Grants and Contracts-Amount (a),Foundation/Private Grants and Contracts-Source,Foundation/Private Grants and Contracts-Amount (a),Total Non-Federal Grants and Contracts (Sum of Lines 6 + 6a + 7 + 8)-Amount (a),Other Revenue (non-patient service revenue not reported elsewhere)-Source,Other Revenue (non-patient service revenue not reported elsewhere)-Amount (a),Total Revenue (Sum of Lines 1 + 5 + 9 + 10)-Amount (a)
1,10030,H80CS00803,0,5721128,0,0,5721128,0,0,0,...,HSN,1442182,-,0,"MA League - CHWs, La Linda Manita, Project Bre...",764680,5122518,"Rental Income from tenants,\nInterest Income, ...",14258919,28330029
2,10040,H80CS00443,1758567,-,-,-,1758567,-,1256,-,...,-,-,-,-,MeHAF Advocacy Grant,25000,25000,"Interest $703; Other Income $33,875; Donations...",36578,2320228


In [25]:
personnel_and_visits.head(2)

Unnamed: 0,BHCMISID,GrantNumber,Family Physicians-FTEs (a),Family Physicians-Clinic Visits (b),Family Physicians-Virtual Visits (b2),General Practitioners-FTEs (a),General Practitioners-Clinic Visits (b),General Practitioners-Virtual Visits (b2),Internists-FTEs (a),Internists-Clinic Visits (b),...,Licensed Clinical Psychologists-Virtual Visits (b2),Licensed Clinical Psychologists-Patients (c),Licensed Clinical Social Workers-Personnel (a1),Licensed Clinical Social Workers-Clinic Visits (b),Licensed Clinical Social Workers-Virtual Visits (b2),Licensed Clinical Social Workers-Patients (c),Other Licensed Mental Health Providers-Personnel (a1),Other Licensed Mental Health Providers-Clinic Visits (b),Other Licensed Mental Health Providers-Virtual Visits (b2),Other Licensed Mental Health Providers-Patients (c)
1,---,---,---,---,---,---,---,---,---,---,...,---,---,---,---,---,---,---,---,---,---
2,010040,H80CS00443,0.92,1013,3,0,0,0,0.1,16,...,-,-,2,6,10,7,-,-,-,-


In [39]:
dataframes_dict = {'health_centers': health_centers, 
              'health_center_sites': health_center_sites, 
              'health_center_funding': health_center_funding, 
              'health_center_zipcodes':health_center_zipcodes,
              'patient_services_revenue': patient_services_revenue, 
              'personnel_and_visits': personnel_and_visits, 
              'patients_age': patients_age, 
              'patients_race': patients_race, 
              'patients_other_demographics': patients_other_demographics}

In [49]:
def dataframe_summary(dataframe_dict, key):
    """
    A function that provides an overview of a dataframe's structure and columns.
    
    Parameters:
    - dataframes_dict: Dict[str, pd.DataFrame], a dictionary of DataFrames.
    - key: str, the key for the DataFrame to process.

    Returns:
    An overview of a dataframe's shape, column names, 
        and number of values in each column."""
    # access the df with its key
    df = dataframe_dict[key]

    # print relevant information about the df
    print(f'The {key} dataframe has a shape of {df.shape}.')
    print()
    print(f'The {key} dataframe has the following columns and number of values: ')
    print(df.info(verbose=True))
    print()
    print('-----------------------------------')

In [50]:
for key in dataframes_dict.keys():
    dataframe_summary(dataframes_dict, key)

The health_centers dataframe has a shape of (1370, 19).

The health_centers dataframe has the following columns and number of values: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1370 entries, 0 to 1369
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   BHCMISID                   1370 non-null   object
 1   GrantNumber                1370 non-null   object
 2   ReportingYear              1370 non-null   int64 
 3   HealthCenterName           1370 non-null   object
 4   HealthCenterStreetAddress  1370 non-null   object
 5   HealthCenterOtherAddress   1370 non-null   object
 6   HealthCenterCity           1370 non-null   object
 7   HealthCenterState          1370 non-null   object
 8   HealthCenterZIPCode        1370 non-null   object
 9   ProjectDirector            1370 non-null   object
 10  ProjectDirectorPhone       1370 non-null   object
 11  ProjectDirectorPhoneExt    1370 non-nu

I will not want to keep all columns for every dataframe. Some of the dataframes need to be combined for easier analysis. Below I will subset the dataframes to keep only the columns of interest and then I'll combine the information together before doing exploratory data analysis.