In [5]:
import pandas as pd
pd.set_option('display.max_columns', None)

The datasets for 2018 and 2019 can be retrieved from: https://ffiec.cfpb.gov/data-browser/

The following parameters have been selected in the HMDA data browser prior to downloading CSV files:
1. States: Nationwide
2. Financial institutions: All
3. Filters:
    A. Action Taken: Options 1, 2, and 3 have been selected from the following options:
    
        1 - Loan Originated
        2 - Application approved but not accepted
        3 - Application denied
        4 - Application withdrawn by applicant
        5 - File closed for incompleteness
        6 - Purchased loan
        7 - Preapproval request denied
        8 - Preapproval request approved but not accepted
        
    B. Loan Purpose: Option 1 has been selected from the following options:
        
        1 - Home Purchase
        2 - Home Improvement
        31 - Refinancing
        32 - Cash Out Refinancing
        4 - Other Purpose
        5 - Not Applicable

This amounts to 5,191,043 rows for 2018 and 5,336,491 rows for 2019.

In [6]:
#Importing CSV to dataframe for year 2018
filename = '/Users/agm/Desktop/springboard/Capstone 2/Data/2018.csv'
df = pd.read_csv(filename,low_memory=False)

In [7]:
#Appending 2019 CSV to dataframe
filename = '/Users/agm/Desktop/springboard/Capstone 2/Data/2019.csv'
df = df.append(pd.read_csv(filename,low_memory=False),ignore_index=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10527534 entries, 0 to 10527533
Data columns (total 99 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   activity_year                             int64  
 1   lei                                       object 
 2   derived_msa-md                            int64  
 3   state_code                                object 
 4   county_code                               object 
 5   census_tract                              object 
 6   conforming_loan_limit                     object 
 7   derived_loan_product_type                 object 
 8   derived_dwelling_category                 object 
 9   derived_ethnicity                         object 
 10  derived_race                              object 
 11  derived_sex                               object 
 12  action_taken                              int64  
 13  purchaser_type                            int64  
 14  

The following steps have been taken to prepare data for analysis:

1. Review of each column to decide relevancy and redundancy and ultimately whether each column should be kept or dropped.

In [9]:
#Dropping columns: derived_msa-md is redundant here as a geographic referent
df.drop(columns=['derived_msa-md'],inplace=True)

In [10]:
#Dropping columns: census-tract is deemed too granular a geographic referent
df.drop(columns=['census_tract'],inplace=True)

0 - activity_year: 2018 or 20191 

1 - lei: Kept; financial institution code--5772 unique values

2 - derived MSA-MD: Dropped; redundant as geographic referent

3 - state_code: Kept

4 - county_code: Kept

5 - census_tract: Dropped; too granular as geographic referent

6 - conforming_loan_limit: Kept; while 95% of the applications are conforming, this is an important factor

7 - derived_loan_product_type: Dropped; this is redundant as an aggregate of other columns

8 - derived_dwelling_category: Dropped; this is redundant as an aggregate of other columns

9 - derived_ethnicity: Dropped; this is redundant as an aggregate of other columns

10 - derived_race: Dropped; this is redundant as an aggregate of other columns

11 - derived_sex: Dropped; this is redundant as an aggregate of other columns

12 - action_taken: Kept; this is the target feature

13 - purchaser_type: Dropped; not especially relevant

14 - preapproval: Dropped; for about 95% of loans preapproval was not requested

15 - loan_type: Kept; an important feature

16 - loan_purpose: Dropped; this dataset had been previously filtered to select only one type: home purchase

17 - lien_status: Kept; will be filtered; roughly 99% of the loans are type 1 (secured by a first lien)

18 - reverse_mortgage: Kept; will be filtered; roughly 99% of the loans are type 2 (not a reverse mortgage)

19 - open-end_line_of_credit: Kept; will be filtered; roughly 99% of these are type 2 (not an open-end credit line)

20 - business_or_commercial_purpose: Kept; the set will be filtered for the 97% type 2 (not business/commercial)

21 - loan_amount: Kept; this is a key feature

22 - loan_to_value_ratio: Kept; this is a key feature

23 - interest_rate: Kept; this is a key feature

24 - rate_spread: Dropped; this feature is not especially relevant for the matter at hand

25 - hoepa_status: Dropped; this feature is not especially relevant for the matter at hand

26 - total_loan_costs: Dropped; this feature is not especially relevant for the matter at hand

27 - total_points_and_fees: Dropped; this feature is not especially relevant for the matter at hand

28 - origination_charges: Dropped; this feature is not especially relevant for the matter at hand

29 - discount_points: Dropped; this feature is not especially relevant for the matter at hand    

30 - lender_credits: Dropped; this feature is not especially relevant for the matter at hand

31 - loan_term: Dropped; roughly 85% of loans are for 30 years; too much variability in other options

32 - prepayment_penalty_term: Dropped; this feature is not especially relevant for the matter at hand

33 - intro_rate_period: Dropped; this feature is not especially relevant for the matter at hand

34 - negative_amortization: Dropped; this feature is not especially relevant for the matter at hand

35 - interest_only_payment: Dropped; this feature is not especially relevant for the matter at hand

36 - balloon_payment: Dropped; this feature is not especially relevant for the matter at hand

37 - other_nonamortizing_features: Dropped; this feature is not especially relevant for the matter at hand

38 - property_value: Kept; this is a key feature

39 - construction_method: Kept; to be filtered; roughly 98% is type 1, site-built

40 - occupancy_type: Kept; despite roughly 93% being type 1 (principal residence), this is a relevant factor

41 - manufactured_home_secured_property_type: Dropped; this feature is not especially relevant for the matter at hand

42 - manufactured_home_land_property_interest: Dropped; this feature is not especially relevant for the matter at hand

43 - total_units: Kept; to be filtered; roughly 99% is type 1, 1-unit

44 - multifamily_affordable_units: Dropped; not especially relevant to the matter at hand

45 - income: Kept; this is a key feature

46 - debt_to_income_ratio: Kept; this is a key feature

47 - applicant_credit_score_type: Dropped; not especially relevant to the matter at hand

48 - co-applicant_credit_score_type: Dropped; not especially relevant to the matter at hand

49 - applicant_ethnicity-1: Kept; this is a key feature

50 - applicant_ethnicity-2: Dropped; this feature is not especially relevant for the matter at hand

51 - applicant_ethnicity-3: Dropped; this feature is not especially relevant for the matter at hand

52 - applicant_ethnicity-4: Dropped; this feature is not especially relevant for the matter at hand

53 - applicant_ethnicity-5: Dropped; this feature is not especially relevant for the matter at hand

54 - co-applicant_ethnicity-1: Dropped; this feature is not especially relevant for the matter at hand

55 - co-applicant_ethnicity-2: Dropped; this feature is not especially relevant for the matter at hand

56 - co-applicant_ethnicity-3: Dropped; this feature is not especially relevant for the matter at hand

57 - co-applicant_ethnicity-4: Dropped; this feature is not especially relevant for the matter at hand

58 - co-applicant_ethnicity-5: Dropped; this feature is not especially relevant for the matter at hand

59 - applicant_ethnicity_observed: Dropped; this feature is not especially relevant for the matter at hand

60 - co-applicant_ethnicity_observed: Dropped; this feature is not especially relevant for the matter at hand

61 - applicant_race-1: Kept; this is a key feature

62 - applicant_race-2: Dropped; this feature is not especially relevant for the matter at hand

63 - applicant_race-3: Dropped; this feature is not especially relevant for the matter at hand

64 - applicant_race-4: Dropped; this feature is not especially relevant for the matter at hand

65 - applicant_race-5: Dropped; this feature is not especially relevant for the matter at hand

66 - co-applicant_race-1: Dropped; this feature is not especially relevant for the matter at hand

67 - co-applicant_race-2: Dropped; this feature is not especially relevant for the matter at hand

68 - co-applicant_race-3: Dropped; this feature is not especially relevant for the matter at hand

69 - co-applicant_race-4: Dropped; this feature is not especially relevant for the matter at hand

70 - co-applicant_race-5: Dropped; this feature is not especially relevant for the matter at hand

71 - applicant_race_observed: Dropped; this feature is not especially relevant for the matter at hand

72 - co-applicant_race_observed: Dropped; this feature is not especially relevant for the matter at hand

73 - applicant_sex: Kept; this is a key feature

74 - co-applicant_sex: Dropped; this feature is not especially relevant for the matter at hand 

75 - applicant_sex_observed: Dropped; this feature is not especially relevant for the matter at hand

76 - co-applicant_sex_observed: Dropped; this feature is not especially relevant for the matter at hand

77 - applicant_age: Kept; this is a key feature

78 - co-applicant_age: Dropped; this feature is not especially relevant for the matter at hand

79 - applicant_age_above_62: Dropped; this feature is not especially relevant for the matter at hand

80 - co-applicant_age_above_62: Dropped; this feature is not especially relevant for the matter at hand

81 - submission_of_application: Kept; seems relevant

82 - initially_payable_to_institution: Dropped; this feature is not especially relevant for the matter at hand

83 - aus-1: Dropped; this feature is not especially relevant for the matter at hand

84 - aus-2: Dropped; this feature is not especially relevant for the matter at hand

85 - aus-3: Dropped; this feature is not especially relevant for the matter at hand

86 - aus-4: Dropped; this feature is not especially relevant for the matter at hand

87 - aus-5: Dropped; this feature is not especially relevant for the matter at hand

88 - denial_reason-1: Dropped; this feature is not especially relevant for the matter at hand

89 - denial_reason-2: Dropped; this feature is not especially relevant for the matter at hand

90 - denial_reason-3: Dropped; this feature is not especially relevant for the matter at hand

91 - denial_reason-4: Dropped; this feature is not especially relevant for the matter at hand

92 - tract_population: Dropped; this feature is not especially relevant for the matter at hand

93 - tract_minority_population_percent: Kept; this seems relevant

94 - ffiec_msa_md_median_family_income: Dropped; this feature is not especially relevant for the matter at hand

95 - tract_to_msa_income_percentage: Kept; this seems relevant

96 - tract_owner_occupied_units: Dropped; this feature is not especially relevant for the matter at hand

97 - tract_one_to_four_family_homes: Dropped; this feature is not especially relevant for the matter at hand

98 - tract_median_age_of_housing_units: Dropped; this feature is not especially relevant for the matter at hand

In [11]:
#Dropping columns specified above
df.drop(columns=['derived_loan_product_type', 'derived_dwelling_category', 'derived_ethnicity', 'derived_race',
                 'derived_sex', 'purchaser_type', 'preapproval', 'loan_purpose', 'rate_spread', 'hoepa_status', 
                 'total_loan_costs', 'total_points_and_fees', 'origination_charges', 'discount_points', 
                 'lender_credits', 'loan_term', 'prepayment_penalty_term', 'intro_rate_period', 
                 'negative_amortization', 'interest_only_payment', 'balloon_payment', 'other_nonamortizing_features', 
                 'manufactured_home_secured_property_type', 'manufactured_home_land_property_interest', 
                 'multifamily_affordable_units', 'applicant_credit_score_type', 'co-applicant_credit_score_type', 
                 'applicant_ethnicity-2', 'applicant_ethnicity-3', 'applicant_ethnicity-4', 'applicant_ethnicity-5', 
                 'co-applicant_ethnicity-1', 'co-applicant_ethnicity-2', 'co-applicant_ethnicity-3', 
                 'co-applicant_ethnicity-4', 'co-applicant_ethnicity-5', 'applicant_ethnicity_observed', 
                 'co-applicant_ethnicity_observed', 'applicant_race-1', 'applicant_race-2', 'applicant_race-3', 
                 'applicant_race-4', 'applicant_race-5', 'co-applicant_race-1', 'co-applicant_race-2', 
                 'co-applicant_race-3', 'co-applicant_race-4', 'co-applicant_race-5', 'applicant_race_observed', 
                 'co-applicant_race_observed', 'co-applicant_sex', 'applicant_sex_observed', 
                 'co-applicant_sex_observed', 'co-applicant_age', 'applicant_age_above_62', 
                 'co-applicant_age_above_62', 'initially_payable_to_institution', 'aus-1', 'aus-2', 'aus-3', 'aus-4', 
                 'aus-5', 'denial_reason-1', 'denial_reason-2', 'denial_reason-3', 'denial_reason-4', 
                 'tract_population', 'ffiec_msa_md_median_family_income', 'tract_owner_occupied_units', 
                 'tract_one_to_four_family_homes', 'tract_median_age_of_housing_units'],inplace=True)


The following cleanup and filtering operations are applied:

    I. Data types and handling null values

    Changing "object" types to more specific ones:
    
    A. Columns with no null values:
        1 - lei: From object to string.
        2 - total_units: From object to category; current unique values are: 
            ['1', '2', '4', '3', '50-99', '>149', '5-24', '25-49', '100-149', '0']
        3 - applicant_age: From object to category; current unique values are:
            ['>74', '45-54', '55-64', '25-34', '35-44', '65-74', '<25', '8888','9999']
                224,515 or 2.13% are '8888' or '9999'

    B. Columns with null values:
        1 - state_code (56,864 or 0.54%) // possibly reconstructible via census tract code [previously dropped]
        2 - county_code (177,137 or 1.68%) // possibly reconstructible via census tract code [previously dropped]
        3 - conforming_loan_limit (54,549 or 0.51%) // possibly reconstructible via loan_amount + county_code
        4 - loan_to_value_ratio (785,874 or 7.46%) // possibly reconstructible via loan_amount + property_value
        5 - interest_rate (1,393,269 or 13.23%) // possibly reconstructible via rate_spread [previously dropped]
        6 - property_value (212,226 or 2.01%) // possibly reconstructible via loan_to_value_ratio + loan_amount
        7 - debt_to_income_ratio (367,313 or 3.49%) // possibly reconstructible via income + loan_amount

    C. Changing into categorical:
        1 - action_taken: Currently, the unique values stored are 1 - Loan originated, 2 - Application approved but 
            not accepted, and 3 - Application denied. Could store this as Y = either 1 or 2 and N = 3.
        
    II. Filtering by columns with a majority value and subsequently dropping the column.
    
    1 - business_or_commercial_purpose: Roughly 97% are type 2 (not business/commercial).
    2 - conforming_loan_limit: About 95% of the applications are conforming. (?)
    3 - lien_status: Roughly 99% of the loans are type 1 (secured by a first lien).
    4 - reverse_mortgage: Roughly 99% of the loans are type 2 (not a reverse mortgage).
    5 - open-end_line_of_credit: Roughly 99% of these are type 2 (not an open-end credit line).
    6 - construction_method: Roughly 98% is type 1, site-built.
    7 - total_units: Roughly 99% is type 1, 1-unit.

In [74]:
df.to_csv('./Data/2018_9_reduced.csv')