<img style="float: left;" src="../images/fanniemae.png">
<br><br><br><br><br><br>
______

# Mortgage Loan Default Classifier
____________
____________

## Problem Statement:
_____________
Fannie Mae, or more specifically the Federal National Mortgage Association (FNMA), is a government sponsored entity whose primary goal is to raise home ownership and affordable housing levels.  Fannie Mae attempts to accomplish this in essence by purchasing mortgage loans within certain parameters from mortgage lenders.  In turn, mortgage lenders are provided cash flow to issue additional mortgages.<br>

The cause of the Financial Crisis of 2008 can in part be drawn back to the purchase of mortgage loans with an actual probability of default that were higher than assumed.  By creating a classification model that will predict whether a mortgage loan will default based on pre-purchase characteristics, Fannie Mae may better avoid high risk mortgage loans.  The model will be evaluated based on Accuracy and False Negative Rate.  In this particular case, the "positive" class will be loans that default therefore, we will seek to minimize the False Negative Rate while maximizing Accuracy.

## Convert Data Description Summary from .pdf to .csvs
_________

The files that contain the raw data located on Fannie Mae's website (hyperlink) do not contain the associated headers.  The headers themselves are contained in a pdf (hyperlink) that summarizes the data file layout.  This workbork seeks to extract the information from the file layout pdf and convert it to a more convenient csv format using the Pandas and Tabula packages.

### Import packages and load pdf file

In [1]:
# import necessary packages
import pandas as pd
from tabula import read_pdf, convert_into

In [2]:
# declare file path for pdf containing data headers
input_file = '../data/FNMA_SF_Loan_Performance_File_layout.pdf'

# load input file and explore output
df = read_pdf(input_file, pages='all')
df.head(10)

Unnamed: 0,File Position,Unnamed: 1,Field Name,Type,Max Length
0,,,,,
1,1.0,LOAN IDENTIFIER,,ALPHA-NUMERIC,20
2,2.0,ORIGINATION CHANNEL,,ALPHA-NUMERIC,1
3,3.0,SELLER NAME,,ALPHA-NUMERIC,80
4,4.0,ORIGINAL INTEREST RATE,,NUMERIC,1410
5,5.0,ORIGINAL UPB,,NUMERIC,112
6,6.0,ORIGINAL LOAN TERM,,NUMERIC,30
7,7.0,ORIGINATION DATE,,DATE,MM/YYYY
8,8.0,FIRST PAYMENT DATE,,DATE,MM/YYYY
9,9.0,ORIGINAL LOAN-TO-VALUE,(LTV),NUMERIC,1410


__Insight:__<br>
It appears that the column with the Field Name was errantly split in a few instances.  Also, the converting package translated space to empty rows.  These will need to be corrected.  

### Clean and save to csv files

In [3]:
# solve read errors on Field Name by replacing nulls and combining Unnamed: 1 and Field Name
df.fillna(value=' ', inplace=True)
df['Field Name'] = df['Unnamed: 1'] +' '+ df['Field Name']
df.head()

Unnamed: 0,File Position,Unnamed: 1,Field Name,Type,Max Length
0,,,,,
1,1.0,LOAN IDENTIFIER,LOAN IDENTIFIER,ALPHA-NUMERIC,20.0
2,2.0,ORIGINATION CHANNEL,ORIGINATION CHANNEL,ALPHA-NUMERIC,1.0
3,3.0,SELLER NAME,SELLER NAME,ALPHA-NUMERIC,80.0
4,4.0,ORIGINAL INTEREST RATE,ORIGINAL INTEREST RATE,NUMERIC,1410.0


In [4]:
# drop Unnamed: 1 column and null rows
df.drop(columns='Unnamed: 1', inplace=True)
df.drop(index = df[df['File Position'] == ' '].index, inplace=True)

# assign data description summaries to associated data files; Acquisition and Performance
df_acq = df.iloc[0:25,:].reset_index(drop=True)
df_perf = df.iloc[26:,:].reset_index(drop=True)

# strip additional spaces from Field Name
df_acq['Field Name'] = df_acq['Field Name'].map(lambda x: x.strip())
df_perf['Field Name'] = df_perf['Field Name'].map(lambda x: x.strip())

In [5]:
# save data description summaries associated data files to csvs; Acquisition and Performance
output_file = '../data/acquisition_data_dict_summary.csv'
df_acq.to_csv(output_file, index=False)

output_file = '../data/performance_data_dict_summary.csv'
df_perf.to_csv(output_file, index=False)