# Data Preparation
The purpose of this notebook is to **preprocess** and **explore** the credit
approval dataset. At the beginning of each section, we will summarize the steps
taken to preprocess the dataset and then explore it.

In [2]:
# Load Dependencies
import numpy as np
import pandas as pd
import torch
import sklearn
import matplotlib.pyplot as plt

In [3]:
# Load Raw Dataset
filepath = "/Users/drewgjerstad/repos/credit-approval-prediction/data/crx.data"

credit_data = pd.read_table(filepath_or_buffer=filepath,
                            delimiter=",", na_values=["?"],
                            names=["A1", "A2", "A3", "A4", "A5", "A6", "A7",
                                   "A8", "A9", "A10", "A11", "A12", "A13",
                                   "A14", "A15", "A16"])
credit_data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


## Data Preprocessing
In this section, we will focus on preprocessing the raw dataset we loaded above.
Once we have processed the raw dataset, we will move onto the next section of
exploring the (training) data. The preprocessing steps taken in this section are
summarized below.

1. Split the dataset into **features** and **response**.
2. Drop any rows with missing response values (if necessary).
3. Identify features with missing data.
4. Split the dataset into **training** and **test** sets.
5. Impute missing values for columns with missing data.
6. Encode categorical features.

Once we have performed each of these steps, we will design a data preprocessing
pipeline to easily handle these steps for hypothetical future versions of this
dataset. It will be found in a `preprocessing.py` file in the `src` directory
and is used in other parts of this project.

### Split Dataset Into Features/Response
First, we split the dataset into a feature matrix and a response vector. As
mentioned in the `README`, the response of this dataset is whether or not the
credit card application was approved. This information is stored in `A16`.

In [4]:
# Split Features/Response
credit_features = credit_data.drop(labels=["A16"], axis=1, inplace=False)
credit_response = credit_data.loc[:, "A16"]

In [5]:
# Verify Split
print(f"Features: {list(credit_features.columns)}")
print(f"Response: {credit_response.name}")

Features: ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15']
Response: A16


### Drop Rows with Missing Response Values

Next, since one of the key goals of this project is to build a predictive model
for predicting the outcome of credit card applications, including observations
with a missing response value is not ideal. Therefore, before we continue with
any additional preprocessing we will remove any observations without a
corresponding response.

In [6]:
# Identify Rows with Missing Response
missing_response_idx = np.argwhere(credit_response.isna() == True)
print(f"Number of Missing Response Values: {missing_response_idx.shape[0]}")

Number of Missing Response Values: 0


In [7]:
# Remove Rows (if necessary)
if missing_response_idx.shape[0] > 0:
    credit_features.drop(labels=missing_response_idx.tolist(), axis=0)
    credit_response.drop(labels=missing_response_idx.tolist(), axis=0)
    print(f"Number of Rows Dropped: {missing_response_idx.shape[0]}")
    print(f"Feature Matrix Shape Post-Removal: {credit_features.shape}")
    print(f"Response Vector Shape Post-Removal: {credit_response.shape}")
else:
    print("No Missing Response Values. No Rows Dropped.")
    print(f"Feature Matrix Shape: {credit_features.shape}")
    print(f"Response Vector Shape: {credit_response.shape}")

No Missing Response Values. No Rows Dropped.
Feature Matrix Shape: (690, 15)
Response Vector Shape: (690,)


### Identify Features with Missing Data
Now that we have removed rows with missing response values, we can focus on
identifying features with missing data. From a business standpoint, we could
simply disregard any observations ("applications") with missing data. However,
since we are focused on developing a predictive model and its corresponding
pipeline, we will develop methods to impute missing values in a logical manner.

In [8]:
# Identify Features with Missing Data
missing_features = []
for col in credit_features.columns:
    if credit_features.loc[:, col].isna().any():
        missing_features.append(col)
print(f"Features with Missing Data: {missing_features}")

Features with Missing Data: ['A1', 'A2', 'A4', 'A5', 'A6', 'A7', 'A14']


### Split Dataset Into Training/Test Sets
If we were to impute missing values prior to splitting the dataset into training
and test sets, trends in the future set could end up "leaking" into the training
set. Therefore, we will split the dataset into training and test sets before
continuing.

To ensure that the percentage of observations in the response is maintained in
the training and test sets, we will use a stratified split.

In [9]:
TEST_SIZE = 0.25
SEED = 42

In [None]:
# Split Dataset Into Training/Test Sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    credit_features, credit_response,
    test_size=TEST_SIZE,
    random_state=SEED,
    stratify=credit_response
)

print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training response shape: {y_train.shape}")
print(f"Testing response shape: {y_test.shape}")

Training features shape: (517, 15)
Testing features shape: (173, 15)
Training response shape: (517,)
Testing response shape: (173,)


### Impute Missing Values

### Feature Engineering

### Encode Categorical Features

### Standardize Continuous Features

### Export Training and Test Sets

## Data Exploration