# Auto credit card approval - case study description

<p>This case study focuses on building an automatic credit card approval predictor using machine learning. The Credit Approval Data Set from the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval"> UCI Machine Learning Repository </a>is used as an example dataset to demonstrate the methodology. Although the features labels are sanitized to maintain anonymity, from conventional credit card application analysis the feature labels are thought to be: Gender, Age, Debt, Married status, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.  </p>

<p>Credit card approval is a perfect case study for applied machine learning since the application approval process can be easily framed as a classification problem. The underlying pattern that differentiates between trustworthy customers and unreliable customers can be ascertained through the customer's credit and personal details. The conventional system for approvals were subjective and based on the bank manager's experience. Using machine learning, this subjective judgement can be supplemented with quantitave metrics that can lead to faster and more accurate approval processes. </p>

<p> This analysis will involve data pre-processing and cleaning followed by an exploratory analysis. Pre-processing is reqired to deal with the missing values and prepping the dataset for use in machine learning libraries. After some exploratory analysis, we'll build a pipeline that will test several machine learning models and their predictors for te credit card applications.</p>  


<p><sub>Sources:</sub><br>    
<sub>Data: Credit Approval Data Set, UCI Machine Learning Repository</sub><br>
<sub>Project: Sayak Paul, Predicting Credit Card Approvals, Datacamp </sub></p>


In [45]:
#Original dataset
import pandas as pd

df = pd.read_csv('../dat/cc_approvals.data',header=None,na_values='?')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     678 non-null object
1     678 non-null float64
2     690 non-null float64
3     684 non-null object
4     684 non-null object
5     681 non-null object
6     681 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    677 non-null float64
14    690 non-null int64
15    690 non-null object
dtypes: float64(4), int64(2), object(10)
memory usage: 86.4+ KB


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


As can be seen, the dataset requires some cleaning before it can be used for any exploratory analysis. 

In [76]:
df = pd.read_csv('../dat/cc_approvals.data',header=None,na_values='?')

df[15]=df[15].map({'+':1,'-':0})
df[4]=df[4].str.replace('gg','g')

def convert_cat_cols(df,cat_var_limit=30,verbose=True):
    """
    Converts columns with a small proption of unique values
    (compared to length of the series)into categorical variables
    """
    temp_var = df.apply(lambda x: len(x.value_counts()))
    temp_var = temp_var[temp_var<len(df)/cat_var_limit].index
    df[temp_var] = df[temp_var].astype('category')
    if verbose:
        print(df[temp_var].describe())
    return df

df = convert_cat_cols(df,verbose=False).fillna(df.median())

def impute_most_freq(df):
    """
    Imputes the most frequent value in place of NaN's
    """
    temp_var = df.apply(lambda x: x.value_counts().index[0])
    return df.fillna(temp_var)

df = impute_most_freq(df)

In [77]:
df.info()



for i in df.columns:
    print(df[i].value_counts(dropna=False))
    print('\n')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null category
1     690 non-null float64
2     690 non-null float64
3     690 non-null category
4     690 non-null category
5     690 non-null category
6     690 non-null category
7     690 non-null float64
8     690 non-null category
9     690 non-null category
10    690 non-null int64
11    690 non-null category
12    690 non-null category
13    690 non-null float64
14    690 non-null int64
15    690 non-null category
dtypes: category(10), float64(4), int64(2)
memory usage: 41.1 KB
b    480
a    210
Name: 0, dtype: int64


28.46    12
22.67     9
20.42     7
24.50     6
20.67     6
         ..
17.83     1
44.83     1
60.58     1
50.08     1
28.33     1
Name: 1, Length: 350, dtype: int64


1.500     21
0.000     19
3.000     19
2.500     19
1.250     16
          ..
12.125     1
13.915     1
22.000     1
12.835     1
10.915     1
Name: 2, Length: 215, dtype: int64


u