This dataset contains credit card default information of clients in Taiwan. An entire modeling methodology is explored, starting from the basics of data exploration and treatment and ending by exploring different techniques for predictive analytics (logistic regression, decision trees, gradient boosting, etc.) <br>

What follows is a brief description of the 25 variables:
<b>ID</b>: ID of each client
<b>LIMIT_BAL</b>: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
<b>SEX</b>: Gender (1 = male; 2 = female).
<b>EDUCATION</b>: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
<b>MARRIAGE</b>: Marital status (1 = married; 2 = single; 3 = others).
<b>AGE</b>: Age (year).

History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:

<b>PAY_0</b>:  the repayment status in September, 2005;
<b>PAY_2</b>: the repayment status in August, 2005; . . .;
<b>PAY_3</b>: . . .
<b>PAY_4</b>: . . .
<b>PAY_5</b>: . . .>
<b>PAY_6</b>: the repayment status in April, 2005. 
The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Amount of bill statement (NT dollar).

<b>BILL_AMT1</b>: amount of bill statement in September, 2005;
<b>BILL_AMT2</b>: amount of bill statement in August, 2005; . . .;
<b>BILL_AMT3</b>: . . .;
<b>BILL_AMT4</b>: . . .;
<b>BILL_AMT5</b>: . . .;
<b>BILL_AMT6</b>: amount of bill statement in April, 2005.

Amount of previous payment (NT dollar).

<b>PAY_AMT1</b>: amount paid in September, 2005;
<b>PAY_AMT2</b>: amount paid in August, 2005; . . .;
<b>PAY_AMT3</b>: . . .;
<b>PAY_AMT4</b>: . . .;
<b>PAY_AMT5</b>: . . .;
<b>PAY_AMT6</b>: amount paid in April, 2005;
<b>default.payment.next.month</b>: payment default (1 = yes; 2 = no)

<b>References/Sources:</b>

[1]UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
[2] Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.<br>
[3] Name: I-Cheng Yeh 
email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw 
institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan. 
other contact information: 886-2-26215656 ext. 3181 




In [1]:
print("Importing required libraries")
import pandas as pd
import os, boto3, subprocess, re, sys, gc
from botocore.client import Config

print("All libraries successfully loaded!")

kms_key = os.environ['AW_S3_ENCRYPTION_KEY']

bucket_name = os.environ['AW_S3_STORAGE_BUCKET']
storage_key = os.environ['AW_S3_STORAGE_KEY'] + '/awdata/rawfiles/'
full_s3_location = 's3://' + bucket_name + '/' + storage_key 
print("full_s3_location: '{}'".format(full_s3_location))

#raw_df= pd.read_csv(full_s3_location + "sampled_ssme_app_data1.csv.gz", compression='gzip',encoding = 'iso8859_11')
df_twn= pd.read_csv(full_s3_location + "UCI_Credit_Card.csv")
z.show(df_twn)

1. Are there any duplicates in "ID" ? 
2. How many distinct "AGE" values it has?

``NOTE`` - Without using the ``groupby`` method


In [3]:
df_twn[['ID']].duplicated().sum()

In [4]:
df_twn[['AGE']].drop_duplicates().shape[0]

# df_twn[['AGE']].nunique() # number of unique values

Write a code to replace the 'SEX' values 1 to 'male' and 2 to 'female'

In [6]:
df_twn['SEX'].replace([1,2],['male', 'female']).head()
# Note that the column type changed automatically

 
Write a script to bin "LIMIT_BAL" into \\(2\\) groups: one \\(\lt\\) than the average value and another \\(\geq\\) than average.


In [8]:
 

bins = [df_twn["LIMIT_BAL"].min(),df_twn["LIMIT_BAL"].mean(),df_twn["LIMIT_BAL"].max()]

pd.cut(df_twn["LIMIT_BAL"],bins, include_lowest = True).head()


 
Write a Python script to divide the "BILL_AMT1" values into 5 equally populated bins:


In [10]:
pd.qcut(df_twn['BILL_AMT1'],5).head()



Create a function to bin a given list, containing numeric variables, in three equally sized categorys and return all of them with prefix =  "categ_" along with all the other variables without changing the oringal data.

Run the function on all "BILL_AMT" variables.


In [12]:
def automatic_bin(df,numeric_list,q):
    df_aux = df.copy()
    for names in numeric_list:
        df_aux["categ_" + names] = pd.qcut(df_aux[names],q)
    
    return df_aux 
        
new_df_twn = automatic_bin(df_twn,['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5', 'BILL_AMT6'] ,3)
df_twn.columns

Write a script to show all the values in the dataframe where the absolute "LIMIT_BAL" value is greater than 500,000.00

``NOTE`` - Without using ``.loc`` or ``.iloc``


In [14]:
 


z.show(df_twn[abs(df_twn['LIMIT_BAL']) > 500000])


Consider the data you created at the Exercise 5:

Modify the function to return all the dummies along with all the other variables (including the new created variables).

In [16]:
def automatic_bin(df,numeric_list,q):
    df_aux = df.copy()
    for names in numeric_list:
        df_aux["categ_" + names] = pd.qcut(df_aux[names],q, labels =  [1, 2, 3])
        df_aux = df_aux.join(pd.get_dummies(df_aux["categ_" + names], prefix = 'D_'+names))
    
    return df_aux 
    
new_df_twn = automatic_bin(df_twn,['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5', 'BILL_AMT6'] ,3)
new_df_twn.columns.values