This dataset contains credit card default information of clients in Taiwan. An entire modeling methodology is explored, starting from the basics of data exploration and treatment and ending by exploring different techniques for predictive analytics (logistic regression, decision trees, gradient boosting, etc.) <br>

What follows is a brief description of the 25 variables:
<b>ID</b>: ID of each client
<b>LIMIT_BAL</b>: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
<b>SEX</b>: Gender (1 = male; 2 = female).
<b>EDUCATION</b>: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
<b>MARRIAGE</b>: Marital status (1 = married; 2 = single; 3 = others).
<b>AGE</b>: Age (year).

History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:

<b>PAY_0</b>:  the repayment status in September, 2005;
<b>PAY_2</b>: the repayment status in August, 2005; . . .;
<b>PAY_3</b>: . . .
<b>PAY_4</b>: . . .
<b>PAY_5</b>: . . .>
<b>PAY_6</b>: the repayment status in April, 2005. 
The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Amount of bill statement (NT dollar).

<b>BILL_AMT1</b>: amount of bill statement in September, 2005;
<b>BILL_AMT2</b>: amount of bill statement in August, 2005; . . .;
<b>BILL_AMT3</b>: . . .;
<b>BILL_AMT4</b>: . . .;
<b>BILL_AMT5</b>: . . .;
<b>BILL_AMT6</b>: amount of bill statement in April, 2005.

Amount of previous payment (NT dollar).

<b>PAY_AMT1</b>: amount paid in September, 2005;
<b>PAY_AMT2</b>: amount paid in August, 2005; . . .;
<b>PAY_AMT3</b>: . . .;
<b>PAY_AMT4</b>: . . .;
<b>PAY_AMT5</b>: . . .;
<b>PAY_AMT6</b>: amount paid in April, 2005;
<b>default.payment.next.month</b>: payment default (1 = yes; 2 = no)

<b>References/Sources:</b>

[1]UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
[2] Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.<br>
[3] Name: I-Cheng Yeh 
email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw 
institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan. 
other contact information: 886-2-26215656 ext. 3181 




In [1]:
print("Importing required libraries")
import pandas as pd
import os, boto3, subprocess, re, sys, gc
from botocore.client import Config

print("All libraries successfully loaded!")

kms_key = os.environ['AW_S3_ENCRYPTION_KEY']

bucket_name = os.environ['AW_S3_STORAGE_BUCKET']
storage_key = os.environ['AW_S3_STORAGE_KEY'] + '/awdata/rawfiles/'
full_s3_location = 's3://' + bucket_name + '/' + storage_key 
print("full_s3_location: '{}'".format(full_s3_location))

#raw_df= pd.read_csv(full_s3_location + "sampled_ssme_app_data1.csv.gz", compression='gzip',encoding = 'iso8859_11')
df_twn= pd.read_csv(full_s3_location + "UCI_Credit_Card.csv")
z.show(df_twn)

Create a cross table between "PAY_0" and "EDUCATION".
Which of the possible combinations has the higher percentage of IDs?

``Challenge`` - Do it using only one line of code, rounding the percentages with 4 number after dot and using z.show() (With the row index appearing) 


In [3]:
import numpy as np
z.show(np.round(df_twn.groupby(['PAY_0','EDUCATION'])['ID'].count().unstack().fillna(0)/df_twn.shape[0],4).reset_index())

 

Which "EDUCATION" has the highest proportional percentage of "PAY_0" equal 1 ?

In [5]:
total_by_educ = df_twn.groupby("EDUCATION").size().reset_index()
gb_df = df_twn.loc[df_twn["PAY_0"] == 1,:].groupby(["EDUCATION","PAY_0"])[["ID"]].count().reset_index()
pd_merge = pd.merge(total_by_educ, gb_df, on = ["EDUCATION"]).drop(["PAY_0"], axis = 1)

pd_merge.rename(columns = {0:"TOTAL"}, inplace = True)
pd_merge["Prop_perc"] = pd_merge['ID']/pd_merge["TOTAL"]
pd_merge



By using ``pivot`` or ``melt`` methods, calculate the average "AGE" by "SEX"

In [7]:
df_twn[['ID','AGE', 'SEX']].pivot('ID','SEX').mean()

Which ID has the highest sum of Pay amount (considering the last 6 months) ?

``NOTE`` - You need to use ``pivot`` or ``melt``

In [9]:
z.show(pd.melt(df_twn[['ID','PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']], ["ID"]).groupby(["ID"]).sum().reset_index().sort_values("value", ascending = False))