Since Pandas is all about working with Data, we will be illustrating a lot of the Pandas concepts using a dataset. An entire modeling methodology is explored, starting from the basics of data exploration and treatment and ending by exploring different techniques for predictive analytics (logistic regression, decision trees, gradient boosting, etc.) The dataset we will use is a credit risk dataset containing credit card default information of clients in Taiwan. 

What follows is a brief description of the 25 variables in the dataset:
<b>ID</b>: ID of each client
<b>LIMIT_BAL</b>: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
<b>SEX</b>: Gender (1 = male; 2 = female).
<b>EDUCATION</b>: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
<b>MARRIAGE</b>: Marital status (1 = married; 2 = single; 3 = others).
<b>AGE</b>: Age (year).

History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:

<b>PAY_0</b>:  the repayment status in September, 2005;
<b>PAY_2</b>: the repayment status in August, 2005; . . .;
<b>PAY_3</b>: . . .
<b>PAY_4</b>: . . .
<b>PAY_5</b>: . . .>
<b>PAY_6</b>: the repayment status in April, 2005. 
The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Amount of bill statement (NT dollar).

<b>BILL_AMT1</b>: amount of bill statement in September, 2005;
<b>BILL_AMT2</b>: amount of bill statement in August, 2005; . . .;
<b>BILL_AMT3</b>: . . .;
<b>BILL_AMT4</b>: . . .;
<b>BILL_AMT5</b>: . . .;
<b>BILL_AMT6</b>: amount of bill statement in April, 2005.

Amount of previous payment (NT dollar).

<b>PAY_AMT1</b>: amount paid in September, 2005;
<b>PAY_AMT2</b>: amount paid in August, 2005; . . .;
<b>PAY_AMT3</b>: . . .;
<b>PAY_AMT4</b>: . . .;
<b>PAY_AMT5</b>: . . .;
<b>PAY_AMT6</b>: amount paid in April, 2005;
<b>default.payment.next.month</b>: payment default (1 = yes; 2 = no)

Source: UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients





In [1]:
import pandas as pd
import numpy as np

import os, boto3, subprocess, re, sys, gc
from botocore.client import Config

print("All libraries successfully loaded!")

kms_key = os.environ['AW_S3_ENCRYPTION_KEY']

bucket_name = os.environ['AW_S3_STORAGE_BUCKET']
storage_key = os.environ['AW_S3_STORAGE_KEY'] + '/awdata/rawfiles/'
full_s3_location = 's3://' + bucket_name + '/' + storage_key 
print("full_s3_location: '{}'".format(full_s3_location))
df_twn= pd.read_csv(full_s3_location + "UCI_Credit_Card.csv",nrows=100)

To explore what we have learned so far about pandas dataframes, complete the following questions. 

1. Rename the column `default.payment.next.month` to simply `default`.

2. Write a function to detect all the binary variables in the dataframe. 

3. Write a function to detect all the data types in a pandas dataframe and output the results as a dictionary (e.g. `{'float':[var1,...],...}`. 

4. Find the `median` and `max` values of the columns `'BILL_AMT1'`,`'BILL_AMT2'`,`'BILL_AMT3'`,`'BILL_AMT4'`,`'BILL_AMT5'` and `'BILL_AMT6'` for <b>each</b> value of `AGE`. 


In [3]:
 

# 1)Rename the column default.payment.next.month to simply default.
df_twn.rename({'default.payment.next.month':'default'},axis=1,inplace=True)
print("1.1: ",df_twn.columns)

# 2)Write a function to detect all the binary variables in the dataframe.
def CheckBinaries(df): 
    '''
    Check for the binaries in a dataframe
    '''
    variables = df.columns.values
    mask      = list(df.apply(lambda x: len(np.unique(x)) == 2))
    return list(np.array(variables)[mask])

# CheckBinaries(df_twn)
print("\n1.2: ",df_twn[CheckBinaries(df_twn)].head(5))

# 3)Write a function to detect all the data types in a pandas dataframe and output the results as a dictionary (e.g. {'float':[var1,...],...}.
dataTypes = list(set([df_twn.dtypes[i].name for i in df_twn.columns.values]))
list_of_types = {str(j): [i for i in df_twn.columns.values if df_twn[i].dtypes == j] for j in dataTypes}
print("\n1.3: ",list_of_types)

# 4)Find the median and max values of the columns 'BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5' and 'BILL_AMT6' for each value of AGE.
print("\n1.4:")
z.show(df_twn.groupby(['AGE'])['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6'].agg(['mean','median']).reset_index())


The coefficient of variation (CV) of a sample is defined as \\(\sigma/\mu\\) (the standard deviation divided by the mean). It is non-dimensional and is used to compare the spread of datasets. Another useful measure is the z-score, defined as \\((x - \mu)/\sigma\\). It measures the distance of a data point from the mean in terms of the standard deviation. This is also called standardization of data.

1. Compute the coefficient of variation of the variable 'LIMIT_BAL' for each value of 'AGE'. 

2. Compute the z-scores of the variable 'PAY_AMT1'.

3. Write a function to compute general statistical information on any given pandas dataframe column, and apply it to the BILL_AMT2 column. The output should follow this format (you can add other statistical functions as well!): 
````
ComputeStats(dataframe,'BILL_AMT2')
>>> {'max': 983931.0, 'min': -69777.0, 'median': 21197.0, 'mean': 49338.90635242128}
````
4. Write a function to calculate the number of unique values in each variable of a dataframe and apply it to the df_twn dataframe. 

5. Write a function to define a variable as continuous or categorical based on the number of unique values (you can use a threshold, e.g. more than 20 is considered continuous) and apply it to all columns in twn_df. The output should be a dictionary.


In [5]:
 
# 2.1) Compute the coefficient of variation of the variable ‘LIMIT_BAL’ for each value of ‘AGE’.
print("2.1: \n",df_twn.groupby(['AGE'])['LIMIT_BAL'].apply(lambda x: x.std()/x.mean()).reset_index().rename(columns={'LIMIT_BAL':'CV'}))

# 2.2) Compute the z-scores of the variable ‘PAY_AMT1’.
z_scores = (df_twn.PAY_AMT1 - df_twn.PAY_AMT1.mean())/df_twn.PAY_AMT1.std()
print("2.2: \n",pd.DataFrame({'PAY_AMT1': df_twn.PAY_AMT1,'z_scores':z_scores}).head(5))

# 2.3) Define a function to compute general statistical information on any given pandas dataframe column. 
def ComputeStats(df,variable=None): 
    '''
    Returns statistical information of any given variable
    '''
    stats_ = df.select_dtypes(['int64','float64']).apply(lambda x: {'mean': x.mean(),'median': x.median(),'max': x.max(),'min': x.min()},axis=0)
    
    if not variable: 
        return stats_
    else:
        return stats_[str(variable)]
        
print("\n2.3: ",ComputeStats(df_twn,'BILL_AMT2'))


# 2.4) Write a function to calculate the number of unique values in each variable.
def CountUnique(df): 
    '''
    Returns a series contining the number of unique values in each variable
    '''
    return df.apply(lambda x: len(np.unique(x))).reset_index().rename(columns={'index':'Column',0:'Count Unique'}) #Notice that the default column name of 0 is an integer and not a string.

print("2.4: \n",CountUnique(df_twn))

# 2.5) Write a function to define a variable as continuous or categorical based on the number of unique values
def find_type_vars(df, lim):

    unique = df.apply(lambda x: len(np.unique(x)))
    mask = np.array(unique.values > 20) #Continuous
    inverse_mask = np.where(mask == True, False, True) #Categorical
    variables = df.columns.values
    
    return{
            "Continuous": variables[mask].tolist(),
            "Categorical": variables[inverse_mask].tolist()
            }

print("2.5: \n",find_type_vars(df_twn, 20))

Consider the following random value generator: 

````
def GenerateRandom(domain,values):
    '''
    Generates a list (of size 'values') of random integers between 0 and 'domain'-1
    '''
    import random
    return np.unique([random.randint(0,domain-1) for x in range(values)])
````

1. Write a function that uses the function above, to **randomly** replace *N* values in a specified pandas dataframe column with missing (`np.nan`) values. Use your function to replace **50** values in the 'LIMIT_BAL' column of twn_df with missing values.

2. Generalize the approach of 3.1 to apply the same function to a *multiple* columns.  Use your functions to replace **50** values in each of the following columns: 'PAY_AMT1','BILL_AMT3'

3. Write a function to check the <b>number</b> of missing values in each column of a pandas dataframe. 

*hint* - Use ``df[col].isna().sum()`` to count how many NULL values are inside a Series.



In [7]:
 

import random

def GenerateRandom(domain,values): 
    return np.unique([random.randint(0,domain-1) for x in range(values)])

def InsertMissing(data,variable,index):
    random_df=data[variable]
    random_df.loc[index,variable] = np.nan
    data.loc[index,variable] = np.nan
    return data
    
index_list = GenerateRandom(df_twn.shape[0],50) 
# print(index_list) #list of 50 random numbers matching size of one column in df_twn
InsertMissing(df_twn,['LIMIT_BAL'],index_list)
print ("3.1: \n",df_twn['LIMIT_BAL'].head(5))

InsertMissing(df_twn,['PAY_AMT1','BILL_AMT3'],index_list)
print ("3.2: \n",df_twn[['PAY_AMT1','BILL_AMT3']].head(5))

def CountNaN(df):
    '''
    Counts the number of missing values per variable
    '''
    return pd.DataFrame([(i,df[i].isna().sum()) for i in df.columns.values],columns=['Column', '# Missing'])

print ("3.3: \n",CountNaN(df_twn))



Suppose we want to detect the records in our dataframe that are outliers. One method is to use the inter quartile range (IQR). The IQR is given by \\(IQR = Q_{3} - Q_{1}\\), where \\(Q_{3}\\) is the third quartile and \\(Q_{1}\\) is the first quartile. Outliers are records which lie outside of the following range: \\( [Q_{1} - 1.5 \cdot \mathrm{IQR}, Q_{3} + 1.5 \cdot \mathrm{IQR}] \\), that is \\( a < Q_{1} - 1.5 \cdot \mathrm{IQR} \\) or \\( b > Q_{3} + 1.5 \cdot \mathrm{IQR} \\).  This is same logic that is used to draw box plots:

<img src="https://miro.medium.com/max/11250/1*2c21SkzJMf3frPXPAR_gZA.png" title="Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51" width="500"/>

1. Define a function to compute the IQR outliers of a given pandas dataframe column (return the index of each outlier)

2. Write a function to <b>replace</b> the outliers of all <i>continuous</i> variables with `np.nan` (remember we have defined a function to determine whether a variable is continuous or not in Question 2.5 of this notebook).

3. Write a function to replace results of question 4.2 with the median value of each column.  Double check that no missing values remain by using the function you wrote in question 3.3.


*hint* - use ``df.quantile(q)`` to calculate the \\(q\\) quantile, where \\(0 <= q <= 1\\)


<b>Bonus</b> Rewrite the function from question 4.1 using z-score and mod-z-score methods for outlier detection.




In [9]:
 
# 4.1) Define a function to compute the IQR outliers of a given pandas dataframe column (return the index of each outlier)

def OutlierIQR(data):
    '''
    Computes the indices of the corresponding outlier values according to IQR
    '''
    Q1 = data.quantile(q = 0.25)
    Q3 = data.quantile(q = 0.75)
    IQR = Q3 - Q1
    LB = Q1 - (IQR*1.5)
    UB = Q3 + (IQR*1.5)
    outlier = np.where((data > UB) | (data < LB))
    return outlier[0]
 

print("4.1: ",OutlierIQR(df_twn['BILL_AMT1']))

# 4.2: Write a function to replace the outliers of all continuous variables with np.nan 
def find_type_vars(df, lim):

    unique = df.apply(lambda x: len(np.unique(x)))
    mask = np.array(unique.values > 20) #Continuous
    inverse_mask = np.where(mask == True, False, True) #Categorical
    variables = df.columns.values
    
    return{
            "Continuous": variables[mask].tolist(),
            "Categorical": variables[inverse_mask].tolist()
            }

def RemoveOutliers(data,variables):
    '''
    Function to remove outliers and input NaN on those values
    '''
    for var in variables: 
        # print(var)
        index_list = OutlierIQR(data[str(var)])
        data.loc[index_list,str(var)] = np.nan

print("\n 4.2:")
continuous=find_type_vars(df_twn,20)['Continuous']
RemoveOutliers(df_twn,continuous)
print(df_twn[continuous].head(5))

# 4.3: Write a function to replace results of question 4.2 with the median value of each column.  Double check that no missing values remain by using the function you wrote in question 3.3.
def InputMethod(x): 
    return {'median':x.median(),'mean':x.mean(),'zero':0}
    
def ReplaceMissing(data): 
    '''
    Function to replace missing values using a dictionary
    '''
    return data.fillna(data.median())
    

df_twn2 = ReplaceMissing(df_twn)
print(df_twn2[continuous].head(5))
def CountNaN(df):
    '''
    Counts the number of missing values per variable
    '''
    return pd.DataFrame([(i,df[i].isna().sum()) for i in df.columns.values],columns=['Column', '# Missing'])

print ("4.3: Count Missing in df_twn\n",CountNaN(df_twn))
print ("4.3: Count Missing in df_twn2\n",CountNaN(df_twn2))
