<center>

# Should This Loan be Approved or Denied?

## *Book Two: Univariate Data Exploration*

</center>

### ABSTRACT

For this case-study assignment, students assume the role of loan officer at a bank and are asked to approve or deny a loan by assessing its risk of default using logistic regression. 

---

## I. Enviroment

### 1. Libraries
Import the libraries and define their function in the notebook

In [None]:
# Mad Math modules
from scipy.stats import skewtest
from scipy.stats import kurtosistest
import numpy as np
#
# data modules
import pandas as pd
from MungingOps import GetDataDictionary
#
# visualization modules
import matplotlib.pyplot as plt
from matplotlib import gridspec
import seaborn as sns
from IPython.display import display
from IPython.display import HTML
#
#
from warnings import filterwarnings

### 2. Options

In [None]:
# Silence
filterwarnings('ignore')
# allow bottleneck
pd.options.compute.use_bottleneck = True

### 3. Styling

In [None]:
#
pd.set_option("display.precision", 2)
pd.options.display.float_format = '{:.2f}'.format
pd.set_option("display.max.columns", None)
#--
accent = '#705096' # Morado Intenso
shades = '#c5b4d7' # Moado Claro
silver = '#f0e6fa' # purple silver
gray = '#595959'
fill = '#DAF7A6' # lemon
#
sns.set_style("whitegrid")
sns.axes_style({#'figure.facecolor': 'white',
                #'xtick.direction': 'out',
                #'ytick.direction': 'out',
                #'xtick.color': '.15',
                #'ytick.color': '.15',
                'xtick.top': False,
                'ytick.right': False,
                'xtick.bottom': False,
                'ytick.left': False,
                'axes.axisbelow': True,
                'grid.linestyle': ':',
                'grid.color': shades,#'white',
                'text.color': gray,#'.15',
                'font.family': ['sans-serif'],
                'font.sans-serif': ['Verdana',
                                    'Arial',
                                    'DejaVu Sans',
                                    'Liberation Sans',
                                    'Bitstream Vera Sans',
                                    'sans-serif'],
                #'lines.solid_capstyle': 'round',
                #'patch.edgecolor': 'w',
                'patch.force_edgecolor': True,
                #'image.cmap': 'rocket',
                'axes.grid': True,
                'axes.labelcolor': gray,#'.15',
                'axes.facecolor': 'white',#'#EAEAF2',
                'axes.edgecolor': 'white',
                #'axes.spines.left': True,
                #'axes.spines.bottom': True,
                #'axes.spines.right': True,
                #'axes.spines.top': True,
                })
palette = [accent,shades,silver,gray]

### 4. Custom Functions

Variable's Basic Information

In [None]:
def display_info(var):
    df = pd.DataFrame()
    try:
        df['Skewness'] = skewtest(var)[0]
    except:None
    try:
        df['Kurtosis'] = kurtosistest(var)[0]
    except:None
    #
    df['dtypes'] = var.dtypes
    df['unique'] = [var.unique()]
    df['count'] = len(var)
    df['nulls'] = var.isnull().sum()
    df['nulls %'] = '{:.2f}'.format(100*(var.isnull().sum()/len(var)))
    df['memory mb'] = '{:.2f}'.format(var.memory_usage()/1024/1024)
    #
    display(df)

Display Graphs for Basic Univariate Analysis

In [None]:
#
def draw_basic_univariate(i, df, bins=2, discrete=False, xscale='linear',yscale='linear'):
    # source: https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
    # kurtosis > 2 weillbur not normal and Y scale log
    # kurtosis < 2  normal and Y scale normal
    #
    # If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. 
    # If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are 
    # moderately skewed. If the skewness is less than -1 or greater than 1, 
    # the data are highly skewed.
    # for highly skewness X scale log - make a log_variable
    #
    dta = np.array(df[i])
    #
    try:
        #
        plt.figure(figsize=(24, 4))
        plt.xscale(xscale)
        plt.yscale(yscale)
        plt.xticks(rotation=90)
        #
        ax = gridspec.GridSpec(1, 3, width_ratios=[1, 1, 1]) 
        plt.suptitle(str(i).replace('_',' ').upper()+'\n\n ---',
                     fontdict = {'color': gray}
                     )
        
        #
        ax0 = plt.subplot(ax[0])
        '''
        ax0 = sns.barplot(data=df, x=df[i],  color='#DAF7A6')
        '''
        ax0 = sns.countplot(#data = df, 
                            color=fill, 
                            #hue=df[i].unique(),
                            #edgecolor= "#555555",
                            x = df[i],
                            #y = df[i].value_counts()
        )
        ax0.set_title('Counting')
        ax0.set_xlabel('value')
        ax0.set_ylabel('count')
        ax0.tick_params(axis='x', rotation=90)
        #
        ax1 = plt.subplot(ax[1])
        try:
            ax1 = sns.boxplot(x=df[i].value_counts(),
                            color=fill,
                            flierprops={"marker": "x"},
                            medianprops={"color": '#aa22bb'})
            ax1.set_title('Outliers')
            ax1.set_xlabel('value')
        except: None #display_info(df[i])
        #
        ax2 = plt.subplot(ax[2])
        #
        #
        if discrete == True: 
            ax2 = sns.histplot(
                            bins=bins,
                            x=dta,
                            discrete= True,
                            kde=False,
                            color=fill,
                            #edgecolor='#555555',
                            stat='probability',
                            )
            ax2.set_title('histogram')
            ax2.set_xlabel('decil')
            ax2.set_ylabel('probability')
        else:
            ax2 = sns.histplot(
                            bins=bins,
                            x=dta, 
                            kde=False,
                            color=fill,
                            #edgecolor='#555555',
                            stat='probability',
                            )
            ax2.set_title('histogram')
            ax2.set_xlabel('decil')
            ax2.set_ylabel('probability')
        #
        plt.show()
    except ValueError as fnf_error:
        print(df[i].info(),'\n')
        print(i, fnf_error)

---

## II. Load the data

In [None]:
# Dataset
df = pd.read_pickle('./dta_01_dataset_dta.dd',compression='gzip')

In [None]:
df.head()

variable group identification

In [None]:
tmp = {}
for i in df.columns:
    tmp[i] = len(df[i].unique())

In [None]:
tmp = pd.Series(tmp).sort_values()
display(tmp)

In [None]:
# dummy var
dummy_vars = list(tmp[tmp < 4].index)
dummy_vars

In [None]:
#define posible discrete variables
discrete_vars = list(set(list(tmp[tmp < 57].index)) - set(dummy_vars))
#
discrete_vars

In [None]:
#define posible non discrete data
non_discrete_vars = list(tmp[tmp > 56].index)
non_discrete_vars

---

## II. Univariate Data Analysis

general view of non categorical variables

### 1. Dummy Variable Analysis

In [None]:
dummy_vars

**new_business**

looking their centrality and dispertion

In [None]:
display_info(df.new_business)


In [None]:
draw_basic_univariate('new_business', df, bins=3, discrete=True)

*observations:*
- The variable new_business is a highly skew dummy variable without nulls, and their kurtosis confirm a non balanced values

**insights:**

1. The serie is unbalanced
2. Majorly aren't new bussines

In [None]:
display(HTML('<hr>'))
dummy_vars.pop(dummy_vars.index('new_business'))
dummy_vars

**low_doc_loan**

In [None]:
display_info(df.low_doc_loan)


In [None]:
draw_basic_univariate('low_doc_loan', df, bins=3, discrete=True)

In [None]:
display(HTML('<hr>'))
dummy_vars.pop(dummy_vars.index('low_doc_loan'))
dummy_vars

**revolving_credit**

In [None]:
display_info(df.revolving_credit)


In [None]:
draw_basic_univariate('revolving_credit', df, bins=3, discrete=True)

In [None]:
display(HTML('<hr>'))
dummy_vars.pop(dummy_vars.index('revolving_credit'))
dummy_vars

**urban**

In [None]:
display_info(df.urban)


In [None]:
draw_basic_univariate('urban', df, bins=3, discrete=True)

In [None]:
display(HTML('<hr>'))
dummy_vars.pop(dummy_vars.index('urban'))
dummy_vars

**loan_status**

In [None]:
display_info(df.loan_status)


In [None]:
draw_basic_univariate('loan_status', df, bins=3, discrete=True)

**observations**
- This kind of variables are highly skewed
    - i recomend be analyzed a log_approval_fiscal_year
    - This effect is apparently by a desaleration in the loan apertures
- gross_amout_outstanding have a high kurtosis

**Observations**

- The dataset is highly unbalanced
    - The loans aren't for the most part low doc class
    - In general the credit type isn't revolving
    - business are mostly 0
    - they are mainly urban credits

### 2. Discrete Variable Analysis

In [None]:
display(HTML('<hr>'))
#discrete_vars.pop(discrete_vars.index('XXX'))
discrete_vars

**borrower_national_zone**

In [None]:
df.borrower_national_zone = df.borrower_national_zone.astype(float).sort_values(ascending=True)

In [None]:
display_info(df.borrower_national_zone)


In [None]:
draw_basic_univariate('borrower_national_zone', df, bins=(len(df.gross_amount_outstanding)+1), discrete=True)

In [None]:
display(HTML('<hr>'))
discrete_vars.pop(discrete_vars.index('borrower_national_zone'))
discrete_vars

**bank_state**

In [None]:
df.bank_state = df.bank_state.sort_values(ascending=True)
display_info(df.bank_state)

In [None]:
import plotly.express as px

fig = px.choropleth(locations=df.bank_state, locationmode="USA-states", color=df.bank_state, color_continuous_scale='aggrnyl', color_continuous_midpoint=np.median(df.bank_state.count()), scope="usa")
fig.show()

In [None]:
draw_basic_univariate('bank_state', df, bins=(len(df.gross_amount_outstanding)+1), discrete=True)

In [None]:
display(HTML('<hr>'))
discrete_vars.pop(discrete_vars.index('bank_state'))
discrete_vars

**approval_fiscal_year**

In [None]:
df.approval_fiscal_year = df.approval_fiscal_year.sort_values(ascending=True)
display_info(df.approval_fiscal_year)


In [None]:
draw_basic_univariate('approval_fiscal_year', df, bins=(len(df.approval_fiscal_year)+1), discrete=True)

In [None]:
display(HTML('<hr>'))
discrete_vars.pop(discrete_vars.index('approval_fiscal_year'))
discrete_vars

**gross_amount_outstanding**

In [None]:
df.gross_amount_outstanding = df.gross_amount_outstanding.astype(float).sort_values(ascending=True)
df['gross_amount_outstanding_LOG'] = np.log10(df.gross_amount_outstanding)

In [None]:
display_info(df.gross_amount_outstanding)


In [None]:
draw_basic_univariate('gross_amount_outstanding_LOG', df, bins=(len(df.gross_amount_outstanding)*0.1), discrete=True)

In [None]:
display(HTML('<hr>'))
discrete_vars.pop(discrete_vars.index('gross_amount_outstanding'))
discrete_vars

**borrower_state**

In [None]:
df.borrower_state = df.borrower_state.sort_values(ascending=True)
display_info(df.borrower_state)

In [None]:
draw_basic_univariate('borrower_state', df, bins=(len(df.gross_amount_outstanding)+1), discrete=True)

In [None]:
display(HTML('<hr>'))
discrete_vars.pop(discrete_vars.index('borrower_state'))
discrete_vars

**two_digit_naic**

In [None]:
df.two_digit_naic = df.two_digit_naic.sort_values(ascending=True)
display_info(df.two_digit_naic)


In [None]:
draw_basic_univariate('two_digit_naic', df, bins=(len(df.gross_amount_outstanding)*0.1), discrete=True)

### 3. non_discrete_vars Variables

In [None]:
display(HTML('<hr>'))
#discrete_vars.pop(discrete_vars.index('XXX'))
non_discrete_vars

In [None]:
df.created_jobs = df.created_jobs.sort_values(ascending=True)
display_info(df.created_jobs)

In [None]:
draw_basic_univariate('created_jobs', df, bins=(len(df.created_jobs)*0.1), discrete=False)

In [None]:
display(HTML('<hr>'))
non_discrete_vars.pop(non_discrete_vars.index('created_jobs'))
non_discrete_vars

---

## III. Bivariate Data Analysis

The first stage is to try to identificate the relations between a continuos variable and their non-continuos pair (probabily categorical)

The second stage is to try to identificate the corelations between variables

---

## IV. Multivariate Analysis

The third stage is to try to identificate the variations in the relations of the variables

linear regresion of x and y, and their variations through categorical variable and a selected variable as a class

k-means of x and y

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
