## **PCA en dataset de credit scoring con visualización con librería Bokeh**

#### Manuel Sánchez-Montañés

In [None]:
COLAB = True

First we import the libraries we will need. In addition we will use the first code cell to activate the *inline* mode for the graphics generated by *matplotlib*. We also initialize the seed of the random generator.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## **Data Load**

Now we load the database:

In [None]:
if COLAB:
  !mkdir datasets
  aux = "'https://docs.google.com/uc?export=download&id=1LRagJb0IJPIR9evKpU5v5rpBhKbAxAQw&confirm=t'"
  !wget $aux -O ./datasets/credit_scoring.csv

  aux = "'https://docs.google.com/uc?export=download&id=1KLLKiw0qVFLVYtjZzdE5sijtdv5p9DJ6&confirm=t'"
  !wget $aux -O ./datasets/credit_scoring_Data_Dictionary.xls


data = pd.read_csv('./datasets/credit_scoring.csv', delimiter=',', header=0)
data.drop(data.columns[0], axis='columns', # quitamos la primera columna
          inplace=True)

class_column = 'SeriousDlqin2yrs'
classes_names = data[class_column].unique()

print('\033[1m' + 'Credit scoring database\n', '\033[0m')
print('Number of real classes: %d' % len(classes_names))
print('Unique class labels:', classes_names, '\n')
print('\033[1m' + 'First 5 instances:' + '\033[0m')
data.head()

## **Data Description**

In [None]:
description = pd.read_excel('./datasets/credit_scoring_Data_Dictionary.xls',
                            header=1)
pd.set_option('display.max_colwidth', 200)
description

## **Data Exploration**

In [None]:
type(data)

In [None]:
data.describe().T[["count", "min", "max", "mean", "std"]]

### **Outliers**

In [None]:
# los 96 y 98 parece que están indicando NaNs:

aux = ["NumberOfTime30-59DaysPastDueNotWorse", "NumberOfTime60-89DaysPastDueNotWorse", "NumberOfTimes90DaysLate"]

for a in aux:
    print(data[a].unique())

In [None]:
# pongo esos valores a NaN:

for a in aux:
    data[a][data[a]>=96] = np.NaN

In [None]:
# por otra parte elimino del dataset menores de edad (??) y mayores de 80:

data = data[(data["age"]>=18) & (data["age"]<=80)]

### **Missing values**

In [None]:
data.isnull().sum()

In [None]:
clean_data = data.copy()
medians = data.median()
#clean_data.dropna(axis=0, inplace=True)
clean_data.fillna(medians, inplace=True)
clean_data.isnull().sum()

In [None]:
attribute_names = list(clean_data.columns)
attribute_names.remove(class_column)

print(class_column)
print(classes_names)
print(attribute_names)

### **Statistics**

In [None]:
clean_data.hist(bins=20, figsize=(12,16), layout=(4,3));

### Transformo logarítmicamente algunas variables:

In [None]:
clean_data["MonthlyIncome"] = np.log(1+clean_data["MonthlyIncome"])
clean_data["DebtRatio"] = np.log(1+clean_data["DebtRatio"])

In [None]:
clean_data.hist(bins=20, figsize=(12,16), layout=(4,3));

In [None]:
attribute_names

In [None]:
class_column

In [None]:
X = clean_data[attribute_names].values
y = clean_data[class_column].values

In [None]:
X.shape

In [None]:
X.var(axis=0)

# **PCA si vamos a realizar posteriormente modelo (división training-test)**

After loading the database we need to do some basic preprocessing: standarization

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print(X.shape, y.shape)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
scaler.fit(X_train)

X_std_train = scaler.transform(X_train)
X_std_test  = scaler.transform(X_test)

pca = PCA(n_components=2)
pca.fit(X_std_train)

X_pca_train = pca.transform(X_std_train)
X_pca_test  = pca.transform(X_std_test)

In [None]:
X_std_train.var(axis=0)

In [None]:
X_std_test.var(axis=0)