# Credit Modeling

In this activity, you will create a classification model with logistic regression using a dataset collected in the1970’s in Germany. Tune the model hyperparameters with grid search to evaluate the performance of the model, and the best parameters. 

## Instructions

* The dataset was collected in the 1970s in Germany. Each row contains information on a loan, such as the amount of the loan, as well as whether the loan was repaid. See [here](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29) for more information.

* Create a classification model with logistic regression, completing the following tasks:

  * Read in the dataset and check to see if there are rows with null values.
  * Remove all rows with null values.
  * Split the dataset into data (X) and labels (y), then split them further into training and testing datasets. The `kredit` column should be the labels.
  * Create a pipeline with the following estimators: standardization of data, dimensionality reduction, logistic regression.
  * Tune the model hyperparameters with a grid search. Evaluate the performance of the model and the best parameters.

In [7]:
# import dependencies
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
from pathlib import Path

Info on dataset: [South German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29)

In [8]:
# file path
file = Path('../Resources/german_credit.csv')

#### Read in the dataset and check to see if there are rows with null values.

In [9]:
# read csv
df = pd.read_csv(file)
df.head()

Unnamed: 0,laufkont,laufzeit,moral,verw,hoehe,sparkont,beszeit,rate,famges,buerge,...,verm,alter,weitkred,wohn,bishkred,beruf,pers,telef,gastarb,kredit
0,1,18,4,2,1049,1,2,4,2,1,...,2,21,3,1,1,3,2,1,2,1.0
1,1,9,4,0,2799,1,3,2,3,1,...,1,36,3,1,2,3,1,1,2,1.0
2,2,12,2,9,841,2,4,2,2,1,...,1,23,3,1,1,2,2,1,2,1.0
3,1,12,4,0,2122,1,3,3,3,1,...,1,39,3,1,2,2,1,1,1,1.0
4,1,10,4,0,2241,1,2,1,3,1,...,1,48,3,1,2,2,1,1,1,1.0


In [10]:
# Count of rows with null values -> isnull
df.isnull().sum()

laufkont      0
laufzeit      0
moral         0
verw          0
hoehe         0
sparkont      0
beszeit       0
rate          0
famges        0
buerge        0
wohnzeit      0
verm          0
alter         0
weitkred      0
wohn          0
bishkred      0
beruf         0
pers          0
telef         0
gastarb       0
kredit      200
dtype: int64

#### Remove all rows with null values.

In [11]:
# Delete rows with null values -> dropna
df = df.dropna()

In [12]:
df.tail()

Unnamed: 0,laufkont,laufzeit,moral,verw,hoehe,sparkont,beszeit,rate,famges,buerge,...,verm,alter,weitkred,wohn,bishkred,beruf,pers,telef,gastarb,kredit
795,1,18,4,0,3966,1,5,1,2,1,...,1,33,1,1,3,3,2,2,2,0.0
796,1,12,0,3,6199,1,3,4,3,1,...,2,28,3,1,2,3,2,2,2,0.0
797,4,21,4,0,12680,5,5,4,3,1,...,4,30,3,3,1,4,2,2,2,0.0
798,2,12,2,3,6468,5,1,2,3,1,...,4,52,3,2,1,4,2,2,2,0.0
799,1,30,2,2,6350,5,5,4,3,1,...,2,31,3,2,1,3,2,1,2,0.0


#### Split the dataset into data (X) and labels (y), then split them further into training and testing datasets. 

In [14]:
# Separate the dataset into data and target
X = df.drop(['kredit'], axis=1)
y = df['kredit']

In [15]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

#### Create a pipeline with the following estimators: standardization of data, dimensionality reduction, logistic regression.

In [16]:
# Create steps and pipeline
steps = [
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.9)),
    ('lr', LogisticRegression())
]

In [17]:
# pipeline
pipe = Pipeline(steps)

In [18]:
params = {'lr__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
         'lr__solver': ['sag', 'lbfgs']}

#### Tune the model hyperparameters with a grid search.

In [19]:
# Run GridSearchCV
cv = GridSearchCV(pipe, params)
cv.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('pca', PCA(n_components=0.9)),
                                       ('lr', LogisticRegression())]),
             param_grid={'lr__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'lr__solver': ['sag', 'lbfgs']})

#### Evaluate the performance of the model and the best parameters.

In [20]:
# Evaluate performance
cv.score(X_test, y_test)

0.755

In [21]:
# Extract best params
cv.best_params_

{'lr__C': 0.1, 'lr__solver': 'sag'}