<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Load-the-libraries" data-toc-modified-id="Load-the-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the libraries</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Processing</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Train-validation-split" data-toc-modified-id="Train-validation-split-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Train validation split</a></span></li><li><span><a href="#Xgboost" data-toc-modified-id="Xgboost-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Xgboost</a></span></li></ul></li></ul></div>

# Description
Reference: https://datahack.analyticsvidhya.com/contest/all/  


**Predict Loan Eligibility for Dream Housing Finance company**
Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 

**Data Dictionary**
Train file: CSV containing the customers for whom loan eligibility is known as 'Loan_Status'

| Variable | Description |
| :---|:---|
| Loan_ID | Unique Loan ID |
| Gender | Male/ Female |
| Married | Applicant married (Y/N) |
| Dependents | Number of dependents |
| Education | Applicant Education (Graduate/ Under Graduate) |
| Self_Employed | Self employed (Y/N) |
| ApplicantIncome | Applicant income |
| CoapplicantIncome | Coapplicant income |
| LoanAmount | Loan amount in thousands |
| Loan_Amount_Term | Term of loan in months |
| Credit_History | credit history meets guidelines |
| Property_Area | Urban/ Semi Urban/ Rural |
| Loan_Status | (Target) Loan approved (Y/N) |


**Evaluation Metric**  
Your model performance will be evaluated on the basis of your prediction of loan status for the test data (test.csv), which contains similar data-points as train except for the loan status to be predicted. Your submission needs to be in the format as shown in sample submission.

We at our end, have the actual loan status for the test dataset, against which your predictions will be evaluated. We will use the Accuracy value to judge your response.



**Public and Private Split**   
Test file is further divided into Public (25%) and Private (75%)

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Load the libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

pd.options.plotting.backend = "matplotlib"
pd.set_option('max_columns',100)

import time,os,json
time_start_notebook = time.time()
home = os.path.expanduser('~')
SEED=100

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

[(x.__name__,x.__version__) for x in [np,pd,sns]]

[('numpy', '1.18.4'), ('pandas', '1.1.0'), ('seaborn', '0.10.1')]

In [2]:
import sklearn
import xgboost as xgb

# Load the data

In [3]:
df_train = pd.read_csv('../data/raw/train.csv')
print(df_train.shape)
df_train.head()

(614, 13)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
df_test = pd.read_csv('../data/raw/test.csv')
print(df_test.shape)
df_test.head()

(367, 12)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


# Data Processing

In [5]:
def clean_data(df):
    df = df.copy()
    # drop unwanted features
    df = df.drop('Loan_ID',axis=1)

    # missing values imputation
    ## fill married yes no from graudated or not
    cond = (df['Education']=='Graduate') & (df['Married'].isnull()) 
    df.loc[cond, 'Married'] = 'Yes'
    cond = (df['Education']!='Graduate') & (df['Married'].isnull()) 
    df.loc[cond, 'Married'] = 'No'

    ## fill with mode
    cols_mode = ['Gender', 'Dependents', 'Self_Employed', 'Credit_History']
    for c in cols_mode:
        df[c] = df[c].fillna(df[c].mode()[0])

    ## fill with mean
    cols_mean = ['LoanAmount','Loan_Amount_Term' ]
    for c in cols_mean:
        df[c] = df[c].fillna(df[c].mean())

    # mapping string to integers
    df['Gender'] = df['Gender'].map({'Male':1, 'Female': 0})
    df['Married'] = df['Married'].map({'Yes':1, 'No': 0 })
    df['Education'] = df['Education'].map({'Graduate': 1, 'Not Graduate': 0})
    df['Self_Employed'] = df['Self_Employed'].map({'Yes':1, 'No': 0})
    
    # target 
    target = 'Loan_Status'
    if target in df.columns:
        df[target] = df[target].map({'Y':1, 'N': 0})

    # one hot encoding
    cols = ['Dependents','Property_Area']
    df = pd.get_dummies(df,columns=cols,drop_first=True)
    return df

In [6]:
df_train = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

df_train = clean_data(df_train)
df_test = clean_data(df_test)

print(df_train.shape)
print(df_test.shape)
df_train.head(2).append(df_train.tail(2))

(614, 15)
(367, 14)


Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
0,1,0,1,0,5849,0.0,146.412162,360.0,1.0,1,0,0,0,0,1
1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0,1,0,0,0,0
612,1,1,1,0,7583,0.0,187.0,360.0,1.0,1,0,1,0,0,1
613,0,0,1,1,4583,0.0,133.0,360.0,0.0,0,0,0,0,1,0


# Modelling

## Train validation split

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
df_train_orig = df_train.copy()
target = 'Loan_Status'
df_Xtrain, df_Xvalid, ser_ytrain, ser_yvalid = train_test_split(
    df_train_orig.drop(target,axis=1), df_train_orig[target],
    test_size = 0.2,
    random_state=SEED,
    stratify=df_train_orig[target]
)

ytrain = ser_ytrain.to_numpy().ravel()
yvalid = ser_yvalid.to_numpy().ravel()

print(f'train shape: {df_Xtrain.shape}')
print(f'valid shape: {df_Xvalid.shape}')

df_Xtrain.head(2)

train shape: (491, 14)
valid shape: (123, 14)


Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Dependents_1,Dependents_2,Dependents_3+,Property_Area_Semiurban,Property_Area_Urban
203,1,1,0,0,3500,1083.0,135.0,360.0,1.0,1,0,0,0,1
369,1,1,1,0,19730,5266.0,570.0,360.0,1.0,0,0,0,0,0


In [9]:
Xtr = df_Xtrain
ytr = ytrain
Xvd = df_Xvalid
yvd = yvalid

In [10]:
df_eval = pd.DataFrame({
    'Model': [],
    'Description': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F-score': []
})

## Xgboost
Note: default xgboost 0.9 gives accuracy 0.82  
but defautl xgboost 1.1.1 gives accuracy 0.78.
They changed the default parameters.


```python

Parameters
----------

    max_depth : int
        Maximum tree depth for base learners.
    learning_rate : float
        Boosting learning rate (xgb's "eta")
    verbosity : int
        The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
    objective : string or callable
        Specify the learning task and the corresponding learning objective or
        a custom objective function to be used (see note below).
    booster: string
        Specify which booster to use: gbtree, gblinear or dart.
    tree_method: string
        Specify which tree method to use.  Default to auto.  If this parameter
        is set to default, XGBoost will choose the most conservative option
        available.  It's recommended to study this option from parameters
        document.
    n_jobs : int
        Number of parallel threads used to run xgboost.
    gamma : float
        Minimum loss reduction required to make a further partition on a leaf
        node of the tree.
    min_child_weight : int
        Minimum sum of instance weight(hessian) needed in a child.
    max_delta_step : int
        Maximum delta step we allow each tree's weight estimation to be.
    subsample : float
        Subsample ratio of the training instance.
    colsample_bytree : float
        Subsample ratio of columns when constructing each tree.
    colsample_bylevel : float
        Subsample ratio of columns for each level.
    colsample_bynode : float
        Subsample ratio of columns for each split.
    reg_alpha : float (xgb's alpha)
        L1 regularization term on weights
    reg_lambda : float (xgb's lambda)
        L2 regularization term on weights
    scale_pos_weight : float
        Balancing of positive and negative weights.
    base_score:
        The initial prediction score of all instances, global bias.
    random_state : int
        Random number seed.

        .. note::

           Using gblinear booster with shotgun updater is nondeterministic as
           it uses Hogwild algorithm.

    missing : float, default np.nan
        Value in the data which needs to be present as a missing value.
    num_parallel_tree: int
        Used for boosting random forest.
    monotone_constraints : str
        Constraint of variable monotonicity.  See tutorial for more
        information.
    interaction_constraints : str
        Constraints for interaction representing permitted interactions.  The
        constraints must be specified in the form of a nest list, e.g. [[0, 1],
        [2, 3, 4]], where each inner list is a group of indices of features
        that are allowed to interact with each other.  See tutorial for more
        information
    importance_type: string, default "gain"
        The feature importance type for the feature_importances\_ property:
        either "gain", "weight", "cover", "total_gain" or "total_cover".
```

In [11]:
import xgboost
from xgboost import XGBClassifier
from sklearn import metrics
import time

TREE_METHOD = 'auto'
xgboost.__version__

'1.1.1'

In [12]:
from sklearn import metrics

df_eval = pd.DataFrame({
    'Model': [],
    'Description': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F-score': [],
    'Time_Taken': [],
})

In [13]:
# XGBClassifier?

In [14]:
time_start = time.time()
model = XGBClassifier(random_state=SEED,n_jobs=-1)

model.fit(Xtr,ytr)

vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['xgb', 'default', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,0,xgb,default,0.780488,0.808511,0.894118,0.849162,0 min 0.09 s


In [15]:
time_start = time.time()
# default from xgboost version 0.9
model = XGBClassifier(
    # always same
    tree_method=TREE_METHOD,
    objective='binary:logistic',
    random_state=100,
    n_jobs=-1,
    verbosity=1,
    base_score=0.5,

    # most imp
    n_estimators=100,
    learning_rate=0.1,

    # subsample and colsample
    subsample=1,
    colsample_bylevel=1,
    colsample_bynode=1, 
    colsample_bytree=1,

    # tree depth
    max_depth=3,
    min_child_weight=1,
    scale_pos_weight=1,
    
    # regularizaion alpha lambda gamma
    reg_alpha=0,
    reg_lambda=1,
    gamma=0,
    )

model.fit(Xtr,ytr)
vd_preds = model.predict(Xvd)

acc = metrics.accuracy_score(yvd, vd_preds)
pre = metrics.precision_score(yvd, vd_preds)
rec = metrics.recall_score(yvd, vd_preds)
f1  = metrics.f1_score(yvd,vd_preds)

time_taken = time.time() - time_start
time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

row = ['xgb', 'xgb9.0', acc, pre, rec, f1,time_taken]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates(['Model','Description'])
df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
1,0,xgb,default,0.780488,0.808511,0.894118,0.849162,0 min 0.09 s


In [16]:
for x in [0.6,0.7,0.8,0.9]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=x,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=3,
        min_child_weight=1,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=1,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'subsample={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
1,5,xgb,subsample=0.9,0.821138,0.811881,0.964706,0.88172,0 min 0.04 s
2,2,xgb,subsample=0.6,0.813008,0.803922,0.964706,0.877005,0 min 0.06 s
3,3,xgb,subsample=0.7,0.813008,0.803922,0.964706,0.877005,0 min 0.04 s
4,4,xgb,subsample=0.8,0.813008,0.803922,0.964706,0.877005,0 min 0.04 s
5,0,xgb,default,0.780488,0.808511,0.894118,0.849162,0 min 0.09 s


In [17]:
for x in [0.6,0.7,0.8,0.9]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=x,

        # tree depth
        max_depth=3,
        min_child_weight=1,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=1,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'colsample_bytree={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
1,9,xgb,colsample_bytree=0.9,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
2,5,xgb,subsample=0.9,0.821138,0.811881,0.964706,0.88172,0 min 0.04 s
3,6,xgb,colsample_bytree=0.6,0.821138,0.805825,0.976471,0.882979,0 min 0.05 s
4,7,xgb,colsample_bytree=0.7,0.821138,0.805825,0.976471,0.882979,0 min 0.03 s
5,2,xgb,subsample=0.6,0.813008,0.803922,0.964706,0.877005,0 min 0.06 s
6,3,xgb,subsample=0.7,0.813008,0.803922,0.964706,0.877005,0 min 0.04 s
7,4,xgb,subsample=0.8,0.813008,0.803922,0.964706,0.877005,0 min 0.04 s
8,8,xgb,colsample_bytree=0.8,0.813008,0.803922,0.964706,0.877005,0 min 0.03 s
9,0,xgb,default,0.780488,0.808511,0.894118,0.849162,0 min 0.09 s


In [18]:
for x in range(3,20):

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=x,
        min_child_weight=1,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=1,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'max_depth={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,9,xgb,colsample_bytree=0.9,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
1,10,xgb,max_depth=3,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
2,11,xgb,max_depth=4,0.829268,0.82,0.964706,0.886486,0 min 0.04 s
3,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
4,5,xgb,subsample=0.9,0.821138,0.811881,0.964706,0.88172,0 min 0.04 s
5,6,xgb,colsample_bytree=0.6,0.821138,0.805825,0.976471,0.882979,0 min 0.05 s
6,7,xgb,colsample_bytree=0.7,0.821138,0.805825,0.976471,0.882979,0 min 0.03 s
7,8,xgb,colsample_bytree=0.8,0.813008,0.803922,0.964706,0.877005,0 min 0.03 s
8,2,xgb,subsample=0.6,0.813008,0.803922,0.964706,0.877005,0 min 0.06 s
9,3,xgb,subsample=0.7,0.813008,0.803922,0.964706,0.877005,0 min 0.04 s


In [19]:
for x in [0.1,0.5,1,1.5,2,3,10]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=3,
        min_child_weight=x,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=1,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'min_child_weight={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,11,xgb,max_depth=4,0.829268,0.82,0.964706,0.886486,0 min 0.04 s
1,10,xgb,max_depth=3,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
2,29,xgb,min_child_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.05 s
3,9,xgb,colsample_bytree=0.9,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
4,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
5,5,xgb,subsample=0.9,0.821138,0.811881,0.964706,0.88172,0 min 0.04 s
6,6,xgb,colsample_bytree=0.6,0.821138,0.805825,0.976471,0.882979,0 min 0.05 s
7,7,xgb,colsample_bytree=0.7,0.821138,0.805825,0.976471,0.882979,0 min 0.03 s
8,30,xgb,min_child_weight=1.5,0.821138,0.8,0.988235,0.884211,0 min 0.03 s
9,4,xgb,subsample=0.8,0.813008,0.803922,0.964706,0.877005,0 min 0.04 s


In [20]:
df_train[target].value_counts(normalize=True)

1    0.687296
0    0.312704
Name: Loan_Status, dtype: float64

In [21]:
for x in [0.32,0.68,0.2, 0.3,0.4,0.5,0.6,1]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=3,
        min_child_weight=1,
        scale_pos_weight=x,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=1,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'scale_pos_weight={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,41,xgb,scale_pos_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
1,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
2,29,xgb,min_child_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.05 s
3,9,xgb,colsample_bytree=0.9,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
4,10,xgb,max_depth=3,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
5,11,xgb,max_depth=4,0.829268,0.82,0.964706,0.886486,0 min 0.04 s
6,30,xgb,min_child_weight=1.5,0.821138,0.8,0.988235,0.884211,0 min 0.03 s
7,5,xgb,subsample=0.9,0.821138,0.811881,0.964706,0.88172,0 min 0.04 s
8,6,xgb,colsample_bytree=0.6,0.821138,0.805825,0.976471,0.882979,0 min 0.05 s
9,7,xgb,colsample_bytree=0.7,0.821138,0.805825,0.976471,0.882979,0 min 0.03 s


In [22]:
for x in [0, 0.001,0.01,0.1,1, 10,100,1000]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=3,
        min_child_weight=1,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=x,
        reg_lambda=1,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'reg_alpha={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,29,xgb,min_child_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.05 s
1,44,xgb,reg_alpha=0.01,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
2,42,xgb,reg_alpha=0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
3,41,xgb,scale_pos_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
4,11,xgb,max_depth=4,0.829268,0.82,0.964706,0.886486,0 min 0.04 s
5,10,xgb,max_depth=3,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
6,9,xgb,colsample_bytree=0.9,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
7,45,xgb,reg_alpha=0.1,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
8,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
9,7,xgb,colsample_bytree=0.7,0.821138,0.805825,0.976471,0.882979,0 min 0.03 s


In [23]:
for x in [0, 0.001,0.01,0.1,1, 10,100,1000]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=3,
        min_child_weight=1,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=x,
        gamma=0,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'reg_lambda={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,29,xgb,min_child_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.05 s
1,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
2,54,xgb,reg_lambda=1,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
3,45,xgb,reg_alpha=0.1,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
4,44,xgb,reg_alpha=0.01,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
5,42,xgb,reg_alpha=0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
6,9,xgb,colsample_bytree=0.9,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
7,10,xgb,max_depth=3,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
8,11,xgb,max_depth=4,0.829268,0.82,0.964706,0.886486,0 min 0.04 s
9,41,xgb,scale_pos_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s


In [25]:
for x in [0, 0.001,0.01,0.1,1, 10,100,1000]:

    time_start = time.time()
    # default from xgboost version 0.9
    model = XGBClassifier(
        # always same
        tree_method=TREE_METHOD,
        objective='binary:logistic',
        random_state=100,
        n_jobs=-1,
        verbosity=1,
        base_score=0.5,

        # most imp
        n_estimators=100,
        learning_rate=0.1,

        # subsample and colsample
        subsample=1,
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1,

        # tree depth
        max_depth=3,
        min_child_weight=1,
        scale_pos_weight=1,

        # regularizaion alpha lambda gamma
        reg_alpha=0,
        reg_lambda=1,
        gamma=x,
        )

    model.fit(Xtr,ytr)
    vd_preds = model.predict(Xvd)

    acc = metrics.accuracy_score(yvd, vd_preds)
    pre = metrics.precision_score(yvd, vd_preds)
    rec = metrics.recall_score(yvd, vd_preds)
    f1  = metrics.f1_score(yvd,vd_preds)

    time_taken = time.time() - time_start
    time_taken = "{:.0f} min {:.2f} s".format(*divmod(time_taken,60))

    row = ['xgb', f'gamma={x}', acc, pre, rec, f1,time_taken]
    df_eval.loc[len(df_eval)] = row
    df_eval = df_eval.drop_duplicates(['Model','Description'])


df_eval.sort_values('Accuracy',ascending=False).head().reset_index().style.background_gradient(subset=['Accuracy'])

Unnamed: 0,index,Model,Description,Accuracy,Precision,Recall,F-score,Time_Taken
0,54,xgb,reg_lambda=1,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
1,11,xgb,max_depth=4,0.829268,0.82,0.964706,0.886486,0 min 0.04 s
2,42,xgb,reg_alpha=0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
3,41,xgb,scale_pos_weight=1,0.829268,0.807692,0.988235,0.888889,0 min 0.03 s
4,1,xgb,xgb9.0,0.829268,0.807692,0.988235,0.888889,0 min 0.04 s
