<a href="https://colab.research.google.com/github/boliang-liu/CreditsPrediction/blob/main/Credits_prediction_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [36]:
import pandas as pd
from   category_encoders          import *
import numpy as np
from   sklearn.compose            import *
from   sklearn.impute             import *
from   sklearn.metrics            import mean_squared_error, mean_absolute_error, r2_score
from   sklearn.pipeline           import Pipeline
from   sklearn.preprocessing      import *
from   sklearn.linear_model       import *
from   sklearn.model_selection    import *
from   sklearn.decomposition      import PCA
from   sklearn.svm                import SVC
from   sklearn.ensemble           import ExtraTreesRegressor, RandomForestRegressor
from   sklearn.base               import BaseEstimator

In [37]:
credit = pd.read_csv('https://raw.githubusercontent.com/boliang-liu/CreditsPrediction/main/BankChurners.csv')

Load the dataset from Github repo

In [38]:
data = credit[['Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2']]

Select features we need to use, drop meaningless columns to this project like ID

In [39]:
target = credit['Credit_Limit']
target = target.values.ravel()

Select the target column 'Credit_Limit' and change the type for fitting models later

In [40]:
X_train, X_test, y_train, y_test = train_test_split(data, 
                                                    target, 
                                                    test_size=0.2)

Split the dataset to train set and test set, test size is 20%

In [41]:
categorical_columns = (X_train.dtypes == object)
continuous_columns  = (X_train.dtypes != object)

Identify the types of columns, continous or categorical. It's useful for transforming columns later 

In [42]:
class DummyEstimator(BaseEstimator):
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

Custom a class for later RandomizedSearchCV

In [43]:
con_pipe = Pipeline([('scalar', MaxAbsScaler()),
                     ('imputer', SimpleImputer(missing_values=np.nan, strategy='median', add_indicator=True))
                     ])

cat_pipe = Pipeline([('ohe', OneHotEncoder(handle_unknown='ignore')),
                     ('imputer', SimpleImputer(strategy='most_frequent', add_indicator=True))])

preprocessing = ColumnTransformer([('categorical', cat_pipe,  categorical_columns),
                                   ('continuous',  con_pipe,  continuous_columns),
                                   ])
pipe = Pipeline([('preprocessing', preprocessing),
                 ('pca', PCA(n_components=15)),
                 ('clf',DummyEstimator())
                ])

Standardize continous columns and impute missing values to the median of the column. OneHotEncode categorical columns and imputer missing values to the most frequent category of the column. Construct a columns transformer containing categorial and continous, then construct a pipeline containing columns tranformer, pca, and DummyEstimator. There's 20 features in total but some features are not important. To improve generality, we set n_components=15 to choose 15 important features.

In [44]:
search_space = [{'clf': [RandomForestRegressor()],
                 'clf__n_estimators': np.arange(100, 1000, 150), # decides how many trees
                 'clf__max_features': ['log2','sqrt'], # decides how many features in each tree
                 'clf__max_depth' : np.arange(15,25,1), # decides how deep in each tree
                 'clf__min_samples_leaf': np.arange(1,10,1), # decides hwo many samples at minimum in each leaf
                 'clf__bootstrap': [True, False] # decides if using bootstrap technique
                },
                
                {'clf': [SVC()], 
                 'clf__C': np.logspace(0.1, 1000, 5), # decides the best C value
                 'clf__gamma': np.logspace(0.0001,1,5), # decides the best gamma
                 'clf__kernel':['rbf','poly'], # decides kernal type
                 'clf__class_weight': ['balanced',None] # decides the type of weighted class
                },
               
                {'clf': [RidgeCV()], 
                 'clf__normalize': [False, True], # decides if doing normalization
                 'clf__alpha_per_target' : [False, True] # decides if using alpha in per target
                },
                
                {'clf': [LassoCV()], 
                 'clf__eps': np.arange(0.0005, 0.01, 0.0005), # decides the best epsilon
                 'clf__normalize': [False, True], # decides if doing normalization
                 'clf__max_iter': np.arange(1000,5000,1000), # decides the maximum times of iteration
                 'clf__n_alphas': np.arange(100,500,100) # decides the best n_alphas
                },
                
                {'clf': [BayesianRidge()], 
                 'clf__normalize': [False, True], # decides if doing normalization
                 'clf__n_iter': np.arange(100, 1000, 100) # decides the times of iteration
                },
                
                {
                 'clf': [HuberRegressor()],
                 'clf__alpha': np.arange(0.0001, 0.001, 0.0001), # decides the best alpha
                 'clf__max_iter': np.arange(100,1000,100), # decides the times of iteration
                 'clf__epsilon': np.arange(1,2,0.1) # decides the best epsilon
                },
                
                {
                 'clf': [ExtraTreesRegressor()], 
                 'clf__max_features': ['log2','sqrt'], # decides how many features in each tree
                 'clf__max_depth' : np.arange(15,25,1), # decides how deep in each tree
                 'clf__n_estimators': np.arange(100, 1000, 150), # decides how many trees
                 'clf__min_samples_leaf': np.arange(1,10,1), # decides hwo many samples at minimum in each leaf
                 'clf__bootstrap': [True, False] # decides if using bootstrap technique
                }
                 ]

clf_algos_rand = RandomizedSearchCV(estimator=pipe,
                                    param_distributions=search_space, 
                                    n_iter=50,
                                    cv=5, 
                                    n_jobs=-1,
                                    verbose=10,
                                    scoring='neg_root_mean_squared_error')

  return _nx.power(base, y)


Construct a search space of 7 different models and various hyperparameters. Use RandomizedSearchCV to search the best model and hyperparameters using MSE as scoring

In [46]:
for i in range(5):

    best_model = clf_algos_rand.fit(X_train, y_train)

    print(best_model.best_estimator_.get_params()['clf'], end='\n')
    print(best_model.best_score_, end='\n')

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   23.8s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   57.5s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 141 tasks      | elapsed: 10

RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=19, max_features='log2', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=850, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
-1771.6695369482761
Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   19.0s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 141 tasks      | elapsed: 10

RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=20, max_features='log2', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=250, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
-1765.3591850968237
Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   20.8s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   52.5s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 141 tasks      | elapsed: 10

RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=20, max_features='log2', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=700, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
-1766.55209906544
Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   35.0s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   52.0s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   54.6s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  6.4min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 141 tasks      | elapsed: 10

RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=23, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=2,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
-1814.8345727824217
Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    9.2s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 141 tasks      | elapsed: 11

RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=17, max_features='log2', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=2,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=550, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
-1847.4984585662569


Run 5 times to look at 5 best models and hyperparameters. By comparing 5 MSE values to choose the final best one model and hyperparameters. The final best model in RandomForestRegressor

In [47]:
RandomForestRegressor().get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Check the default hyperparameters of RandomForestRegressor

In [48]:
params={'bootstrap': False,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': 20,
 'max_features': 'log2',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 250,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

Change the default hyperparameters into choosen best hyperparameters

In [49]:
pipe = Pipeline([('preprocessing', preprocessing),
                 ('pca', PCA(n_components=15)),
                 ('reg',RandomForestRegressor(**params))
                ])

Construct the final pipeline by adding the best model and hyperparameters

In [50]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('categorical',
                                                  Pipeline(memory=None,
                                                           steps=[('ohe',
                                                                   OneHotEncoder(categories='auto',
                                                                                 drop=None,
                                                                                 dtype=<class 'numpy.float64'>,
                                                                                 handle_unknown='ignore',
                                                                                 sparse=True)),
                                           

Train the model by using train set

In [51]:
y_pred   = pipe.predict(X_test)

Use trained model to do prediction for test set

In [52]:
mse  = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r_2 = r2_score(y_test, y_pred)

Calculate MSE, MAE, R2 of between prediction and y_test. These 3 evaluation metrics are appropriate metrics for regression so they are reasonable. MSE means mean squared error regression loss between prediction and actual values. MAE means median absolute error regression loss between prediction and actual values. R2 means coefficient of determination of the prediction model.

In [53]:
print('mse = {mse}\nmae = {mae}\nr2  = {r_2}'.format(mse=mse, mae=mae, r_2=r_2))

mse = 2575826.8301345236
mae = 1044.6272158530253
r2  = 0.9702055308283112


Print out the scores together.
mse = 2575826.8301345236.
mae = 1044.6272158530253.
r2  = 0.9702055308283112.
The mean squared error regression loss between prediction and actual values is 2575826.8301345236. The median absolute error regression loss between prediction and actual values is 1044.6272158530253. The R2 coefficient of determination of the prediction model is 0.97.

The searched model and hyperparameters are good for predicting the credit limit. Next, I'll do grid search if possible, which is more accurate but would cost more time.