Term frequency (TF) and inverse document frequency (IDF) are used to find the impact and importance of a word in a given sentence in natural language processing. TF-IDF, combined with XGBoost, which is a scalable tree boosting ML model, can tell if a loan will default. TF-IDF and XGBoost are a shallower and more light-weighted approach, though they are not as accurate as those of deep learning models (e.g. BERT). Here are the steps to conduct the experiment:

Install and upgrade XGBoost:

In [None]:
!pip3 install --upgrade xgboost

Import TF-IDF Vectorizer from scikit-learn:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

Download Kiva's train and test datasets:

In [None]:
!wget -O kiva_train.csv https://drive.google.com/u/0/uc?id=1dzzVbgHphbCf7kvq9IKiIhwzmxPbuH4s&export=download
!wget -O kiva_test.csv https://drive.google.com/u/0/uc?id=1EVWfyqQOd_W2uTKrr4JTD2iFrEZHoOHT&export=download

Load train and test datasets from CSV files:

In [None]:
import pandas as pd

train_dataset = pd.read_csv (r'kiva_train.csv')
test_dataset = pd.read_csv (r'kiva_test.csv')

Set aside 10% of data from train dataset for validation:

In [None]:
eval_dataset = train_dataset.sample(frac=0.1, random_state=2)

Note: The following block should be removed when the model is used to predict loan defaults of the held-out test dataset.

In [None]:
train_set_aside = []
for index, row in train_dataset.iterrows():
    if row.loan_id not in list(eval_dataset["loan_id"]):
        train_set_aside.append(row)
train_dataset = pd.DataFrame(train_set_aside)

Calculate TF-IDF values on all datasets:

In [None]:
# create object
tfidf = TfidfVectorizer()

# get tf-df values
tfidf_vectorizer = tfidf.fit(list(train_dataset["en_clean"]))
train_tfidf_vectors = tfidf_vectorizer.transform(list(train_dataset["en_clean"]))
eval_tfidf_vectors = tfidf_vectorizer.transform(list(eval_dataset["en_clean"]))
test_tfidf_vectors = tfidf_vectorizer.transform(list(test_dataset["en_clean"]))


Train the model using XGBoost and TD-IDF features from the previous steps:

In [None]:
import xgboost as xgb
train_features = pd.DataFrame(train_tfidf_vectors.toarray())
dtrain = xgb.DMatrix(train_features, label=pd.DataFrame(train_dataset["defaulted"]))
dtrain.save_binary('train.buffer')

eval_features = pd.DataFrame(eval_tfidf_vectors.toarray())
deval = xgb.DMatrix(eval_features, label=pd.DataFrame(eval_dataset["defaulted"]))
deval.save_binary('eval.buffer')

test_features = pd.DataFrame(test_tfidf_vectors.toarray())
dtest = xgb.DMatrix(test_features)
dtest.save_binary('test.buffer')

In [None]:
param = {'max_depth': 9, 'eta': 0.1, 'objective': 'binary:logistic'}
param['nthread'] = 4

param['eval_metric'] = ['error','auc']

evallist = [(dtrain, 'train'),(deval, 'eval')]

num_round = 41
bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=3)
#bst = xgb.train(param, dtrain, num_round, evallist)

bst.save_model('0001.model')
bst.dump_model('dump.raw.txt')

Predict loan defaults of the held-out test dataset:

In [None]:
test_pred = bst.predict(dtest, iteration_range=(0, bst.best_iteration + 1))
test_pred = [int(x) for x in (test_pred>0.5)]
test_dataset["defaulted"] = test_pred
test_dataset.to_csv("kiva_test_with_defaulted.csv",index=False)

Code for debug: Dump validation predictions and validation labels, and print validation accuracy.

In [None]:
deval2 = xgb.DMatrix(eval_features)
eval_pred = bst.predict(deval2, iteration_range=(0, bst.best_iteration + 1))
eval_pred = [int(x) for x in (eval_pred>0.5)]
print(eval_pred)
print(list(eval_dataset["defaulted"]))

import sklearn
import numpy as np
acc = sklearn.metrics.accuracy_score(np.array(eval_pred), np.array(list(eval_dataset["defaulted"])))
print("Validation Accuracy: ", acc)