The BERT machine learning model is a bidirectional transformer pretrained using masked language objective and next sentence prediction. The BERT model have been successfully applied to natural language processing tasks such as question answering, sentiment analysis, and document summarization, just name a few. In order to evaluate if BERT can predict loan defaults specifically, there are a few steps to go:


1.   Install TensorFlow 2.x.
2.   Download Kiva's train and test dataset.
3.   Load train and test dataset from CSV files.
4.   Tokenize English texts in both datasets using the BERT tokenizer.
5.   Load the pretrained BERT model from huggingface.
6.   Fine tune the BERT model using Kiva's train dataset.
7.   It's time to tell if a Kiva loan request will default by using the trained model.

The experiment is best explained by the Colab notebook as follows:



Install TensorFlow 2.x:

In [None]:
%tensorflow_version 2.x
!pip3 install transformers

Download Kiva's train and test datasets:

In [None]:
!wget -O kiva_train.csv https://drive.google.com/u/0/uc?id=1dzzVbgHphbCf7kvq9IKiIhwzmxPbuH4s&export=download
!wget -O kiva_test.csv https://drive.google.com/u/0/uc?id=1EVWfyqQOd_W2uTKrr4JTD2iFrEZHoOHT&export=download

Import TensorFlow and BERT tokenizer:

In [None]:
import tensorflow as tf
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Load train and test datasets from CSV files:

In [None]:
import pandas as pd

train_df = pd.read_csv (r'kiva_train.csv')
test_df = pd.read_csv (r'kiva_test.csv')

Tokenize English texts:



In [None]:
tokenized_datasets = {"train": [], "test": []}

for index, row in train_df.iterrows():
    tokenized_row = tokenizer(row.en_clean, padding="max_length", truncation=True)
    tokenized_row["loan_id"] = row.loan_id
    tokenized_row["label"] = row.defaulted
    tokenized_datasets["train"].append(tokenized_row)

for index, row in test_df.iterrows():
    tokenized_row = tokenizer(row.en_clean, padding="max_length", truncation=True)
    tokenized_datasets["test"].append(tokenized_row)

tokenized_datasets["train"] = pd.DataFrame(tokenized_datasets["train"])
tokenized_datasets["eval"] = tokenized_datasets["train"].sample(frac=0.1, random_state=2)

Note: The following block should be removed when the model is used to predict loan defaults of the held-out test dataset.

In [None]:
train_set_aside = []
for index, row in tokenized_datasets["train"].iterrows():
    if row.loan_id not in list(tokenized_datasets["eval"]["loan_id"]):
        train_set_aside.append(row)
tokenized_datasets["train"] = pd.DataFrame(train_set_aside)

Convert pandas dataframes to TensorFlow datasets:

In [None]:
train_features = {x: list(tokenized_datasets["train"][x]) for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, list(tokenized_datasets["train"]["label"])))
train_tf_dataset = train_tf_dataset.shuffle(len(tokenized_datasets["train"])).batch(16)

eval_features = {x: list(tokenized_datasets["eval"][x]) for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, list(tokenized_datasets["eval"]["label"])))
eval_tf_dataset = eval_tf_dataset.batch(16)

Load the pretrained BERT model from huggingface:

In [None]:
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Fine tune the BERT model using Kiva's train dataset for three epochs and save the model: (Colab Pro usually stops the training after a day of execution. Better saving the model and then reloading it every few epochs. )

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)


checkpoint_filepath = 'bert-kiva-checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_sparse_categorical_accuracy',
    mode='max',
    save_best_only=True)

model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3,callbacks=[model_checkpoint_callback])
model.save_pretrained("bert-kiva")

Reload the saved model from the previous step:

In [None]:
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained("bert-kiva", num_labels=2)

Train the BERT model for another three epochs:

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)


checkpoint_filepath = 'bert-kiva-checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_sparse_categorical_accuracy',
    mode='max',
    save_best_only=True)

model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3,callbacks=[model_checkpoint_callback])
model.save_pretrained("bert-kiva")

Predict loan defaults of the held-out test dataset:

In [None]:
import numpy as np
from tqdm import tqdm
test_pred = []
for row in tqdm(tokenized_datasets["test"]):
    row = dict(row)
    row["input_ids"] = tf.reshape(row["input_ids"], (1,-1))
    row["attention_mask"] = tf.reshape(row["attention_mask"], (1,-1))
    row["token_type_ids"] = tf.reshape(row["token_type_ids"], (1,-1))
    outputs = model(**row)
    loss = outputs.loss
    logits = outputs.logits
    test_pred.append(np.argmax(logits))


Save the default results to a CSV file:

In [None]:
test_df["defaulted"] = test_pred
test_df.to_csv("kiva_test_with_defaulted.csv",index=False)