Term frequency (TF) and inverse document frequency (IDF) are used to find the impact and importance of a word in a given sentence in natural language processing. TF-IDF, combined with support vector machine (SVM), can tell if a loan will default. TF-IDF and SVM are a shallower and more light-weighted approach, though they are not as accurate as those of deep learning models (e.g. BERT). Here are the steps to conduct the experiment:

Import TF-IDF Vectorizer from scikit-learn:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

Download Kiva's train and test datasets:

In [None]:
!wget -O kiva_train.csv https://drive.google.com/u/0/uc?id=1dzzVbgHphbCf7kvq9IKiIhwzmxPbuH4s&export=download
!wget -O kiva_test.csv https://drive.google.com/u/0/uc?id=1EVWfyqQOd_W2uTKrr4JTD2iFrEZHoOHT&export=download

Load train and test datasets from CSV files:

In [None]:
import pandas as pd

train_dataset = pd.read_csv (r'kiva_train.csv')
test_dataset = pd.read_csv (r'kiva_test.csv')

Set aside 10% of data from train dataset for validation:

In [None]:
eval_dataset = train_dataset.sample(frac=0.1, random_state=2)

Note: The following block should be removed when the model is used to predict load defaults of the held-out test dataset.

In [None]:
train_set_aside = []
for index, row in train_dataset.iterrows():
    if row.loan_id not in list(eval_dataset["loan_id"]):
        train_set_aside.append(row)
train_dataset = pd.DataFrame(train_set_aside)

Calculate TF-IDF values on all datasets:

In [None]:
# create object
tfidf = TfidfVectorizer()

# get tf-df values
tfidf_vectorizer = tfidf.fit(list(train_dataset["en_clean"]))
train_tfidf_vectors = tfidf_vectorizer.transform(list(train_dataset["en_clean"]))
eval_tfidf_vectors = tfidf_vectorizer.transform(list(eval_dataset["en_clean"]))
test_tfidf_vectors = tfidf_vectorizer.transform(list(test_dataset["en_clean"]))


Train the model using SVM and TD-IDF features from the previous steps:

In [None]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='sigmoid') # Linear Kernel

#Train the model using the training sets
clf.fit(train_tfidf_vectors.toarray(), list(train_dataset["defaulted"]))

#Predict the response for test dataset
eval_pred = clf.predict(eval_tfidf_vectors.toarray())

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Validation Accuracy:",metrics.accuracy_score(list(eval_dataset["defaulted"]), eval_pred))

Predict loan defaults of the held-out test dataset:

In [None]:
test_pred = clf.predict(test_tfidf_vectors.toarray())
test_dataset["defaulted"] = test_pred
test_dataset.to_csv("kiva_test_with_defaulted.csv",index=False)