<center><h1>3.2) Dealing with imbalance</h1></center>



As we saw in the second section of the project I still have problems with correctly classifying the two smallest classes so I will use two sampling techniques to address the imbalance in the dataset. The two approaches have the same goal but act on opposite sides. While oversampling modifies the two minority classes undersampling modifies the biggest class.


<center><h2>Oversampling</h2></center>

I will use the SMOTE algorithm to augment the two smallest classes (non fundet and partially funded loans). SMOTE stands for Synthetic Minority Over-sampling Technique.

The idea behind the algorithm is to create synthetic data points that are similar to existing data points that share the same characteristics.. By using smote I will increase the size of the two smallest classes so they match the biggest class in size and will therefore remove class imbalance.

Smote is unfortunately not part of Sklearn or Keras and the package has to be installed separately. It can be installed from this site: https://pypi.org/project/imbalanced-learn/

The biggest drawback with smote is computing time as the dataset gets, in this case, multiplied by three.


In [1]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


In [2]:
#load the prepared file from
kiva_smote = pd.read_csv('data_files/kiva_keras.csv')
#reduce dataset with half ao the augmented dataset doesn't get to big
kiva_undersampling = kiva_smote
kiva_smote = kiva_smote.sample(frac=0.5,random_state=1)
kiva_smote.shape

(335581, 11)

In [3]:
kiva_smote.head()

Unnamed: 0,loan_amount,description,loan_status,activity,sector,country,currency,term_in_months,lender_count,borrower_genders,repayment_interval
63804,1050.0,Yes,Totally funded,Pigs,Agriculture,Indonesia,IDR,8.0,33,Female,irregular
510864,400.0,Yes,Totally funded,Embroidery,Arts,Pakistan,PKR,14.0,15,Female,monthly
595924,300.0,Yes,Totally funded,Bakery,Food,El Salvador,USD,8.0,7,Male,monthly
580780,500.0,Yes,Totally funded,Sewing,Services,Pakistan,PKR,11.0,14,Female,irregular
344839,1200.0,Yes,Totally funded,Farming,Agriculture,Cambodia,KHR,13.0,26,Female,monthly


In [4]:
#preproccesing keras
from sklearn.preprocessing import StandardScaler,MinMaxScaler

#features_to_scale = ['loan_amount','term_in_months','lender_count']
scaler = MinMaxScaler()
#scaled_features = scaler.fit_transform(kiva[features_to_scale])

kiva_smote = pd.get_dummies(data=kiva_smote, columns=['activity','description', 'sector', 'country','currency','repayment_interval','borrower_genders'])
kiva_smote[['loan_amount','term_in_months','lender_count']] = scaler.fit_transform(kiva_smote[['loan_amount','term_in_months','lender_count']])

#kiva_keras.head()

In [5]:
#Remove funded amount and separate th Y form the X
y = kiva_smote.loan_status
kiva_smote.drop(['loan_status'], axis=1, inplace=True)

In [6]:
#apply smotefrom imblearn.over_sampling import SMOTE

X_train, X_test, y_train, y_test = train_test_split(kiva_smote, y, test_size=0.2, random_state=0, shuffle=True)


smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_sample(X_train, y_train)

In [7]:
from keras.utils import to_categorical
from sklearn import preprocessing


# Encode train_y
le = preprocessing.LabelEncoder()
y_train_step_one = le.fit_transform(y_res)
smote_train_y = to_categorical(y_train_step_one)
# Encode train_y

y_test_step_one = le.fit_transform(y_test)
keras_test_y = to_categorical(y_test_step_one)



  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


<h3>Caveat:</h3>

I haven’t changed the architecture of the model or the parameters. In this part of the project I wanted to see if balancing the classes led to improvements in the metrics. 

This is not the best approach. The GridsearchCV used in part two was based on the original, imbalanced dataset. We can’t say that the optimal parameters for the first dataset are optimal for the augmented set. Unfortunately running Gridsearch or Randomsearch on nearly two million samples would take probably days. 


In [11]:
import numpy as np
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
from keras.layers import Dense,Dropout
from keras.callbacks import ModelCheckpoint  

model = Sequential()
model.add(Dense(10, input_shape=kiva_smote.shape[1:], activation='relu'))
model.add(Dense(1000,  activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='sigmoid'))
  
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.0015), metrics=['accuracy'])

checkpointer = ModelCheckpoint(filepath='weights.bestsmote.hdf5', 
                               verbose=1, save_best_only=True)

model.fit(X_res, smote_train_y,validation_split = 0.2,epochs = 80, batch_size= 1000,callbacks=[checkpointer],verbose=1)

Train on 598312 samples, validate on 149579 samples
Epoch 1/80

Epoch 00001: val_loss improved from inf to 1.24587, saving model to weights.bestsmote.hdf5
Epoch 2/80

Epoch 00002: val_loss improved from 1.24587 to 1.06441, saving model to weights.bestsmote.hdf5
Epoch 3/80

Epoch 00003: val_loss did not improve from 1.06441
Epoch 4/80

Epoch 00004: val_loss improved from 1.06441 to 0.88222, saving model to weights.bestsmote.hdf5
Epoch 5/80

Epoch 00005: val_loss improved from 0.88222 to 0.87099, saving model to weights.bestsmote.hdf5
Epoch 6/80

Epoch 00006: val_loss improved from 0.87099 to 0.61425, saving model to weights.bestsmote.hdf5
Epoch 7/80

Epoch 00007: val_loss did not improve from 0.61425
Epoch 8/80

Epoch 00008: val_loss improved from 0.61425 to 0.56889, saving model to weights.bestsmote.hdf5
Epoch 9/80

Epoch 00009: val_loss improved from 0.56889 to 0.50542, saving model to weights.bestsmote.hdf5
Epoch 10/80

Epoch 00010: val_loss did not improve from 0.50542
Epoch 11/80




Epoch 00040: val_loss did not improve from 0.27385
Epoch 41/80

Epoch 00041: val_loss did not improve from 0.27385
Epoch 42/80

Epoch 00042: val_loss did not improve from 0.27385
Epoch 43/80

Epoch 00043: val_loss did not improve from 0.27385
Epoch 44/80

Epoch 00044: val_loss did not improve from 0.27385
Epoch 45/80

Epoch 00045: val_loss did not improve from 0.27385
Epoch 46/80

Epoch 00046: val_loss did not improve from 0.27385
Epoch 47/80

Epoch 00047: val_loss did not improve from 0.27385
Epoch 48/80

Epoch 00048: val_loss did not improve from 0.27385
Epoch 49/80

Epoch 00049: val_loss did not improve from 0.27385
Epoch 50/80

Epoch 00050: val_loss improved from 0.27385 to 0.26870, saving model to weights.bestsmote.hdf5
Epoch 51/80

Epoch 00051: val_loss did not improve from 0.26870
Epoch 52/80

Epoch 00052: val_loss improved from 0.26870 to 0.25653, saving model to weights.bestsmote.hdf5
Epoch 53/80

Epoch 00053: val_loss did not improve from 0.25653
Epoch 54/80

Epoch 00054: va

<keras.callbacks.History at 0x210046ce4e0>

In [9]:
model = Sequential()
model.add(Dense(10, input_shape=kiva_smote.shape[1:], activation='relu'))
model.add(Dense(1000,  activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='sigmoid'))

model.load_weights("weights.bestsmote.hdf5")
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.0015), metrics=['accuracy'])
model.fit(X_res,smote_train_y,batch_size= 1000)
score = model.evaluate(X_test,keras_test_y)

Epoch 1/1


In [12]:
from sklearn.metrics import classification_report
import numpy as np

keras_test_y = np.argmax(keras_test_y, axis=1) # Convert one-hot to index
y_pred = model.predict_classes(X_test)
print(classification_report(keras_test_y, y_pred, target_names=['full funded', 'non funded','partially funded']))

                  precision    recall  f1-score   support

     full funded       0.99      0.94      0.96     62278
      non funded       0.71      0.93      0.81       319
partially funded       0.49      0.83      0.62      4520

     avg / total       0.95      0.93      0.94     67117



<center><h2>Undersampling</h2></center>

The idea behind data reduction is to resize the biggest class, thus giving the smaller classes a higher weight. The reduction is done by removing randomly rows buy for example using pandas sample method. The problem with data reduction is that we risk losing information. 

In this case the problem is accentuated by the small size of the non funded loans. In order to balance information loss and weight compensation i will reduce the biggest class to 20 percent of the original size even if that means that the smallest class still will be proportionally smaller. If the reduction still shows some imbalances I will further reduce the proportion of the biggest class.


In [13]:
kiva_undersampling = pd.read_csv('data_files/kiva_keras.csv')

In [14]:
totally_funded = kiva_undersampling.loc[kiva_undersampling.loan_status == 'Totally funded']
partially_funded = kiva_undersampling.loc[kiva_undersampling.loan_status == 'partially funded']
not_funded = kiva_undersampling.loc[kiva_undersampling.loan_status == 'not funded']


df1 = totally_funded.sample(frac=.05)
df2 = partially_funded
df3 = not_funded
kiva_reduced = (pd.concat([df1,df2,df3]))
kiva_reduced.shape

(79462, 11)

Note to .sample(frac=.05). Reducing the biggest class to 20 % did not produce significant results so I reduced gradually to 5 %

In [15]:
#preproccesing keras
from sklearn.preprocessing import StandardScaler,MinMaxScaler

#features_to_scale = ['loan_amount','term_in_months','lender_count']
scaler = MinMaxScaler()
#scaled_features = scaler.fit_transform(kiva[features_to_scale])

kiva_reduced = pd.get_dummies(data=kiva_reduced, columns=['activity','description', 'sector', 'country','currency','repayment_interval','borrower_genders'])
kiva_reduced[['loan_amount','term_in_months','lender_count']] = scaler.fit_transform(kiva_reduced[['loan_amount','term_in_months','lender_count']])



In [16]:
y = kiva_reduced.loan_status
kiva_reduced.drop(['loan_status'], axis=1, inplace=True)

In [17]:
X_train_red, X_test, y_train_red, y_test = train_test_split(kiva_reduced, y, test_size=0.2, random_state=0, shuffle=True)

In [18]:


# Encode train_y
le = preprocessing.LabelEncoder()
y_train_step_one = le.fit_transform(y_train_red)
reduced_train_y = to_categorical(y_train_step_one)
# Encode train_y

y_test_step_one = le.fit_transform(y_test)
keras_test_y = to_categorical(y_test_step_one)

In [19]:
model = Sequential()
model.add(Dense(10, input_shape=kiva_reduced.shape[1:], activation='relu'))
model.add(Dense(1000,  activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='sigmoid'))
  
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.0015), metrics=['accuracy'])

checkpointer = ModelCheckpoint(filepath='weights.bestredu.hdf5', 
                               verbose=1, save_best_only=True)

model.fit(X_train_red,reduced_train_y,validation_split = 0.2,epochs = 80,batch_size= 1000,callbacks=[checkpointer],verbose=1)

Train on 50855 samples, validate on 12714 samples
Epoch 1/80

Epoch 00001: val_loss improved from inf to 0.68284, saving model to weights.bestredu.hdf5
Epoch 2/80

Epoch 00002: val_loss improved from 0.68284 to 0.65886, saving model to weights.bestredu.hdf5
Epoch 3/80

Epoch 00003: val_loss improved from 0.65886 to 0.64171, saving model to weights.bestredu.hdf5
Epoch 4/80

Epoch 00004: val_loss improved from 0.64171 to 0.63139, saving model to weights.bestredu.hdf5
Epoch 5/80

Epoch 00005: val_loss improved from 0.63139 to 0.61872, saving model to weights.bestredu.hdf5
Epoch 6/80

Epoch 00006: val_loss improved from 0.61872 to 0.61247, saving model to weights.bestredu.hdf5
Epoch 7/80

Epoch 00007: val_loss did not improve from 0.61247
Epoch 8/80

Epoch 00008: val_loss improved from 0.61247 to 0.59578, saving model to weights.bestredu.hdf5
Epoch 9/80

Epoch 00009: val_loss improved from 0.59578 to 0.57864, saving model to weights.bestredu.hdf5
Epoch 10/80

Epoch 00010: val_loss did not 


Epoch 00039: val_loss improved from 0.42989 to 0.42920, saving model to weights.bestredu.hdf5
Epoch 40/80

Epoch 00040: val_loss did not improve from 0.42920
Epoch 41/80

Epoch 00041: val_loss did not improve from 0.42920
Epoch 42/80

Epoch 00042: val_loss did not improve from 0.42920
Epoch 43/80

Epoch 00043: val_loss did not improve from 0.42920
Epoch 44/80

Epoch 00044: val_loss improved from 0.42920 to 0.41883, saving model to weights.bestredu.hdf5
Epoch 45/80

Epoch 00045: val_loss improved from 0.41883 to 0.40934, saving model to weights.bestredu.hdf5
Epoch 46/80

Epoch 00046: val_loss did not improve from 0.40934
Epoch 47/80

Epoch 00047: val_loss did not improve from 0.40934
Epoch 48/80

Epoch 00048: val_loss improved from 0.40934 to 0.40789, saving model to weights.bestredu.hdf5
Epoch 49/80

Epoch 00049: val_loss did not improve from 0.40789
Epoch 50/80

Epoch 00050: val_loss improved from 0.40789 to 0.39472, saving model to weights.bestredu.hdf5
Epoch 51/80

Epoch 00051: val


Epoch 00080: val_loss did not improve from 0.35650


<keras.callbacks.History at 0x210058674a8>

In [21]:
from sklearn.metrics import classification_report
import numpy as np
model.load_weights("weights.bestredu.hdf5")
keras_test_y = np.argmax(keras_test_y, axis=1) # Convert one-hot to index
y_pred = model.predict_classes(X_test)
print(classification_report(keras_test_y, y_pred, target_names=['full funded', 'non funded','partially funded']))

                  precision    recall  f1-score   support

     full funded       0.85      0.83      0.84      6063
      non funded       0.70      0.72      0.71       721
partially funded       0.87      0.89      0.88      9109

     avg / total       0.86      0.86      0.85     15893



<center><h3>Resampling the datasets conclusion</h3></center>



I tried two resampling approaches to improve the results from part two / Keras. Both improved the results for the two smallest classes. 


Augmenting the dataset with SMOTE helped to improve f1 score for the partially funded loans, but at the cost of a worse result for the non funded loans and for the fully funded loans.
Undersampling gave better results for the non and partially funded loans but at detriment of the fully funded loans. 



<center><h1>Final Conclusion</h1></center>

The goal of this project was to predict if a loan from Kiva gets funded, especially for those outcomes that could have negative consequences for the borrower. Based on the assumption that there is not a single model that serves/fits all purposes I compared Sklearn algorithms with a Keras neural network with and without resampling to overcome class imbalance<br>

As we can see from the following table the different algorithms showed some differences.




| Algorithm | f1 fully funded   | f1 non funded   | f1 partially funded |
|------|------|------|------|
|   Random forest| 0.98 | 1.00| 0.71  |
|   Keras| 0.98 | 0.53| 0.67  |
|   Keras SMOTE| 0.96 | 0.81 | 0.62  |
|   Keras undersampling| 0.84 | 0.71 | 0.88  |

The first good candidate was already found in the first part, where the Random forest algorithm achieved perfect score, with no misclassification for one of the the tree classes, the non funded loans. These was the most critical outcome, as a non funded loan means the death of the borrowers project.
Randomforest had problems predicting the correct outcome for the partially funded loans. 

The keras based model improved a little bit the performance for the partially funded loans but failed with respect of the of the smallest class, the non funded loans. It’s interesting, to point out that the Sklearn MLP showed similar behaviour.

Resampling by augmenting (SMOTE) improved the results for the the non funded loans but showed worse results for the rest of the classes. Undersampling improved the result of the non funded and partially funded loans but didn't perform well for the biggest class.

The overall conclusion is that Randomforest is suited for the problem/domain as it achieved the best possible predictions for the most critical class for the borrowers, the non fundet loans. Using f1 as metric it outperformed Keras and did much more faster. 


Results for Keras could improve by increasing the search space for the Gridsearch/Randomsearch for Keras and by extending the search space to include also the architecture (layers and hidden nodes) 