# Introduction to the Data

Lending Club releases data for all of the approved and declined loan applications periodically on their website. You can select a few different year ranges to download the datasets (in CSV format) for both approved and declined loans.

You'll also find a data dictionary (in XLS format) which contains information on the different column names towards the bottom of the page. This data dictionary is available in the data/financial loans repo, so you can refer to it whenever anyone would to learn more about what a column represents in the datasets.

Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?

In this notebook, we'll build a machine learning model that will focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

# Reading in the data

To ensure that code runs fast on our platform, we reduced the size of LoanStats3a.csv by:

(1) Removing the first line because it contains the extraneous text Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action) instead of the column titles, which prevents the dataset from being parsed by the pandas library properly.

(2) Removing the desc column which contains a long text explanation for each loan.

(3) Removing the url column which contains a link to each loan on Lending Club which can only be accessed with an investor account.

(4) Removing all columns containing more than 50% missing values which allows us to move faster since we can spend less time trying to fill these values

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline

os.chdir(r"C:\Users\gerr1\Desktop\Data Science Portfolio\data\financial loans")

#The chunck of code below performs the data cleaning described above
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False, encoding='utf-8')

In [None]:
#Reading in data
loans_2007 = pd.read_csv("loans_2007.csv", sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
loans_2007.drop_duplicates

print(loans_2007.shape)
loans_2007.head()

In [None]:
#Printing column dtypes
print(loans_2007.dtypes)

As we can see, pandas had a little trouble figuring out what dtype each column is. This means we may have to manually assign some columns to float64 and convert some of the other string columns to numeric columns.

# Grouping Columns

The Dataframe contains many columns and can be cumbersome to try to explore all at once. We will break up the columns into 3 groups of 18 columns and use the data dictionary to become familiar with what each column represents. As we understand each feature, we will pay attention to any features that leak information from the future (after the loan has already been funded), don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club), formatted poorly and need to be cleaned up, and/or require more data or a lot of processing to turn into a useful feature, contain redundant information.

In the next few cells, we'll focus on just columns that we need to remove from consideration. Then, we can circle back and further dissect the columns we decided to keep.

After analyzing each column, we can conclude that the following features need to be removed:

id: randomly generated field by Lending Club for unique identification purposes only,
member_id: also a randomly generated field by Lending Club for unique identification purposes only,
funded_amnt: leaks data from the future (after the loan is already started to be funded),
funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded),
grade: contains redundant information as the interest rate column (int_rate),
sub_grade: also contains redundant information as the interest rate column (int_rate),
emp_title: requires other data and a lot of processing to potentially be useful, and
issue_d: leaks data from the future (after the loan is already completed funded).

In [None]:
#Dropping redundant columns
drop_cols = ["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"]

loans = loans_2007.drop(drop_cols, axis=1)
print(loans.shape)
loans.head()

Next, we need to drop the following columns:

zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in), 
out_prncp: leaks data from the future, (after the loan already started to be paid off), 
out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off), 
total_pymnt: also leaks data from the future, (after the loan already started to be paid off), 
total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off), 
total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off).

The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column describe properties of the loan after it's fully funded and started to be paid off. This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.

In [None]:
loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

In [None]:
print(loans.shape)
loans.head()

In the next cell, we need to drop the following columns:

total_rec_int: leaks data from the future, (after the loan already started to be paid off),
total_rec_late_fee: also leaks data from the future, (after the loan already started to be paid off),
recoveries: also leaks data from the future, (after the loan already started to be paid off),
collection_recovery_fee: also leaks data from the future, (after the loan already started to be paid off),
last_pymnt_d: also leaks data from the future, (after the loan already started to be paid off),
last_pymnt_amnt: also leaks data from the future, (after the loan already started to be paid off).

All of these columns leak data from the future, meaning that they're describing aspects of the loan after it's already been fully funded and started to be paid off by the borrower.

In [None]:
loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)

In [None]:
#Shape and first row
print(loans.shape)
loans.iloc[0]

# Target Column & Binary Classification

The column we will be using as a target is the loan_status column since it indicates whether the loan was fully paid. But, this column has categorical values, which we will need to get dummy values for.

In [None]:
#First, we'll take a look at the loan_status column
print(loans_2007["loan_status"].value_counts())

There are 8 different possible values for the loan_status column.

From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

Since we're interesting in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case. While there are a few different ways to transform all of the values in a column, we'll use the Dataframe method replace. According to the documentation, we can pass the replace method a nested mapping dictionary in the following format:

    mapping_dict = {
        "date": {
            "january": 1,
            "february": 2,
            "march": 3
        }
    }
    df = df.replace(mapping_dict)

Lastly, one thing we need to keep in mind is the class imbalance between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes. There are a few different ways to tackle this class imbalance.

In [None]:
#Removing all rows from loans_2007 that contain values other than Fully Paid or Charged Off for the loan_status column
loans = loans[(loans["loan_status"] == "Fully Paid") | (loans["loan_status"] == "Charged Off")]

#Using the Dataframe method replace to replace: Fully Paid with 1, Charged Off with 0
replacements = {"Fully Paid": 1, "Charged Off": 0}
loans["loan_status"] = loans["loan_status"].replace(to_replace=replacements.keys(), value=replacements.values())

print(loans.shape)

# Removing Single Value Columns

We'll look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns we'll need to explore further.

In [None]:
#Removing any columns from loans_2007 that contain only one unique value
drop_columns = []

for col in list(loans.columns):
    non_null = loans[col].dropna()
    unique_non_null = non_null.unique()
    num_true_unique = len(unique_non_null)
    if num_true_unique <= 1:
        drop_columns.append(col)
        
loans = loans.drop(drop_columns, axis=1)

print(drop_columns)

In [None]:
#Shape of filtered data
print(loans.shape)

It looks we we were able to remove 9 more columns since they only contained 1 unique value. Now, we'll explore the individual features in greater depth and work towards training our first machine learning model.

# Handling Missing Values

We'll now prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

In [None]:
#Use the isnull and sum methods to return the number of null values in each column
null_count = loans.isnull().sum()
print(null_count)

While most of the columns have 0 missing values, 2 columns have 50 or less rows with missing values, and 1 column, pub_rec_bankruptcies, contains 697 rows with missing values. Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values.

In [None]:
loans = loans.drop("pub_rec_bankruptcies", axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())

# Categorical Columns

In [None]:
#Use the Dataframe method select_dtypes to select only the columns of object type from loans and assign the resulting 
#Dataframe object_columns_df
object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.shape)
object_columns_df.head()

Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

    home_ownership: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
    verification_status: indicates if income was verified by Lending Club,
    emp_length: number of years the borrower was employed upon time of application,
    term: number of payments on the loan, either 36 or 60,
    addr_state: borrower's state of residence,
    purpose: a category provided by the borrower for the loan request,
    title: loan title provided the borrower.

We're also left with some columns that need to be converted to numeric like loan_amnt, revol_util and int_rate.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

    earliest_cr_line: The month the borrower's earliest reported credit line was opened,
    last_credit_pull_d: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

In [None]:
#Let's explore the unique value counts of the columnns that seem like they contain categorical values
cat_cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']

for col in cat_cols:
    print(loans[col].value_counts())

The home_ownership, verification_status, emp_length, term, and addr_state columns all contain multiple discrete values. We should clean the emp_length column and treat it as a numerical one since the values have ordering (2 years of employment is less than 8 years).

In [None]:
#Let's look at the unique value counts for the purpose and title columns to understand which column we want to keep
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

The home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

We can use the following mapping to clean the emp_length column:

    "10+ years": 10
    "9 years": 9
    "8 years": 8
    "7 years": 7
    "6 years": 6
    "5 years": 5
    "4 years": 4
    "3 years": 3
    "2 years": 2
    "1 year": 1
    "< 1 year": 0
    "n/a": 0
    
We erred on the side of being conservative with the 10+ years, < 1 year and n/a mappings. We assume that people who may have been working more than 10 years have only really worked for 10 years. We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. This is a general heuristic but it's not perfect.

Lastly, the addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

In [None]:
#Remove the last_credit_pull_d, addr_state, title, and earliest_cr_line columns from loans
rem_cols = ["last_credit_pull_d", "addr_state", "title", "earliest_cr_line"]
loans = loans.drop(rem_cols, axis=1)

#Convert the int_rate and revol_util columns to float columns
loans["int_rate"] = loans['int_rate'].str.rstrip('%').astype(float)
loans["revol_util"] = loans['revol_util'].str.rstrip('%').astype(float)

In [None]:
#Use the replace method to clean the emp_length column
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans = loans.replace(mapping_dict)

In [None]:
#Now we just need to convert the remaining numeric columns to the float or int dtype
import numpy as np

float_cols = ['loan_amnt', 'installment', 'annual_inc', 'dti', 'revol_bal']
int_cols = ['delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc']

for col in float_cols:
    loans[col] = loans[col].astype(float)
    
for col in int_cols:
    column = loans[col].astype(float)
    loans[col] = column.astype(np.int64)
    
print(loans.dtypes)

In [None]:
#Encoding the home_ownership, verification_status, emp_length, purpose, and term columns as integer values
cat_columns = ["home_ownership", "verification_status", "purpose", "term"]

dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)

print(loans.shape)

In [None]:
#Dropping the pymnt_plan column because it has 'n': 39016 and 'y': 1. So this column is not giving any new information
loans = loans.drop('pymnt_plan', axis=1)
loans.head()

We've performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of values scikit-learn can work with. In the next section, we'll experiment with training models and evaluating accuracy using cross-validation.

# Picking An Error Metric

An error metric will help us figure out when our model is performing well, and when it's performing poorly. To tie error metrics all the way back to the original question we wanted to answer, let's say we're using a machine learning model to predict whether or not we should fund a loan on the Lending Club platform. Our objective in this is to make money -- we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if our algorithm will make us money or lose us money.

In this case, we're primarily concerned with false positives and false negatives. Both of these are different types of misclassifications. With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

We are interested in the false positive rate (fpr) and the true positive rate (tpr).

# Class Imbalance

We mentioned earlier that there is a significant class imbalance in the loan_status column. There are 6 times as many loans that were paid off on time (1), than loans that weren't paid off on time (0). This causes a major issue when we use accuracy as a metric. This is because due to the class imbalance, a classifier can predict 1 for every row, and still have high accuracy. An example follows.

In [None]:
# Predict that all loans will be paid off on time.
predictions = pd.Series(np.ones(loans.shape[0]))

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

fpr = fp / (fp + tn)
tpr = tp / (tp + fn)

print("True Positive Rate:", tpr)
print("False Positive Rate:", fpr)

# Logistic Regression with Cross Validation

In the last cell, you may have noticed that both fpr and tpr were 1. This is because we predicted 1 for each row. This means that we correctly identified all of the good loans (true positive rate), but we also incorrectly identified all of the bad loans (false positive rate). Now that we've setup error metrics, we can move on to making predictions using a machine learning algorithm. In order to fit the machine learning models, we'll use the Scikit-learn library, in this case, we'll train a logistic regression model.

In order to get a realistic depiction of the accuracy of the algorithm, we'll need to use cross validation to generate predictions. Cross validation splits the dataset into groups, then makes predictions on each group using the other groups as training data. This ensures that we don't overfit by generating predictions on the same data that we train our algorithm with.

We can perform cross validation using the cross_val_predict method of scikit-learn. cross_val_predict allows us to pass in a classifier, the features, and the target.

We'll create an instance of KFold, which will perform 3 fold cross validation across our dataset. We set random_state to 1 to ensure that the folds are always consistent, and we can compare scores between runs. If we don't, each fold will be randomized every time, making it hard to tell if we're improving our model or not.

If we pass the instance of KFold into cross_val_predict, it will then perform 3 fold cross validation to generate unbiased predictions.

Once we have cross validated predictions, we can compute true positive rate and false positive rate.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold

#Create a Dataframe named features that contains just the feature columns
features = loans.drop("loan_status", axis=1)

#Create a Series named target that contains just the target column
target = loans["loan_status"]

lr = LogisticRegression()
kf = KFold(features.shape[0], random_state=1)

#Generate cross validated predictions for features
predictions = cross_val_predict(lr, features, y=target, cv=kf)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

fpr = fp / (fp + tn)
tpr = tp / (tp + fn)

print("True Positive Rate:", tpr)
print("False Positive Rate:", fpr)

As you can see from the last cell, our fpr and tpr are around what we'd expect if the model was predicting all ones.

We will tell the classifier to penalize certain rows more, which is actually much easier to implement using scikit-learn. We can do this by setting the class_weight parameter to balanced when creating the LogisticRegression instance. We can repeat the cross validation procedure we performed in the last cell, but with the class_weight parameter set to balanced.

In [None]:
lr = LogisticRegression(class_weight='balanced')
kfold = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, y=target, cv=kfold)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

fpr = fp / (fp + tn)
tpr = tp / (tp + fn)

print("True Positive Rate;", tpr)
print("False Positive Rate:", fpr)

# Manual Penalties

We significantly improved false positive rate in the last cell by balancing the classes, which reduced true positive rate. Our true positive rate is now around 63%, and our false positive rate is around 61%. From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. However, we'd only ever decide to fund 63% of the total loans (true positive rate), so we'd immediately reject a good amount of loans.

We can try to lower the false positive rate further by assigning a harsher penalty for misclassifying the negative class. While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. In the last screen, the penalty scikit-learn imposed for misclassifying a 0 would have been around 5.89 (since there are 5.89 times as many 1s as 0s). We will use a dictionary to manually set these weights. We will also try some larger penalties.

In [None]:
#We will now train a new model with the new manual weights, penalty will be the dict we use
penalty = {0: 5.89, 1: 1}

#Now we will re-run our new model and test our predictions
lr = LogisticRegression(class_weight=penalty)
kfold = KFold(features.shape[0], random_state=2)

predictions = cross_val_predict(lr, features, y=target, cv=kfold)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

fpr = fp / (fp + tn)
tpr = tp / (tp + fn)

print("True Positive Rate;", tpr)
print("False Positive Rate:", fpr)

In [None]:
#We can even train a model with some larger manual penalties
penalty = {0: 10, 1: 1}

#Now we will re-run our new model and test our predictions
lr = LogisticRegression(class_weight=penalty)
kfold = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, y=target, cv=kfold)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

fpr = fp / (fp + tn)
tpr = tp / (tp + fn)

print("True Positive Rate;", tpr)
print("False Positive Rate:", fpr)

It looks like assigning manual penalties lowered the false positive rate to 19%, and thus lowered our risk. Note that this comes at the expense of true positive rate. While we have fewer false positives, we're also missing opportunities to fund more loans and potentially make more money. Given that we're approaching this as a conservative investor, this strategy makes sense, but it's worth keeping in mind the tradeoffs.

While we could tweak the penalties further, it's best to move to trying a different model right now, for larger potential false positive rate gains. We can always loop back and interate on the penalties more later.

# Random Forests

Random forests are able to work with nonlinear data, and learn complex conditionals. Logistic regressions are only able to work with linear data. Training a random forest algorithm may enable us to get more accuracy due to columns that correlate nonlinearly with loan_status.

We can use the RandomForestClassifer class from scikit-learn to do this.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#Now we will train our Random Forest and analyze the predictions
rf = RandomForestClassifier(class_weight='balanced', random_state=8)
kfold = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(rf, features, y=target, cv=kfold)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (target == 0)
tn = len(predictions[tn_filter])

tp_filter = (predictions == 1) & (target == 1)
tp = len(predictions[tp_filter])

fn_filter = (predictions == 0) & (target == 1)
fn = len(predictions[fn_filter])

fp_filter = (predictions == 1) & (target == 0)
fp = len(predictions[fp_filter])

fpr = fp / (fp + tn)
tpr = tp / (tp + fn)

print("True Positive Rate;", tpr)
print("False Positive Rate:", fpr)

Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is likely weighting too heavily on the 1 class, and still mostly predicting 1s. We could fix this by applying a harsher penalty for misclassifications of 0s.

Ultimately, our best model had a false positive rate of 19.6%, and a true positive rate of 20%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 19.6% of borrowers defaulting, and that the pool of 20% of borrowers is large enough to make enough interest money to offset the losses. There is still a lot of improvement to be made on this algorithm.

# Further Improvement

If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model is better than that, although we're excluding more loans than a random strategy would. Given this, there's still quite a bit of room to improve:

    We can tweak the penalties further.
    We can try models other than a random forest and logistic regression.
    We can use some of the columns we discarded to generate better features.
    We can ensemble multiple models to get more accurate predictions.
    We can tune the parameters of the algorithm to achieve higher performance.