# Coursework 2
# General Instructions
In this CW, we apply predictive modelling in building investment strategies. 

The data required to run this notebook (including the notebook itself) is shared on Moodle-->coursework. The data is saved in a pickle format with the file name "clean_data_v2.pickle". You can find this in CW 1. 

You need to save this file on your PC, and then load it using an appropriate file path. There are five exercises for this CW. 

For this CW, no preliminary codes are provided and you must build the entire notebook by yourself.

No approximated number of lines for this CW is provided. 

Marks for each exercise are shown in brackets; note that these marks are provisional and they might be changed. 

If you need some parameters which are not specified to you, you can choose them at your will but the choice should be justified.

Where applicable, for simplicity, in forming training, validation, and testing set, use only continuous features.

No short selling is allowed.

The weight of a selected loan in a portfolio is either zero or one, i.e. no partial weights are allowed.

It is advised to support your coding with brief comments.

You can copy some of the codes from CW1 if required.


# Provide all the necessary preliminaries such as importing libraries, dataset, etc after this block and before Exercise 3.1.

In [41]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import numpy as np
import pickle
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers import Dense, Activation
import warnings
from keras.layers import Dropout

warnings.filterwarnings('ignore')

data, discrete_features, continuous_features = pickle.load( 
open( "./clean_data_v2.pickle", "rb" ) )
default_seed=1
#putting all return names in one list
return_cols=["return_1","return_2","return_3a","return_3b","return_3c"]

In [42]:
data.head()

Unnamed: 0,id,loan_amnt,funded_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,...,total_pymnt,last_pymnt_d,recoveries,loan_length,term_num,ret_PESS,ret_OPT,ret_INTa,ret_INTb,ret_INTc
0,40390412,5000.0,5000.0,36 months,12.39,167.01,C,< 1 year,RENT,48000.0,...,5475.14,2015-12-01,0.0,10.973531,36,0.031676,0.103917,0.031155,0.050634,0.086751
2,40401108,17000.0,17000.0,36 months,12.39,567.82,C,1 year,RENT,53000.0,...,20452.09912,2018-03-01,0.0,37.947391,36,0.067688,0.064215,0.050574,0.066334,0.09495
3,40501689,9000.0,9000.0,36 months,14.31,308.96,C,6 years,RENT,39000.0,...,9792.56,2015-11-01,0.0,9.987885,36,0.029354,0.105803,0.029798,0.049345,0.085622
4,40352737,14000.0,14000.0,36 months,11.99,464.94,B,6 years,RENT,44000.0,...,16592.9113,2018-01-01,0.0,36.008953,36,0.061736,0.061721,0.047093,0.063007,0.091937
5,40431323,10000.0,10000.0,60 months,19.24,260.73,E,10+ years,MORTGAGE,130000.0,...,15122.07997,2018-10-01,0.0,44.978336,60,0.102442,0.136655,0.113866,0.131897,0.164518


In [43]:
print(discrete_features)
print(continuous_features)

['home_ownership', 'grade', 'emp_length', 'purpose', 'verification_status', 'term']
['loan_amnt', 'funded_amnt', 'installment', 'annual_inc', 'dti', 'revol_bal', 'delinq_2yrs', 'open_acc', 'pub_rec', 'fico_range_high', 'fico_range_low', 'int_rate', 'revol_util']


In [44]:
#Return Number 1
data['return_1'] = ( (data.total_pymnt - data.funded_amnt) 
                                            / data.funded_amnt ) * (12 / data['term_num'])


#Return Number 2
# Assuming that if a loan gives a positive return, we can immediately find a similar loan to invest in; if the loan 
# takes a loss, we use modify the return based on the return_1


data['return_2'] = ( (data.total_pymnt - data.funded_amnt)
                                            / data.funded_amnt ) * (12 / data['loan_length'])
data.loc[data.return_2 < 0,'return_2'] = data.return_1[data.return_2 < 0]

In [45]:
def ret_method_3(data, T, i):
    '''
    Given an investment time horizon (in months) and re-investment
    interest rate, calculate the return of each loan
    '''
    
    # Assuming that the total amount paid back was paid at equal
    # intervals during the duration of the loan, calculate the
    # size of each of these installment
    actual_installment = (data.total_pymnt - data.recoveries) / data['loan_length']

    # Assuming the amount is immediately re-invested at the prime
    # rate, find the total amount of money we'll have by the end
    # of the loan
    cash_by_end_of_loan = actual_installment * (1 - pow(1 + i, data.loan_length)) / ( 1 - (1 + i) )
    
    cash_by_end_of_loan = cash_by_end_of_loan + data.recoveries
    
    # Assuming that cash is then re-invested at the prime rate,
    # with monthly re-investment, until T months from the start
    # of the loan
    remaining_months = T - data['loan_length']
    final_return = cash_by_end_of_loan * pow(1 + i, remaining_months)

    # Find the percentage return
    return( (12/T) * ( ( final_return - data['funded_amnt'] ) / data['funded_amnt'] ) )


#--------------------------------------

In [46]:
#Calculating three different types of returns based on the ret_method_3
data['return_3a'] = ret_method_3(data, 5*12, 0.001)
data['return_3b'] = ret_method_3(data, 5*12, 0.0025)
data['return_3c'] = ret_method_3(data, 5*12, 0.005)

In [47]:
data["outcome"] = data.loan_status.isin(["Charged Off", "Default"])
data["outcome"]=data["outcome"].apply(lambda x: 1 if x==True else 0)
data["outcome"].head()

0    0
2    0
3    0
4    0
5    0
Name: outcome, dtype: int64

In [48]:
x_continuous = data[continuous_features]
y = data.outcome.values

#store return values in y_return
y_return1 = data.return_1
y_return2 = data.return_2
y_return3a = data.return_3a
y_return3b = data.return_3b
y_return3c = data.return_3c

In [49]:
# initialize the column names of the continuous data
# performin min-max scaling each continuous feature column to
# the range [0, 1]
#cs = MinMaxScaler()
#x_continuous = cs.fit_transform(x_continuous)

# Random Based Strategy
# Exercise 3.1 --- [8/30]

In this part, you have to implement the random based strategy, i.e. choose loans completely at random (uniform distribution), and build a portfolio made of these randomly chosen loans. You then calculate the average return (using the test dataset) of an investor if they use this strategy in long run. In what follows, more details
are provided.

Split the dataset into training set, test set, and cross validation (though you will not use the training and the cross validation set for this exercise): 60% training, 20% cross validation, and 20% testing. For default seed use 1. 

In this exercise, use the following notions of returns ["return_1","return_2","return_3a","return_3b","return_3c"], these returns are exactly the same as CW1. For return_3 use the solution of CW1.

The goal is to estimate the average returns (using the test set) that an investor might obtain following this random based strategy. In order to do that fix the number of iteration to be 1000, then for each iteration, you build a random portfolio by randomly selecting 100 loans from the test dataset. Note that in your random selection, no selection with partial weights is allowed. More precisely, if the returns of 100 randomly selected loans are $(r_1, r_2,..., r_{100})$, the weight of each $r_i$, $i=1,2,...,100$ in the random portfolio will be 0.01, however, in the initial random selection the whole $r_i$ is selected.  Based on this dataset and for each notion of return in ['return_1', 'return_2', 'return_3a', 'return_3b', 'return_3c'], calculate the average returns that an investor might obtain following this random strategy.

Provide your solution after this block. Do not forget to print your result.


In [73]:
x_train, x_test, y_return_train1, y_return_test1 = train_test_split(x_continuous, y_return1, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test1, y_return_val1 = train_test_split(x_test, y_return_test1, test_size=0.5,
                                                                  random_state=default_seed) 

In [74]:
x_train, x_test, y_return_train2, y_return_test2 = train_test_split(x_continuous, y_return2, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test2, y_return_val2 = train_test_split(x_test, y_return_test2, test_size=0.5,
                                                                  random_state=default_seed) 

In [75]:
x_train, x_test, y_return_train3a, y_return_test3a = train_test_split(x_continuous, y_return3a, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test3a, y_return_val3a = train_test_split(x_test, y_return_test3a, test_size=0.5,
                                                                  random_state=default_seed) 

In [76]:
x_train, x_test, y_return_train3b, y_return_test3b = train_test_split(x_continuous, y_return3b, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test3b, y_return_val3b = train_test_split(x_test, y_return_test3b, test_size=0.5,
                                                                  random_state=default_seed) 

In [77]:
x_train, x_test, y_return_train3c, y_return_test3c = train_test_split(x_continuous, y_return3c, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test3c, y_return_val3c = train_test_split(x_test, y_return_test3c, test_size=0.5,
                                                                  random_state=default_seed) 

In [78]:
r1=[]
r2=[]
r3a=[]
r3b=[]
r3c=[]
for i in range(1000):
    r1.append(np.mean(y_return_test1.sample(n=100)))
    r2.append(np.mean(y_return_test2.sample(n=100)))
    r3a.append(np.mean(y_return_test3a.sample(n=100)))
    r3b.append(np.mean(y_return_test3b.sample(n=100)))
    r3c.append(np.mean(y_return_test3c.sample(n=100)))
print("The average returns of return1  is " + str(np.mean(r1)))
print("The average returns of return2  is " + str(np.mean(r2)))
print("The average returns of return3a is " + str(np.mean(r3a)))
print("The average returns of return3b is " + str(np.mean(r3b)))
print("The average returns of return3c is " + str(np.mean(r3c)))

The average returns of return1  is 0.0056673844564332195
The average returns of return2  is 0.04624899351502564
The average returns of return3a is 0.012583900124201861
The average returns of return3b is 0.028523575051907536
The average returns of return3c is 0.057671676723057584


# Return Based Strategy
# Exercise 3.2 ---- [5/30]

In this section, you implement a return based strategy, i.e. you will estimate the return of the loans (here using linear regression) and choose 100 loans with the highest return. More specifically, use Ridge regression and suppose that you have done your cross validation and it turns out the optimal hyperparameter for alpha is 240.

Use the same dataset and notion of returns as in Exercise 3.1 and only use continuous features. The goal is to estimate the  return (for each notion of return as in Exercise 3.1) that an investor will obtain following this strategy. 

Provide your solution after this block. Do not forget to print your result.




In [86]:
r1_test=[]
clf = Ridge(alpha=240)
clf.fit(x_train, y_return_train1)
y_return_pre1=clf.predict(x_test)
res = sorted(y_return_pre1, reverse = True)[:100]
for count, value in enumerate(y_return_pre1):
    if value in res:
        r1_test.append(y_return_test1.values[count])


In [87]:
np.mean(res)

0.04596370925645696

In [88]:
r2_test=[]
clf = Ridge(alpha=240)
clf.fit(x_train, y_return_train2)
y_return_pre2=clf.predict(x_test)
res = sorted(y_return_pre2, reverse = True)[:100]
for count, value in enumerate(y_return_pre2):
    if value in res:
        r2_test.append(y_return_test2.values[count])

In [89]:
np.mean(res)

0.12452493404207292

In [90]:
r3a_test=[]
clf = Ridge(alpha=240)
clf.fit(x_train, y_return_train3a)
y_return_pre3a=clf.predict(x_test)
res = sorted(y_return_pre3a, reverse = True)[:100]
for count, value in enumerate(y_return_pre3a):
    if value in res:
        r3a_test.append(y_return_test3a.values[count])

In [91]:
np.mean(res)

0.044971746691009305

In [92]:
r3b_test=[]
clf = Ridge(alpha=240)
clf.fit(x_train, y_return_train3b)
y_return_pre3b=clf.predict(x_test)
res = sorted(y_return_pre3b, reverse = True)[:100]
for count, value in enumerate(y_return_pre3b):
    if value in res:
        r3b_test.append(y_return_test3b.values[count])

In [93]:
np.mean(res)

0.0643855081675103

In [94]:
r3c_test=[]
clf = Ridge(alpha=240)
clf.fit(x_train, y_return_train3c)
y_return_pre3c=clf.predict(x_test)
res = sorted(y_return_pre3c, reverse = True)[:100]
for count, value in enumerate(y_return_pre3c):
    if value in res:
        r3c_test.append(y_return_test3c.values[count])

In [95]:
np.mean(res)

0.10051069766611956

In [96]:
print("The average returns of return1  is " + str(np.mean(r1_test)))
print("The average returns of return2  is " + str(np.mean(r2_test)))
print("The average returns of return3a is " + str(np.mean(r3a_test)))
print("The average returns of return3b is " + str(np.mean(r3b_test)))
print("The average returns of return3c is " + str(np.mean(r3c_test)))

The average returns of return1  is 0.021225409277534343
The average returns of return2  is 0.058890925043272514
The average returns of return3a is 0.023851225553156677
The average returns of return3b is 0.03664908214740802
The average returns of return3c is 0.06709502108830714


# Exercise 3.3 ---- [5/30]

Repeat Exercise 3.2 but instead of a linear model use a neural network. Your neural network will have
two hidden layers. You can choose the rest of the parameters of the network as appropriate. Since 
using a cross validation to choose the best architecture takes time, you are not required to perform
any cross validation to choose the optimal architecture. However, the return of your model should be reasonable and at 
least as good as the linear model in the previous exercise.  

Use the same dataset and notions of returns as in Exercise 3.2.

Provide your solution after this block. Do not forget to print your result.


In [107]:
# initialize the column names of the continuous data
# performin min-max scaling each continuous feature column to
# the range [0, 1]
cs = MinMaxScaler()
x_nn = cs.fit_transform(x_continuous)
x_train, x_test, y_return_train1, y_return_test1 = train_test_split(x_nn, y_return1, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test1, y_return_val1 = train_test_split(x_test, y_return_test1, test_size=0.5,
                                                                  random_state=default_seed) 
x_train, x_test, y_return_train2, y_return_test2 = train_test_split(x_nn, y_return2, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test2, y_return_val2 = train_test_split(x_test, y_return_test2, test_size=0.5,
                                                                  random_state=default_seed) 
x_train, x_test, y_return_train3a, y_return_test3a = train_test_split(x_nn, y_return3a, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test3a, y_return_val3a = train_test_split(x_test, y_return_test3a, test_size=0.5,
                                                                  random_state=default_seed) 
x_train, x_test, y_return_train3b, y_return_test3b = train_test_split(x_nn, y_return3b, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test3b, y_return_val3b = train_test_split(x_test, y_return_test3b, test_size=0.5,
                                                                  random_state=default_seed) 
x_train, x_test, y_return_train3c, y_return_test3c = train_test_split(x_nn, y_return3c, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_return_test3c, y_return_val3c = train_test_split(x_test, y_return_test3c, test_size=0.5,
                                                                  random_state=default_seed) 

In [108]:
#生成Sequential 顺序模型
model = Sequential()

#创建两个layer
layer1 = Dense(512,use_bias = False, input_dim=13, kernel_initializer='he_uniform', activation='relu')
layer2 = Dense(256,use_bias = False, input_dim=13, kernel_initializer='he_uniform', activation='relu')
layer3 = Dense(128,use_bias = False, input_dim=13, kernel_initializer='he_uniform', activation='relu')


#使用 .add() 来堆叠模型
#现在模型会以尺寸为 (*, 13) 的数组作为输入,其输出数组的尺寸为 (*, 1)
model.add(layer1)
#在第一层之后，就不再需要指定输入的尺寸了
model.add(layer2)
model.add(layer3)
model.add(Dense(1, activation='linear'))

#使用 .compile() 来配置学习过程
model.compile(optimizer='rmsprop',
              loss='mse')
r1_test=[]
cs = MinMaxScaler()
x_train=cs.fit_transform(x_train)
x_test=cs.fit_transform(x_test)
model.fit(x_train, y_return_train1,epochs=3, batch_size=1000)
y_return_pre1=model.predict(x_test)
res = sorted(y_return_pre1, reverse = True)[:100]
for count, value in enumerate(y_return_pre1):
    if value in res:
        r1_test.append(y_return_test1.values[count])
np.mean(r1_test)

Epoch 1/3
Epoch 2/3
Epoch 3/3


0.009402039927653441

In [109]:
r2_test=[]
model.fit(x_train, y_return_train2,epochs=3, batch_size=1000)
y_return_pre2=model.predict(x_test)
res = sorted(y_return_pre2, reverse = True)[:100]
for count, value in enumerate(y_return_pre2):
    if value in res:
        r2_test.append(y_return_test2.values[count])
np.mean(r2_test)

Epoch 1/3
Epoch 2/3
Epoch 3/3


0.12797865065301808

In [110]:
r3a_test=[]
model.fit(x_train, y_return_train3a,epochs=3, batch_size=1000)
y_return_pre3a=model.predict(x_test)
res = sorted(y_return_pre3a, reverse = True)[:100]
for count, value in enumerate(y_return_pre3a):
    if value in res:
        r3a_test.append(y_return_test3a.values[count])

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [111]:
r3b_test=[]
model.fit(x_train, y_return_train3b,epochs=3, batch_size=1000)
y_return_pre3b=model.predict(x_test)
res = sorted(y_return_pre3b, reverse = True)[:100]
for count, value in enumerate(y_return_pre3b):
    if value in res:
        r3b_test.append(y_return_test3b.values[count])

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [112]:
r3c_test=[]
model.fit(x_train, y_return_train3c,epochs=3, batch_size=1000)
y_return_pre3c=model.predict(x_test)
res = sorted(y_return_pre3c, reverse = True)[:100]
for count, value in enumerate(y_return_pre3c):
    if value in res:
        r3c_test.append(y_return_test3c.values[count])

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [113]:
print("The average returns of return1  is " + str(np.mean(r1_test)))
print("The average returns of return2  is " + str(np.mean(r2_test)))
print("The average returns of return3a is " + str(np.mean(r3a_test)))
print("The average returns of return3b is " + str(np.mean(r3b_test)))
print("The average returns of return3c is " + str(np.mean(r3c_test)))

The average returns of return1  is 0.009402039927653441
The average returns of return2  is 0.12797865065301808
The average returns of return3a is 0.02969940110393391
The average returns of return3b is 0.04050067379966151
The average returns of return3c is 0.0682262511903451


# Default Based Strategy
# Exercise 3.4 --- [8/30]

In this exercise, you will implement a default based strategy, i.e. to select 100 loans with the highest credit quality (a loan with PD of zero has the highest credit quality). For the feature space, use only the same continuous features as Exercise 3.2; you will need to determine the output as well. Split the dataset: 60% training, 20% cross validation (though we don't perform cross validation here), and 20% testing.

Train three machine learning models (of your choice) to estimate the probability of default for these loans. Note that for simplicity, ignore any cross validation analysis that might have been done or required but use reasonable parameters. For instance for a logistic regression model, one can use l2 penalty with C=1, and an appropriate solver of your choice. 

Although in practice, you should provide justification for the choice of the models, you are not required to provide such justification for this exercise.

For each notion of the returns ["return_1","return_2","return_3a","return_3b","return_3c"] estimate the return of this strategy.

Provide your solution after this block. Do not forget to print your result.


In [102]:
x_train, x_test, y_credit_train, y_credit_test = train_test_split(x_continuous, y, test_size=0.4,
                                                                  random_state=default_seed) 
x_test, x_val, y_credit_test, y_credit_val = train_test_split(x_test, y_credit_test, test_size=0.5,
                                                                  random_state=default_seed) 

In [103]:
rf = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
rf.fit(x_train, y_credit_train)
y_credit_pre=rf.predict_proba(x_test)
res = sorted(y_credit_pre.T[0], reverse = True)[:100]
r1_test=[]
r2_test=[]
r3a_test=[]
r3b_test=[]
r3c_test=[]
for count, value in enumerate(y_credit_pre.T[0]):
    if value in res:
        r1_test.append(y_return_test1.values[count])
        r2_test.append(y_return_test2.values[count])
        r3a_test.append(y_return_test3a.values[count])
        r3b_test.append(y_return_test3b.values[count])
        r3c_test.append(y_return_test3c.values[count])
r_rf=[np.mean(r1_test),np.mean(r2_test),np.mean(r3a_test),np.mean(r3b_test),np.mean(r3c_test)]

In [104]:
lr = LogisticRegression(random_state=0,C=1.0,max_iter=1000,penalty='l2').fit(x_train, y_credit_train)
y_credit_pre=lr.predict_proba(x_test)
res = sorted(y_credit_pre.T[0], reverse = True)[:100]
r1_test=[]
r2_test=[]
r3a_test=[]
r3b_test=[]
r3c_test=[]
for count, value in enumerate(y_credit_pre.T[0]):
    if value in res:
        r1_test.append(y_return_test1.values[count])
        r2_test.append(y_return_test2.values[count])
        r3a_test.append(y_return_test3a.values[count])
        r3b_test.append(y_return_test3b.values[count])
        r3c_test.append(y_return_test3c.values[count])
r_lr=[np.mean(r1_test),np.mean(r2_test),np.mean(r3a_test),np.mean(r3b_test),np.mean(r3c_test)]

In [105]:
svc = make_pipeline(StandardScaler(), SVC(gamma='auto',max_iter=100,probability=True))
svc.fit(x_train, y_credit_train)
y_credit_pre=svc.predict_proba(x_test)
res = sorted(y_credit_pre.T[0], reverse = True)[:100]
r1_test=[]
r2_test=[]
r3a_test=[]
r3b_test=[]
r3c_test=[]
for count, value in enumerate(y_credit_pre.T[0]):
    if value in res:
        r1_test.append(y_return_test1.values[count])
        r2_test.append(y_return_test2.values[count])
        r3a_test.append(y_return_test3a.values[count])
        r3b_test.append(y_return_test3b.values[count])
        r3c_test.append(y_return_test3c.values[count])
r_svc=[np.mean(r1_test),np.mean(r2_test),np.mean(r3a_test),np.mean(r3b_test),np.mean(r3c_test)]

In [106]:
df=pd.DataFrame({'Random Forest':r_rf,'Logistc Regression':r_lr,'Support Vector':r_svc})
df.index=return_cols
df

Unnamed: 0,Random Forest,Logistc Regression,Support Vector
return_1,0.017391,0.013895,0.012658
return_2,0.052075,0.040547,0.049067
return_3a,0.02173,0.018732,0.018578
return_3b,0.039821,0.035161,0.035673
return_3c,0.073312,0.06538,0.067212


# Exercise 3.5 [Max 300 words] --- [4/30]

Compare the last three strategies, i.e. random, return (as in the two models of Exercises 3.2, 3.3), and default; which strategy you will pick and why.

Provide and explain a new investment strategy that is different from those strategies explained above.

Your explanations should be right to the point, concise, clear, and free from spelling error and grammatically correct.

Write your answer below:
