# IN-STK5000 project 1, part 1
## by Espen H. Kristensen (espenhk)

### 2.1.1 $\texttt{NameBanker.expected_utility}$

Given a probability $p \in [0,1]$ of our loan being paid back, we wish to find the expected return on investment. We'll use $a$ for the amount, $d$ for the duration of the loan, and $r=0.005$ as the monthly interest rate. Treating the "win/lose" (repaid/forfeited) value as a binomial random variable $X$ with $X=1$ for a repaid loan and $X=0$ for a forfeited one, we know the expected value is

$$ E(X) = p $$

That is, we will have $X=1$ (a repaid loan) $\frac{p}{100} \%$ of the time, and a forfeited one $\frac{1-p}{100} \%$ of the time. So, by adding the returns of the win/lose cases and scaling each term by the rate of occurrence ($p$ and $1-p$), we get a return R

$$ R = p \cdot (a \cdot 1.005^d) + (1-p) \cdot (-a) $$

I've implemented $\texttt{expected_utility}$ function in $\texttt{name_banker.py}$ as follows. Note that $\texttt{get_proba}$ has been hard-coded set to always return $p=0.8$ for this part of the exercise.

Note on filenames: the file $\texttt{name_banker.py}$ delivered alongside this contains the entire implementation as part 1 of the project is finished, so it will deviate slightly from the below.

In [None]:
# %load -r 37-65 name_banker.py
# The expected utility of granting the loan or not. Here there are two actions:
# action = 0 do not grant the loan
# action = 1 grant the loan
#
# Make sure that you extract the length_of_loan from the
# 2nd attribute of x. Then the return if the loan is paid off to you is amount_of_loan*(1 + rate)^length_of_loan
# The return if the loan is not paid off is -amount_of_loan.
def expected_utility(self, x, action):
    duration = x[0]
    amount = x[1]
    paid_off = True
    rate = self.rate
    return_win = amount*(1+rate)**duration
    return_loss = -amount
    success_prob = self.get_proba()

    expected_return = (success_prob*return_win +
                       (1-success_prob)*return_loss)

    # Assume purely that if we get expect anything more than
    # the original amount back, we grant the loan. In practice,
    # you'd likely have a margin so you're making at least say 5%
    # on every loan
    return_margin = 0
    if (expected_return + return_margin) > amount:
        action = 1
    else:
        action = 0
    return action

Then, $\texttt{get_best_action}$ simply calculates action using this function, and returns the action chosen:

In [None]:
# %load -r 54-60 name_banker.py
# Return the best action. This is normally the one that maximises expected utility.
# However, you are allowed to deviate from this if you can justify the reason.
def get_best_action(self, x):
    # dummy value, action will be set by expected_utility()
    action=0
    action = self.expected_utility(x, action)
    return action

See the $\texttt{name_banker.py}$ file for the rest of this implementation, but other than this and the hard-coded $\texttt{get_proba}$ function there are no changes from the skeleton code. Running this program and varying the probabilities, I've generated the following test output file

In [None]:
# %load test_lending_output.txt
= Using NameBanker with probability p of a successful return
== p=0.8
Trial 1: 811671.63351
Trial 2: 785167.647446
Trial 3: 730726.926036
== p=0.5
Trial 1: 835328.181468
Trial 2: 881273.321313
Trial 3: 826356.885197
== p=0.2
Trial 1: 605824.790654
Trial 2: 611672.643699
Trial 3: 603023.393655
== p=0.1
Trial 1: 27313.5038378
Trial 2: 43701.6061405
Trial 3: 36418.0051171

= Using RandomBanker
Trial 1: 366147.424382
Trial 2: 375869.218748
Trial 3: 327109.889969


In short, we see that even at a measly 20% successful repayments, our NameBanker outperforms the random banker by a steady margin.

### 2.1.2 $\texttt{NameBanker.fit} , \texttt{NameBanker.predict_proba}$, comments on labelling

$\texttt{fit()}$: I've chosen to use a K-nearest neighbors classifier. After some testing it seems the results are fairly stable with any $k \in [5, 100]$. A $k$ as high as 100 seems excessive, though, so the current implementation uses $k=20$. The fit function is implemented as follows, note that it doesn't return anything but saves to the instance variable $\texttt{self.model}$

In [None]:
# %load -r 9-15 name_banker.py
# Fit the model to the data.  You can use any model you like to do
# the fit, however you should be able to predict all class
# probabilities
def fit(self, X, y):
    self.data = [X, y]
    self.model = KNeighborsClassifier(n_neighbors=20)
    self.model.fit(X, y)

$\texttt{predict_proba()}$: After fit is called, we have a $\texttt{self.model}$ we can use to call its built-in $\texttt{predict_proba}$ function, so our implementation pretty straight forward. Note that the function both saves to $\texttt{self.proba}$, so it can be used with $\texttt{get_proba}$, and returns the value itself, so you can do prediction and getting in one call to $\texttt{predict_proba}$. The implementation is:

In [None]:
# %load -r 22-28 name_banker.py
# Predict the probability of failure for a specific person with data x
def predict_proba(self, x):
    # data needs to be packed in a list, as the function expects a double array
    prob = self.model.predict_proba([x])
    # unpack, and we only need the first probability p, as the other one is (1-p)
    self.proba = prob[0][0]
    return self.proba

Comments on labelling: We are missing information on how this data was collected. It is unclear whether these are data based on actual given and repaid/forfeited loans, or simply data generated by a (professional) bank's assessment of a number of loan applications. Thus, particularly not knowing how reliable this bank is in the case of the data coming from loan applicants, it's hard to say if the assessments we get based on the data are applicable in the real world or simply "best-guess" estimations. If the data is from loan applicants, we will have inherent bias from how the output data was estimated, which may or may not actually be accurate. If this data is off from the real-world outcome, no matter how good a classifier we generate it will always carry these problems with it. 

Of course, knowing these real-world outcomes may not be a trivial task, since it (a) requires access to payment data and not application forms, and (b) may not even exist yet if the loans haven't been granted or aren't yet repaid. Also, to collect data on outcomes given all variations of input might entail giving large numbers of loans even though our currently best classifier suggests it's a bad idea -- and good luck finding a bank that will risk giving away millions ''just to see''.