# Logistic Regression

## Overview
In this lab, we will be applying logistic regression on "prosper-loan" data which is an individual loan data set provided by the P2P lending company, Prosper.
Our goal is to predict whether or not an individual can get a loan based on his/her credentials.


## Downloading S3 Data into Jupyter Notebook

First, we need to upload the data required to run the lab into an S3 bucket.

Next, we need to create another empty S3 bucket starting with the name “Sagemaker”. This bucket is for storing the model, results, etc., after running the algorithm.

Download the data file <give path here> from S3 - you should see a message like this:

<img src="../assets/images/logistic-regression/pic1.jpg" style="width:50%"/>

---

## Preparing the data for Training

* The data has more than 50 columns. We will simplify it by just taking 6 columns. We will use the following columns: 'LoanStatus','BorrowerRate','EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome','IncomeVerifiable' alone.

<img src="../assets/images/logistic-regression/pic2.jpg" style="width:50%"/>

* Convert the categorical values into numeric by using one-hot encoding.
Pandas library provides in-built functions for encoding the data. Let's do this for the "Employment Status" column.

<img src="../assets/images/logistic-regression/pic3.jpg" style="width:50%"/>

* Convert "IncomeVerifiable" column into integer values. We will assign 1 for True and 0 for False.

<img src="../assets/images/logisitic-regression/pic4.jpg" style="width:50%"/>

<br>

Now, all the columns are float variables. So, we can proceed for training the data.

---

Before, giving input to the algorithm, we divide the data into features and labels. Labels is the prediction column. In this case, it is the "LoanStatus" column.
Features are all the columns except the "LoanStatus" column.

Next, we create a bucket in S3 that begins with the letters "sagemaker". Then Amazon will create the subfolders, which in needs, which in this case are sagemaker/grades and others. It is important that we create the S3 buckets in the same Amazon region as our notebook. Otherwise Amazon will throw an error saying it cannot find the data. See the note below on that.
Copy this text into a notebook cell and then run it.

<img src="../assets/images/logistic-regression/pic5.jpg" style="width:50%"/>

We will get an output displaying the location where the training data and artifacts will be stored.

Next, we create a docker container in the same region where our notebook is created. In our case, it is " us-east-1"


---

## Setting up the Logistic Regression Estimator

Now we begin to set up the Estimator. Amazon will not let you use any of their smaller (i.e. less expensive) images, so here we use a virtual machine of size ml.p2.xlarge.



Now we provide hyperparameters. There are many, like which loss function to use. Here we put only the most important ones:

feature_dim—is the number of columns in our feature array. In this case it is 12.
mini_batch_size—is the number of batches into which to split the data. This number should be smaller than the number of records in our training set. We only have 20000 records, so we take 2000 as batch size.
predictor_type—we use binary_classifier, which means logistic regression.

When you run the fit() method Amazon will kick off this job. This will take several minutes to run.

<img src="../assets/images/logistic-regression/pic6.jpg" style="width:50%"/>


a.x(a1, b1)

# Calculate precision by calling x and passing it ...
a.x (?? , ??)
---

## Deploying the Model

When the training model is done, deploy it to an endpoint. Remember that Amazon is charging you money now. So when you get done delete your endpoints unless you want to be charged.


---

## Validating the Dataset

Now we run the prediction. We just take some 10 data points and run the prediction.  

<img src="../assets/images/logistic-regression/pic1.jpg" style="width:50%"/>

That's the end of Logistic Regession session. You can try with different data and observe the results. You can also find the accuracy of the model using some techniques like Pseudo-R², precision, recall etc.,

## ToDo

To find the accuracy of the model,
Calculate,
1. Precision
2. Recall



# Let's Get Started

In [None]:
# Downloading S3 Data into Jupyter Notebook

# Firstly, we will download the data file from amazon S3 into our directory, 
# if we are able to download it, we will get a message like below,

# !wget 'http://datakmeans.s3.amazonaws.com/prosper-loan-data.csv'

In [None]:
## TODO : Modify these to your own bucket
bucket = "elephantscale-sagemaker"
prefix = "elephantscale-sagemaker/prosper"

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

location = 'http://datakmeans.s3.amazonaws.com/prosper-loan-data.csv'
# location = 'https://s3.amazonaws.com/elephantscale-public/data/prosper-loan/prosper-loan-data-sample.csv'
prosper_data = pd.read_csv(location, header=0)
prosper_data.head()

In [None]:
# Preparing the data for Training

# The data has more than 50 columns. We will make it simple by just taking 6 columns. 
# We will consider 'LoanStatus','BorrowerRate','EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome','IncomeVerifiable' alone.

prosper_final = prosper_data[['LoanStatus','BorrowerRate','EmploymentStatus', 'CreditScore', 'StatedMonthlyIncome','IncomeVerifiable']]
prosper_final.head()

In [None]:
# Get the execution role for the notebook instance. 
# This is the IAM role that you created when you created your notebook instance. 

from sagemaker import get_execution_role
role = get_execution_role()


In [None]:
# Next step is to convert the categoriacal values into one-hot encoding. 
# Pandas library provides in-built functions for converting the columns into one-hot encoding. 
# In our data, we will be converting "Employment Status" column.


EmpStatus_onehot = pd.get_dummies(prosper_final['EmploymentStatus'])

In [None]:
# We will remove the EmploymentStatus Column
prosper_final = prosper_final.drop('EmploymentStatus',axis = 1)
prosper_final = prosper_final.join(EmpStatus_onehot)

In [None]:
prosper_final.head()

In [None]:
len(prosper_final)

In [None]:
# Next, we will convert "IncomeVerifiable" column into integer values. 
# We will assign 1 for True and 0 for False. 

IncomeVerifiable_int = []
for i in range(len(prosper_final)):
    IncomeVerifiable_int.append(int(prosper_final['IncomeVerifiable'][i]))
    

In [None]:
prosper_final['IncomeVerifiable_int'] = IncomeVerifiable_int

In [None]:
prosper_final1 = prosper_final.drop('IncomeVerifiable',axis = 1)

In [None]:
prosper_final1.head()

In [None]:
#We are taking 20000 rows for training and convert into np.array

import numpy as np
a = np.array(prosper_final1[:20000]).astype('float32')

# Now, all the columns are float variables. So, we can proceed for training the data. 

In [None]:
# we divide the data into features and labels. 
# Labels is the prediction column. In this case, it is the "LoanStatus" column.
# Features are all the columns except the "LoanStatus" column.

features = a[:,1:]
labels = a[:,0]

In [None]:
len(features)

In [None]:
# Next, create a bucket in S3 that begins with the letters "sagemaker". 
# SageMaker will create the subfolders it needs automatically. 
# It is important that you create the S3 buckets in the same Amazon region as the notebook. 
# Otherwise you will get an error saying it cannot find data. See the note below on that.

#You should see an output displaying the location where the training data and artifacts will be stored. 

import boto3
import sagemaker
import io
import sagemaker.amazon.common as smac
import os

sess = sagemaker.Session()


buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, a, labels)
buf.seek(0)

key = 'linearlearner'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('training artifacts will be uploaded to: {}'.format(output_location))

In [None]:
# Next, create a docker container in the same region where our notebook is created. 
# In our case, it is " us-east-1"

# containers = {
#               'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest'
#               }

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

In [None]:
# containers[boto3.Session().region_name]

In [None]:
# Setting up the Logistic Regression Estimator

# Now we begin to set up the Estimator. 
# SageMaker will not let you use any of their smaller (i.e. less expensive) images, 
# so here we use a virtual machine of size ml.p2.xlarge.

#We set up the linear estimator
linear = sagemaker.estimator.Estimator(container,
                                       role=role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.c4.xlarge',
                                       output_path=output_location,
                                       sagemaker_session=sess)

In [None]:
%%time

# Now we provide hyperparameters. There are many, like which loss function to use. 
# Here we put only the most important ones
# feature_dim - It is the number of columns in our feature array. In this case it is 12.
# mini_batch_size - It is the number of batches into which to split the data. 
#                   This number should be smaller than the number of records in our training set. 
#                   We only have 20000 records, so we take 2000 as batch size.
# predictor_type—we use binary_classifier, which means logistic regression.

linear.set_hyperparameters(feature_dim=12,
                           mini_batch_size=2000,
                           predictor_type='binary_classifier')

# When you run the fit() method Amazon will kick off this job. This will take several minutes to run.

linear.fit({'train': s3_train_data})

In [None]:
# Deploying the Model


# When the training model is done, deploy it to an endpoint. 
# Remember that Amazon is charging you money now. 
# So when you get done delete your endpoints unless you want to be charged.
# Deploying the linear model

linear_predictor = linear.deploy(initial_instance_count=1,
                                 instance_type='ml.c4.xlarge')

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

In [None]:
# Creating an array for testing.
b = np.array(prosper_final1[20000:21000]).astype('float32')

In [None]:
# Validating the Dataset

# Now we run the prediction. We take 1000 data points and run the prediction.  

result = linear_predictor.predict(b[0:1000])

# Printing some 100 predictions
for i in range(100):
    # print(i)
      print(result['predictions'][i])


In [None]:
# ToDo 1
# To find the accuracy of the model, 
#   1. Precision - Precision tries to find out what proportion of positive identifications was actually correct.
#   2. Recall - Recall tries to find what proportion of actual positives was identified correctly.

# 1. Precision formula,
#     precision = TP/(TP+FP)

# 2. Recall formula,
#     recall = TP/(TP+FN)
# where,
# TP(True Positive): A true positive is an outcome where the model correctly predicts the positive class.
# TN(True Negative): It is an outcome where the model correctly predicts the negative class.
# FP(False Positive): It is an outcome where the model incorrectly predicts the positive class.
# FN(False Negative): It is an outcome where the model incorrectly predicts the negative class.

        
# Correct prediction
true_positive = 0
true_negative = 0
# Wrong prediction
false_positive = 0
false_negative = 0

for i in range(500):
    # print(i)
    original = float(prosper_final1['LoanStatus'][20000+i])
    predicted = result['predictions'][i]['predicted_label']
# Here we are calculating the parameters needed to calculate precision and recall    
    if original == 1.0 :
        if original == predicted :
            # print(str(original) +" : "+str(predicted))
            true_positive+= 1
        else :
            false_negative+=1      
    elif original == 0.0 :
        if original == predicted :
            true_negative+= 1
        else :
            false_positive+=1
            
    # print (original)        
    # print (predicted)

# print(result)
# print("TP "+str(true_positive) +" : TN "+str(true_negative))
# print("FP "+str(false_positive)+" : FN "+str(false_negative))

precision = ??? # Apply precision formula here and uncomment below line
# print("Precision "+str(precision))

# Output similar to below

# Precision 0.9916546414285714

In [None]:
# ToDo 2

recall = ??? # Apply recall formula here and uncomment below line
# print("Recall "+str(recall))

# Output similar to below

# Recall 1.0