<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Logistic Regression in Spark  - Credit Card Approval (Demo)

### Overview
Instructor to demo this on screen.
 
### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

Spark has a logistic regression function called Logistic Regression.

## Load imports

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

## Sigmoid Curve

In logistic regression, we often use a sigmoid activation function.  Let's generate a sigmoid curve in python!  (no spark required!)

In [None]:
import numpy as np
# plot sigmoid curve
x = np.arange(-10.,10.,1.)
b = 0 # intercept
m = 1 # slope
sigmoid = lambda x,b,m: np.exp((b + m*x)) / (1 + np.exp((b + m*x)))
y = sigmoid(x,b,m)
plt.scatter(x,y)
plt.title("Sigmoid (Logistic) Function")


## Step 1: Credit Approval Data

Here is the sample data we are looking at:

| score | approved | 
|-------|----------| 
| 550   | 0        | 
| 750   | 1        | 
| 680   | 1        | 
| 650   | 0        | 
| 450   | 0        | 
| 800   | 1        | 
| 775   | 1        | 
| 525   | 0        | 
| 620   | 0        | 
| 705   | 0        | 
| 830   | 1        | 
| 610   | 1        | 
| 690   | 0        | 


## Step 2: Let's visualize the data

In [None]:
mydata = pd.DataFrame({'score' : [550., 750., 680., 650., 450., 800., 775., 525., 620., 705., 830., 610., 690.],
              'approved' : [0,1,1,0,0,1,1,0,0,0,1,1,0]
             })

mydata



## Let us plot and visualize the sample data.

In [None]:
plt.scatter(mydata.score,mydata.approved)
plt.xlabel('score')
plt.ylabel('approved')

## Step 3: Convert dataframe to Spark and Prepare feature vector

We need to first convert the pandas dataframe to a spark dataframe, and then prepare the feature vector. To create the feature vector, we use the VectorAssembler.

We will need a numeric column called "label" in our dataset because our model by default will look for a column by that name.  So, we'll just add it.



In [None]:
spark_credit = spark.createDataFrame(mydata)
assembler = VectorAssembler(inputCols=["score"], outputCol="features")
featureVector = assembler.transform(spark_credit)
featureVector = featureVector.withColumn("label",featureVector.approved)
featureVector.show()



## Step 3: Fit logistic regression

Now it's time to fit our logistic regression model.  This is a linear model, so we will be getting the coefficients and intercept.

In [None]:

lr = LogisticRegression(maxIter=50, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(featureVector)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

In [None]:
lrModel.summary.predictions.show()

The output lists approval & estimated probabilities

In [None]:

# Extract the summary from the returned LogisticRegressionModel instance trained
# in the earlier example
trainingSummary = lrModel.summary

# Obtain the objective per iteration
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

# Set the model threshold to maximize F-Measure
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']) \
    .select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)

In [None]:
## Visualize the ROC curve

Here we are going 

In [None]:
# ROC

roc_df = trainingSummary.roc.toPandas()

plt.plot(roc_df['FPR'], roc_df['TPR'])
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC Curve")
plt.plot([0.0, 1.0], [0.0, 1.0], 'r')

## Step 4: Visualize data and logit model

Let's visualize the data and our model.

In [None]:
plt.scatter(mydata.score,mydata.approved)
plt.xlabel('score')
plt.ylabel('approved')
lrModel.summary.predictions.printSchema()
probabilities = lrModel.summary.predictions.select('score', 'rawPrediction', 'probability').toPandas()
probabilities[['raw1','raw2']] = pd.DataFrame(probabilities.rawPrediction.values.tolist())
probabilities[['prob1','prob2']] = pd.DataFrame(probabilities.probability.values.tolist())
plt.scatter(probabilities['score'], probabilities['prob2'])
plt.scatter(probabilities['score'], probabilities['raw2'])
plt.plot()

In [None]:
## Step 6: Let's create some new test data and make predictions

In [None]:
newdata = pd.DataFrame({'score' : [600., 700., 810.]
             })
print(newdata)

spark_newdata = spark.createDataFrame(newdata)
newfeatures = assembler.transform(spark_newdata)
lrModel.transform(newfeatures).show()