## REFER: Chapter Sec 4-4.3 of "Introduction to Statistical Learning" by Gareth James

#### pyspark API Documentation:
* http://spark.apache.org/docs/latest/
* http://spark.apache.org/docs/latest/ml-guide.html
* https://spark.apache.org/docs/latest/api/python/

## [Introduction to Statistical Learning](<https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf>)

## Logistic Regression
* Logistic Regression is a classificaiton algorithm, where we try to predict discrete categories
* Linear Regresssion is used for predicting continuous data
* whereas Logistic Regression used for classification or grouping into discrete categories (in contrast to word "regression" pointing at continuous data)
* For understanding concepts behind the evaluation methods and metrics behind classificaiton
>* REFER sections 4 - 4.3 in Introduction to Statistical Learning by Gareth James
* EXAMPLES of Binary classification: (classification between two classes)
>* "Spam" vs "Ham" emails (Finding a mail whether it is bad (ham) or good (ham)
>* Loan Default (Yes/No) - Finding whether a customer will default loan or not
>* Disease Diagnosis - e.g. whether a patient will be diagnosed with cancer or not based on certain body parameters
* To classify into one of the two classes we need a function that fits binary categorical data. We can not use Linear Regression as a lot of poitns will not fit into it.
* There is a funciton called "Sigmoid Function" that returns value between 0 and 1 for any input value.
* So to do binarry classification (say between class 0 and class 1) we can follow the below appraoch:
>1. We take the results from Linear Regression.
>2. Pass the linear regression result into Sigmoid Function to get values between 0 and 1.
>3. Then we set the cut-off point at 0.5
>4. Anything below the cut-off points results in class 0 and anything above will be classified as class 1.
* Below block explains in brief how Sigmoid Function works.

## Sigmoid Function (aka Logistic)

# $\phi(x) = \frac{e^x}{e^x+1}  =  \frac{1}{1+e^{-x}}$
This sigmoid function always returns values between 0 to 1 for any value of x
* So we can take our Linear Regression solution and place it into the Sigmoid Function

* <h6 style="display: inline"></h6>Linear Model
<h2 style="display: inline">$ y = b_0 + b_1x$</h2>

* <h6 style="display: inline"></h6>Logistic Model
<h2 style="display: inline">$\phi(y) = \frac{1}{1+e^{-y}} = \frac{1}{1+e^{-(b_0+b_1x)}}$</h2>
* So no matter what is the result of Linear Regression, the result will always be between 0 to 1
* This results in a probability from 0 to 1 belonging to one class.
* Then we set the cut-off point at 0.5, anything below it results in class 0 and anything above will be classified as class 1.
* ![](./logistic_regression_steps.png)
* <table><tr>
    <td><img src="logistic_regression_steps.png" width="500" /></td>
    <td><img src="confusiton_matrix_metrics_ratios.png" width="400" /></td>
  </tr></table>

###### Metrics in Logistic Regression
* Legends:  T-True, F-False, P-Positive, N-Negative
* The metrics is measured by using a confusion matrix.
* Accuracy is $\frac{TP + TN}{Total Cases}$
* Misclassification rate is $\frac{FP + FN}{Total Cases}$
* Precision = $\frac{True Positives}{Predicted Positives}$ = $\frac{TP}{TP + FP}$
* Sensitivity or Recall = $\frac{True Positives}{Conditional Positives}$
* Specificity = $\frac{False Positives}{Atual Positives}$ = $\frac{FP}{TP + FN}$
* Type-I error = False Positives, Tyupe-II error = False Negatives
<img src="metrics_from_confusion_matrix.png" width="700" height="100" />
* Receiver Operator Curve (ROC Curve) is the plot between Sensitivity (y-axis) vs 1-Specificiity (x-axis) i.e. graph of True positive vs False positive rate
>* Area under the curve is a metric for how well the model fit the data.
>* Above the red random guess lime means our model is performing better than random guess. If below then model is performing worse than the random guess. <img src="roc_curve_plot.png" width="500" height="100" />
* Other type of metric used in BinaryClassificationEvaluator is PR curve or Precision Recall curve.
<hr/>
* Metrics for binary classification are 'areadUnderROC' or 'areaUnderPR' (default: areaUnderROC)
* Metrics for multiclass classification are f1|weightedPrecision|weightedRecall|accuracy) (default: f1)

### Evaluators
* Evaluation DataFrame - Evaluation DataFrames are the dataframes returned by model.evaluate(test_data) method
* Evaluators are similar to Machine Learning Algorithm objects but are designed to take evaluation dataframes.
* Evaluators being experimental, we should be cautious while using them on production code.
* We have two types of evaluators:
>1. BinaryClassificationEvaluator
        * Metrics for binary classification are 'areadUnderROC' or 'areaUnderPR' (default: areaUnderROC)
        * Expects 'rawPrediction' field (a +ve or -ve float value) in input evaluation dataframe
>2. MulticlassClassificationEvaluator
        * Metrics for multiclass classification are f1|weightedPrecision|weightedRecall|accuracy) (default: f1)
        * Expects 'prediction' field (0.0 or 1.0 a binary number) in input evaluation dataframe
* Both these evaluators take evaluation dataframe as their parametr to .evaluate() method and that method returns a value between 0.0 to 1.0 (i.e. %0 to 100%) indicating the mathc percentage between the observed values ('label' column) and predicted values('prediction' column)
* The inout dataframe to these evaluator are the "predictions" dataframe member of the result of fitted_model.evaluate(test_dataset)

#### Steps in Linear Regression (flashback)
* Load CSV file into DataFrame (spark dataframe)
  * To load simple text files use sparksessn.read.csv(...) etc
  * To load vectorized text files use sparksessn.read.format('libsvm').load(...)
    * When we load a vector formatted or libsvm text then we need not worry about VectorAssembler as the data is alreayd having vectorized features column. We can directly jump to train-test splitfor our requirement.
* Convert participating string feature columns into numeric using StringIndexer. Use .fit() and .transform() to get the converted dataframe.
* Create a VectorAssembler with this indexed dataframe including the numeric feature columns and generating a new vectorised combined feature column
* Get the final dataframe using the combined vectorized feature and numeric label column.
* Split the final data into training data frame (70%) and test data frame (30%)
* Create a LinearRegression with the combined feature and label columns, do a .fit(sdf) to get the model.
* Do a .evaluate() on the model with test data to get comparison of test_result (evaluation) and test_data.label col to find various parameters.
* Then you can run .transform on the model with a unlabeled data (could be test data minus the labels) to get the prediction.

#### Steps in Logistic Regression
* Load CSV file into DataFrame (spark dataframe)
  * ??? To load simple text files use sparksessn.read.csv(...) etc
  * To load vectorized text files use sparksessn.read.format('libsvm').load(...)
    * When we load a vector formatted or libsvm text then we need not worry about VectorAssembler as the data is alreayd having vectorized features column. We can directly jump to train-test splitfor our requirement.
* ??? Convert participating string feature columns into numeric using StringIndexer. Use .fit() and .transform() to get the converted dataframe.
* ??? Create a VectorAssembler with this indexed dataframe including the numeric feature columns and generating a new vectorised combined feature column
* Get the final dataframe using the combined vectorized feature and numeric label column.
* Split the final data into training data frame (70%) and test data frame (30%)
* Create a LogisticRegression with the combined feature and label columns, do a .fit(sdf) to get the model.
* Do a .evaluate() on the model with test data to get BinaryLogisticRegressionSummary.
* To get a BinaryLogisticRegressionTrainingSummary object get the summary attribute of fitted model itself at the beginning before calling fitted_model.evaluate().
* Check the "prediction" dataframe member of this above summary field (i.e. the result of fitted_model.evaluate(test_data))
* The explore this evaluation dataframe we can evaluate using "BinaryClassificationEvaluator" or "MulticlassClassificationEvaluator". The metric for evaluation also differes based on the evaluator used.
>* REFER "<b>Evaluators</b>" section above for more details.

###### Import convenient utilities from my personal library
* The "import *" brings in all y modules along with the setup that allows multiple implicit prints in a notebook shell

In [1]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/tools/')
from chinmay_tools import *

###### Logistic Regression Exercise

In [2]:
printHighlighted('Load the libsvm or vector formatted text')
from pyspark.sql import SparkSession
spark1 = SparkSession.builder.appName('logireg_1').getOrCreate()
training = spark1.read.format('libsvm').load('Logistic_Regression/sample_libsvm_data.txt')

[7m[1mLoad the libsvm or vector formatted text[0m[0m


In [3]:
training.printSchema()
training.describe().show()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                100|
|   mean|               0.57|
| stddev|0.49756985195624287|
|    min|                0.0|
|    max|                1.0|
+-------+-------------------+



In [4]:
printHighlighted('Instantiate a LogisticRegression model to be fitted / trained next')

[7m[1mInstantiate a LogisticRegression model to be fitted / trained next[0m[0m


In [5]:
from pyspark.ml.classification import LogisticRegression

In [6]:
###### Create the model
# lr = LogisticRegression()   ## This line is same as below one as the parameter values are default ones
lr = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction')

In [7]:
printHighlighted('First fit the entire dataset and check the "predictions" dataframe from summary of the fitted model')
model_fitted_all = lr.fit(training)

[7m[1mFirst fit the entire dataset and check the "predictions" dataframe from summary of the fitted model[0m[0m


In [8]:
model_fitted_all.summary.predictions.printSchema()
model_fitted_all.summary.predictions.show()
printUnderlined('We can observe that the "prediction" column or predicted results is mostly mattching with the "label" (the actual values) ')

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[127,128,129...|[19.8534775947478...|[0.99999999761359...|       0.0|
|  1.0|(692,[158,159,160...|[-20.377398194908...|[1.41321555111056...|       1.0|
|  1.0|(692,[124,125,126...|[-27.401459284891...|[1.25804865126979...|       1.0|
|  1.0|(692,[152,153,154...|[-18.862741612668...|[6.42710509170303...|       1.0|
|  1.0|(692,[151,152,153...|[-20.483011833009...|[1.27157209200604...|       1.0|
|  0.0|(692,[129,130,131...|[19.8506078990277...|[0.99999999760673...|       0.0|
|  1.0|(692,[158,159,160...|[-20.337256674833...

#### That was a basic way to do a Logistic Regression
#### We will now introduce concept of 'evaluators'

In [9]:
printHighlighted('To evaluate the model we need to split the entire data into train_data and test_data, '
                    + 'train a new model with train_data and evaluate the test_data with the trained model.')

[7m[1mTo evaluate the model we need to split the entire data into train_data and test_data, train a new model with train_data and evaluate the test_data with the trained model.[0m[0m


In [10]:
train_data, test_data = training.randomSplit([0.7, 0.3])
train_data.describe().show()
test_data.describe().show()

# final_model = LogisticRegression(featuresCol='features', labelCol='label', predictionCol='prediction')
final_model = LogisticRegression()
model_fitted_train_data = final_model.fit(train_data)
predictions_and_labels_summary = model_fitted_train_data.evaluate(test_data)
type(predictions_and_labels_summary)

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|                69|
|   mean|0.5652173913043478|
| stddev|0.4993602044724247|
|    min|               0.0|
|    max|               1.0|
+-------+------------------+

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|                31|
|   mean|0.5806451612903226|
| stddev| 0.501610310127101|
|    min|               0.0|
|    max|               1.0|
+-------+------------------+



pyspark.ml.classification.BinaryLogisticRegressionSummary

In [11]:
predictions_and_labels_summary.predictions.show()
printHighlighted('The result comes pretty fine i.e. actual "Label" values are almost identical to "prediction" col values because of well created documentation example data')

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[95,96,97,12...|[23.9967265226565...|[0.99999999996212...|       0.0|
|  0.0|(692,[123,124,125...|[32.2218848363852...|[0.99999999999998...|       0.0|
|  0.0|(692,[124,125,126...|[31.0461877476375...|[0.99999999999996...|       0.0|
|  0.0|(692,[124,125,126...|[20.8145774182799...|[0.99999999908726...|       0.0|
|  0.0|(692,[125,126,127...|[23.3025129766084...|[0.99999999992416...|       0.0|
|  0.0|(692,[126,127,128...|[23.4852932650678...|[0.99999999993683...|       0.0|
|  0.0|(692,[126,127,128...|[25.5661709794507...|[0.99999999999211...|       0.0|
|  0.0|(692,[126,127,128...|[29.5303545329574...|[0.99999999999985...|       0.0|
|  0.0|(692,[126,127,128...|[21.8600559943781...|[0.99999999967915...|       0.0|
|  0.0|(692,[127

In [12]:
printHighlighted('Now we will explore the evaluation of this prediction using an evaluator object.')

[7m[1mNow we will explore the evaluation of this prediction using an evaluator object.[0m[0m


In [13]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

* <b>BinaryClassificationEvaluator</b> expects 3 parameters rawPredictionCol, labelCol and metricName with default values 'rawPrediction', 'label' and 'areaUnderROC' respevtively.
* Other values for metricName can be 'areaUnderPR' ROC for Receiver Operator Characteristics curve and PR stamds for Precision Recall curve.
* BinaryClassificationEvaluator.evaluate() expects an evaliuation dataframe with labels/observations and predictions
    * The evaluation dataframe is the "predictions" dataframe member of the BinaryLogisticRegressionSummary object returned from fitted_model.evaluate(test_data)
* binary_evaluator.evaluate(predictions_dataframe) returns a value between 0 to 100% indicating the match between the 'label' value and 'prediction' value
* It is a bit difficult to grab accuracy, precision or recall etc directly from the BinaryClassificaitonEvaluator, and for this we use <b>MulticlassClassificationEvaluator</b>
* BinaryClassificationEvaluator expects <i>rawPrediction</i>, where as MulticlassClassificationEvaluator expects <i>prediction</i> column (binary value of 0 or 1) which has already taken care of 'probability'.

In [29]:
# assuming default values 'rawPrediction', 'label' and default metricName is 'areaUnderROC' (other metric is 'areaUnderPR') in its parameter
bin_evaluator = BinaryClassificationEvaluator()

In [50]:
BinaryClassificationEvaluator?

In [16]:
predictions_and_labels_summary.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[95,96,97,12...|[23.9967265226565...|[0.99999999996212...|       0.0|
|  0.0|(692,[123,124,125...|[32.2218848363852...|[0.99999999999998...|       0.0|
|  0.0|(692,[124,125,126...|[31.0461877476375...|[0.99999999999996...|       0.0|
|  0.0|(692,[124,125,126...|[20.8145774182799...|[0.99999999908726...|       0.0|
|  0.0|(692,[125,126,127...|[23.3025129766084...|[0.99999999992416...|       0.0|
|  0.0|(692,[126,127,128...|[23.4852932650678...|[0.99999999993683...|       0.0|
|  0.0|(692,[126,127,128...|[25.5661709794507...|[0.99999999999211...|       0.0|
|  0.0|(692,[126,127,128...|[29.5303545329574...|[0.99999999999985...|       0.0|
|  0.0|(692,[126,127,128...|[21.8600559943781...|[0.99999999967915...|       0.0|
|  0.0|(692,[127

In [30]:
bin_evaluator.evaluate(predictions_and_labels_summary.predictions)

1.0

* The metrics useed for BinaryClassificationEvalator are 'areaUnderROC'(default) or 'areaUnderPR'
    * ROC means Receiver Operator Curvve
    * PR means Precision-Recall curve
* The value of 1 for 'areaUnderROC' indicates that our model perfectly fits under ROC curve and for every record of test_data evaluation predicted values ('prediction') is identical to observer value('label'). The values are 0 or 1 (bimnary).

In [54]:
# Evaluating again with a non efault metric 'areaUnderPR' also resulted in a perfect fit as well.
BinaryClassificationEvaluator(metricName='areaUnderPR').evaluate(predictions_and_labels_summary.predictions)

1.0

In [32]:
printHighlighted('evaluate() method of Binary evaluator returned 1 indicating exact match between two columns "label" and "prediction" ')
printHighlighted('So it was a very good fit - may be due to the fact that data was from documentation example.')

[7m[1mevaluate() method of Binary evaluator returned 1 indicating exact match between two columns "label" and "prediction" [0m[0m
[7m[1mSo it was a very good fit - may be due to the fact that data was from documentation example.[0m[0m


In [34]:
multi_evaluator = MulticlassClassificationEvaluator()

* Explain the parameter details of the multi class evaluator using multi_evaluator.explainParams()

In [38]:
multi_evaluator.explainParams()

'labelCol: label column name. (default: label)\nmetricName: metric name in evaluation (f1|weightedPrecision|weightedRecall|accuracy) (default: f1)\npredictionCol: prediction column name. (default: prediction)'

In [49]:
printHighlighted('Print scores based on default f1, or weighted precision score, or weighted recall scoe or accuracy')
multi_evaluator.evaluate(predictions_and_labels_summary.predictions)
multi_evaluator.evaluate(predictions_and_labels_summary.predictions, {multi_evaluator.metricName:'f1'})

# Predicted positives = True Positives + False Positives = TP + FP
# Actual positives = True Positives + False Negatives = TP + FN

# precision = TP/(TP+FP) i.e. TP / Total positives predicted i.e. perentage of true positives predicted
multi_evaluator.evaluate(predictions_and_labels_summary.predictions, {multi_evaluator.metricName:'weightedPrecision'})

# recall or sensitivity TP/(TP+FN) i.e. TP / Actual positives predicted i.e. perentage of tru positives predicted
multi_evaluator.evaluate(predictions_and_labels_summary.predictions, {multi_evaluator.metricName:'weightedRecall'})

# accuracy (TP+TN) / Total population i.e. perentage of truth in the prediction either positive or negative i.e. (TP+TN)/(TP+TN+FP+FN)
multi_evaluator.evaluate(predictions_and_labels_summary.predictions, {multi_evaluator.metricName:'accuracy'})

[7m[1mPrint scores based on default f1, or weighted precision score, or weighted recall scoe or accuracy[0m[0m


0.9358904852263485

0.9358904852263485

0.9440860215053763

0.935483870967742

0.9354838709677419

* The "predictions" dataframe in the summary (BinaryLogisticRegressionTrainingSummary or BinaryLogisticRegressionSummary) has the following fields
>1. 'label' -- actual labels from input
>2. 'features'
>1. 'rawPrediction' -- result from Logistic Regression
>1. 'probability' -- probability for the raw prediction
>1. 'prediction' -- label as predicted by the model.
* Here Label and Prediction are binary either 0 and 1, representing one of the two classes in binary classification.

* The fitted_model.summary gives BinaryLogisticRegressionTRAININGSummary')
* The fitted_model.evaluate(test_data) returns BinaryLogisticRegressionSummary')

In [None]:
printHighlighted('VALIDATING')

In [None]:
lr_model_fitted.summary.accuracy
predictions_and_labels.predictions.show()