## REFER Chapter 2 and 3 of "Introduction to Stastical Learning" by Gareth James

#### pyspark API Documentation:
* http://spark.apache.org/docs/latest/
* http://spark.apache.org/docs/latest/ml-guide.html
* https://spark.apache.org/docs/latest/api/python/

## [Introduction to Statistical Learning](<https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf>)

## Files to open in browser: 
    * Documentation Page for Linear Regression
    * Data from Documentation: 
    * Linear_Regression_Example.ipynb
    * New Untitled Notebook: for my experiment

###### Spark Documentation Example Page for Linear Regression
* http://spark.apache.org/docs/latest/ml-guide.html
* How to reach there
> http://spark.apache.org/docs/latest/ --> Programming Guide --> MLlib (Machine Learning) --> http://spark.apache.org/docs/latest/ml-guide.html --> Classification and Regression --> Linear Regression --> Python

In [None]:
## Enable the shell to print multiple results (instead of only the last result)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

###### import chinmay_tools for various utility functions

In [None]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy')
from tools.chinmay_tools import printHighlighted

In [None]:
printHighlighted('STEPs for training testing and deploying using LinearRegression model')

#### How to Train-Validate-Deploy a LinearRegression model
#### STEPS PERFORMED BELOW

* Initiated a spark session to start work. SparkSession.build.appName('test')<b>.getOrCreate</b>()
* Created an instance of the model e.g. LinearRegression() with specific parameters supplied.
* Loaded the data from file into a dataframe (say "all_data") [sparkSesn.read<b>.format('libsvm').load</b>(fileName)]
    * We could read using .read.text(file) instead of .read.format..load(file), BUT .format() version is more generic.
    * Sometimes the .format('libsvm').load(fileName) fails, then use ".format('libsvm').option('numFeatures',10),load(fileName)
        * Reason: there are 10 features in the text file (open and check it), so set numFeatures option to 10
* Splitted the data (in proportion of n1:n2) into train_data and test_data using dataframe<b>.randomSplit([n1,n2])</b> method. We used n1=0.7 and n2=0.3
* Trained our model with the train_data (proportion n1) using trained_model = our_model<b>.fit</b>(train_data) result is of type LinearRegressionModel, where as instantiated model was of type LinearRegression
* Then comared the trained model (result of our_model.fit ) against the test data using trained_model<b>.evaluate</b>(test_data). Used metods like .r2, .rootMeanSquaredError, 'residual etc on the result from .evaluate
    * residuals are actually the difference of the predicted values from the actual values, i.e. the vertical distances of the actual points formt he regression line.
* Then we predicted the label on an unlabeled data using our_model<b>.transform()</b>
* Basic packages used:
    * SparkSession from pyspark.sql
    * LinearRegression from pyspark.ml.regression

##### Create a spark session to work on

In [2]:
from pyspark.sql import SparkSession
spark1 = SparkSession.builder.appName('lr_example').getOrCreate()

<h1 style="display: inline">Heading 1</h1>
<h2 style="display: inline">Heading 2</h2>
<h3 style="display: inline">Heading 2</h3>
<h4 style="display: inline">Heading 2</h4>
<h5 style="display: inline">Heading 2</h5>


###### Load the data

In [None]:
from pyspark.ml.regression import LinearRegression

# Load training data

# training = spark.read.format("libsvm").load("sample_linear_regression_data.txt")

## The above line works from udemy example folder
## But if I copy the file to somewhere else then ot fails with error "Py4JJavaError: An error occurred while calling o317.load."
## The above error is avoided by setting "numFeatures" in option method Refer: https://stackoverflow.com/questions/59244415/spark-read-formatlibsvm-not-working-with-python
### BUT REASON IS NOT CLEAR
## Checked the text file and found there 10 features and hence set the option "numFeatures" to "10". We can even pass higher , but NOT lower than the number of features.

training = spark1.read.format("libsvm").option("numFeatures","10").load("Linear_Regression/sample_linear_regression_data.txt")

In [None]:
#training = spark1.read.format('libsvm').load('sample_linear_regression_data.txt')

# printHighlighted("Note down the format of 'libsvm', it is similar to csv or json. The 'libsvm' is not well documented.")
training.printSchema()
training.show()

##### Create an instance of the algorithm or the model

In [3]:
from pyspark.ml.regression import LinearRegression

# Create an instance of the model
lr=LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=3)

In [4]:
LinearRegression?

###### Train the model with the loaded data

In [None]:
# Train the model
lrModel = lr.fit(training)
# Here we have fit or trained the model with the entire dataset.

###### Print the deetails and  parameters from the trained model and its summary

In [None]:
# Print the details of the model
lrModel.coefficients  # print coefficients of the features
lrModel.intercept

#print the model summary information
training_summary = lrModel.summary

training_summary.residuals.show()
training_summary.rootMeanSquaredError
training_summary.meanSquaredError
training_summary.meanAbsoluteError
training_summary.r2
training_summary.totalIterations

###### Now load the data again but don't train the model with the entire set of data

In [None]:
# all_data = spark1.read.format("libsvm").load("sample_linear_regression_data.txt")
all_data = spark1.read.format("libsvm").option("numFeatures", 10).load("Linear_Regression/sample_linear_regression_data.txt")
#REFER: https://stackoverflow.com/questions/59244415/spark-read-formatlibsvm-not-working-with-python

## LOAD
###### Split the data into training data and test data using randosmsplit() on the dataframe
* i.e. Load the train_data with a major portion from the DataFrame

In [None]:
train_data, test_data = all_data.randomSplit([0.7, 0.3])    # Data is split into 70% (i.e. 0.7) for train_data and 30% for test_Data into two dataframes

train_data.describe().show()
test_data.describe().show()

## TRAIN
###### Train the model with train_data

In [None]:
# Now train the model on the training data
correct_model = lr.fit(train_data)

## EVALUATE
###### Now evaluate how the model did on the train_data  by running it on test_data which is not yet seen by the model
* Run trained_model.evaluate(test_data)

In [None]:
test_result = correct_model.evaluate(test_data)

###### The .evaluate() on the trained model actually compares our predictions against the labels that were already assigned in the test data

In [None]:
type(test_result)
type(correct_model.summary)

In [None]:
test_result.r2
test_result.rootMeanSquaredError
test_result.residuals.show()

###### How to improve the model to do better predictions
* It is a hit and trial method.
* Keep testing out with different parameters to the model i.e. in the constructor LinearRegression()..
* Explore the other available parameters, keep modifying the used parameters to fine tune 
* Mess around the parameters to the model constructor and repeat the split, run on train data evaluate against test data
* Repeat this till you are comforatable with the result say rootMeanSquaredError etc.

###### Once we are comfortable with the resutl we can deploy our model on the unlabeled data.

* Ensure that there is no label on any of the deployment data
>* In other words: The deployment data should be unseen by model and does not have label.
* To mimic the production data create a dataframe from the test_data and dropping the "label" column 

In [None]:
test_data.columns

###### Get the unlabeled data

In [None]:
unlabeled_data = test_data.select('features')

In [None]:
unlabeled_data.show()

###### Use .transform() on the model to get our predictions
* Note that we used our_model<b>.evaluate()</b> against a labeled but unseen data to compare our predictions against the labels.
* our_model<b>.transform()</b> is used to predict labels if there is no model already available in the supplied data.

## DEPLOY

In [None]:
predicted_data = correct_model.transform(unlabeled_data)

In [None]:
predicted_data.show()

In [None]:
type(lr)
type(lrModel)