# Linear Regression Code Along
###### (Chinmay's Version)

###### Purpose of this exercise:
* Here we examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

In [None]:
# Import all the utility functions from Chinmay for convenience
import sys
sys.path.append('C:/Users/nishita/exercises_chinmay')
from tools.chinmay_tools import *

###### ACTIVITIES:
    * Create a spark session

In [None]:
from pyspark.sql import SparkSession
sparkSesn = SparkSession.builder.appName('code_along').getOrCreate()

from pyspark.ml.regression import LinearRegression

data = sparkSesn.read.csv('Linear_Regression/Ecommerce_Customers.csv', inferSchema=True, header=True)

In [None]:
printTextFile('Linear_Regression/Ecommerce_Customers.csv')

In [None]:
data.printSchema()

In [None]:
printHighlighted("Printing the first row from the data")
i=0
row = data.head(1)[0]
for item in row:
    print (data.columns[i] +' = '+str(item))
    i += 1

In [None]:
data.head(3)

### Now setup data from Machine Learning

* NOT USED: from pyspark.ml.regression import LinearRegression
* Import <b>VectorAsembler</b> - this is a feature transformer that merges multiple columnsinto a vector column which si the purpose of this step as spark expects all features in a single column.
>* Optionally import "Vector", probably not needed
>* Search at https://spark.apache.org/docs/latest/api/python/index.html
>* You can use <b>VectorAssembler?</b> to get the same help locally and more conveniently 
* We will now operate on numeric columns to chagne tehm to a vector column using VectorAssembler

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
data.printSchema()

* Let's assume the fields 4th, 5, 6, 7th columns as features and last and the 8th column as the Label that we want to predict ("Yearly Amount Spent")
* Create an assempbler (<b>VectorAssembler</b>) with 4,5,6,7th columns as input columns and a new column with name say "our_Features" which will hold all the features in vectorized form from the 4 columns 4,5,6,7th ones.
* Now transform the data through out vector assembler with the ccolumns passed and the output feature column that we passed to the assembler will eb create with vectorized output from the list of input columns from "data"
* Spark, for any ML algorithm, needs this vectorized feature column with all the pre-existing numeric columns.


* ###### Create a VectorAssembler using the numeric features or coumns form our data

In [None]:
assembler = VectorAssembler(inputCols=['Avg Session Length', 
                                       'Time on App', 
                                       'Time on Website', 
                                       'Length of Membership'], 
                            outputCol='our_features')

In [None]:
output = assembler.transform?

In [None]:
output = assembler.transform

* ###### Get the transformed result from original input data to get a single verctor column that contains all the desired feature columns as the members of the vector column

In [None]:
output = assembler.transform(data)
output.printSchema()

In [None]:
data.head()
output.head()

###### Please note here that the vector feature column is a list of all the numeric columns we passed as inptuCols to the vercot assembler

In [None]:
output.select('our_features').show()

* ###### Finally prepare a 2 column DataFrame (as desired by spark for ML algos) having the vectorized feature column and the Label column

In [None]:
final_data = output.select('our_features', 'Yearly Amount Spent')
final_data.show()

###### Split the final data into train and test data

In [None]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [None]:
train_data.describe().show()
test_data.describe().show()

###### Run the regression model on the training data and we will evaluate that on test data

In [None]:
lr =LinearRegression(featuresCol='our_features', labelCol='Yearly Amount Spent')

In [None]:
lr_model = lr.fit(train_data)

In [None]:
test_result = lr_model.evaluate(test_data)

In [None]:
test_result.rootMeanSquaredError

###### residuals in the test_result are actually the differences of the prediceted values from te actual labels.
* i.e. this is the list of vertical distances of the predicted labels from the regression line i.e distances from the labbls in test_data.

In [None]:
test_result.residuals.show()

In [None]:
final_data.describe().show()

In [None]:
test_result.r2
test_result.rootMeanSquaredError

* RMS or rootMeanSquaredError = $\sqrt{\frac{1}{n}{\sum{(y_i-\hat{y_i})^2}}}$
* And ($y_i-\hat{y_i}$) or the errors are nothing but the elements of <b>residual</b>.
* So RMS is the square root of the average of squares of the residual elements

* .r2 (r squared) of 0.98 says that our model explains 98% of the variance in the data, which is very good
* now that RMS and r squared are very good, we should think of double checking our data and double check the way we fitted our model to be more realistic

* Compare the RMS values of mean and standard deviation of the final data (final_Data.describe().show()), if the RMS is much less than the stddev then it is good, so be suspicious.

* If you get results very good with Linear Regression be siuspicious and check how you fit the model
* Did we evaluate our model also on the training data which is already known to the model, which is a common mistake.

######  points to note
* we can expect more advanced and more complex models to have better fit, but if it is still a very good fit even with simple LinearRegression, double check the data and fitting way with a realistic data.

#### Now we want to deploy this model on some unlabeled data.
* For this we need some customer data with featuires alone, and without any label and that our model must not be trained with that data.
* To mimic the data for deployment we will remove the labels test_data and mark the data with feature set as unlabeled datam as  for now we don't have realistic data, so we will mimic some production data 

In [None]:
unlabeled_data = test_data.select('our_features')
unlabeled_data.printSchema()
test_data.printSchema()

In [None]:
predicted_data = lr_model.transform(unlabeled_data)

In [None]:
predicted_data.show()

In [None]:
test_data.show()

###### Here basic flow of data is as follows
* EVALLUATION:  test_data-->test_result
    * This compares the labels in test_result against test_data
* PREDICTION:   test_data-->unlabeled_data->predicted_data  [IDEALLY: It should have been NewData(unlabeled)-->predicted_data]
    * Here we need to manually compare the label of predicted_data against test_data

###### Here we make a select join to find the difference between labeled field of input and prediction label of deployed data
* register both the dataframes as temporary view or table using sparkSession.createOrReplaceTempView()
* use a standard select join on the two views using standard sql stmt where condition and create a "differece" column to show the difference between the actial label value "Yearly Amound Spent" and predicted labels "prediction"
* Note: we have used <b>reverse single quote</b> to specify columns with spaces in its name

In [None]:
### Using SQL
#### To use SQL queries directly with the dataframe, you will need to register it to a temporary view:

# Register the DataFrame as a SQL temporary view
test_data.createOrReplaceTempView("my_test_data")
predicted_data.createOrReplaceTempView("my_predicted_data")

In [None]:
input_output_df = sparkSesn.sql("SELECT t.*, p.prediction, (t.`Yearly Amount Spent` - p.prediction) as difference  \
                                FROM my_test_data t, my_predicted_data p WHERE t.our_features==p.our_features")

In [None]:
input_output_df.count()
input_output_df.show()

###### Now we find the differnece between our deviation ("difference" field) and the deviation from the evaluation step on test data (test_result.residuals) to find how accurate was our prediction

In [None]:
#test_result.residuals.printSchema()
#input_output_df.printSchema()
test_result.residuals.show()

###### If we observe the values in test_result.residuals is probably idntical to our predicted result.
* i.e. the test_data evaluation result is identical to the final prediction on the unlabeled data for the same feature set

In [None]:
print("RMSE: {}".format(test_result.rootMeanSquaredError))
print("MSE: {}".format(test_result.meanSquaredError))
print("MAE: {}".format(test_result.meanAbsoluteError))