### PROJECT BACKGROUND

* Requirement: Give accurate estimate of how many crew members a ship will require.
    * This information will be passed on by the company to its customers while selleing respective ships to tehm.
* The input DataSet fields are: ['Ship Name', 'Cruise Line', 'Age (as of 2013)', 'Tonnage (1000s of tons)', 'passengers (100s)', 'Length (100s of feet)', 'Cabins (100s)', 'Passenger Density', 'Crew (100s)']
* create a regression model to predict the number of crew members needed for future ships.
* Condition: Particular cruise lines will differ in acceptable crew counts. This may be an important feature as pe Hyundai
* Cruise line value is a String value, we need to convert these strings to numbers. We can use STringIndexer from pyspark.ml.feature
* Exercise file is: Linear_Regression_Consulting_Project.ipynb and data file is: cruise_line_info.csv

# Consulting Project
Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [None]:
## Enable the shell to print multiple results (instead of only the last result)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import sys
sys.path.append('c:/Users/nishita/exercises_chinmay/tools')
from chinmay_tools import *

In [None]:
printTextFile('Linear_Regression/cruise_ship_info.csv')

* As we can see the contents of the dataset, the second column "Cruise_line" is not numeric but affects the crew count. So it is a categorical data point. We can us e StringIndexer to convert it into numeric

### PREPARE THE DATA
* Load data from csv
* Convert the string field into numeric field using StringIndexer from pyspark.ml.features
* Decide on the label column and the feature columns (all numeric)
* Create a vectorized combined feature column named 'features'
* Get the final input data with two columns 'features' and 'crew' (the label column)
* Split the final input dat into training set and test set (70:30 proportion).
* NEXT: we will train with train data, evaluate the trained model against test data and print comparison results.
* FINALLY: Deploy the model on production data (mimiced from test data after stripping out the label column i.e remvoed 'crew' columns. Compare the prediction results against the residuals of evaluation test result.

In [None]:
from pyspark.sql import SparkSession
sparkEx = SparkSession.builder.appName('cruise_crews').getOrCreate()
sdf_cruises = sparkEx.read.csv('Linear_Regression/cruise_ship_info.csv', inferSchema=True, header=True)
sdf_cruises.printSchema()

In [None]:
df_cruises = getPandasDFfromSparkDF(sdf_cruises)
df_cruises.columns

* As we will see next, the StringIndexer has converted the "Cruise_line" string values into number groups 0,1,2,...etc

In [None]:
sdf_cruises.columns
sdf_cruises.groupBy('Cruise_line').count().show()

#### Convert the String data (Cruise_line) into Numeric using StringIndexer
* StringIndexer assigns numbers starting from 0 depending on the frequency of occurrence of the associated value in that stirng field
* Then we can use that imdexed numeric field as a feature in place of the string field.
* As we will see next, the StringIndexer will convert the "Cruise_line" string values into number groups 0,1,2,3,...etc and these numerical values can be used by the algorithm instead of a simple string value.
* StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels] ordered by label frequencies, so the most frequent label gets index 0. This is similar to label encoding in pandas.
* The format and usage of StringIndexer is similar to VectorAssembler. Bot transform a DataFrame to introduce a new column with transformed data from one old col (for Stringindexer) or more than one column (for VectorAssembler)
>* Note that we will use .fit(sdf).transform(sdf) for StringIndexer where as we will call only .transform(sdf) with VectorAssembler.
* There are different ways to deal with categorical information
>* We can separate out the "Cruise_line" categorical value into some dummy variables to have a single column 'yes" or 'No' for every cruise line

In [None]:
from pyspark.ml.feature import StringIndexer

str_Indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_line_indexed')
sdf_cruises_indexed = str_Indexer.fit(sdf_cruises).transform(sdf_cruises)
sdf_cruises_indexed.printSchema()
sdf_cruises.printSchema()

#### Vectorize the numeric feature columns to get a unique 'features' column to pass to spark ML algorithms

In [None]:
from pyspark.ml.feature import VectorAssembler

feature_input_cols = ['Cruise_line_indexed', 'Age',  'Tonnage',  'passengers',  'length',  'cabins',  'passenger_density']
assembler = VectorAssembler(inputCols=feature_input_cols, outputCol='features')

sdf_cruises_indexed_vec_data = assembler.transform(sdf_cruises_indexed)
#The last call appends a vector column containing the merged data of featuree columns passed to assembler constructor

sdf_cruises_indexed_vec_data.printSchema()

In [None]:
sdf_cruises_indexed_vec_data.select('features', 'crew').show()

In [None]:
final_data = sdf_cruises_indexed_vec_data.select('features', 'crew')

final_data.printSchema()

In [None]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

train_data.describe().show()
test_data.describe().show()

In [None]:
train_data.printSchema()

In [None]:
from pyspark.ml.regression import LinearRegression
'''
Default parameters are featuresCol='features', labelCol='label', predictionCol='prediction',

The input DataFrame is having 'features' as vector feature column name and 'crew' as labelCol.
Let us get the predicted data in a new column named 'prediction' by default.

So in LinearRegression() constructor we need to adjust the labelCol alone. Default value is 'label', we need to change it to 'crew'
For other params of constructor default is fine with us.
'''

lr_ship = LinearRegression(labelCol='crew')

### TRAIN THE MODEL

In [None]:
lr_ship_model_trained = lr_ship.fit(train_data)

In [None]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lr_ship_model_trained.coefficients,lr_ship_model_trained.intercept))

* If the equation is Y = mX +c  '<b>m</b>' is the <b>coefficient</b>, whcih describes the relationship between X (the predictor or independent variable) and Y (the response or dependent variable) and '<b>c</b>' is the constant called <b>intercept</b>.
>* A <b>+ve coefficient</b> indicates that as the predictor variable increases, the response variable also increases.
>* A <b>-ve coefficient</b> indicates that as the predictor variable increases, the response variable decreases.
* Tne coefficient value represents the mean change in the response given one unit change in the predictor.
* Intercept is the value of response variable Y when the predictor value (X) is zero.

###### My Theory about coefficients and Intercept 
* If we can express the equation as below:
* Y = $M_1X_1 + M_2X_2+M_3X_3+ ...$ + C = ($ \sum_{i=1}^n M_i X_i$) + C 
* The Coefficients are  [$M_1, M_2, M_3, M_4, ...$] and Intercept is C

### EVALUATE THE MODEL AGAINST THE TEST DATA AND PRINT THE COMPARISON

In [None]:
test_result = lr_ship_model_trained.evaluate(test_data)

In [None]:
test_result.meanAbsoluteError
test_result.meanSquaredError
test_result.r2
test_result.rootMeanSquaredError
test_result.residuals.show()

###### If we get really good results, we need to do a reality check on the results
* This is a real data on real ships obtained from US machine learning repostitoryi.e. U.C. Irvin
* If we get very good results (a higher value for R-squared, see if any of the two feature columsn are highly co-related
>* See if number of crew is highly co-related to number of passengers in board.
>* or see if number of crew is highly co-related with number of cabins in hte ship
>* We can use Pearson correlation to find this

In [None]:
sdf_cruises.printSchema()

In [None]:
from pyspark.sql.functions import corr   # Pearson Correlation

In [None]:
printHighlighted("corr('col1', 'col2') says how related are these two columns.")
sdf_cruises.select(corr('crew', 'passengers')).show()
sdf_cruises.select(corr('crew', 'cabins')).show()

In [None]:
sdf_cruises.select(corr('crew', 'passenger_density')).show()
sdf_cruises.select(corr('crew', 'length')).show()
sdf_cruises.select(corr('crew', 'Age')).show()
sdf_cruises.select(corr('crew', 'tonnage')).show()
printHighlighted("Here we see that there is high correlation of 'crew' with 'passengers', 'tonnage' and 'cabins'")

* Here we see that, higer the number of passengers, ships need mroe crews and with lesser number of passengers lesser crew count will be needed.
* Similarly higer  the numbe of cabins, even more higher will eb crew member count.
* So a lot of features of ship itself indicates how many crew memer we need.

In [None]:
train_data.describe().show()

### DEPLOY THE MODEL / PREDICT

In [None]:
unlabeled_data = test_data.select('features')
unlabeled_data.printSchema()

In [None]:
predicted_data = lr_ship_model_trained.transform(unlabeled_data)

In [None]:
predicted_data.show()

#### COMPARE THE PREDICTED RESUILT WITH THE EVALUATED TEST RESULT

In [None]:
# Register the DataFrame as a SQL temporary view
test_data.createOrReplaceTempView("my_test_data")
predicted_data.createOrReplaceTempView("my_predicted_data")

input_output_df = sparkEx.sql("SELECT t.*, p.prediction, (t.crew - p.prediction) as difference  \
                                FROM my_test_data t, my_predicted_data p WHERE t.features==p.features")
input_output_df.show()

In [None]:
test_result.residuals.show()

### Explanation of the error metrics
* Refer: https://statisticsbyjim.com/glossary/


###### <u>Coefficients and Intercept</u> (i.e. lr_mode.coefficients and lr_model.intercept)
* If the equation is Y = mX +c  '<b>m</b>' is the <b>coefficient</b>, whcih describes the relationship between X (the predictor or independent variable) and Y (the response or dependent variable) and '<b>c</b>' is the constant called <b>intercept</b>.
>* A <b>+ve coefficient</b> indicates that as the predictor variable increases, the response variable also increases.
>* A <b>-ve coefficient</b> indicates that as the predictor variable increases, the response variable decreases.
* Tne coefficient value represents the mean change in the response given one unit change in the predictor.
* Intercept is the value of response variable Y when the predictor value (X) is zero.

###### My Theory about coefficients and Intercept 
* If we can express the equation as below:
* Y = $M_1X_1 + M_2X_2+M_3X_3+ ...$ + C = ($ \sum_{i=1}^n M_i X_i$) + C 
* The Coefficients are  [$M_1, M_2, M_3, M_4, ...$] and Intercept is C

###### <u>Residuals</u> and <u>RMS</u>
* Residual is the difference between the observed value ($y_i$) and the mean value that the model predicts ($\hat{y_i}$) for that observation.
* $y_i$ is the observeed or actual value of dependent variable and $\hat{y_i}$ is the predicted value (which falls on the regression line) for the same values of $x_i$
* ($y_i-\hat{y_i}$) or the errors are nothing but the elements of <b>residual</b>.
* <i>Example:</i> When we predict on the feaures from the test_data (train_data vs test_data say in 70:30 proportion from final data of two column data frame features & label), the difference of our predictions from the evaluated test result on test_data was almost identical to the residuals of the test result.

* <b>RMS</b> or <b>rootMeanSquaredError</b> = $\sqrt{\frac{1}{n}{\sum{(y_i-\hat{y_i})^2}}}$ which is same (as per <u>my theory</u>) as $\sqrt{\frac{1}{n}{\sum{(residuals[i])^2}}}$
* That means RMS is the square root of the average of squares of the residual elements
<hr/>

###### <u>R-squared</u>
* R-squared is the percentage of the Y (response variable) variation that is explained by a linear model.
* It is always between 0 and 100% (i.e. between 0 and 1).
* R-squared is a statistical measure of how close the data are to the fitted regression line.
* It is also known as the <b>coefficient of determination</b>, or the <b>coefficient of multiple determination</b> for multiple regression.
* In general, the <b>higher the R-squared, the better the model fits your data</b>. However, there are important conditions for this guideline that I discuss elsewhere.
* If we select more training data, we will get a higher R-squared value. So [0.8, 0.2] will give better r2 value, but on the downside it will be a over-fit and it will reduce the test sample and hence we will not be able to valdate properly. Generally accepterd proportion is [0.7, 0.3]
* Before you can trust the statistical measures for goodness-of-fit, like R-squared, you <b>should check the residual plots for unwanted patterns</b> that indicate biased results.
Also check how the label is related to some of the crucial features of the ship itself, if we find the fit is better.
>* Use Pearson correlation method to find the correlation
>* sdf_cruises.select(corr('crew', 'passengers')) OR sdf_cruises.select(corr('crew', 'cabins'))  # <i>from pyspark.sql.functions import corr</i>
>* If corr indicates high correlation by returning high value e.g. above 0.9, then be can assume that the model is good fit for the data

In our exercise
* .r2 (r squared) of 0.98 says that our model explains 98% of the variance in the data, which is very good
* now that RMS and r squared are very good, we should think of double checking our data and double check the way we fitted our model to be more realistic

* Compare the RMS values of mean and standard deviation of the final data (final_Data.describe().show()), if the RMS is much less than the stddev then it is good, so be suspicious.

* If you get results very good with Linear Regression be siuspicious and check how you fit the model
* Did we evaluate our model also on the training data which is already known to the model, which is a common mistake.

######  points to note
* we can expect more advanced and more complex models to have better fit, but if it is still a very good fit even with simple LinearRegression, double check the data and fitting way with a realistic data.

#### EXTRA VALIDATIONS
* 

In [None]:
sdf_cruises
sdf_cruises_indexed
sdf_cruises_indexed_vec_data
lr_ship
lr_ship_model_trained
test_result
predicted_data

In [None]:
test_result.meanAbsoluteError
test_result.meanSquaredError
test_result.rootMeanSquaredError
test_result.r2

In [None]:
test_result.residuals.show()