# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

## Imports

In [44]:
import findspark
findspark.init('/home/aforestier10/Downloads/spark-3.5.3-bin-hadoop3')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('LinRegConsult').getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [45]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator

## Read in data and show

In [46]:
data = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)
data.printSchema()

                                                                                

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [47]:
data.describe().show()

                                                                                

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       NULL|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|     NULL|       NULL| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

In [48]:
for row in data.head(1)[0]:
    print(row)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55


## Linear Regression Model Parameters, Pipeline and Cross Validation

In [49]:
# change string category to numeric input for feature vector
string_indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_line_indexed', handleInvalid='keep')

In [50]:
# Set up feature vector
features = ['Cruise_line_indexed', 'Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']
assembler = VectorAssembler(inputCols=features, outputCol='features')

In [51]:
# Create base estimator object
lr = LinearRegression(featuresCol="features", labelCol='crew')

### Create pipeline object 
- sequence of stages where each stage is either a Transformer; algorithm that converts one DataFrame into another. Or Estimator; algorithm that needs to be fitted on data

**Advantages**
1. Organization: Keeps all preprocessing & modeling steps together
2. Consistency: Ensures same steps are applied to training and test data
3. Convenience: One call to fit() handles everything
4. Reproducability: Pipeline can be saved and reused on new data

In [52]:
# Create the pipeline - sequence of steps to perform
pipeline = Pipeline(stages=[string_indexer, assembler, lr])

### Cross validation model with hyperparameter tuning
* maxIter - max # of times algorithm will run. Each tries to improve model's fit. If algo converges before reaching max, it will stop. default is 100
* regParam - controls strength of regularization. Default is 0.0 which is none. Max is 1.0
* elasticNetParam - mix of Lasso (L1) & Ridge (L2). Default is 0.0 which is pure L2, and max is 1.0 which is pure L1 

In [53]:
param_grid = ParamGridBuilder()
param_grid = param_grid.addGrid(lr.regParam, [0.0, 0.1, 0.3])
param_grid = param_grid.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
param_grid = param_grid.build()

### Create evaluator - way to evaluate each cross validation model's ability to perform predictions and select the best model

In [76]:
rmse_evaluator = RegressionEvaluator(labelCol='crew', predictionCol='prediction', metricName='rmse')
r2_evaluator = RegressionEvaluator(labelCol='crew', predictionCol='prediction', metricName='r2')

### Create cross validator

In [77]:
# Split data
train_data, test_data = data.randomSplit([.8, .2], seed=42)

# Cross validation object
cross_val = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=param_grid,
    evaluator=rmse_evaluator,
    numFolds=3,
    seed=42
)

# Fit model
cv_model = cross_val.fit(train_data)

# Get the best model
best_model = cv_model.bestModel

### Evaluate best model

In [78]:
y_pred = best_model.transform(test_data)
rmse = rmse_evaluator.evaluate(y_pred)
print(f'RMSE = {rmse}. The mean crew is {mean_crew}')

RMSE = 0.6993867393223809. The mean crew is None


In [79]:
mean_crew = data.agg({'crew': 'mean'}).show()

+-----------------+
|        avg(crew)|
+-----------------+
|7.794177215189873|
+-----------------+



RMSE is .7 and avg crew is 8.

In [80]:
r2 = r2_evaluator.evaluate(y_pred)
print(f"R^2 says model explains {r2} of variance in the data!")

R^2 says model explains 0.9517658022255572 of variance in the data!
