# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

## Load Spark and Data

In [1]:
# start spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_project').getOrCreate()

In [2]:
# read in the input csv file.
data = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)

In [3]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [4]:
for r in data.head(1)[0]:
    print(r)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55


## Clear and format data

In [5]:
# format the columns
cleaned_data = data.select([
    data['Cruise_line'].alias('line'),
    data['Age'].alias('age'),
    (data['Tonnage']*1000).cast('integer').alias('tonnage'),
    (data['passengers']*100).cast('integer').alias('passengers'),
    (data['length']*100).cast('integer').alias('length'),
    (data['cabins']*100).cast('integer').alias('cabins'),
    data['passenger_density'],
    (data['crew']*100).cast('integer').alias('crew')
])
cleaned_data.show()

+--------+---+-------+----------+------+------+-----------------+----+
|    line|age|tonnage|passengers|length|cabins|passenger_density|crew|
+--------+---+-------+----------+------+------+-----------------+----+
| Azamara|  6|  30276|       694|   594|   355|            42.64| 355|
| Azamara|  6|  30276|       694|   594|   355|            42.64| 355|
|Carnival| 26|  47262|      1486|   722|   743|             31.8| 670|
|Carnival| 11| 110000|      2974|   952|  1488|            36.99|1910|
|Carnival| 17| 101353|      2642|   892|  1321|            38.36|1000|
|Carnival| 22|  70367|      2052|   855|  1019|            34.29| 919|
|Carnival| 15|  70367|      2052|   855|  1019|            34.29| 919|
|Carnival| 23|  70367|      2056|   855|  1022|            34.23| 919|
|Carnival| 19|  70367|      2052|   855|  1019|            34.29| 919|
|Carnival|  6| 110238|      3700|   951|  1487|            29.79|1150|
|Carnival| 10| 110000|      2974|   951|  1487|            36.99|1160|
|Carni

In [6]:
cleaned_data.printSchema()

root
 |-- line: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- tonnage: integer (nullable = true)
 |-- passengers: integer (nullable = true)
 |-- length: integer (nullable = true)
 |-- cabins: integer (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: integer (nullable = true)



In [7]:
# list number of NANs or NULLs in each column
from pyspark.sql.functions import count, when, isnan, col
cleaned_data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in cleaned_data.columns]).show()

+----+---+-------+----------+------+------+-----------------+----+
|line|age|tonnage|passengers|length|cabins|passenger_density|crew|
+----+---+-------+----------+------+------+-----------------+----+
|   0|  0|      0|         0|     0|     0|                0|   0|
+----+---+-------+----------+------+------+-----------------+----+



Nothing to clean, no nulls. Data looks good.

## Prepare data for MLlib

Let's first convert the 'line' string to a indexed integer so the vector assembler can create the features vector

In [24]:
# convert line string to int 
from pyspark.ml.feature import StringIndexer
string_indexer = StringIndexer(inputCol="line", outputCol="line_index")
prepared_data = string_indexer.fit(cleaned_data).transform(cleaned_data)
prepared_data = prepared_data.select([prepared_data['line_index'].alias('line'),'age','tonnage','passengers','length','cabins','passenger_density','crew'])
prepared_data.show()

+----+---+-------+----------+------+------+-----------------+----+
|line|age|tonnage|passengers|length|cabins|passenger_density|crew|
+----+---+-------+----------+------+------+-----------------+----+
|16.0|  6|  30276|       694|   594|   355|            42.64| 355|
|16.0|  6|  30276|       694|   594|   355|            42.64| 355|
| 1.0| 26|  47262|      1486|   722|   743|             31.8| 670|
| 1.0| 11| 110000|      2974|   952|  1488|            36.99|1910|
| 1.0| 17| 101353|      2642|   892|  1321|            38.36|1000|
| 1.0| 22|  70367|      2052|   855|  1019|            34.29| 919|
| 1.0| 15|  70367|      2052|   855|  1019|            34.29| 919|
| 1.0| 23|  70367|      2056|   855|  1022|            34.23| 919|
| 1.0| 19|  70367|      2052|   855|  1019|            34.29| 919|
| 1.0|  6| 110238|      3700|   951|  1487|            29.79|1150|
| 1.0| 10| 110000|      2974|   951|  1487|            36.99|1160|
| 1.0| 28|  46052|      1452|   727|   726|            31.72| 

Now use the vector assembler to create the features vector that will be fed to MLlib ,model

In [25]:
# import helpers for vector assembling
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [26]:
prepared_data.printSchema()

root
 |-- line: double (nullable = true)
 |-- age: integer (nullable = true)
 |-- tonnage: integer (nullable = true)
 |-- passengers: integer (nullable = true)
 |-- length: integer (nullable = true)
 |-- cabins: integer (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: integer (nullable = true)



In [27]:
# create and configure the vector assembler with the desired 'features' columns
assembler = VectorAssembler(
    inputCols=['line','age','tonnage','passengers','length','cabins','passenger_density'],
    outputCol='features')

# assemble the 'features'
final_data = assembler.transform(prepared_data).select('features', 'crew')
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[16.0,6.0,30276.0...| 355|
|[16.0,6.0,30276.0...| 355|
|[1.0,26.0,47262.0...| 670|
|[1.0,11.0,110000....|1910|
|[1.0,17.0,101353....|1000|
|[1.0,22.0,70367.0...| 919|
|[1.0,15.0,70367.0...| 919|
|[1.0,23.0,70367.0...| 919|
|[1.0,19.0,70367.0...| 919|
|[1.0,6.0,110238.0...|1150|
|[1.0,10.0,110000....|1160|
|[1.0,28.0,46052.0...| 660|
|[1.0,18.0,70367.0...| 919|
|[1.0,17.0,70367.0...| 919|
|[1.0,11.0,86000.0...| 930|
|[1.0,8.0,110000.0...|1160|
|[1.0,9.0,88500.0,...|1030|
|[1.0,15.0,70367.0...| 919|
|[1.0,12.0,88500.0...| 930|
|[1.0,20.0,70367.0...| 919|
+--------------------+----+
only showing top 20 rows



In [72]:
# normalize the features
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=2.0)
final_data_norm = normalizer.transform(final_data).select('features', 'crew')

In [73]:
# split into training and testing sets
train_data, test_data = final_data_norm.randomSplit([0.7,0.3])

In [74]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               119|
|   mean| 765.0840336134454|
| stddev|359.19162016554816|
|    min|                59|
|    max|              2100|
+-------+------------------+



In [75]:
test_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               39|
|   mean|822.7948717948718|
| stddev|322.2279496753142|
|    min|               60|
|    max|             1220|
+-------+-----------------+



## Create the Linear Regression Model

In [76]:
# load linear regression lib
from pyspark.ml.regression import LinearRegression

In [77]:
# create linear regression model object
lr = LinearRegression(featuresCol='features', labelCol='crew', predictionCol='crew_pred')

In [78]:
# fit the model with the training data
lr_model = lr.fit(train_data)

In [79]:
# print the coefficients and intercept for linear regression
# Coefficients => 'line','age','tonnage','passengers','length','cabins','passenger_density'
print("Coefficients: {}\nIntercept: {}".format(lr_model.coefficients,lr_model.intercept))

Coefficients: [5.711237343146823,-1.7045758708090215,0.0008952993351538641,-0.17374051831234982,0.44556166824166077,0.898309688065081,-0.5590078925005209]
Intercept: -91.83804833736346


## Evaluate the Linear Regression Model

In [80]:
# evaluate the model with the test_data
test_results = lr_model.evaluate(test_data)

In [81]:
# print some statistical evaluation indicators of the model
print('MAE:  {}\nMSE:  {}\nRMSE: {}\nR2:   {}'.format(test_results.meanAbsoluteError, 
                                                       test_results.meanSquaredError,
                                                       test_results.rootMeanSquaredError,
                                                       test_results.r2))

MAE:  58.027712993596346
MSE:  4645.7213949757215
RMSE: 68.159529010812
R2:   0.9540793786250188


In [82]:
# analyze indicators against original data
final_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|779.3291139240506|
| stddev|350.3192255413956|
|    min|               59|
|    max|             2100|
+-------+-----------------+



* RMSE of 68.15 over an average crew value of 779, not bad
* R-squared indicates that this model explains 95% of the train data variance, good

## Apply model to unlabeled data
Assuming the test data is unlabelled

In [83]:
# create a set of unlabeled data using the test_data
unlabeled_data = test_data.select('features')

In [84]:
# apply the model and predict 'expenditure'
crew_predictions = lr_model.transform(unlabeled_data)

In [85]:
crew_predictions.show()

+--------------------+------------------+
|            features|         crew_pred|
+--------------------+------------------+
|[0.0,12.0,90090.0...| 885.1139933784553|
|[0.0,12.0,138000....|1298.1506698994644|
|[0.0,18.0,70000.0...| 800.4584751713626|
|[0.0,25.0,73192.0...| 835.5948961533312|
|[1.0,6.0,110238.0...|1102.3646316669308|
|[1.0,8.0,110000.0...|1220.8531581523084|
|[1.0,9.0,110000.0...|1219.5941439497412|
|[1.0,11.0,86000.0...| 963.5402820291749|
|[1.0,12.0,88500.0...| 1053.245293989608|
|[1.0,14.0,101509....|1065.6903046719608|
|[1.0,20.0,70367.0...| 863.4300741805278|
|[1.0,22.0,70367.0...| 860.0209224389097|
|[2.0,6.0,113000.0...|1159.1333307263908|
|[2.0,7.0,116000.0...|1274.3907668406118|
|[2.0,9.0,116000.0...|1122.9614275964393|
|[2.0,11.0,108977....|1115.3470989009272|
|[2.0,15.0,108806....|1107.8416408167145|
|[2.0,16.0,77499.0...| 925.3129917172189|
|[2.0,29.0,44348.0...| 555.6462180224939|
|[3.0,11.0,85000.0...| 889.6262299749642|
+--------------------+------------