# Consulting Project: Hyundai Heavy Industries

## Case

**Building a predictive model for some ships of Hyundai Heavy Industries.**

Hyundai Heavy Industries are currently selling ships to some new customers and want us to create a model and provide an accurate estimates of how many crew members the ships will need.

**Notes**: Particular **Cruise Line** will differ in acceptable crew counts. Hence, most likely, it's an important feature for your analysis.  

In [5]:
from pyspark.sql import SparkSession

In [7]:
spark = SparkSession.builder.appName('lr_project').getOrCreate()

## Load the Data

In [9]:
df = spark.read.csv('datasets/cruise_ship_info.csv', header=True, inferSchema=True)

In [36]:
df.show(5)
df.printSchema()
df.describe().show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 5 rows

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: int

### Deal with Categorical Variable

In [13]:
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [14]:
from pyspark.ml.feature import StringIndexer

In [16]:
indexer = StringIndexer( inputCol='Cruise_line', outputCol='cruise_cat')
indexed = indexer.fit( df).transform( df)

In [20]:
indexed.select(['Cruise_line', 'cruise_cat']).show()

+-----------+----------+
|Cruise_line|cruise_cat|
+-----------+----------+
|    Azamara|      16.0|
|    Azamara|      16.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
|   Carnival|       1.0|
+-----------+----------+
only showing top 20 rows



## Set the Features

In [21]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [22]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

In [25]:
assembler = VectorAssembler( inputCols=['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density',
                                         'cruise_cat'], outputCol = 'features')

In [27]:
dataReady = assembler.transform(indexed)

In [30]:
dataReady.select('features', 'crew').show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [31]:
final_data = dataReady.select('features', 'crew')

In [32]:
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



## Train/Test Split

In [33]:
train_data, test_data = final_data.randomSplit([0.7,0.3])

In [34]:
train_data.describe().show()
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               113|
|   mean| 7.619734513274346|
| stddev|3.1148912100290223|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               45|
|   mean|8.232222222222221|
| stddev|4.338844048449733|
|    min|              0.6|
|    max|             21.0|
+-------+-----------------+



## Train the Model

In [37]:
from pyspark.ml.regression import LinearRegression

In [38]:
lr = LinearRegression( featuresCol='features', labelCol='crew', predictionCol='prediction')

In [46]:
lrModel = lr.fit(train_data)

## Evaluate the Model

In [47]:
ship_results = lrModel.evaluate( test_data)

In [48]:
ship_results.rootMeanSquaredError

1.3471486214738875

In [50]:
ship_results.r2

0.9014077498974512

In [51]:
from pyspark.sql.functions import corr

In [54]:
df.select(corr('crew', 'passengers')).show()
df.select(corr('crew', 'cabins')).show()


+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

