# Hyundai Consulting Project

Hyundai has a cruise plant in South Korea and they need a model to predict the amount of crew members that fits on each ship. They want to disclose an accurate number of crew members that fits in the ship at the time of negotiation. Our job is to build that model.

In [15]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('consulting_project').getOrCreate()

In [16]:
from pyspark.ml.regression import LinearRegression

In [17]:
data = spark.read.csv('./cruise_ship_info.csv', header=True, inferSchema=True)

In [18]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



- Age - Home many years as of 2013
- Tonnage - 1,000s of tons
- passengers - 100s
- length - 100s feet
- cabins - 100s
- passenger_density
- crew - 100s

In [19]:
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [23]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_index")
indexed_data = indexer.fit(data).transform(data)
indexed_data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|cruise_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|         1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|         1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|         1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|         1.0|
|    Elati

In [24]:
indexed_data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- cruise_index: double (nullable = true)



- Vector
    - Age
    - Tonnage
    - passengers
    - length
    - cabins
    - passenger_density
    - cruise_index
- Predict
    - crew

In [25]:
for row in indexed_data.head():
    print(row)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55
16.0


Import `Vectors` & `VectorAssembler` to convert multiple features into a vector

In [26]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [27]:
indexed_data.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_index']

`inputsCols` is a list of strings of the inputs we are considering to take in. `outputCol` is the vector column name.

In [28]:
assembler = VectorAssembler(inputCols=['Age',
                                       'Tonnage',
                                       'passengers',
                                       'length',
                                       'cabins',
                                       'passenger_density',
                                       'cruise_index'],
                            outputCol='features')

`assembler` combines all 7 columns values ('Age', 'Tonnage','passengers','length', 'cabins', 'passenger_density', 'cruise_index') into a list and creates a new column called `features`

In [29]:
output = assembler.transform(indexed_data) 

In [30]:
output.printSchema() # feature column is added

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- cruise_index: double (nullable = true)
 |-- features: vector (nullable = true)



In [31]:
output.select('features').show()

+--------------------+
|            features|
+--------------------+
|[6.0,30.276999999...|
|[6.0,30.276999999...|
|[26.0,47.262,14.8...|
|[11.0,110.0,29.74...|
|[17.0,101.353,26....|
|[22.0,70.367,20.5...|
|[15.0,70.367,20.5...|
|[23.0,70.367,20.5...|
|[19.0,70.367,20.5...|
|[6.0,110.23899999...|
|[10.0,110.0,29.74...|
|[28.0,46.052,14.5...|
|[18.0,70.367,20.5...|
|[17.0,70.367,20.5...|
|[11.0,86.0,21.24,...|
|[8.0,110.0,29.74,...|
|[9.0,88.5,21.24,9...|
|[15.0,70.367,20.5...|
|[12.0,88.5,21.24,...|
|[20.0,70.367,20.5...|
+--------------------+
only showing top 20 rows



`features` column is a DenseVector type holding the list with the value of each feature in the row.

In [32]:
output.head(1)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_index=16.0, features=DenseVector([6.0, 30.277, 6.94, 5.94, 3.55, 42.64, 16.0]))]

Selecting the columns to build the model

In [33]:
final_data = output.select('features', 'crew')

In [34]:
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



Spliting the data for training and test

In [35]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [36]:
train_data.describe().show() # 70% split

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              112|
|   mean|7.769821428571439|
| stddev|3.695671191192548|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



In [37]:
test_data.describe().show() # 30% split

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                46|
|   mean| 7.853478260869564|
| stddev|3.0214751492101812|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+



Creating the linear regression model object. `labelCol` is the column we want to predict. `featuresCol` are the features.

In [38]:
lr = LinearRegression(labelCol='crew', featuresCol='features')

Training the `training_data`

In [39]:
lr_model = lr.fit(train_data)

Testing the `lr_model` against `test_data`

In [40]:
test_results = lr_model.evaluate(test_data)

In [41]:
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
| -1.4559963749872011|
| -1.1696945126371983|
|  0.3098414873863593|
|  -0.434912376881849|
| -0.2690256466913201|
|-0.19288214849169982|
| -0.6045095459710073|
| -0.6484054597404239|
|  1.7526932123094383|
|-0.35251081719456323|
| -0.5230848035854709|
|  1.4737579460232588|
| 0.11589268541076159|
|-0.43155933627867604|
| -0.1891705691470431|
|  0.9264043705468996|
|   0.054135411372493|
|   0.832061486280459|
|   0.832061486280459|
|  -1.380806292466029|
+--------------------+
only showing top 20 rows



In [42]:
test_results.predictions.show()

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[5.0,160.0,36.34,...| 13.6|  15.0559963749872|
|[6.0,90.0,20.0,9....|  9.0|10.169694512637198|
|[6.0,110.23899999...| 11.5| 11.19015851261364|
|[6.0,112.0,38.0,9...| 10.9| 11.33491237688185|
|[6.0,158.0,43.7,1...| 13.6| 13.86902564669132|
|[7.0,158.0,43.7,1...| 13.6|  13.7928821484917|
|[8.0,110.0,29.74,...| 11.6|12.204509545971007|
|[9.0,90.09,25.01,...| 8.69| 9.338405459740423|
|[10.0,46.0,7.0,6....| 4.47|2.7173067876905614|
|[10.0,58.825,15.6...|  7.0| 7.352510817194563|
|[10.0,86.0,21.14,...|  9.2|  9.72308480358547|
|[10.0,151.4,26.2,...|12.53| 11.05624205397674|
|[11.0,58.6,15.66,...|  7.6| 7.484107314589238|
|[11.0,90.09,25.01...| 8.48| 8.911559336278676|
|[11.0,91.62700000...|  9.0| 9.189170569147043|
|[11.0,108.977,26....| 12.0|  11.0735956294531|
|[12.0,42.0,14.8,7...|  6.8| 6.745864588627507|
|[12.0,91.0,20.32,...| 9.99| 9.157938513

The model is off by 81 people out of 779 people on avg per ship - 10% error... Not bad!

In [43]:
test_results.rootMeanSquaredError

0.8432066600860751

In [44]:
final_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



The model explains 92% of the variance of the data.

In [45]:
test_results.r2

0.9203885890553847

In [46]:
from pyspark.sql.functions import corr

In [47]:
data.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [48]:
data.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

