# Cruise ship crew requrement estimation
## Data overview
In this notebook we are going to prepare dataset with records shown below and using logistic regression with spark to predict necesary crew for given ship.

Data:

`Description: Measurements of ship size, capacity, crew, and age for 158 cruise ships.
Variables/Columns
Ship Name     1-20
Cruise Line   21-40
Age (as of 2013)   46-48
Tonnage (1000s of tons)   50-56
passengers (100s)   58-64
Length (100s of feet)  66-72
Cabins  (100s)   74-80
Passenger Density   82-88
Crew  (100s)   90-96`

## Data preparation

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cruise').getOrCreate()
df = spark.read.csv('cruise_ship_info.csv',inferSchema=True,header=True)

22/04/08 11:03:00 WARN Utils: Your hostname, wojciech-VirtualBox resolves to a loopback address: 127.0.1.1; using 192.168.8.181 instead (on interface enp0s3)
22/04/08 11:03:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/08 11:03:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [2]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



Most of our data is in numerical form. Only Ship_name and Cruise_line are strings. Ship_name is useless we will drop it later on, but Cruse_line is usefull for us so we have to process it.

In [3]:
df.head(1)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)]

In [4]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat")
indexed = indexer.fit(df).transform(df)

                                                                                

In [5]:
indexed.head(1)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0)]

Using StringIndexer we transformed name of curuise line name into numerical value representing it.

In [6]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
assembler = VectorAssembler(
  inputCols=['Age',
             'Tonnage',
             'passengers',
             'length',
             'cabins',
             'passenger_density',
             'cruise_cat'],
    outputCol="features")
output = assembler.transform(indexed)

In [8]:
output.select("features", "crew").show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



## Linear regression

First we have to stlit our data into train and test datasets

In [9]:
final_data = output.select("features", "crew")
train_data,test_data = final_data.randomSplit([0.7,0.3])

Next we can prepare our model

In [10]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(labelCol='crew')
lrModel = lr.fit(train_data)
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))
test_results = lrModel.evaluate(test_data)

22/04/08 11:03:21 WARN Instrumentation: [77228b17] regParam is zero, which might cause numerical instability and overfitting.


Coefficients: [-0.015144009704948483,-7.873785666729284e-05,-0.12565033423234856,0.3977215207194875,0.9079226466251031,-0.0006587663603538364,0.07134553252829551] Intercept: -1.1853267253041428


Now we can evaluate if our model is any good.

In [11]:
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))

MSE: 0.4750538054677656
R2: 0.9669631416349129


In [12]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))

RMSE: 0.6892414710881561
