# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

# Import Dependencies

In [33]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import corr
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression


# Data Preparation

## Start a Spark Session

In [2]:
spark = SparkSession.builder.appName('cruise').getOrCreate()

23/04/25 01:13:07 WARN Utils: Your hostname, carloshgalvan.local resolves to a loopback address: 127.0.0.1; using 192.168.3.8 instead (on interface en0)
23/04/25 01:13:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/25 01:13:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/25 01:13:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/04/25 01:13:09 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## Read Data

In [3]:
df = spark.read.csv('cruise_ship_info.csv', inferSchema = True, header = True)

                                                                                

In [5]:
# Analyze the data Schema
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [6]:
# Look at the first entries
for ship in df.head(5):
    print(ship)
    print('\n')

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)


Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1)


Row(Ship_name='Destiny', Cruise_line='Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0)




## Convert Categorical Cruise Line Column

In [7]:
# See the different cruise lines
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [10]:
# Use StringIndexer to convert the categorial cruise line column
indexer = StringIndexer(inputCol = 'Cruise_line', outputCol= 'cruise_cat')
indexed = indexer.fit(df).transform(df)
indexed.head(2)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0)]

## Assemble feature vector

In [13]:
# Look at the columns to assemble
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

In [14]:
# Assemble feature vector
assembler = VectorAssembler(inputCols=['Age'
                                       ,'Tonnage'
                                       ,'passengers'
                                       ,'length'
                                       ,'cabins'
                                       ,'passenger_density'
                                       ,'crew'
                                       ,'cruise_cat']
                           ,outputCol='features')

In [15]:
output = assembler.transform(indexed)

In [17]:
# Verify your prepared dataset to use with Linear Regression
output.select('features','crew').show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [19]:
# Create final data to use with Linear Regression
final_data = output.select('features','crew')

## Create Linear Regression Model

In [20]:
# Train, Test split
train_data, test_data = final_data.randomSplit([0.7,0.3])
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               107|
|   mean| 7.791495327102814|
| stddev|3.8480224776091827|
|    min|              0.59|
|    max|              21.0|
+-------+------------------+



In [22]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                51|
|   mean| 7.799803921568627|
| stddev|2.6739854077358602|
|    min|              0.88|
|    max|              13.6|
+-------+------------------+



In [25]:
# Build linear regression model
ship_lr = LinearRegression(labelCol='crew')

In [26]:
# Fit to the training data
trained_model = ship_lr.fit(train_data)

23/04/25 01:29:27 WARN Instrumentation: [8c579f2b] regParam is zero, which might cause numerical instability and overfitting.
23/04/25 01:29:28 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/04/25 01:29:28 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


In [27]:
# Evaluate model
results_model = trained_model.evaluate(test_data)

In [29]:
# RMSE
results_model.rootMeanSquaredError

9.482844654815497e-15

Compared to our average label column, our RMSE looks quite good.

In [32]:
# R squared
results_model.r2

1.0

This means that we are explaining 100% of the variance of the label column which is optimal. Something that is only achievable with pre selected data.

In [34]:
# Check Pearson correlation between two columns 
df.select(corr('crew','passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



We can see a high correlation between both columns, which would sustain the high metric values of our models.