<h1>Predicting Crew Members with Linear Regression using PySpark</h1>

In this notebook, I am creating a regression model that will help predict how many crew members will be needed for Hyundai Heavy Industries ships.

<h2>Step 1: Create a new Spark session</h2>

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PredictingCrewMembers").getOrCreate()

<h2>Step 2: Load the data</h2>

First I'm loading a csv with cruise ships information called <code>cruise_ship_info.csv</code>, which you can find online <a href="https://raw.githubusercontent.com/kushangbhatt/CrewMemberPrediction/master/cruise_ship_info.csv">here</a>.

In [2]:
file = "cruise_ship_info.csv"
df = spark.read.csv(file, header = True, inferSchema = True)

In [3]:
print("This dataset has {} columns.\n\nThe names of these columns are:\n  {}".format(len(df.columns), df.columns))

This dataset has 9 columns.

The names of these columns are:
  ['Ship_name', 'Cruise_line', 'Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'crew']


## Exploring data

Now, it is important to first examine the data we have. The first five rows in the dataset look like this:

In [4]:
df.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 5 rows



Also, we can see the schema of the database in a tree format:

In [5]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



The target variable is **crew**, which is continuous. So are most of the features in this dataset. Only **Ship_name** and **Cruise_line** are categorical variables.

Below, a summary of the database is shown in <em>Pandas</em> format. Here we can see mean, standard deviation, minimum and maximum values for each numerical column. Also, notice that there are 158 rows in this dataset.

In [6]:
df.describe().toPandas()

Unnamed: 0,summary,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,count,158,158,158.0,158.0,158.0,158.0,158.0,158.0,158.0
1,mean,Infinity,,15.689873417721518,71.28467088607599,18.45740506329114,8.130632911392404,8.830000000000005,39.90094936708861,7.794177215189873
2,stddev,,,7.615691058751413,37.229540025907866,9.677094775143416,1.793473548054825,4.4714172221480615,8.63921711391542,3.503486564627034
3,min,Adventure,Azamara,4.0,2.329,0.66,2.79,0.33,17.7,0.59
4,max,Zuiderdam,Windstar,48.0,220.0,54.0,11.82,27.0,71.43,21.0


It is time to explore how many unique values are there in both categorical variables:

In [7]:
for column in ["Ship_name", "Cruise_line"]:    
    print("There are %s different values in %s column." % (df.select(column).distinct().count(), column))

There are 138 different values in Ship_name column.
There are 20 different values in Cruise_line column.


**Ship_name** has roughly as many distinct values as rows are there in the dataset. Then, I think this feature is not very informative for predicting crew members, so I decide not to include it in the linear regression model.

Next, I am converting the **Cruise_line** column to numerical labels for each string. I want to include this variable in the linear regression model.

In [8]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Cruise_line", outputCol="line_id")
indexed = indexer.fit(df).transform(df)
indexed.select("line_id").describe().show()

+-------+-----------------+
|summary|          line_id|
+-------+-----------------+
|  count|              158|
|   mean|5.063291139240507|
| stddev|4.758744608182735|
|    min|              0.0|
|    max|             19.0|
+-------+-----------------+



We see that the strings in **Cruise_line** have been transformed to numerical values between 0 and 19. Now, we can include this numerical variable in the linear regression model.

In [9]:
print([int(row.line_id) for row in indexed.select("line_id").distinct().orderBy("line_id").collect()])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


In [10]:
for i in range(len(indexed.schema.names)):
    print("Column number", i, "is", indexed.schema.names[i])

Column number 0 is Ship_name
Column number 1 is Cruise_line
Column number 2 is Age
Column number 3 is Tonnage
Column number 4 is passengers
Column number 5 is length
Column number 6 is cabins
Column number 7 is passenger_density
Column number 8 is crew
Column number 9 is line_id


We want to include in the linear regression model as predictor variables all columns between 2 and 9 except column number 8, which is the target variable. Knowing this, the vector assembler with all features for the linear regression model is created in the next block of code.

In [11]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols = [indexed.schema.names[i] for i in [9] + list(range(2,8))],
                            outputCol = "features")
output = assembler.transform(indexed)
output.select("features").show(5)

+--------------------+
|            features|
+--------------------+
|[16.0,6.0,30.2769...|
|[16.0,6.0,30.2769...|
|[1.0,26.0,47.262,...|
|[1.0,11.0,110.0,2...|
|[1.0,17.0,101.353...|
+--------------------+
only showing top 5 rows



Features vector has been correctly generated. Next, I am fitting a linear regression model that considers these seven features.

In [12]:
from pyspark.ml.regression import LinearRegression

# Splitting of final_data in train (70%) and test (30%) sets. Random seed is
#   set to today's date, 2020-12-21

train_data, test_data = output.randomSplit([0.7, 0.3], seed = 20201221)


# Fitting linear regression model with train_data

lr = LinearRegression(labelCol="crew")
lr_model = lr.fit(train_data)

How well does this model predicts crew members? I am evaluating its performance in terms of RMSE and R²:

In [13]:
# Evaluating linear regression model with test_data
test_results = lr_model.evaluate(test_data)


# Outputting R-squared of this model
print("RMSE of model =", test_results.rootMeanSquaredError)
print("R-squared of model =", test_results.r2)

RMSE of model = 0.8104566925385541
R-squared of model = 0.9004064252541517


R² is very close to 1 and RMSE is quite low, so this model seems adequate to make predictions about the number of crew members given some features.

However, what would had happened if I hadn't included **Cruise_line** as a numerical variable (**line_id**) in the model? Let's see:

In [14]:
assembler2 = VectorAssembler(inputCols = [df.columns[i] for i in range(2,8)], outputCol = "features")

output2 = assembler2.transform(df)

# Splitting of final_data in train (70%) and test (30%) sets. Random seed is
#   set to today's date, 2020-11-04

train_data2, test_data2 = output2.randomSplit([0.7, 0.3], seed = 20201221)


# Fitting linear regression model with train_data

lr_model2 = lr.fit(train_data2)


# Evaluating linear regression model with test_data
test_results2 = lr_model2.evaluate(test_data2)


# Outputting R-squared of this model
print("RMSE of model =", test_results2.rootMeanSquaredError)
print("R-squared of model =", test_results2.r2)

RMSE of model = 0.8508139693858561
R-squared of model = 0.8902408035381612


We can see that both models perform very well, in terms of RMSE and R². Then, we might think that **Cruise_line** is not a very informative predictor variable.

This is just a very simple example, but linear regression can be applied to a lot of more complex and interesting tasks using Spark. I hope you have enjoyed this notebook, see you soon!