## Use Case - Regression Model

You've been flown to their headquarters in Ulsan, South Korea, to assist them in accurately estimating the number of crew members a ship will need.


They're currently building new ships for certain customers, and they'd like you to create a model and utilize it to estimate how many crew members the ships will require.


Metadata:
1. Measurements of ship size 
2. capacity 
3. crew 
4. age for 158 cruise ships.

It is saved in a csv file for you called "ITI_data.csv". our task is to develop a regression model that will assist in predicting the number of crew members required for future ships. The client also indicated that they have found that particular cruise lines will differ in acceptable crew counts, thus this is most likely an important factor to consider when conducting your investigation.

In [1]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession

spark = (SparkSession.builder.appName('SparkSQL')
         .enableHiveSupport()
         .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.0.1")
         .getOrCreate())

spark = SparkSession.builder.appName('SparkSQL').enableHiveSupport().getOrCreate()

In [2]:
df_schema = "Ship_name STRING, Cruise_line STRING, Age INT, Tonnage DOUBLE, passengers Double, length DOUBLE, cabins DOUBLE, passenger_density DOUBLE, crew DOUBLE"

ships_df = (spark.read.format('csv')
            .option('header','true')
            .schema(df_schema)
            .load('ITI_data.csv')
            )
ships_df.createOrReplaceTempView("ships")
ships_df.show()
ships_df.printSchema()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [3]:
spark.sql('SELECT DISTINCT Cruise_line FROM ships').show()

+-----------------+
|      Cruise_line|
+-----------------+
|            Costa|
|              P&O|
|           Cunard|
|Regent_Seven_Seas|
|              MSC|
|         Carnival|
|          Crystal|
|           Orient|
|         Princess|
|        Silversea|
|         Seabourn|
| Holland_American|
|         Windstar|
|           Disney|
|        Norwegian|
|          Oceania|
|          Azamara|
|        Celebrity|
|             Star|
|  Royal_Caribbean|
+-----------------+



In [4]:
# drop Ship_name column because it is not relevant to our problem
ships_df = ships_df.drop('Ship_name')
ships_df.printSchema()

root
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [5]:
trainDF, testDF = ships_df.randomSplit([.8,.2], seed=42)
print(f"There are {trainDF.count()} rows in the training set, and {testDF.count()} in the test set")

There are 133 rows in the training set, and 25 in the test set


### OneHotEncoder 


In [6]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler

In [7]:
categoricalCols = [field for (field, dataType) in trainDF.dtypes
                   if dataType == "string"]
categoricalCols

['Cruise_line']

In [8]:
indexOutputCols = [x + "_Index" for x in categoricalCols]
indexOutputCols

['Cruise_line_Index']

In [9]:
oheOutputCols = [x + "_OHE" for x in categoricalCols]
oheOutputCols

['Cruise_line_OHE']

In [10]:
stringIndexer = StringIndexer(inputCols=categoricalCols,
                             outputCols=indexOutputCols,
                             handleInvalid='skip')
oheEncoder = OneHotEncoder(inputCols=indexOutputCols,
                          outputCols=oheOutputCols)

In [11]:
numericCols = [field for (field,dataType) in trainDF.dtypes
              if ((dataType=='double')& (field!='crew'))]
numericCols

['Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']

### Use VectorAssembler to merge all columns into one column:

In [12]:
assemblerInputs = oheOutputCols + numericCols
assemblerInputs

['Cruise_line_OHE',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density']

In [13]:
vecAssembler = VectorAssembler(inputCols=assemblerInputs,outputCol='features')
vecAssembler

VectorAssembler_142f8f2d3634

### Create a Linear Regression Model 

In [14]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol='crew',featuresCol='features')

### Creating a Pipeline

In [15]:
from pyspark.ml import Pipeline

pipeline =Pipeline(stages = [stringIndexer,oheEncoder,vecAssembler,lr])
pipelineModel = pipeline.fit(trainDF)
predDF = pipelineModel.transform(testDF)

### Model Evaluation

In [16]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol='prediction',
                                         labelCol='crew',
                                         metricName='rmse')

rmse = regressionEvaluator.evaluate(predDF)
#print("RMSE is {:.1f}".format(rmse))
print(f"RMSE is {rmse:.1f}")

r2 = RegressionEvaluator(predictionCol='prediction',
                                         labelCol='crew',
                                         metricName='r2').evaluate(predDF)
print(f"R2 is {r2}")



RMSE is 1.6
R2 is 0.8507705398325915
