## Simple ML task

### Build SparkSession:

In [1]:
import pyspark
from pyspark.sql import SparkSession


ModuleNotFoundError: No module named 'pyspark'

In [3]:
spark = SparkSession.builder.getOrCreate()

You've been flown to their headquarters in Ulsan, South Korea, to assist them in accurately estimating the number of crew members a ship will need.


They're currently building new ships for certain customers, and they'd like you to create a model and utilize it to estimate how many crew members the ships will require.


Metadata:
1. Measurements of ship size 
2. capacity 
3. crew 
4. age for 158 cruise ships.

It is saved in a csv file for you called "ITI_data.csv". our task is to develop a regression model that will assist in predicting the number of crew members required for future ships. The client also indicated that they have found that particular cruise lines will differ in acceptable crew counts, thus this is most likely an important factor to consider when conducting your investigation.

In [57]:
data = spark.read.csv("ITI_data.csv" , header=True , inferSchema=True)
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [58]:
data.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 5 rows



In [75]:
trainDF , testDF = data.randomSplit([0.8,0.2] , seed=42)

### OneHotEncoder 


In [61]:
string_col = [c for (c,d) in trainDF.dtypes]
string_col

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']

In [62]:
#output column names for StringIndexer
strInd_col = [c+"_Ind" for c in string_col]
strInd_col

['Ship_name_Ind',
 'Cruise_line_Ind',
 'Age_Ind',
 'Tonnage_Ind',
 'passengers_Ind',
 'length_Ind',
 'cabins_Ind',
 'passenger_density_Ind',
 'crew_Ind']

In [63]:
#output columns for OneHotEncoder

OHE_col = [c+"_OHE" for c in string_col]
OHE_col

['Ship_name_OHE',
 'Cruise_line_OHE',
 'Age_OHE',
 'Tonnage_OHE',
 'passengers_OHE',
 'length_OHE',
 'cabins_OHE',
 'passenger_density_OHE',
 'crew_OHE']

In [65]:
#getting numeric columns to be merged with OHE columns
numeric_cols = [c for (c,d) in trainDF.dtypes if ((d=='double' or d=='int') and (c != 'crew'))]
numeric_cols

['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density']

In [66]:
# getting all features together and Instantiate VectorAssembler
all_columns = OHE_col + numeric_cols
all_columns

['Ship_name_OHE',
 'Cruise_line_OHE',
 'Age_OHE',
 'Tonnage_OHE',
 'passengers_OHE',
 'length_OHE',
 'cabins_OHE',
 'passenger_density_OHE',
 'crew_OHE',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density']

In [81]:
from pyspark.ml.feature import StringIndexer , OneHotEncoder , VectorAssembler
#Instantiate StringIndexer
strInd = StringIndexer(inputCols=string_col , outputCols=strInd_col , handleInvalid='keep')

#Instantiate OneHotEncoding
OHEncoder = OneHotEncoder(inputCols=strInd_col , outputCols=OHE_col)

#Instantiate VectorAssembler
vecAssem = VectorAssembler(inputCols=all_columns , outputCol='features')


### Create a Linear Regression Model 

In [82]:
from pyspark.ml.regression import LinearRegression

lr_multi = LinearRegression(featuresCol='features' , labelCol='crew')

### Creating a Pipeline

In [83]:
from pyspark.ml import Pipeline
pip = Pipeline(stages=[strInd , OHEncoder , vecAssem , lr_multi] )
pip_model = pip.fit(trainDF)

#transform train data
train_transformed = pip_model.transform(trainDF)
train_pred = train_transformed.select("crew" , "prediction")

#transform test data
test_transformed = pip_model.transform(testDF)
test_pred = test_transformed.select("crew" , "prediction")


In [84]:
test_pred.show(5)

+----+------------------+
|crew|        prediction|
+----+------------------+
| 6.0| 7.163955294028179|
| 5.2| 6.364820511097811|
| 8.5| 8.560809743834433|
|6.17| 6.916979653836525|
|12.0|10.602570456632998|
+----+------------------+
only showing top 5 rows



### Model Evaluation

In [87]:
from pyspark.ml.evaluation import RegressionEvaluator
regressionEvaluator = RegressionEvaluator(predictionCol='prediction',
                                         labelCol='crew',
                                         metricName='mae')
rmse = regressionEvaluator.evaluate(train_pred)
print("RMSE is {:.1f}".format(rmse))

rmse = regressionEvaluator.evaluate(test_pred)
print("RMSE is {:.1f}".format(rmse))


RMSE is 0.0
RMSE is 0.9


In [88]:
r2 = RegressionEvaluator(predictionCol='prediction',
                                         labelCol='crew',
                                         metricName='r2').evaluate(test_pred)
print(f"R2 is {r2}")

R2 is 0.8840935158691435


By Eng. Mostafa Nabieh 
If you have questions, please feel free to ask.

My Email : nabieh.mostafa@yahoo.com

My Whatsapp : +201015197566