# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('Hyundai').getOrCreate()

In [5]:
 data = spark.read.csv('/FileStore/tables/cruise_ship_info.csv', inferSchema=True, header=True)

In [1]:
data.show()

In [7]:
data.printSchema()

In [8]:
#Para el modelo usaremos todas las features excepto el nombre del barco
#Añadiremos una columna con los valores de la columna Cruise_line pero en formato String

from pyspark.ml.feature import StringIndexer

In [9]:
indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_line_index')
data2 = indexer.fit(data).transform(data)

In [2]:
data2.show()

In [11]:
#Importamos paquetes para arreglar los datos y dejarlos en el formato que la librería ml acepta
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [12]:
data2.columns

In [13]:
#Creamos objeto assembler, el 1er argumento son las features predictoras, el 2do argumento, el nombre que daremos a 
#ese conjunto de features predictoras
assembler = VectorAssembler(inputCols=['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'Cruise_line_index'], outputCol='features')

In [14]:
#Transformamos los datos
output = assembler.transform(data2)

In [15]:
output.printSchema()

In [16]:
#Seleccionamos las columnas para obtener el dataset final para el modelo
final_data = output.select('features','crew')

In [3]:
final_data.show()

In [18]:
#División en test y training set
train_data, test_data = final_data.randomSplit([0.7,0.3]) 

In [4]:
train_data.describe().show()

In [5]:
test_data.describe().show()

In [21]:
#Del paquete de ml de pyspark importaremos el módulo regresión lineal
from pyspark.ml.regression import LinearRegression

In [22]:
#Creamos objeto LinearRegression
lr = LinearRegression(labelCol='crew')

In [23]:
#Ajustamos el objeto a nuestros datos de entrenamiento
lr_model = lr.fit(train_data)

In [24]:
#Evaluamos el modelo sobre el test_set
test_results = lr_model.evaluate(test_data)

In [25]:
test_results.residuals.show()

In [26]:
test_results.r2

In [27]:
#Error cuadrático medio
test_results.rootMeanSquaredError

In [28]:
final_data.describe().show()

Vemos que la media de número de tripulantes es 7.94, con una desviación estándar de 3.50, nuestra RMSE es 0.75, considerando los valores anteriores nuestro modelo parece ser bueno.

Veremos qué tanta correlación hay entre las features predictoras y la feature a predecir, así veremos si los resultados del modelo tienen sentido o son solo una coincidencia

In [31]:
from pyspark.sql.functions import corr

In [32]:
data.select(corr('crew','passengers')).show()

Vemos que las columnas crew y passengers tienen una alta correlación. Intuitivamente esto hace sentido, ya que mientras más pasajeros haya a bordo debiera necesitarse un mayor número en la tripulación.

In [34]:
data.select(corr('crew','cabins')).show()