# Sesión 11 - Taller de Spark

En la práctica de hoy vamos a combinar las facilidades del API `DataFrame`de Spark con la librería `Spark MLLib`. Vamos a usar el conjunto de datos conocido como **Titanic**. En Abril 15 de 1912, durante su primer viaje, el Titanic se hundió luego de colisionar con un Iceberg. Allí fallecieron 1502 personas de los 2224 personas a bordo. El objetivo de este conjunto de datos, es analizar las características de los pasajeros que sí sobrevivieron.

# Carga de datos
Ejecute los siguientes comandos. Luego, vamos a particionar el conjunto de datos en `training` y `test`.

In [3]:
%sh wget "biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv"

Vamos a revisar la ubicación de nuestros archivos con los siguientes comandos:
* `pwd` nos permite conocer nuestro directorio local.
* `ls` lista los elementos del directorio local.

In [5]:
%sh pwd

In [6]:
%sh ls -lah

Cargamos `training` y `test` como `DataFrames`

In [8]:
dataset = (sqlContext
                   .read
                   .format('com.databricks.spark.csv')
                   .options(header='true', inferSchema='true')
                   .load('file:///databricks/driver/titanic3.csv'))
dataset.cache()
dataset.printSchema()

In [9]:
dataset = dataset.select(['survived', 
                          'pclass', 
                          'name', 
                          'sex', 
                          'age', 
                          'sibsp', 
                          'parch', 
                          'ticket', 
                          'fare', 
                          'cabin', 
                          'embarked'])

In [10]:
(train, test) = dataset.randomSplit([0.7, 0.3], seed=100)

Visualizamos la tabla de datos

In [12]:
display(train)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,"""Lindeberg-Lind, Mr. Erik Gustaf (""""Mr Edward Lingrey"""")""",male,42.0,0,0,17475,26.55,,S
0,1,"""Rosenshine, Mr. George (""""Mr George Thorne"""")""",male,46.0,0,0,PC 17585,79.2,,C
0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
0,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
0,1,"Beattie, Mr. Thomson",male,36.0,0,0,13050,75.2417,C6,C
0,1,"Birnbaum, Mr. Jakob",male,25.0,0,0,13905,26.0,,C
0,1,"Blackwell, Mr. Stephen Weart",male,45.0,0,0,113784,35.5,T,S
0,1,"Borebank, Mr. John James",male,42.0,0,0,110489,26.55,D22,S


También puede usar databricks para otro tipo de visualizaciones. Haga click en el ícono de gráfico en la ejecución de la línea inferior y escoja *Scatter plot*.

![](https://i.imgur.com/qiTzW16.png)

Luego, arrastre `Age` y `Fare` al campo `Values`.

![](https://i.imgur.com/RoUaV0T.png)

In [14]:
display(train)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


# Trabajando con el API `DataFrame`
## Operaciones básicas de agrupamiento

Vamos a determinar cuantas personas sobrevivieron en el conjunto de `training`.

In [16]:
countings = train.groupBy("survived").count()
display(countings)

survived,count
1,359
0,551


Podemos también contabilizar cuantas personas de sexo masculino sobrevivieron.

In [18]:
male_countings = train.filter(train.sex == 'male').groupBy("survived").count()
display(male_countings)

survived,count
1,117
0,458


## Extracción y transformación de características

En esta sección, vamos a construir una nueva característica, la cual va a determinar si un pasajero era menor de edad o no. Para esto, vamos a trabajar con una **función definida por el usuario** o **UDF**. El objetivo de estas funciones es definir una operación a aplicar sobre una columna. En el caso de la siguiente celda de código, definimos la función `set_child()` de forma que reciba un valor que va a corresponder a la edad, posteriormente evalúa si el número es mayor o menor a 18 y regresa 1 o 0 dependiendo del resultado.

Luego, enmascaramos la función `set_child()` en la función de Spark `udfChild`. Es importante especificar el tipo de dato que va a regresar esta función de Spark. Finalmente con `withColumn`, aplicamos la transformación definida en `udfChild` sobre la columna `Age`.

In [20]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import when, lit, col

def set_child(age):
  if (age != age) | (age is None):
    return None
  if age >= 18:
    return 0
  if age < 18:
    return 1
    
udfChild = udf(set_child, IntegerType())
train = train.withColumn('child', udfChild('age'))
test = test.withColumn('child', udfChild('age'))
display(train)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child
0,1,"""Lindeberg-Lind, Mr. Erik Gustaf (""""Mr Edward Lingrey"""")""",male,42.0,0,0,17475,26.55,,S,0.0
0,1,"""Rosenshine, Mr. George (""""Mr George Thorne"""")""",male,46.0,0,0,PC 17585,79.2,,C,0.0
0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,1.0
0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,0.0
0,1,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,0.0
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0.0
0,1,"Beattie, Mr. Thomson",male,36.0,0,0,13050,75.2417,C6,C,0.0
0,1,"Birnbaum, Mr. Jakob",male,25.0,0,0,13905,26.0,,C,0.0
0,1,"Blackwell, Mr. Stephen Weart",male,45.0,0,0,113784,35.5,T,S,0.0
0,1,"Borebank, Mr. John James",male,42.0,0,0,110489,26.55,D22,S,0.0


In [21]:
child_countings = train.filter(train.child == 1).groupBy("survived").count()
display(child_countings)

survived,count
1,64
0,48


Hacemos lo mismo para la variable `Sex`

In [23]:
def set_gender(gender):
  if gender == 'male':
    return 0
  else:
    return 1
    
udfGender = udf(set_gender, IntegerType())
train = train.withColumn('sex', udfGender('sex'))
test = test.withColumn('sex', udfGender('sex'))
display(train)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child
0,1,"""Lindeberg-Lind, Mr. Erik Gustaf (""""Mr Edward Lingrey"""")""",0,42.0,0,0,17475,26.55,,S,0.0
0,1,"""Rosenshine, Mr. George (""""Mr George Thorne"""")""",0,46.0,0,0,PC 17585,79.2,,C,0.0
0,1,"Allison, Miss. Helen Loraine",1,2.0,1,2,113781,151.55,C22 C26,S,1.0
0,1,"Allison, Mr. Hudson Joshua Creighton",0,30.0,1,2,113781,151.55,C22 C26,S,0.0
0,1,"Andrews, Mr. Thomas Jr",0,39.0,0,0,112050,0.0,A36,S,0.0
0,1,"Artagaveytia, Mr. Ramon",0,71.0,0,0,PC 17609,49.5042,,C,0.0
0,1,"Beattie, Mr. Thomson",0,36.0,0,0,13050,75.2417,C6,C,0.0
0,1,"Birnbaum, Mr. Jakob",0,25.0,0,0,13905,26.0,,C,0.0
0,1,"Blackwell, Mr. Stephen Weart",0,45.0,0,0,113784,35.5,T,S,0.0
0,1,"Borebank, Mr. John James",0,42.0,0,0,110489,26.55,D22,S,0.0


### Manejando variables categóricas
Como pudo observar, `Embarked` es una variable categórica. Spark tiene varias funciones que le permitirán transformar este columna en una variable categórica. Primero debemos transformar este col

In [25]:
test_embarked = train.groupBy("embarked").count()
display(test_embarked)

embarked,count
Q,89
,1
C,176
S,644


Cada valor representa una ciudad de embarque distinta. A continuación transformaremos cada ciudad por un número, sin embargo, este número no es útil, por lo que lo llevaremos a una representación *OneHot*. Ya que existen dos pasajeros sin registro de su puerto de abordaje, vamos a crear una categoría especial para ellos, también se puede escoger descartarlos.

In [27]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

indexer = StringIndexer(inputCol="embarked", outputCol="embarkedIndex", handleInvalid="keep")
encoder = OneHotEncoderEstimator(inputCols=["embarkedIndex"], outputCols=["embarkedCategorical"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline = pipeline.fit(train)
train = pipeline.transform(train)

In [28]:
display(train)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarkedIndex,embarkedCategorical
0,1,"""Lindeberg-Lind, Mr. Erik Gustaf (""""Mr Edward Lingrey"""")""",0,42.0,0,0,17475,26.55,,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"""Rosenshine, Mr. George (""""Mr George Thorne"""")""",0,46.0,0,0,PC 17585,79.2,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Allison, Miss. Helen Loraine",1,2.0,1,2,113781,151.55,C22 C26,S,1.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Allison, Mr. Hudson Joshua Creighton",0,30.0,1,2,113781,151.55,C22 C26,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Andrews, Mr. Thomas Jr",0,39.0,0,0,112050,0.0,A36,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Artagaveytia, Mr. Ramon",0,71.0,0,0,PC 17609,49.5042,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Beattie, Mr. Thomson",0,36.0,0,0,13050,75.2417,C6,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Birnbaum, Mr. Jakob",0,25.0,0,0,13905,26.0,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Blackwell, Mr. Stephen Weart",0,45.0,0,0,113784,35.5,T,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Borebank, Mr. John James",0,42.0,0,0,110489,26.55,D22,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"


Observe `EmbarkedCategorical`. 
¿Qué peculiaridades nota?

In [30]:
train.select("embarkedCategorical").show()

In [31]:
test = pipeline.transform(test)

## Reemplazando datos nulos
Una estrategia común es reemplazar los valores nulos por la media o la mediana, sin embargo, Spark nos permite también reemplazar los valores nulos por un valor fijo como:

```python
train = train.na.fill(10, subset=['Age'])
```
En el ejemplo de abajo, vamos a usar `Imputer` de Spark para reemplazar los valores nulos por la mediana de cada columna.

In [33]:
from pyspark.ml.feature import Imputer

imputer = Imputer(inputCols=["age", "fare"], outputCols=["age", "fare"], strategy='median')
model = imputer.fit(train)
train = model.transform(train)
test = model.transform(test)

In [34]:
display(train)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarkedIndex,embarkedCategorical
0,1,"""Lindeberg-Lind, Mr. Erik Gustaf (""""Mr Edward Lingrey"""")""",0,42.0,0,0,17475,26.55,,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"""Rosenshine, Mr. George (""""Mr George Thorne"""")""",0,46.0,0,0,PC 17585,79.2,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Allison, Miss. Helen Loraine",1,2.0,1,2,113781,151.55,C22 C26,S,1.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Allison, Mr. Hudson Joshua Creighton",0,30.0,1,2,113781,151.55,C22 C26,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Andrews, Mr. Thomas Jr",0,39.0,0,0,112050,0.0,A36,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Artagaveytia, Mr. Ramon",0,71.0,0,0,PC 17609,49.5042,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Beattie, Mr. Thomson",0,36.0,0,0,13050,75.2417,C6,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Birnbaum, Mr. Jakob",0,25.0,0,0,13905,26.0,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Blackwell, Mr. Stephen Weart",0,45.0,0,0,113784,35.5,T,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Borebank, Mr. John James",0,42.0,0,0,110489,26.55,D22,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"


In [35]:
display(test)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarkedIndex,embarkedCategorical
0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,25.0,1,2,113781,151.55,C22 C26,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Astor, Col. John Jacob",0,47.0,1,0,PC 17757,227.525,C62 C64,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Baumann, Mr. John D",0,28.0,0,0,PC 17318,25.925,,S,,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Baxter, Mr. Quigg Edmond",0,24.0,0,1,PC 17558,247.5208,B58 B60,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Brandeis, Mr. Emil",0,48.0,0,0,PC 17591,50.4958,B10,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"
0,1,"Cairns, Mr. Alexander",0,28.0,0,0,113798,31.0,,S,,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Carlsson, Mr. Frans Olof",0,33.0,0,0,695,5.0,B51 B53 B55,S,0.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Carrau, Mr. Jose Pedro",0,17.0,0,0,113059,47.1,,S,1.0,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Chisholm, Mr. Roderick Robert Crispin",0,28.0,0,0,112051,0.0,,S,,0.0,"List(0, 3, List(0), List(1.0))"
0,1,"Clark, Mr. Walter Miller",0,27.0,1,0,13508,136.7792,C89,C,0.0,1.0,"List(0, 3, List(1), List(1.0))"


## Construcción de una matriz de características.

In [37]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler

target_labeler = StringIndexer(inputCol='survived', outputCol='label').fit(train)

train_set = target_labeler.transform(train)

In [38]:
assembler = VectorAssembler(
 inputCols=['pclass', 'sex','age', 'fare', 'embarkedCategorical'],
 outputCol="features")

train_set = assembler.transform(train_set)
test_set = assembler.transform(test)
test_set = target_labeler.transform(test_set)

display(train_set)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarkedIndex,embarkedCategorical,label,features
0,1,"""Lindeberg-Lind, Mr. Erik Gustaf (""""Mr Edward Lingrey"""")""",0,42.0,0,0,17475,26.55,,S,0.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 42.0, 26.55, 1.0, 0.0, 0.0))"
0,1,"""Rosenshine, Mr. George (""""Mr George Thorne"""")""",0,46.0,0,0,PC 17585,79.2,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 46.0, 79.2, 0.0, 1.0, 0.0))"
0,1,"Allison, Miss. Helen Loraine",1,2.0,1,2,113781,151.55,C22 C26,S,1.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 1.0, 2.0, 151.55, 1.0, 0.0, 0.0))"
0,1,"Allison, Mr. Hudson Joshua Creighton",0,30.0,1,2,113781,151.55,C22 C26,S,0.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 30.0, 151.55, 1.0, 0.0, 0.0))"
0,1,"Andrews, Mr. Thomas Jr",0,39.0,0,0,112050,0.0,A36,S,0.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(0, 7, List(0, 2, 4), List(1.0, 39.0, 1.0))"
0,1,"Artagaveytia, Mr. Ramon",0,71.0,0,0,PC 17609,49.5042,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 71.0, 49.5042, 0.0, 1.0, 0.0))"
0,1,"Beattie, Mr. Thomson",0,36.0,0,0,13050,75.2417,C6,C,0.0,1.0,"List(0, 3, List(1), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 36.0, 75.2417, 0.0, 1.0, 0.0))"
0,1,"Birnbaum, Mr. Jakob",0,25.0,0,0,13905,26.0,,C,0.0,1.0,"List(0, 3, List(1), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 25.0, 26.0, 0.0, 1.0, 0.0))"
0,1,"Blackwell, Mr. Stephen Weart",0,45.0,0,0,113784,35.5,T,S,0.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 45.0, 35.5, 1.0, 0.0, 0.0))"
0,1,"Borebank, Mr. John James",0,42.0,0,0,110489,26.55,D22,S,0.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(1, 7, List(), List(1.0, 0.0, 42.0, 26.55, 1.0, 0.0, 0.0))"


## Definimos una función de evaluación

In [40]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

def evaluate(model, dataset):
  predictions = model.transform(dataset)
  #display(predictions)
  evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                                metricName="accuracy")
  accuracy = evaluator.evaluate(predictions)
  print("Set accuracy = " + str(accuracy))

## Entrenamos un árbol de decisión

In [42]:
from pyspark.ml.classification import DecisionTreeClassifier

model = DecisionTreeClassifier(maxDepth=2).fit(train_set)
evaluate(model, train_set)

In [43]:
evaluate(model, test_set)

## Análisis de los resultados

Evaluamos las reglas del árbol

In [46]:
print(model.toDebugString)

In [47]:
predictions = model.transform(test_set)
display(predictions)

survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,child,embarkedIndex,embarkedCategorical,features,label,rawPrediction,probability,prediction
0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,25.0,1,2,113781,151.55,C22 C26,S,0.0,0.0,"List(0, 3, List(0), List(1.0))","List(1, 7, List(), List(1.0, 1.0, 25.0, 151.55, 1.0, 0.0, 0.0))",0.0,"List(1, 2, List(), List(15.0, 171.0))","List(1, 2, List(), List(0.08064516129032258, 0.9193548387096774))",1.0
0,1,"Astor, Col. John Jacob",0,47.0,1,0,PC 17757,227.525,C62 C64,C,0.0,1.0,"List(0, 3, List(1), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 47.0, 227.525, 0.0, 1.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Baumann, Mr. John D",0,28.0,0,0,PC 17318,25.925,,S,,0.0,"List(0, 3, List(0), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 28.0, 25.925, 1.0, 0.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Baxter, Mr. Quigg Edmond",0,24.0,0,1,PC 17558,247.5208,B58 B60,C,0.0,1.0,"List(0, 3, List(1), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 24.0, 247.5208, 0.0, 1.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Brandeis, Mr. Emil",0,48.0,0,0,PC 17591,50.4958,B10,C,0.0,1.0,"List(0, 3, List(1), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 48.0, 50.4958, 0.0, 1.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Cairns, Mr. Alexander",0,28.0,0,0,113798,31.0,,S,,0.0,"List(0, 3, List(0), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 28.0, 31.0, 1.0, 0.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Carlsson, Mr. Frans Olof",0,33.0,0,0,695,5.0,B51 B53 B55,S,0.0,0.0,"List(0, 3, List(0), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 33.0, 5.0, 1.0, 0.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Carrau, Mr. Jose Pedro",0,17.0,0,0,113059,47.1,,S,1.0,0.0,"List(0, 3, List(0), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 17.0, 47.1, 1.0, 0.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Chisholm, Mr. Roderick Robert Crispin",0,28.0,0,0,112051,0.0,,S,,0.0,"List(0, 3, List(0), List(1.0))","List(0, 7, List(0, 2, 4), List(1.0, 28.0, 1.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0
0,1,"Clark, Mr. Walter Miller",0,27.0,1,0,13508,136.7792,C89,C,0.0,1.0,"List(0, 3, List(1), List(1.0))","List(1, 7, List(), List(1.0, 0.0, 27.0, 136.7792, 0.0, 1.0, 0.0))",0.0,"List(1, 2, List(), List(447.0, 98.0))","List(1, 2, List(), List(0.8201834862385321, 0.1798165137614679))",0.0


Use la opción de graficar y evalúe visualmente la correspondencia entre edad y supervivencia

In [49]:
display(predictions.select("age", "survived"))

age,survived
25.0,0
47.0,0
28.0,0
24.0,0
48.0,0
28.0,0
33.0,0
17.0,0
28.0,0
27.0,0


# Taller

En este taller, vamos a trabajar con el mismo conjunto de datos, pero vamos a usar `Random Forest` y a estudiar como realizar validación cruzada. En el caso de validación cruzada, su objetivo será explorar el número subóptimo de árboles para resolver este problema.

* Validación cruzada: 
Ahora queremos explorar el conjunto de parámetros subóptimo para nuestro modelo. Para esto, generamos una grilla de parámetros para explorar el número óptimo de árboles. Para estos creamos una instancia de `ParamGridBuilder()`, donde especificaremos el parámetros como el conjunto de parámetros que vamos a explorar. Posteriormente, creamos un objeto de tipo `CrossValidator`, dónde especificamos el modelo que queremos validar, el conjunto de parámetros que vamos a explorar, la medida de desempeño a usar, y el número de folds. Finalmente llamamos a la función `fit()` sobre el conjunto de entrenamiento. Use el siguiente código como base:

```python
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

model = RandomForestClassifier()
paramGrid = ParamGridBuilder() \
    .addGrid(model.numTrees, [5, 10, 15, 20, 25, 50]) \
    .build()

crossval = CrossValidator(estimator=model,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                                metricName="accuracy"),
                          numFolds=5)

cvModel = crossval.fit(trainSet)
```
* `cvModel` puede ser usado como cualquier modelo ya entrenado. Use la función `evaluate` y mida el desempeño sobre el conjunto de prueba.
```python
evaluate(cvModel, testSet)
```
* Visualice la curva de complejidad en Spark usando la siguiente función:
```python
exploration = sqlContext.createDataFrame(zip([5, 10, 15, 20, 25, 50], cvModel.avgMetrics), ['numTrees', 'score'])
display(exploration)
```

# Entregables
Vaya a la parte superior y descargue como html este notebook, luego renombrelo así: 

* s11_nombre_apellido.html 

Y envíelo a la siguiente solicitud de archivos:
[Dropbox link](https://www.dropbox.com/request/T8oiFuVbiUm8pzhB9f4w)

Fecha de entrega: Lunes 22 de Julio de 2019

# Ayuda
* Documentación detallada de PySpark: https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html
* Documentación con ejemplo de Spark MLLib: https://spark.apache.org/docs/2.3.1/ml-features.html