# Ejemplo Apache Spark 

El objetivo de este taller es familiarizar al estudiante con el entorno de ejecución de apache spark y sus funciones básicas. 

## Instalación pyspark 

PySpark es una librería de python que permite el acceso al motor de apache Spark desarrollado en  Scala, tambien permite utilizar spark desde python. Actualmente puede ser facilmente instalado utilizando pip, como se muestra a continuación.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

In [0]:
!pip install pyspark
!pip install -q findspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/37/98/244399c0daa7894cdf387e7007d5e8b3710a79b67f3fd991c0b0b644822d/pyspark-2.4.3.tar.gz (215.6MB)
[K     |████████████████████████████████| 215.6MB 116kB/s 
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 33.1MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/8d/20/f0/b30e2024226dc112e256930dd2cd4f06d00ab053c86278dcf3
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.3


## Crear un contexto de Spark 

El contexto de spark es el objeto fundamental que permite la comunicación con el cluster **solo puede existir un objeto contexto por máquina virtual**. 

In [0]:
import pyspark
from pyspark import SparkContext
sc = SparkContext()

## Crear un RDD 

Para crear un RDD se utiliza el método parallelize del objeto sparkcontext. 

In [0]:
# Convertimos la lista de valores a un RDD
nums= sc.parallelize([1,2,3,4])

In [0]:
# tomamos uno de los valores
nums.take(1)

[1]

In [0]:
# Realizando operaciones básicas con un RDD
squared = nums.map(lambda x: x*x)

In [0]:
squared.collect()

[1, 4, 9, 16]

In [0]:
# mostramos el resultado 
for i in squared.collect():
  print("{}".format(i))

1
4
9
16


## Spark Dataframes 

Aunque los RDDs son estructuras de datos robustas y flexibles, manejar los datos en dataframes en muchos casos resulta más conveniente y sencillo, para lograr esto debemos utilizar un objeto de tipo sql context.

In [0]:
from pyspark.sql import Row
from pyspark.sql import SQLContext

In [0]:
# creamos el objeto sqlContext
sqlContext = SQLContext(sc)

In [0]:
# creamos una lista de tuplas 
lista_p = [("John", 20), ("Camila", 22), ("Andres", 25), ("Nancy", 18)]

In [0]:
# Creamos un RDD a partir de la lista de tuplas anterior 
ppl_rdd = sc.parallelize(lista_p)

In [0]:
# Creamos un esquema para los tuplas 
ppl = ppl_rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

In [0]:
# Creamos un dataFrame a partir del esquema anterior 
DF_ppl = sqlContext.createDataFrame(ppl)

In [0]:
# Imprimimos el esquema del dataframe para comprobar su correcto funcionamiento 
DF_ppl.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [0]:
DF_ppl.show()

+---+------+
|age|  name|
+---+------+
| 20|  John|
| 22|Camila|
| 25|Andres|
| 18| Nancy|
+---+------+



# Machine Learning con Spark 

En el siguiente ejemplo realizaremos un pipeline completo de machine learning utilizando Spark.

# Carga de datos 

Descargamos el conjunto de datos de https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ que corresponde a un conjunto de datos para clasificación de tumores en cancer de seno y que cuenta con las siguientes columnas, separadas por comas:

  | # |  Attribute  |            Domain |
  |-- -|-----------------|----------------------|
  | 1. |Sample code number|            id number|
  | 2. |Clump Thickness|               1 - 10 |
  | 3. |Uniformity of Cell Size|       1 - 10|
  | 4. |Uniformity of Cell Shape|      1 - 10|
  | 5. |Marginal Adhesion|             1 - 10|
  | 6. |Single Epithelial Cell Size|   1 - 10|
  | 7. |Bare Nuclei |                  1 - 10|
  | 8. |Bland Chromatin |              1 - 10|
  | 9. |Normal Nucleoli  |             1 - 10|
  |10.| Mitoses  |                     1 - 10|
  |11. |Class:      |                  (2 for benign, 4 for malignant)

In [0]:
# Descargamos el archivo de dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

--2019-07-23 03:37:26--  https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19889 (19K) [application/x-httpd-php]
Saving to: ‘breast-cancer-wisconsin.data’


2019-07-23 03:37:26 (664 KB/s) - ‘breast-cancer-wisconsin.data’ saved [19889/19889]



Cargaremos el archivo de que se encuentra en nuestro sistema de archivos local para ello primero requerimos agregarlo como un recurso de spark. 

In [0]:
from pyspark import SparkFiles

In [0]:
# Agregamos el archivo de datos como un recurso de Spark
sc.addFile("breast-cancer-wisconsin.data")

In [0]:
# Leemos el archivo de datos con ayuda del sqlcontext
# sqlContext.read.csv("breast-cancer-wisconsin.data", header=False, inferSchema=True)
df_cancer = sqlContext.read.csv(SparkFiles.get("breast-cancer-wisconsin.data"), header=False, inferSchema=True)

In [0]:
# Verificamos los tipos inferidos imprimiendo el esquema
df_cancer.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: integer (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: integer (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: integer (nullable = true)
 |-- _c8: integer (nullable = true)
 |-- _c9: integer (nullable = true)
 |-- _c10: integer (nullable = true)



Filtramos los valores nulos de las columna "_c6".

In [0]:
df_cancer.where(df_cancer["_c6"] == "?").show()

+-------+---+---+---+---+---+---+---+---+---+----+
|    _c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|
+-------+---+---+---+---+---+---+---+---+---+----+
|1057013|  8|  4|  5|  1|  2|  ?|  7|  3|  1|   4|
|1096800|  6|  6|  6|  9|  6|  ?|  7|  8|  1|   2|
|1183246|  1|  1|  1|  1|  1|  ?|  2|  1|  1|   2|
|1184840|  1|  1|  3|  1|  2|  ?|  2|  1|  1|   2|
|1193683|  1|  1|  2|  1|  3|  ?|  1|  1|  1|   2|
|1197510|  5|  1|  1|  1|  2|  ?|  3|  1|  1|   2|
|1241232|  3|  1|  4|  1|  2|  ?|  3|  1|  1|   2|
| 169356|  3|  1|  1|  1|  2|  ?|  3|  1|  1|   2|
| 432809|  3|  1|  3|  1|  2|  ?|  2|  1|  1|   2|
| 563649|  8|  8|  8|  1|  2|  ?|  6| 10|  1|   4|
| 606140|  1|  1|  1|  1|  2|  ?|  2|  1|  1|   2|
|  61634|  5|  4|  3|  1|  2|  ?|  2|  3|  1|   2|
| 704168|  4|  6|  5|  6|  7|  ?|  4|  9|  1|   2|
| 733639|  3|  1|  1|  1|  2|  ?|  3|  1|  1|   2|
|1238464|  1|  1|  1|  1|  1|  ?|  2|  1|  1|   2|
|1057067|  1|  1|  1|  1|  1|  ?|  1|  1|  1|   2|
+-------+---+---+---+---+---+--

Dejamos únicamente las columnas con valores válidos.

In [0]:
df_cancer = df_cancer.where(df_cancer["_c6"] != "?")

Visualizamos las primeras 5 filas de los datos.

In [0]:
# Visualizamos las primeras 5 filas 
df_cancer.show(5, truncate=False)

+-------+---+---+---+---+---+---+---+---+---+----+
|_c0    |_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|
+-------+---+---+---+---+---+---+---+---+---+----+
|1000025|5  |1  |1  |1  |2  |1  |3  |1  |1  |2   |
|1002945|5  |4  |4  |5  |7  |10 |3  |2  |1  |2   |
|1015425|3  |1  |1  |1  |2  |2  |3  |1  |1  |2   |
|1016277|6  |8  |8  |1  |3  |4  |3  |7  |1  |2   |
|1017023|4  |1  |1  |3  |2  |1  |3  |1  |1  |2   |
+-------+---+---+---+---+---+---+---+---+---+----+
only showing top 5 rows



Las columnas han quedado con nombres automáticos poco claros, vamos a renombrarlas.

In [0]:
target_names = ["Sample code number", "Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial Cell Size",
                "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class"]

In [0]:
# Cambiamos los nombres a las columnas
# df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
for i in range(0, len(target_names)):
  df_cancer = df_cancer.withColumnRenamed("_c{}".format(i), target_names[i])

In [0]:
# Imprimimos el esquema para confirmar
df_cancer.printSchema()

root
 |-- Sample code number: integer (nullable = true)
 |-- Clump Thickness: integer (nullable = true)
 |-- Uniformity of Cell Size: integer (nullable = true)
 |-- Uniformity of Cell Shape: integer (nullable = true)
 |-- Marginal Adhesion: integer (nullable = true)
 |-- Single Epithelial Cell Size: integer (nullable = true)
 |-- Bare Nuclei: string (nullable = true)
 |-- Bland Chromatin: integer (nullable = true)
 |-- Normal Nucleoli: integer (nullable = true)
 |-- Mitoses: integer (nullable = true)
 |-- Class: integer (nullable = true)



In [0]:
df_cancer.show(5, truncate=False)

+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+
|Sample code number|Clump Thickness|Uniformity of Cell Size|Uniformity of Cell Shape|Marginal Adhesion|Single Epithelial Cell Size|Bare Nuclei|Bland Chromatin|Normal Nucleoli|Mitoses|Class|
+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+
|1000025           |5              |1                      |1                       |1                |2                          |1          |3              |1              |1      |2    |
|1002945           |5              |4                      |4                       |5                |7                          |10         |3              |2              |1      |2    |
|1015425           |3              |1             

Seleccionamos varias columnas.

In [0]:
df_cancer.select("Uniformity of Cell Size", "Uniformity of Cell Shape").show(5)

+-----------------------+------------------------+
|Uniformity of Cell Size|Uniformity of Cell Shape|
+-----------------------+------------------------+
|                      1|                       1|
|                      4|                       4|
|                      1|                       1|
|                      8|                       8|
|                      1|                       1|
+-----------------------+------------------------+
only showing top 5 rows



A continuación agruparemos y contaremos los datos en cada grupo.

In [0]:
df_cancer.groupBy("Class").count().sort("count", ascending=True).show()

+-----+-----+
|Class|count|
+-----+-----+
|    4|  239|
|    2|  444|
+-----+-----+



Convertimos la columna bare_nuclei de string a integer. 

In [0]:
df_cancer = df_cancer.withColumn("Bare Nuclei", df_cancer["Bare Nuclei"].cast("int"))
df_cancer.show(4)

+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+
|Sample code number|Clump Thickness|Uniformity of Cell Size|Uniformity of Cell Shape|Marginal Adhesion|Single Epithelial Cell Size|Bare Nuclei|Bland Chromatin|Normal Nucleoli|Mitoses|Class|
+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+
|           1000025|              5|                      1|                       1|                1|                          2|          1|              3|              1|      1|    2|
|           1002945|              5|                      4|                       4|                5|                          7|         10|              3|              2|      1|    2|
|           1015425|              3|              

In [0]:
df_cancer.printSchema()

root
 |-- Sample code number: integer (nullable = true)
 |-- Clump Thickness: integer (nullable = true)
 |-- Uniformity of Cell Size: integer (nullable = true)
 |-- Uniformity of Cell Shape: integer (nullable = true)
 |-- Marginal Adhesion: integer (nullable = true)
 |-- Single Epithelial Cell Size: integer (nullable = true)
 |-- Bare Nuclei: integer (nullable = true)
 |-- Bland Chromatin: integer (nullable = true)
 |-- Normal Nucleoli: integer (nullable = true)
 |-- Mitoses: integer (nullable = true)
 |-- Class: integer (nullable = true)



Describimos los datos por medios de las medidas estadísticas básicas tales como:
* Conteo
* Media
* Desviación Estándar
* Mínimo 
* Máximo

In [0]:
df_cancer.describe().show()

+-------+------------------+------------------+-----------------------+------------------------+-----------------+---------------------------+------------------+-----------------+------------------+------------------+------------------+
|summary|Sample code number|   Clump Thickness|Uniformity of Cell Size|Uniformity of Cell Shape|Marginal Adhesion|Single Epithelial Cell Size|       Bare Nuclei|  Bland Chromatin|   Normal Nucleoli|           Mitoses|             Class|
+-------+------------------+------------------+-----------------------+------------------------+-----------------+---------------------------+------------------+-----------------+------------------+------------------+------------------+
|  count|               683|               683|                    683|                     683|              683|                        683|               683|              683|               683|               683|               683|
|   mean|1076720.2269399706|  4.44216691068814|     

Para obtener la descripción de una sola columna coloque el nombre de columna dentro del método describe().

In [0]:
df_cancer.describe("Clump Thickness").show()

+-------+------------------+
|summary|   Clump Thickness|
+-------+------------------+
|  count|               683|
|   mean|  4.44216691068814|
| stddev|2.8207613188371266|
|    min|                 1|
|    max|                10|
+-------+------------------+



In [0]:
df_cancer.describe("Clump Thickness", "Uniformity of Cell Size").show()

+-------+------------------+-----------------------+
|summary|   Clump Thickness|Uniformity of Cell Size|
+-------+------------------+-----------------------+
|  count|               683|                    683|
|   mean|  4.44216691068814|      3.150805270863836|
| stddev|2.8207613188371266|     3.0651448557860426|
|    min|                 1|                      1|
|    max|                10|                     10|
+-------+------------------+-----------------------+



Crosstap puede mostrarnos interesantes relaciones entre 2 columnas.

In [0]:
df_cancer.crosstab("Uniformity of Cell Size", "Class").sort("Uniformity of Cell Size_Class", ascending=True).show()

+-----------------------------+---+---+
|Uniformity of Cell Size_Class|  2|  4|
+-----------------------------+---+---+
|                            1|369|  4|
|                           10|  0| 67|
|                            2| 37|  8|
|                            3| 27| 25|
|                            4|  8| 30|
|                            5|  0| 30|
|                            6|  0| 25|
|                            7|  1| 18|
|                            8|  1| 27|
|                            9|  1|  5|
+-----------------------------+---+---+



Borramos las filas con valores nulos.

In [0]:
df_cancer = df_cancer.dropna()

In [0]:
df_cancer.count()

683

Filtramos los datos.

In [0]:
df_cancer.filter(df_cancer["Mitoses"] > 5).count()

34

Agregar una columna con valores al DataFrame.

In [0]:
df_cancer2 = df_cancer.withColumn("Mitoses_square", df_cancer["Mitoses"]**2)

In [0]:
df_cancer2.printSchema()

root
 |-- Sample code number: integer (nullable = true)
 |-- Clump Thickness: integer (nullable = true)
 |-- Uniformity of Cell Size: integer (nullable = true)
 |-- Uniformity of Cell Shape: integer (nullable = true)
 |-- Marginal Adhesion: integer (nullable = true)
 |-- Single Epithelial Cell Size: integer (nullable = true)
 |-- Bare Nuclei: integer (nullable = true)
 |-- Bland Chromatin: integer (nullable = true)
 |-- Normal Nucleoli: integer (nullable = true)
 |-- Mitoses: integer (nullable = true)
 |-- Class: integer (nullable = true)
 |-- Mitoses_square: double (nullable = true)



In [0]:
df_cancer2.show(4)

+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+--------------+
|Sample code number|Clump Thickness|Uniformity of Cell Size|Uniformity of Cell Shape|Marginal Adhesion|Single Epithelial Cell Size|Bare Nuclei|Bland Chromatin|Normal Nucleoli|Mitoses|Class|Mitoses_square|
+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+--------------+
|           1000025|              5|                      1|                       1|                1|                          2|          1|              3|              1|      1|    2|           1.0|
|           1002945|              5|                      4|                       4|                5|                          7|         10|              3|              2|     

In [0]:
# Borramos la columna creada anteriormente 
df_cancer2.drop("Mitoses_square").columns

['Sample code number',
 'Clump Thickness',
 'Uniformity of Cell Size',
 'Uniformity of Cell Shape',
 'Marginal Adhesion',
 'Single Epithelial Cell Size',
 'Bare Nuclei',
 'Bland Chromatin',
 'Normal Nucleoli',
 'Mitoses',
 'Class']

### Machine Learning classification with Spark

In [0]:
# Importamos el modelo LinearSVC
from pyspark.ml.classification import LinearSVC
from pyspark.sql.functions import udf

In [0]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler

Seleccionamos las columnas que conformarán las características.

In [0]:
feature_cols = df_cancer.columns[1:10]
print(feature_cols)

['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses']


In [0]:
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

In [0]:
df = assembler.transform(df_cancer)
df.show(4)

+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+--------------------+
|Sample code number|Clump Thickness|Uniformity of Cell Size|Uniformity of Cell Shape|Marginal Adhesion|Single Epithelial Cell Size|Bare Nuclei|Bland Chromatin|Normal Nucleoli|Mitoses|Class|            features|
+------------------+---------------+-----------------------+------------------------+-----------------+---------------------------+-----------+---------------+---------------+-------+-----+--------------------+
|           1000025|              5|                      1|                       1|                1|                          2|          1|              3|              1|      1|    2|[5.0,1.0,1.0,1.0,...|
|           1002945|              5|                      4|                       4|                5|                          7|         10|             

Nos quedamos solo con las columnas class y features, realizando una selección.

In [0]:
df = df.select("Class", "features")
df.show(4)

+-----+--------------------+
|Class|            features|
+-----+--------------------+
|    2|[5.0,1.0,1.0,1.0,...|
|    2|[5.0,4.0,4.0,5.0,...|
|    2|[3.0,1.0,1.0,1.0,...|
|    2|[6.0,8.0,8.0,1.0,...|
+-----+--------------------+
only showing top 4 rows



Codificamos los labels utilizando la clase StringIndexer.

In [0]:
indexer = StringIndexer(inputCol="Class", outputCol="label")
indexed = indexer.fit(df)
df = indexed.transform(df)
df.show(4)

+-----+--------------------+-----+
|Class|            features|label|
+-----+--------------------+-----+
|    2|[5.0,1.0,1.0,1.0,...|  0.0|
|    2|[5.0,4.0,4.0,5.0,...|  0.0|
|    2|[3.0,1.0,1.0,1.0,...|  0.0|
|    2|[6.0,8.0,8.0,1.0,...|  0.0|
+-----+--------------------+-----+
only showing top 4 rows



Dividimos el conjunto en train y test de manera estratificada.

In [0]:
train = df.sampleBy("label", fractions={0.0: 0.7, 1.0: 0.7}, seed=42)
train.show(4)

+-----+--------------------+-----+
|Class|            features|label|
+-----+--------------------+-----+
|    2|[5.0,1.0,1.0,1.0,...|  0.0|
|    2|[4.0,1.0,1.0,3.0,...|  0.0|
|    4|[8.0,10.0,10.0,8....|  1.0|
|    2|[1.0,1.0,1.0,1.0,...|  0.0|
+-----+--------------------+-----+
only showing top 4 rows



Verificamos que se encuentre estratificado.

In [0]:
train.groupby("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0|  313|
|  1.0|  172|
+-----+-----+



Generamos el conjunto de test.

In [0]:
test = df.subtract(train)
test.show(4)

+-----+--------------------+-----+
|Class|            features|label|
+-----+--------------------+-----+
|    4|[10.0,10.0,10.0,3...|  1.0|
|    4|[8.0,10.0,10.0,10...|  1.0|
|    4|[10.0,3.0,4.0,5.0...|  1.0|
|    2|[4.0,1.0,4.0,1.0,...|  0.0|
+-----+--------------------+-----+
only showing top 4 rows



Verificamos el muestreo estratificados.

In [0]:
test.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0|   50|
|  1.0|   67|
+-----+-----+



Nos quedamos únicamente con las features y los labels tanto en train como test.

In [0]:
train = train.select("label", "features")
train.show(4)
test = test.select("label", "features")
test.show(4)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[5.0,1.0,1.0,1.0,...|
|  0.0|[4.0,1.0,1.0,3.0,...|
|  1.0|[8.0,10.0,10.0,8....|
|  0.0|[1.0,1.0,1.0,1.0,...|
+-----+--------------------+
only showing top 4 rows

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|[10.0,10.0,10.0,3...|
|  1.0|[8.0,10.0,10.0,10...|
|  1.0|[10.0,3.0,4.0,5.0...|
|  0.0|[4.0,1.0,4.0,1.0,...|
+-----+--------------------+
only showing top 4 rows



Creamos nuestro modelo con sus hyperparámetros.

In [0]:
model = LinearSVC(maxIter=100, regParam=0.01)

Entrenamos el modelo.

In [0]:
model_trained = model.fit(train)

Predecimos en test utilizando el modelo entrenado.

In [0]:
test_prediction = model_trained.transform(test)
test_prediction.show(4)

+-----+--------------------+--------------------+----------+
|label|            features|       rawPrediction|prediction|
+-----+--------------------+--------------------+----------+
|  1.0|[10.0,10.0,10.0,3...|[-5.5967786576080...|       1.0|
|  1.0|[8.0,10.0,10.0,10...|[-5.6500907583149...|       1.0|
|  1.0|[10.0,3.0,4.0,5.0...|[-1.8537633714647...|       1.0|
|  0.0|[4.0,1.0,4.0,1.0,...|[1.55325298020805...|       0.0|
+-----+--------------------+--------------------+----------+
only showing top 4 rows



Evaluamos el modelo

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

In [0]:
evaluator_1 = MulticlassClassificationEvaluator(predictionCol="prediction")

In [0]:
print("Metric {}: {}".format(evaluator_1.getMetricName(), evaluator_1.evaluate(test_prediction)))

Metric f1: 0.9311903877121268


## Ejemplo con texto (datos no estructurados)

En el siguiente ejemplo se muestra como procesar datos no estructurados como por ejemplo texto.

Descargamos el dataset de spam de Youtube de https://archive.ics.uci.edu/ml/machine-learning-databases/00380/.

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip

--2019-07-23 03:38:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163567 (160K) [application/x-httpd-php]
Saving to: ‘YouTube-Spam-Collection-v1.zip’


2019-07-23 03:38:08 (1.76 MB/s) - ‘YouTube-Spam-Collection-v1.zip’ saved [163567/163567]



Descomprimimos el archivo descargado.

In [0]:
!unzip YouTube-Spam-Collection-v1.zip

Archive:  YouTube-Spam-Collection-v1.zip
  inflating: Youtube01-Psy.csv       
   creating: __MACOSX/
  inflating: __MACOSX/._Youtube01-Psy.csv  
  inflating: Youtube02-KatyPerry.csv  
  inflating: __MACOSX/._Youtube02-KatyPerry.csv  
  inflating: Youtube03-LMFAO.csv     
  inflating: __MACOSX/._Youtube03-LMFAO.csv  
  inflating: Youtube04-Eminem.csv    
  inflating: __MACOSX/._Youtube04-Eminem.csv  
  inflating: Youtube05-Shakira.csv   
  inflating: __MACOSX/._Youtube05-Shakira.csv  


Revisamos la colección de archivos en la carpeta /content.

In [0]:
!ls

breast-cancer-wisconsin.data  Youtube03-LMFAO.csv
__MACOSX		      Youtube04-Eminem.csv
sample_data		      Youtube05-Shakira.csv
Youtube01-Psy.csv	      YouTube-Spam-Collection-v1.zip
Youtube02-KatyPerry.csv


Cargamos los datos de los videos de Shakira.

In [0]:
df_yt = sqlContext.read.csv("Youtube05-Shakira.csv", header=True, inferSchema=True)
df_yt.show(4)

+--------------------+--------------------+--------------------+--------------------+-----+
|          COMMENT_ID|              AUTHOR|                DATE|             CONTENT|CLASS|
+--------------------+--------------------+--------------------+--------------------+-----+
|z13lgffb5w3ddx1ul...|          dharma pal|2015-05-29 02:30:...|          Nice song﻿|    0|
|z123dbgb0mqjfxbtz...|       Tiza Arellano|2015-05-29 00:14:...|       I love song ﻿|    0|
|z12quxxp2vutflkxv...|Prìñçeśś Âliś Łøv...|2015-05-28 21:00:...|       I love song ﻿|    0|
|z12icv3ysqvlwth2c...|       Eric Gonzalez|2015-05-28 20:47:...|860,000,000 lets ...|    0|
+--------------------+--------------------+--------------------+--------------------+-----+
only showing top 4 rows



In [0]:
df_yt.printSchema()

root
 |-- COMMENT_ID: string (nullable = true)
 |-- AUTHOR: string (nullable = true)
 |-- DATE: timestamp (nullable = true)
 |-- CONTENT: string (nullable = true)
 |-- CLASS: string (nullable = true)



Extraemos características de la columnas content que contiene los comentarios.

In [0]:
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel, IDFModel, HashingTF, Tokenizer, Word2Vec, StopWordsRemover, IDF

Tokenizamos el texto y lo guardamos en otra columna.

In [0]:
tokenizer = Tokenizer(inputCol="CONTENT", outputCol="words")

In [0]:
df_yt = tokenizer.transform(df_yt)
df_yt.show(4)

+--------------------+--------------------+--------------------+--------------------+-----+--------------------+
|          COMMENT_ID|              AUTHOR|                DATE|             CONTENT|CLASS|               words|
+--------------------+--------------------+--------------------+--------------------+-----+--------------------+
|z13lgffb5w3ddx1ul...|          dharma pal|2015-05-29 02:30:...|          Nice song﻿|    0|       [nice, song﻿]|
|z123dbgb0mqjfxbtz...|       Tiza Arellano|2015-05-29 00:14:...|       I love song ﻿|    0|  [i, love, song, ﻿]|
|z12quxxp2vutflkxv...|Prìñçeśś Âliś Łøv...|2015-05-28 21:00:...|       I love song ﻿|    0|  [i, love, song, ﻿]|
|z12icv3ysqvlwth2c...|       Eric Gonzalez|2015-05-28 20:47:...|860,000,000 lets ...|    0|[860,000,000, let...|
+--------------------+--------------------+--------------------+--------------------+-----+--------------------+
only showing top 4 rows



Removemos las stopWords.

In [0]:
stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered_words")

In [0]:
df_yt = stopWordsRemover.transform(df_yt)
df_yt.show(4)

+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+
|          COMMENT_ID|              AUTHOR|                DATE|             CONTENT|CLASS|               words|      filtered_words|
+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+
|z13lgffb5w3ddx1ul...|          dharma pal|2015-05-29 02:30:...|          Nice song﻿|    0|       [nice, song﻿]|       [nice, song﻿]|
|z123dbgb0mqjfxbtz...|       Tiza Arellano|2015-05-29 00:14:...|       I love song ﻿|    0|  [i, love, song, ﻿]|     [love, song, ﻿]|
|z12quxxp2vutflkxv...|Prìñçeśś Âliś Łøv...|2015-05-28 21:00:...|       I love song ﻿|    0|  [i, love, song, ﻿]|     [love, song, ﻿]|
|z12icv3ysqvlwth2c...|       Eric Gonzalez|2015-05-28 20:47:...|860,000,000 lets ...|    0|[860,000,000, let...|[860,000,000, let...|
+--------------------+--------------------+-------------------

Extraemos features de la columna de palabras filtradas. Primero extraemos bag of words, luego TF-IDF y finalmente usamos un word embedding (word2vec).

Bag of words Features 

In [0]:
bowExtractor = CountVectorizer(inputCol="filtered_words", outputCol="bow", minDF=2.0)
extractor = bowExtractor.fit(df_yt)
df_yt = extractor.transform(df_yt)
df_yt.show(4)

+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|          COMMENT_ID|              AUTHOR|                DATE|             CONTENT|CLASS|               words|      filtered_words|                 bow|
+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|z13lgffb5w3ddx1ul...|          dharma pal|2015-05-29 02:30:...|          Nice song﻿|    0|       [nice, song﻿]|       [nice, song﻿]|(508,[42,62],[1.0...|
|z123dbgb0mqjfxbtz...|       Tiza Arellano|2015-05-29 00:14:...|       I love song ﻿|    0|  [i, love, song, ﻿]|     [love, song, ﻿]|(508,[3,4,27],[1....|
|z12quxxp2vutflkxv...|Prìñçeśś Âliś Łøv...|2015-05-28 21:00:...|       I love song ﻿|    0|  [i, love, song, ﻿]|     [love, song, ﻿]|(508,[3,4,27],[1....|
|z12icv3ysqvlwth2c...|       Eric Gonzalez|2015-05-28 20:47:...|860,00

TF-IDF features 

In [0]:
tfidfExtractor = HashingTF(inputCol="filtered_words", outputCol="raw_tf", numFeatures=20)
df_yt = tfidfExtractor.transform(df_yt)
tfidfExtractor = IDF(inputCol="raw_tf", outputCol="tfidf")
extractor = tfidfExtractor.fit(df_yt)
df_yt = extractor.transform(df_yt)
df_yt.show(4)

+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|          COMMENT_ID|              AUTHOR|                DATE|             CONTENT|CLASS|               words|      filtered_words|                 bow|              raw_tf|               tfidf|
+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|z13lgffb5w3ddx1ul...|          dharma pal|2015-05-29 02:30:...|          Nice song﻿|    0|       [nice, song﻿]|       [nice, song﻿]|(508,[42,62],[1.0...|(20,[10,18],[1.0,...|(20,[10,18],[0.76...|
|z123dbgb0mqjfxbtz...|       Tiza Arellano|2015-05-29 00:14:...|       I love song ﻿|    0|  [i, love, song, ﻿]|     [love, song, ﻿]|(508,[3,4,27],[1....|(20,[0,7,10],[1.0...|(20,[0,7,10],[0.9...|
|z12quxxp2vutfl

Para word2vec embedding.

In [0]:
w2vec = Word2Vec(vectorSize=5, seed=42, inputCol="filtered_words", outputCol="w2vec")
w2vec_tr = w2vec_fit = w2vec.fit(df_yt)
df_yt = w2vec_tr.transform(df_yt) 
df_yt.show(4)

+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|          COMMENT_ID|              AUTHOR|                DATE|             CONTENT|CLASS|               words|      filtered_words|                 bow|              raw_tf|               tfidf|               w2vec|
+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|z13lgffb5w3ddx1ul...|          dharma pal|2015-05-29 02:30:...|          Nice song﻿|    0|       [nice, song﻿]|       [nice, song﻿]|(508,[42,62],[1.0...|(20,[10,18],[1.0,...|(20,[10,18],[0.76...|[0.07292558252811...|
|z123dbgb0mqjfxbtz...|       Tiza Arellano|2015-05-29 00:14:...|       I love song ﻿|    0|  [i, love, song, ﻿]|     [love, song

Codificamos las features

In [0]:
indexer = StringIndexer(inputCol="CLASS", outputCol="label")
df_yt = 

Ahora entrenamos el modelo.

## Referencias 

* https://spark.apache.org/downloads.html
* https://www.guru99.com/pyspark-tutorial.html
* https://steemit.com/google-colab/@ankurankan/importing-data-to-google-colaboratory