##### Título: Conexion Spark con MongoDB
##### Autor: Dr. Gabriel Guerrero, saxsa2000@gmail.com
##### Fecha: 20190702



### Spark conexión con MongoDB

El siguinete cuaderno muestra como leer una colección de mongodb con spark.

El mecanismo es leer una colección de mongoDB y convertirla  a un dataframe de spark.

En el dataframe de Spark se ejecutan metódos en el dataframe y finalmente se guarda el resultado en un nuevo dataframe de spark que se almacena  en una coleccion mongodb.

![mongo_spark](spark_mongo.jpg)

## Las fuentes de información son datos de BIMBO surgidos de un concurso de kaggle 

In [1]:
%%bash

jps

24928 SecondaryNameNode
24737 DataNode
25300 LivyServer
27513 Jps
24572 NameNode


In [2]:
%%bash

systemctl status mongod


● mongod.service - High-performance, schema-free document-oriented database
   Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
   Active: active (running) since mar 2019-07-02 16:25:33 CDT; 6h ago
     Docs: https://docs.mongodb.org/manual
  Process: 1158 ExecStart=/usr/bin/mongod $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 1135 ExecStartPre=/usr/bin/chmod 0755 /var/run/mongodb (code=exited, status=0/SUCCESS)
  Process: 1119 ExecStartPre=/usr/bin/chown mongod:mongod /var/run/mongodb (code=exited, status=0/SUCCESS)
  Process: 1114 ExecStartPre=/usr/bin/mkdir -p /var/run/mongodb (code=exited, status=0/SUCCESS)
 Main PID: 1529 (mongod)
    Tasks: 23
   CGroup: /system.slice/mongod.service
           └─1529 /usr/bin/mongod -f /etc/mongod.conf


=== Ejecutar mongodb en una terminal Linux ===

mongo

=== Una vez iniciado el cliente MongoDB, mostrar bases en mongodb ===

show dbs

=== Usar la base de datos llamada bimbodb ===

use bimbodb

=== Mostrar colecciones de la base de datos ===

show collections

=== Mostrar el contenido de las colecciones ===

db.cliente.find()
db.producto.find()
db.town_state.find().pretty()
db.train.find().pretty()

=== Salir de mongodb ===

exit

In [3]:
%%bash

start-dfs.sh

Starting namenodes on [hostsaxsa]
hostsaxsa: namenode running as process 24572. Stop it first.
hostsaxsa: datanode running as process 24737. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 24928. Stop it first.


In [4]:
%%bash 

start-spark.sh

starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/logs/spark-saxsa-org.apache.spark.deploy.master.Master-1-hostsaxsa.out
hostsaxsa: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/logs/spark-saxsa-org.apache.spark.deploy.worker.Worker-1-hostsaxsa.out


In [5]:
%%bash 

jps


24928 SecondaryNameNode
24737 DataNode
27938 Master
25300 LivyServer
28037 Worker
24572 NameNode
28092 Jps


In [6]:
from pyspark.sql.functions import count

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,,pyspark3,idle,,,✔


SparkSession available as 'spark'.


### Lectura de la base bimbodb y la coleccion cliente

In [7]:
df_c = spark.read.format("com.mongodb.spark.sql.DefaultSource")\
    .option("spark.mongodb.input.uri", "mongodb://127.0.0.1/bimbodb.cliente?readPreference=primaryPreferred")\
    .option("spark.mongodb.output.uri", "mongodb://127.0.0.1/bimbodb.cliente").load()

In [8]:
df_c

DataFrame[Cliente_ID: int, NombreCliente: string, _id: struct<oid:string>]

In [9]:
df_c = df_c.drop('_id')
df_c.describe().show()

+-------+------------------+--------------------+
|summary|        Cliente_ID|       NombreCliente|
+-------+------------------+--------------------+
|  count|            450000|              450000|
|   mean|2349831.5673133335|   4103.354430379747|
| stddev| 1915668.265326387|  22887.275797071834|
|    min|                 2|056 THE AIRPORT M...|
|    max|          19988629|                ÑEKA|
+-------+------------------+--------------------+

In [10]:
df_c.select('Cliente_ID').distinct().count()

448852

In [11]:
df_cli = df_c.select('Cliente_ID').groupby('Cliente_ID')\
            .agg(count('Cliente_ID').alias('Apariciones'))\
            .sort('Apariciones', ascending=False)\
            .where("Apariciones != 1")

df_cli.show()

+----------+-----------+
|Cliente_ID|Apariciones|
+----------+-----------+
|   6137009|          2|
|    525200|          2|
|    149688|          2|
|    406654|          2|
|    370466|          2|
|    101541|          2|
|    164940|          2|
|    103066|          2|
|    496802|          2|
|   1453247|          2|
|    373944|          2|
|     19758|          2|
|    996554|          2|
|   1190343|          2|
|   2403824|          2|
|    390498|          2|
|     21633|          2|
|    523171|          2|
|    321977|          2|
|     69340|          2|
+----------+-----------+
only showing top 20 rows

In [12]:
df_cli.count()

1148

In [13]:
df_c.filter("Cliente_ID == 1453247").show(2, False)

+----------+------------------------+
|Cliente_ID|NombreCliente           |
+----------+------------------------+
|1453247   |NOVEDADES Y V  J  INGRID|
|1453247   |NOVEDADES Y V J INGRID  |
+----------+------------------------+

In [14]:
df_p = spark.read.format("com.mongodb.spark.sql.DefaultSource")\
    .option("spark.mongodb.input.uri", "mongodb://127.0.0.1/bimbodb.producto?readPreference=primaryPreferred")\
    .option("spark.mongodb.output.uri", "mongodb://127.0.0.1/bimbodb.producto").load()

In [15]:
df_p

DataFrame[NombreProducto: string, Producto_ID: int, _id: struct<oid:string>]

In [16]:
df_p = df_p.drop('_id')
df_p.describe().show()

+-------+--------------------+------------------+
|summary|      NombreProducto|       Producto_ID|
+-------+--------------------+------------------+
|  count|                2592|              2592|
|   mean|                null|32591.095679012345|
| stddev|                null|13004.091023722118|
|    min|100pct Whole Whea...|                 0|
|    max|Wonderbutter 680g...|             49997|
+-------+--------------------+------------------+

In [17]:
df_p.createOrReplaceTempView('producto')

In [18]:
%%sql
SELECT * FROM producto WHERE NombreProducto LIKE '%Pan%' limit 100

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [19]:
df_cons = spark.sql("SELECT * FROM producto WHERE NombreProducto LIKE '%Pan%'")

In [20]:
df_cons.show()

+--------------------+-----------+
|      NombreProducto|Producto_ID|
+--------------------+-----------+
|Pan Multigrano Li...|         73|
|Pan Blanco 567g W...|         99|
|Super Pan Bco Ajo...|        100|
|Pan Multicereal 4...|        109|
|Pan Multigrano 68...|        713|
|Pan 100pct Integr...|        715|
|Pan Blanco Siluet...|        779|
|Panitos Chocolate...|        357|
|Pan Bolsa 2a 500g...|       1031|
|Pan 12 Granos 680...|       1039|
|Panque Marmol 255...|       1064|
|Pan Blanco Chico ...|       1109|
|Pan Blanco 2Pq 13...|       1111|
|Super Pan Blanco ...|       1112|
|Pan Blanco 680g B...|       1120|
|Pan Integral Gde ...|       1143|
|Pan Integral 480g...|       1144|
|Pan Integral 680g...|       1145|
|Pan Integral 675g...|       1146|
|Panque Pasas 255g...|       1230|
+--------------------+-----------+
only showing top 20 rows

In [21]:
df_cons.write.format("com.mongodb.spark.sql.DefaultSource")\
.mode("append")\
.option("spark.mongodb.input.uri", "mongodb://127.0.0.1/bimbodb.coleccion_nueva?readPreference=primaryPreferred")\
.option("spark.mongodb.output.uri", "mongodb://127.0.0.1/bimbodb.coleccion_nueva").save()