## Extracting Data from MongoDB

1. Spark Connector: https://docs.mongodb.com/spark-connector/current/
2. Mongo Spark: https://spark-packages.org/package/mongodb/mongo-spark
3. Launch Jupyter Notebook with PySpark and MongoDB: **$SPARK_HOME/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.0**

In [1]:
from pyspark.sql import SparkSession

**Creating the Spark Session**

In [2]:
# my_spark = SparkSession \
#     .builder \
#     .appName("myApp") \
#     .config("spark.mongodb.input.uri", "mongodb://localhost/store.clothes") \
#     .config("spark.mongodb.output.uri", "mongodb://localhost/store.clothes") \
#     .getOrCreate()

In [3]:
# df = spark.read.format('com.mongodb.spark.sql.DefaultSource').load()

**Reading Data**

In [4]:
data = spark.read.format("mongo").option("uri", "mongodb://localhost/store.clothes").load()

In [5]:
data.printSchema()

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- item: string (nullable = true)
 |-- qty: double (nullable = true)
 |-- size: struct (nullable = true)
 |    |-- h: double (nullable = true)
 |    |-- w: double (nullable = true)
 |    |-- uom: string (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [6]:
data.count()

4

In [7]:
data.head()

Row(_id=Row(oid='5f50f2a029127016b1ffe361'), item='Shirt', qty=25.0, size=Row(h=14.0, w=21.0, uom='cm'), tags=['white', 'red'])

In [8]:
data.show()

+--------------------+-----+----+-----------------+-------------+
|                 _id| item| qty|             size|         tags|
+--------------------+-----+----+-----------------+-------------+
|[5f50f2a029127016...|Shirt|25.0| [14.0, 21.0, cm]| [white, red]|
|[5f50f2a029127016...|Dress|85.0| [27.9, 35.5, cm]|       [gray]|
|[5f50f2a029127016...|Pants|45.0|[19.0, 22.85, cm]|[green, blue]|
|[5f5172a1d7877900...|Shoes|30.0|             null|         null|
+--------------------+-----+----+-----------------+-------------+



**Inserting Data**

In [9]:
newItem = spark.createDataFrame([("Purse", 40)], ["item", "qty"])

In [10]:
newItem.write.format("mongo").option("uri", "mongodb://localhost/store.clothes").mode("append").save()

In [11]:
data.show()

+--------------------+-----+----+-----------------+-------------+
|                 _id| item| qty|             size|         tags|
+--------------------+-----+----+-----------------+-------------+
|[5f50f2a029127016...|Shirt|25.0| [14.0, 21.0, cm]| [white, red]|
|[5f50f2a029127016...|Dress|85.0| [27.9, 35.5, cm]|       [gray]|
|[5f50f2a029127016...|Pants|45.0|[19.0, 22.85, cm]|[green, blue]|
|[5f5172a1d7877900...|Shoes|30.0|             null|         null|
|[5f51738fddd1653e...|Purse|40.0|             null|         null|
+--------------------+-----+----+-----------------+-------------+

