## Store data to Mongo DB

This notebook will demonstrate uploading of the data to Mongo DB with Checkpoint

The data can be read from captured events in Datalake or delta table

### Option - 1
##### Structure Streaming Upload
This option allows you to uplaod the data as and when it is stored in data lake

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

claims_schema = StructType([
    StructField("id", StringType(), True),
    StructField("customer_name", StringType(), True),
    StructField("phone_number", StringType(), True),
    StructField("country", StringType(), True),
    StructField("claim_amount", IntegerType(), True),
    StructField("type_id", StringType(), True),
    StructField("status", StringType(), True),
    StructField("processed", TimestampType(), True),
    StructField("type", StringType(), True)
])


In [0]:
checkpoint_path = "dbfs:/FileStore/events/_checkpoints/lake_events"
upload_path = "dbfs:/mnt/stream-data/claims"

# Set up the stream to begin reading incoming files from the
# upload_path location.
events_datalake_df = spark.readStream.format('cloudFiles') \
  .option('cloudFiles.format', 'csv') \
  .option('header', 'true') \
  .schema(claims_schema) \
  .load(upload_path)

In [0]:
def write_row(batch_df , batch_id):
    batch_df.write\
    .format("mongo")\
    .mode("append")\
    .option('checkpointLocation', checkpoint_path) \
    .option("uri", "mongodb+srv://admin:demo%40PSL@cluster0.s5tuvb0.mongodb.net/events_db.claims_events?retryWrites=true&w=majority")\
    .save()
    pass

In [0]:
events_datalake_df.writeStream\
.foreachBatch(write_row)\
.start()

In [0]:
# Reading from MongoDB
mongo_df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://admin:demo%40PSL@cluster0.s5tuvb0.mongodb.net/events_db?retryWrites=true&w=majority")\
.option("database", "events_db")\
.option("collection", "claims_events")\
.load()

display(mongo_df)

### Option - 2
##### Batch Upload
This option allows you to uplaod the complete data in one go.

In [0]:
events_datalake_df = spark.read.format("csv")\
.schema(claims_schema)\
.option("checkpointLocation", "dbfs:/FileStore/events/_checkpoints/lake_events")\
.load("dbfs:/mnt/stream-data/claims")

In [0]:
display(events_datalake_df)

In [0]:
events_datalake_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.option("checkpointLocation", "dbfs:/FileStore/events/_checkpoints/mongo_events")\
.mode("overwrite")\
.option("uri", "mongodb+srv://admin:demo%40PSL@cluster0.s5tuvb0.mongodb.net/events_db.claims_events?retryWrites=true&w=majority")\
.save()

In [0]:
# Reading from MongoDB
mongo_df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://admin:demo%40PSL@cluster0.s5tuvb0.mongodb.net/events_db.claims_events?retryWrites=true&w=majority")\
.option("database", "events_db")\
.option("collection", "claims_events")\
.load()

display(mongo_df)

##### ==== end of notebook ====