# Implementing a Data Pipeline with Apache Spark - Demo 

In this demo we will see together how to leverage [Apache Spark](https://spark.apache.org/) via Python ([PySpark](http://spark.apache.org/docs/latest/api/python/)) to process data from different sources representing a Data Lake. 
For the demo purpose we will use data stored in:

* __MySQL__ - RDBMS database
* __MongoDB__ - NoSQL database
* __Parquet__ Files - [Apache Parquet](https://parquet.apache.org/) is a columnar data format often used in [Apache Hadoop](https://hadoop.apache.org/) environments, particularly suitable for analytics


The basic idea is to design and implement a Big Data Pipeline consisting of several **Jobs** and **Tasks**. 

* Job: a complete data tranformation activity, from reading data from a source to saving them somewhere
* Task: a single step of a job
___

To start the demo, run:

`docker-compose -f ./docker-compose-full.yml up -d`
 
To stop the demo, run:

`docker-compose -f ./docker-compose-full.yml down`

Docker will build up an ecosystem with:

* **MongoDB** with a populated database
* **MySQL** with a populated database
* **Apache Spark** deployed in Standalone Mode
* **Jupyter** enabled to work with Spark


### SparkSession
In the first cell we have to instantiate the __SparkSession__ object. Via the SparkSession we can read, manipulate, and store data from different data sources using both RDD and DataFrame API. 

> Only one SparkSession object can be contained in a Spark-powered program. The SparkSession creates and handles the DAG and interacts with the exectutors to execute it. 

In [1]:
from pyspark.sql import SparkSession
ss = SparkSession.builder \
.config("spark.mongodb.input.uri", "mongodb://root:example@mongo/test.coll?authSource=admin") \
.config("spark.mongodb.output.uri", "mongodb://root:example@mongo/test.coll?authSource=admin") \
.config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.17,org.mongodb.spark:mongo-spark-connector_2.11:2.4.1') \
.getOrCreate()
ss.version
# Spark version 2.4.4 uses Scala 2.11

'2.4.4'

The [builder pattern](https://en.wikipedia.org/wiki/Builder_pattern) is used to create and initialize the SparkSession. Notice that we used the *config* method to store metadata as the MondoDB input and output uris and a list of java packages (fully specified in [Apache Maven](https://maven.apache.org/) format). 

> Note that Spark checks if the listed jar packages are available at executor level, otherwise it downloads them.  

Other metadata (hostname, username, password, etc.) are to be expressed as python variable. Next cell reports MySQL connection parameters. In the following the **jdbcUrl** is used for DataFrame creation from MySQL.

In [2]:
jdbcHostname = "mysql"
jdbcDatabase = "esame"
username = "root"
password = "example"
jdbcPort = 3306
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}?user={3}&password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, username, password)

### 1. Total sales per film category

List the **total sales per film category** considering only the sales referred to rented movies.
Measure the query execution time (you might want to use the python 'time' library)

You may need to use the following tables: 

1. category
2. film_category 
3. inventory 
3. payment 
4. rental

![EER Diagram](figures/mysql_table1.png)


The result must have __two columns__:

1. the film_category
2. the total_sales

and has to be **sorted in descending order** with respect to the **total_sales**.

> For performance reasons it is recommended to write a SQL query rather than import the data in different spark dataframes and use spark to join them. It is better because Spark is able to communicate with MySQL and push down filter and join operations. 



The following code snippet presents how to create a dataframe from MySQL. 

Note that it is necessary to provide:

1. The protocol to use (jdbc, driver)
2. The connection string (url)
3. The query MySQL has to execute to export data to Spark

Moreover, note that the dataframe API is lazy evaluated, it needs an **action**. In the code snippet below:

* load() is not an action, the dataframe is therefore only defined but not created
* show() is an action, the dataframe is created here and the results are returned to the main program (driver)


In [None]:
# 1
import time

query1 =  '''
        XXXX put your query here
    '''

salesCat = ss.read \
    .format("jdbc") \
    .option("url", jdbcUrl) \
    .option("query", query1)\
    .option("driver", "com.mysql.jdbc.Driver") \
    .load()


start = time.time()
salesCat.show() # this is an action, the dataframe is created only at this point.
end = time.time()
time_taken = end - start
print('Time: ',time_taken)

### 2. Optimizing data loading with indexes

Indexes are extremely important for speeding up analytical processes, particularly when JOIN operations must be performed. It is easy to implement indexes in RDBMS and NoSQL systems, a little less easy when dealing with large files in HDFS. 

**Optimize** the query created in the previous cell by entering the appropriate indexes in the mysql database. 

**Report** as comments the sql statements used to create the indexes

**Re-execute** the query (reating the dataframe df2) and measure the time.


In [None]:
# 2
start = time.time()
salesCat.show()
end = time.time()
time_taken = end - start
print('Time: ',time_taken)

### 3. Optimizing data loading with views

Views are a great way to simplify the definition of a data pipeline because it defines tasks to be executed at the database level, in some cases it also allows to improve the overall query performance. 

In the big data world, a view can be implemented as a **batch process**, for example implemented through [Apache HIVE](https://hive.apache.org/), which makes partially pre-processed data available. 
This is very useful when you want to crate a data pipeline for production, a bit less so when you want to implement exploratory actions.

1. **Create** a view called "total_sales" from the query implemented in the previous cells.
2. **Report** here the sql statement used to create the view
3. **Load** in a spark dataframe all the rows of the view and show them
4. **Measure** the data loading time


In [None]:
# 3

# put here the sql statement for the view creation

import time

query2 = '''XXX here the query for loading all data from total_sales view'''

dfview = ss.read \
    .format("jdbc") \
    .option("url", jdbcUrl) \
    .option("query", query2)\
    .option("driver", "com.mysql.jdbc.Driver") \
    .load()

start = time.time()
dfview.show()
end = time.time()
time_taken = end - start
print('Time: ',time_taken)



### 4. Films per category

List films per category (create the dataframe called "film_category"), the table must have the following structure:

1. film_id
2. film_title
3. film_description
3. film_category
4. film_rental_rate
5. film_length
6. film_rating

and measure the query execution time.

In [None]:
# 4

query3 ='''
    XXX put the query here
'''

film_category=ss.read \
    .format("jdbc") \
    .option("url", jdbcUrl) \
    .option("query", query3)\
    .option("driver", "com.mysql.jdbc.Driver") \
    .load()

start = time.time()
film_category.show()
end = time.time()
time_taken = end - start
print('Time: ',time_taken)

### 5. Actor - Film dataframe 

The database in mysql is incomplete. In fact, there is a lack of information on addresses, cities and countries. 
This information is present in the "esame" database and in the collections: "denormalizedAddress", "actor" and "film_actor" in MongoDB. 

___

The following pictures depict the structure of a typical document withing the "actor" and "film_actor" collections, respectively.

![Actor collection](figures/actor.png)

___

![Film_Actor collection](figures/film_actor.png)


___

**Join** (using a pipeline of `lookup`, `unwind`, `project`, `concat` and `sort`) the "actor" and "film_actor" collections and extract a Spark dataFrame with the following columns:

1. actor_id
2. film_id
3. name i.e. "first_name last_name"

sorted by film_id

> as for MySQL extracted dataframe for performance reasons is recommended to execute in-database operations (filters, joins, concat..) within the database itself


Notes:

* [`$lookup`](https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/). It is equivalent to a Join in **MongoDB Query Language (MQL)**
* [`$unwind`](https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/). Deconstructs an array field from the input documents to output a document for each element.
* [`$project`](https://docs.mongodb.com/manual/reference/operator/aggregation/project/). Passes along the documents with the requested fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.


In [None]:
# 5

queryFA = [
            XXX put the mongo pipeline here
]

FA = ss.read.format("mongo")\
.option("pipeline", queryFA)\
.option("uri","mongodb://root:example@mongo/esame.actor?authSource=admin&readPreference=primaryPreferred")\
.load()

FA.show()


### 6. Actor - Film - Category dataframe

Using the dataframe created in the previous cell ("FA") add to the "film_category" dataframe a column with the names (separated by commas) of the actors starring in each film.

> hint: you might want to use **concat_ws** and **collect_list** function to aggregate the actors' nane

Note that in the code snippet below film_category and FA are registered as temporary relational tables, this means that we can usa Spark SQL API to manipulate them.

In [None]:
# 6

film_category.registerTempTable('film_category')
FA.registerTempTable('FA')

full_film_category = ss.sql('''
        XXX put the query here
''')

full_film_category.show()

### 7. Conclude the pipeline saving results as a parquet file

Save the dataframe created in the previous cell as parquet file named "film_category".


In [None]:
# 7
# put here the code


Hurray! We implemented in Apache Spark our first Big Data Pipeline. Follows a grafical representation of the pipeline at issue.

![Pipeline 1](figures/pipeline1.png)

### 8. More on indexing - Geospatial indexes in MongoDB

**Create** a geospatial index (type 2dsphere) in mongodb on the "location.coordinates" field of the "denormalizedAddress" collection.

Follows a picture with the typical structure of an address in "denormalizedAddress" collection:

![denormalizedAddress collection](figures/address.png)

In [None]:
# 8
# put here the code


### 9. Geospatial querying data from MongoDB to Spark

**Retrive** all documents that contains coordinates within 200km from the point [ 8.659, 45.955 ]

> you might want to use the **geoNear** aggregation

**Select** only the following field: address_id, address, city_name, district, country_name

Note that in the snipped below the selection part is performed using the Spark dataframe API, it may seems a waste of time but Spark uses the mongodb connector to push down the projection operator. In this way Spark does not load all data and then selects the columns; in fact, the querying and selection happens in an optimized way within MongoDB.


In [None]:
# 9

distPip =[
        XXX put the geo-spatial query here
]

disDF = ss.read.format("mongo")\
.option("pipeline", distPip)\
.option("uri","mongodb://root:example@mongo/esame.denormalizedAddress?authSource=admin&readPreference=primaryPreferred")\
.load().select('address_id', 'address', 'city_name', 'district', 'country_name')

disDF.show()

### 10. Customers within a certain area

Generate and a table with information about the customers whose address has been identified in the previous cell

The table must have he following structure :

1. Customer name ("first_name last_name")
2. Customer address ("address, city_name")
3. district, 
4. country_name
5. active

The information about the customers can be found in the "customer.parquet" file

In [None]:
# 10. 

customers = ss.read.parquet("customer.parquet")
customers.registerTempTable('customer')
disDF.registerTempTable('address')
ss.sql('''
XXX put the query here
''').show()

### 11. Conclude the pipeline saving results as a parquet file

Save the dataframe created in the previous cell as parquet file named "Custormer_full".

In [None]:
# 11
# put code here

The previous code concludes the second Spark-powered pipeline, representable as:
![Pipeline 2](figures/pipeline2.png)