Deployment of a machine learning model to production:
The main focus here is not building a machine learning model or API, but to assess the ability to build
the architecture and use open source tools for deployment of the model on cloud to production. You are
welcome to choose any model API (Flask, Django, FastAPI, etc.) available publicly or already built by
yourself. However, don’t forget to use a web server gateway interface (WSGI) such as gunicorn in your
API.
Please make a proper explanation and documentation of your workflow. We consider it seriously.
Follow the steps below for your assistance. We are open to other methods too.
1. Build a big data pipeline for both stream processing and batch processing. You can refer to
Lambda Architecture of Kafka.
2. Use MongoDB for your data storage (sinking and sourcing)/Hadoop(HDFS).
3. Build a CI/CD pipeline in 2 steps(Build,Tag) in Gitlab using the model API and tag the image in
":latest" when you are on the master branch:
i. Dockerize the API.
ii. Create the file gitlab-ci.yml
iii. Built-up images must be saved to the GitLab image container (under the branch name tag).
iv. Start the Tag step only on the master branch and tag the image in ":latest".
v. The pipeline starts automatically when committing to any branch.
vi. Orchestrate the docker container(s) using Kubernetes: Creation of Kubernetes clusters.
Usage of nginx reverse proxy and async workers, task schedulers (Celery) are highly appreciated and
bonus.

## Apache Spark
* Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
* Programming Language - Scala

<img src="Image/spark.JPG" width="500" />

## Lambda Architecture 
* Build the data ingestion and reporting pipeline with Lambda Architecture using Spark
* it provides code reusability between streaming and the batch layers, key configurations for the deployment and a few troubleshooting tips

<img src="Image/Lambda_spark.png" width="800" />

**Lambda Architecture(LA) provides**
* robust system that is fault-tolerant against hardware failures and human mistakes
* system is linearly scalable (scaling out instead of scaling up)
* there is no guarantee that data will come in one shot and there wont be any noise. Also, to provide protection against data that arrives late due to network or server instability
* LA process the data twice to overcome above issues, once in realtime streaming and second time in a scheduled batch process
* it writes to KairosDB idempotent in order to be able to reprocess the data and guarantee exactly once processing semantics.




**Why Spark Streaming**
* provides unified api for streaming, batch and interactive analytics
* its flexible , robust and scalable data processing engine
* provides high throughput and good integration with data stores like Cassandra, Kafka, Hadoop, etc


## Details
We use Spark Kafka Direct for our streaming that guarantees exactly once processing semantics with checkpointing enabled for recovery. Batch processing is built on top of core spark.The above mentioned metrics are calculated and stored in KairosDB, a time series DB. Also, the user session information is stored in HDFS as Hive External tables from the batch processing. The user state is maintained for 12 hours in streaming and in batch the user state is maintained as long as the user is active.
RDDs in the streaming and the batch processing are processed as follows:
Parsing and Validation: This is done in a series of map(), flatMap(), filter(), groupByKey(), reduceByKey() and leftOuterJoin() RDD transformations which gets reused across Streaming and Batch jobs.
Data Enrichment and state maintenance: Data enrichment and state maintenance is done under updateStateByKey() in Streaming and mapPartitions() in case of batch processing with the business logic for merging the sessions and enriching the session with additional information getting reused across batch and streaming jobs. In streaming the state is maintained in memory and in the checkpoint as part of the functionality provided by updateStateByKey(). In batch processing, the state is maintained in Hive External Tables in HDFS. We also calculate rolling hour and 24 hour unique visitors using reduceByKeyAndWindow() with inverse function in the Streaming job.
Metric Generation and Save to KairosDB: Metrics are generated using flatMap() and reduceByKey() after the Sessionization and data enrichment and are saved to KairosDB using a kairosDB client. The write to KairosDB happens in forEachPartition() action. The logic for the save gets reused in both the Streaming and batch processing.
In the stream processing, the incoming beacons for each micro batch are grouped by sessionId and sorted by timestamp. Session state is updated by processing the sessions from the current micro batch and merging them with the session state from the previous micro batch. Metrics generated after sessionization are reduced and then stored to KairosDB. We store the offsets in Zookeeper in case we need to restart the job from the last successfully processed offsets. For accuracy, our transactions update the offsets in the same transaction when updating the results. The following code sample gives an overview of the same.

**StreamingSpecificCode.txt**

The following code sample has the reusable functions that are called from the Stream processing and batch processing for filtering and populating the valid beacons, populating grouped and sorted sessions, processing Stateful Sessions , generating the metric values, reducing the metrics and storing the generated metrics to KairosDB.

**CommonCodeBetweenBatchAndStreaming**

The following is a code sample of the batch processing that does the Stateful Sessionization, Metric Generation and Save to KairosDB and the above functions that are called from Stream Processing are called from batch processing as well thus providing code reusability between the batch and stream processing.

## Streaming Configuration and Deployment
We use a standalone cluster for Spark Streaming job deployment. For HA/DRwe run the Streaming Job in 2 different regions. The Streaming job from only one region writes to Kairos cluster at any point of time and the job in the second region keeps running by maintaining the state with no writes to kairos. If there is any issue with the job in the first region or if we need to do any maintenance to the cluster or upgrade the code or if there are any issues with the Kafka topic in one region, we switch the writes to Kairos from the job running in the second region. This gives us higher availability of the Streaming Job with session state being maintained and the data written to kairos is continuous.
Our streaming job runs with a micro batch of 60 seconds. The correct micro batch size is chosen depending on the processing times. We started with a batch window size of 10 seconds and observed processing times and increased it incrementally until the processing times are completed within the micro batch time period and there is no scheduling delay or it’s only increasing occasionally and recovers quickly. The ideal number of executors/cores is dependent on the application by considering various factors like number of peak events per second, maximum allowed lag and the buffering capabilities of the streaming source which can be arrived at by testing in a pre-production environment. The right amount of parallelism is between 2–3 times the number of cores and needs to be arrived at iteratively by testing the job with various configurations. Another key consideration is to set spark.memory.fraction and spark.memory.storageFraction to the right values. Based on various tests, it is observed that spark.locality.wait=0s is good for the job performance. We use kryo serialization in our Spark Jobs for better performance. When upgrading the application code, the application needs to be shutdown gracefully with no further records to process.

## Job Monitoring and Automatic Job Restarts for Streaming
We use a python script that runs every 5 minutes to monitor the streaming job to see if its up and running. We monitor if there is any lag, total delay, number of failed jobs vs number of successful jobs and take appropriate steps if it exceeds a certain threshold. We query for these metrics from Spark UI end point http://streamingUIURL:4040/metrics/json and http://streamingUIURL:4040/api/v1/applications/<appId>/jobs. If the job is down for some reason or the number of failure jobs are greater than the number of successful jobs for the past 5 minutes or if any of the task is stuck due to any infrastructure issues then the job gets restarted automatically and we receive an alert email.

## Troubleshooting the streaming job
We use Spark streaming UI to troubleshoot any issues with the job. Following is a screenshot of the Streaming UI. Ideally the scheduling delay should be 0 or even if it increases, it should recover quickly within acceptable time periods. The processing time should be less than or equal to the microbatch time period. The Kafka Direct stream consumption parallelism is equal to the number of Kafka partitions and the number of partitions need to be adjusted if there is not enough parallelism for the reads.

In case of increased processing times, individual batches can be drilled down to see which stage is taking time as shown below. Shuffle reads/writes should be as less as possible and there should not be any data skew. If the shuffles are taking a lot of time, then one way of fixing that is by increasing the cores and the parallelism.
DAG visualization as shown below is used to determine if there are any pipelining operations that are separated by shuffles. Performance is good if all the operations go through a single pipeline without any shuffles. It also gives an insight if any of the RDDs are cached. The cached RDDs are denoted by a green highlight. Caching this RDD means future computations on this RDD can be accessed from memory thereby improving the performance.
The storage section of streaming UI as shown below provides an insight to see if the cached RDDs are regularly cleared or not. Ideally cached RDDs should be cleared at regular intervals to avoid any performance bottlenecks.
The executors tab of the Web UI as shown below also provides information about the number of active tasks, utilization of the cores, task GC time, shuffle reads/writes. If there is a huge difference between active tasks vs cores allocated then there is under utilization of the cluster and the core allocation need to be adjusted accordingly. Thread dump is used to drill down on possible performance issues.


## Batch Configuration and Deployment
We use Mesos as our cluster manager for batch jobs and use HDFS for state maintenance and intermediate file storage. The batch processing job runs every 30 minutes and processes any new data available in HDFS. The data is stored in Hive external tables. We configured the executor memory depending on the amount of data that is getting processed in each batch. We also use shuffle operations like reduceByKey, groupByKey, repartition, coalesce, cogroup, join etc as part of the batch job, the configs like spark.executor.memory, spark.cores.max, spark.default.parallelism were adjusted according to the job requirement iteratively.

## Troubleshooting the batch job
A few slow local tasks could cause a huge performance impact when one stage needs to finish before the next one can start. The detailed stage/tasks link in the Spark UI can be used to identify slow running nodes, nodes with resources problems, skew in data partitioning which can be identified by looking at the input size/records, or a small number of tasks taking significantly longer time to execute than the others. Drilling down into slower running tasks we can determine if the slowness is in writing data, reading data, or computation. If the processing is slow, it can be due to not enough resources and we would need to focus on how much memory and cpu has been allocated for the executers and also the total number of cores allocated for the job.

Recap:
Conclusion

To recap, here are some points covered in this article

    Building the data pipeline for A/B testing with the lambda architecture using Spark helped us to have quick view of the data/metrics that get generated with a streaming job, and a reliable view from a scheduled batch process.
    Using Spark/Spark Streaming helped us to write the business logic functions once, and then reuse the code in a batch ETL process as well as a streaming process which helped us lower the risk for errors resulting from duplicate code bases and also helped with the developer productivity as it provides a unified api for streaming, batch and interactive analytics.
    We focused on having a stable stream/batch processing application first before focusing on the throughput. The performance of the applications were improved by tuning Spark’s serialization, memory parameters, increasing the number of cores and parallelism iteratively.
    Spark Streaming/Batch UI provides very good information on the performance bottlenecks like shuffles, data skew, slow running tasks due to resource issues, task GC time, shuffle reads/writes, slow running stages, storage and other information to help troubleshoot the jobs.

Reference:
1. https://medium.com/walmartglobaltech/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3