# Batch Processing

## Objectives:
- Understand batch processing and why it is used
- Explore batch processing in Python with `joblib` library
- Create batch ETL pipeline to update model and dashboard

## What is Batch Processing

### Definition
- Jobs that can run without end user interaction, or can be scheduled to run as resources permit
- Used for running high-volume, repetitive data jobs
- Batch processing works in an **automated** way based on a **scheduler**

More useful introductory discussion [here](https://www.talend.com/resources/batch-processing/).

#### Batch vs Stream

![img](https://res.cloudinary.com/hevo/images/f_auto,q_auto/v1649315584/hevo-learn/Batch-Processing-Batch-Processing-vs-Stream-Processing/Batch-Processing-Batch-Processing-vs-Stream-Processing.png?_i=AA)

(Source: https://hevodata.com/learn/batch-processing/.)

Batch processing is to be contrasted with serial or *stream* processing. Stream processing is critical when you need real-time updating of data reports or analyses. But if you are processing large chunks of data, it can be better to process it in batches.

### Batch size
The batch size refers to the number of work units to be processed within one batch operation. Some examples are:

- The number of lines from a file to load into a database before committing the transaction.
- The number of messages to dequeue from a queue.
- The number of requests to send within one payload.

### Common batch processing usage

- Efficient bulk database updates and automated transaction processing, as contrasted to interactive online transaction processing (OLTP) applications.
- The extract, transform, load (ETL) step in populating data warehouses is inherently a batch process in most implementations.
- Performing bulk operations on digital images such as resizing, conversion, watermarking, or otherwise editing a group of image files.
- Converting computer files from one format to another. For example, a batch job may convert proprietary and legacy files to common standard formats for end-user queries and display.

(Source: https://en.wikipedia.org/wiki/Batch_processing.)

In [None]:
# Import Packages
import sqlite3
import time
from joblib import Parallel, delayed, Memory
from tqdm import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from prophet import Prophet

## Today's Agenda

Today we will be exploring batch processing through two examples:

1. Use Python's `joblib` package
1. Create simple batch ETL pipeline to continuously update a model and deploy to dashboard.

## `joblib`

### Advantages

- Disk Caching of Functions & Lazy Re-Evaluation

Cache the results of expensive function calls for later use. Useful during pipeline development.

- Parallel Computing

Execute multiple operations at the same time.

### Caching of Functions

In [None]:
result = []

# Getting the square of the number:
def square_number(no):
    return (no*no)

# Function to compute square of a range of a number:
def get_square_range(start_no, end_no):
    for i in np.arange(start_no, end_no):
        time.sleep(1)
        result.append(square_number(i))
    return result

start = time.time()
# Getting square of 1 to 20:
final_result = get_square_range(1, 21)
end = time.time()

# Total time to compute
print('\nThe function took {:.2f} s to compute.'.format(end - start))
print(final_result)

In [None]:
# COMPLETE: Define a location to store cache

result = []

# Function to compute square of a range of a number:
def get_square_range_cached(start_no, end_no):
    for i in np.arange(start_no, end_no):
        time.sleep(1)
        result.append(square_number(i))
    return result

# COMPLETE: Cash results of function


start = time.time()
# Getting square of 1 to 20:
final_result = get_square_range_cached(1, 21)
end = time.time()

# Total time to compute
print('\nThe function took {:.2f} s to compute.'.format(end - start))
print(final_result)

In [None]:
start = time.time()
# Getting square of 1 to 20:
final_result = get_square_range_cached(1, 21)
end = time.time()

print('\nThe function took {:.2f} s to compute.'.format(end - start))
print(final_result)

### Parallelizing

The function below is based on the following mathematical theorem:

$\large\frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - ... = lim_{n\rightarrow\infty}\sum^n_{j=0}\frac{(-1)^j}{2j+1}$

In [None]:
def batch_process_function(row, order, payload):
    """
    Simulate process function
    
    Row and payload are ignored.
    
    Approximate pi
    """
    k, pi = 1, 0
    for i in range(10**order):
        if i % 2 == 0: # even
            pi += 4 / k
        else:  # odd 
            pi -= 4 / k 
        k += 2
    return pi

In [None]:
# Settings


#### Serial

In [None]:
%%time

result = None

#### Batch

In [None]:
%%time

# Parallel using joblib and a progress bar using tqdm
result = None

## Batch ETL Pipeline

Next we will walk through a simple example of a batch ETL pipeline that can be used to update a model and deploy it to a dashboard.

### Scenario

We work for a store that is interested in forecasting their future sales. They have a model that forecasts total daily sales for the upcoming month. They would like us to create a pipeline that will automatically update the model on a weekly basis and deploy the results to a dashboard.

### Tasks:
- Extract recent sales data from database
- Transform to appropriate format for time series model
- Load to "Data Warehouse"
- Train model on most recent data and deploy forecasts to dashboard