![Königsweg Logo](../img/koenigsweg_150.png)

<span style="font-size: small;float: right;">&copy; 2015-2020 Alexander C.S. Hendorf, <a href="http://koenigsweg.com">Königsweg GmbH</a>, Mannheim </span>

---

# Analytics with  Pandas and Jupyterlab

---

# Scaling and Optimizing Performance

---

In [None]:
import numpy as np
import pandas as pd
import parquet
import dask
%matplotlib inline
from datetime import datetime as dt

In [None]:
large_file_path = '../data/blooth_sales_data_big.json'  # 42 MB json

---

### Catagorical

If you deal with table with a lot of repetive data, a Categorical can ge a good option to save space. It's basically a lookup table.

In [None]:
tiny_big_set = pd.read_json(large_file_path)

In [None]:
tiny_big_set.head(3)

In [None]:
tiny_big_set.info()

In [None]:
tiny_big_set.info(memory_usage='deep')

In [None]:
tiny_big_set.memory_usage()

In [None]:
tiny_big_set.memory_usage(deep=True)

In [None]:
tiny_big_set['product'] = tiny_big_set['product'].astype('category')

In [None]:
tiny_big_set.memory_usage(deep=True)

---

### Parquet

Apache Parquet is a
* free and open-source column-oriented data store of the Apache Hadoop ecosystem.
* top-level Apache Software Foundation (ASF)-sponsored project.
* built from the ground up with complex nested data structures in mind

Benefits:
* Column-wise compression is efficient and saves storage space
* Compression techniques specific to a type can be applied as the column values tend to be of the same type
* Queries that fetch specific column values need not read the entire row data thus improving performance
* Different encoding techniques can be applied to different columns
* can work with a number of programming languages like C++, Java, Python, PHP,…
* lower data storage costs and maximize effectiveness of querying data (e.g. with serverless technologies)


In [None]:
start = dt.utcnow()
df = pd.read_json(large_file_path)
took = start = dt.utcnow() - start
took.total_seconds()

In [None]:
df.to_parquet(f'{large_file_path}.parquet.gzip', compression='gzip')

In [None]:
start = dt.utcnow()
df = pd.read_parquet(f'{large_file_path}.parquet.gzip')
took = start = dt.utcnow() - start
took.total_seconds()

---

### Dask

#### Dask natively scales Python.

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love as 
* Pandas
* Numpy
* Scikit-Learn

We can summarize the basics of Dask as follows:
* process data that doesn't fit into memory by breaking it into blocks and specifying task chains
* parallelize execution of tasks across cores and even nodes of a cluster
* move computation to the data rather than the other way around, to minimize communication overheads

In [None]:
# preprocessing articial data
df['total'] = df.units * df.unitprice
for i in range(5):
    df.to_csv(f'/tmp/data_for_dask_{i}.csv')

Pandas is great for tabular datasets that fit in memory. 
Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. 

The dask.dataframe module implements a blocked parallel DataFrame object that mimics a large subset of the Pandas DataFrame. One Dask DataFrame is comprised of many in-memory pandas DataFrames separated along the index. One operation on a Dask DataFrame triggers many pandas operations on the constituent pandas DataFrames in a way that is mindful of potential parallelism and memory constraints.

In [None]:
import dask
filename = f'/tmp/data_for_dask_*.csv'

In [None]:
import dask.dataframe as dd
df = dd.read_csv(filename)
# load and count number of rows
df.head()

In [None]:
len(df)

##### **Pandas** way

In [None]:
start = dt.utcnow()

maxes = []
for fn in [f'/tmp/data_for_dask_{i}.csv' for i in range(5)]:
    pdf = pd.read_csv(fn)
    maxes.append(pdf.total.max())
    
took = start = dt.utcnow() - start
took.total_seconds(), max(maxes)

**Dask** way

In [None]:
start = dt.utcnow()

df.total.max().compute()

took = start = dt.utcnow() - start
took.total_seconds(), max(maxes)

![Dask](../img/dask-compute.gif)

---