# Parallelize your python code

In this lesson you will learn how to paralellize custom python code using Dask. You will learn about the Futures and Delayed APIs, and how to use them to parallelize customs functions.

The example we will be tackling consist of scrapping and cleaning some data from Stack Overflow website. BUt first let's do a quick recap on Futures and Delayed objects.

## Futures and Delayed: low-level collections

Dask low-level collections are the best tools when you need to have fine control control to build custom parallel and distributed computations.

**NOTE:** For an introductory lesson on futures and delayed revisit:
- https://tutorial.dask.org/03_dask.delayed.html
- https://tutorial.dask.org/05_futures.html


## Recap Basics
### Futures

Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

In [1]:
import dask
from dask.distributed import Client

In [2]:
client = Client(n_workers=4)
client

2022-12-08 09:35:51,965 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-spng1cll', purging
2022-12-08 09:35:51,965 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-4sckt0tb', purging
2022-12-08 09:35:51,966 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-nm7ouwy2', purging
2022-12-08 09:35:51,966 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-tbk7x2py', purging


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:63898,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:63914,Total threads: 2
Dashboard: http://127.0.0.1:63919/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63902,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-q8kxqda1,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-q8kxqda1

0,1
Comm: tcp://127.0.0.1:63916,Total threads: 2
Dashboard: http://127.0.0.1:63918/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63903,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-2hj7dtzv,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-2hj7dtzv

0,1
Comm: tcp://127.0.0.1:63915,Total threads: 2
Dashboard: http://127.0.0.1:63920/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63904,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-8_an2lww,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-8_an2lww

0,1
Comm: tcp://127.0.0.1:63913,Total threads: 2
Dashboard: http://127.0.0.1:63917/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63901,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-10k4qv3l,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-10k4qv3l


Let's make a toy functions, `inc`  that sleep for a while to simulate work. We'll then time running these functions normally.

In [3]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

We can run these locally

In [7]:
inc(1)

2

**`client.submit()`**

Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.

In [6]:
future = client.submit(inc, 1)  # returns immediately with pending future
future

If you wait a second, and then check on the future again, you’ll see that it has finished.

In [8]:
future

You can block on the computation and gather the result with the `.result()` method.

In [9]:
future.result()

2

**`client.map()`**

In [10]:
futures = client.map(inc, range(8))  # returns immediately with pending list of futures
futures

[<Future: pending, key: inc-e89eb876bfcce4d2b7fa5a1eedd71da0>,
 <Future: finished, type: int, key: inc-b612ca5982bd2c57f06bea3c2e9ee23a>,
 <Future: pending, key: inc-192b31e5f13235db413a8e8d7fbfcd41>,
 <Future: pending, key: inc-a4235b69b0416ff52d2fd50a33197afb>,
 <Future: pending, key: inc-9d6f01cb38a804c622ab0031f981c5aa>,
 <Future: pending, key: inc-380564c010c36f918a586922e139c5a7>,
 <Future: pending, key: inc-1b3178850e2c9e15bc6c073bcbc3ce06>,
 <Future: pending, key: inc-9915b3e13c584dfc5f078311ae6d0a1d>]

In [11]:
future_sum = client.submit(sum, futures)
future_sum.result()

36

**NOTE:** For an introductory lesson on futures revisit:
- https://tutorial.dask.org/05_futures.html

**Useful links**
* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Delayed 

Similarly to `futures`, `delayed` can be used to support arbitrary task scheduling, but in the case of `delayed` this happens **lazily**. This is the important difference between futures and delayed, `delayed` constructs a graph while `futures` are eager.


In [12]:
@dask.delayed
def inc(x):
    sleep(1)
    return x + 1

In [13]:
%%time
x = inc(1)
x

Delayed('inc-2804d179-c1c9-4b1d-9ae3-ce9fe5d23a92')

This ran immediately, since nothing has really happened yet.

To get the result, call `compute`. Notice that this runs faster than the original code.

In [14]:
x.compute()

2

The same example we did with `client.map()`, using `delayed` would be

In [15]:
%%time
results = []
for x in range(8):
    y = inc(x)
    results.append(y)
    
total = sum(results)

In [16]:
total.compute()

36

**NOTE:** For an introductory lesson on delayed revisit:
- https://tutorial.dask.org/03_dask.delayed.html

**Useful links**

* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)
* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)
* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)
* [Delayed examples](https://examples.dask.org/delayed.html)
* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)

## Grown-up example: Scrapping and cleaning SO data

In the re-cap as well as in plenty of introductory tutorials we use toy examples. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping and cleaning workflow.

You'll be scraping multiple pages from https://stackoverflow.com/questions/ , cleaning the data you are receiving and doing some computations with it. You'll first see how the sequential code works, and then you'll use `futures` and `delayed` to do this in parallel

Note about throttling: 

Stack exchange has a throttling limit.

> If an application does not have an access_token, then the application shares an IP based quota with all other applications on that IP. This quota is based on the key being passed by the applications; it is the max of the daily request limit for the applications involved, which by default is 10,000. This quota scheme is essentially unchanged from earlier versions of the API.

In the following code we will be working well within those limits, but if you want to explore more, keep in mind that limitations you'll run into. https://api.stackexchange.com/docs/throttle

In [20]:
import pandas as pd
import re
import requests
from requests_html import HTML



TODO:

    - Introduce the "SO problem"
    - Write text around how to solve it serialy
    - Write text around multiple ways to parallelize the workflow
    - Write text around as_completed, vs dask dataframe example
    - Select some parts to be exercises. 
     

### Stack Overflow walk around

Before we start let's quickly see what kind of data are we going to be retrieving.

In [21]:
#Example url
base_url = "https://stackoverflow.com/questions/tagged/"
tag = "dask"
query_filter = "Newest"
url = f"{base_url}{tag}?tab={query_filter}"
url

'https://stackoverflow.com/questions/tagged/dask?tab=Newest'

In [92]:
r = requests.get(url)
html_str = r.text
html = HTML(html=html_str)


`.s-post-summary`: Base parent container for a post summary

https://stackoverflow.design/product/components/post-summary/

`html.find(".s-post-summary")` returns list of 50 elements, let's explore one of them.

In [26]:
html.find(".s-post-summary")[0].text

'0 votes\n0 answers\n10 views\nCreating xr.DataArray with dask.delayed coordinates\nI am trying to create a xr.DataArray from the output of dask.delayed. Both the data and its coordinates are read from file in the same call to the delayed function, therefore the coordinates are not ...\npython\ndask\npython-xarray\npnjun\n109\nasked 4 hours ago'

In [27]:
print(html.find(".s-post-summary")[0].text)

0 votes
0 answers
10 views
Creating xr.DataArray with dask.delayed coordinates
I am trying to create a xr.DataArray from the output of dask.delayed. Both the data and its coordinates are read from file in the same call to the delayed function, therefore the coordinates are not ...
python
dask
python-xarray
pnjun
109
asked 4 hours ago


In [36]:
question = html.find(".s-post-summary")[0]

In [39]:
question.text

'0 votes\n0 answers\n10 views\nCreating xr.DataArray with dask.delayed coordinates\nI am trying to create a xr.DataArray from the output of dask.delayed. Both the data and its coordinates are read from file in the same call to the delayed function, therefore the coordinates are not ...\npython\ndask\npython-xarray\npnjun\n109\nasked 4 hours ago'

In [42]:
question.find(".s-post-summary--content-title", first=True).text

'Creating xr.DataArray with dask.delayed coordinates'

In [41]:
question.find(".s-post-summary--stats", first=True).text

'0 votes\n0 answers\n10 views'

In [43]:
question.find(".s-post-summary--meta-tags", first=True).text

'python\ndask\npython-xarray'

## Scrape and clean 

With this information, we wrote some functions, that get a page and clean the data from all the posts in that page and returns it as a list of dictionaries.

In [136]:
def extract_url_and_parse_html(url):
    """
    Given a SO url gets questions summary and cleans data.
    Returns a list of dictionaries with the clean data.
    
    see also: clean_scraped_data
    """
    # function that will parse a single page
    r = requests.get(url)
    
    if r.status_code not in range(200, 299):
        return []
    
    #get html
    html = HTML(html=r.text)
        
    questions = html.find(".s-post-summary")
    
    key_class_dict = {"title": ".s-post-summary--content-title" ,
                  "stats": ".s-post-summary--stats",
                  "tags": ".s-post-summary--meta-tags"}

    datas = []

    for q_el in questions:
        q_data = {}
        for k, v in key_class_dict.items():
            q_data[k] = clean_scraped_data(q_el.find(v, first=True).text, k)
        
        q_data["votes"] = q_data["stats"][0]
        q_data["answers"] = q_data["stats"][1]
        q_data["views"] = q_data["stats"][2]
        datas.append(q_data)
    return datas

In [137]:
def clean_scraped_data(text, keyname=None):
    """
    Cleans the scraped data once in text format
    """
    if keyname == "stats":
        # '12415 votes\n51 answers\n3.1m views' -> ['12415 votes', '51 answers', '3.1m views']
        # t.split()[0] grabs what is before the space and we apply the str_to_num func
        
        text = [str_to_num(t.split()[0]) for t in text.split("\n")]

    elif keyname == "tags":
        text  = text.split("\n")
        
    return text

In [138]:
def str_to_num(x):
    """
    Converts strings of the form '1.5k' or '3.1m' into numbers
    """
    
    mult = {'k': 1e3, 'm': 1e6}

    if ('k' in x) or ('m' in x):
        num =int(float(x[:-1])*mult[x[-1]])
    else:
        num = int(float(x))
        
    return num

## Let's scrape 

The following function will scrape a page with a specific `tag` and `query_filter` and return a `pandas.Dataframe`

In [139]:
def scrape_tag(page_num, tag="dask", query_filter="Newest", pagesize=50):
    
    base_url = "https://stackoverflow.com/questions/tagged/"
    url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
    
    datas_page = extract_url_and_parse_html(url)

    return pd.DataFrame(datas_page)

## Sequential

Let's get some data for the first 32 pages, where we have 50 posts per page. 

In [143]:
%%time
pd_res = []
for page in range(1, 33):
    pd_res.append(scrape_tag(page))

##pd_res

CPU times: user 7.33 s, sys: 192 ms, total: 7.52 s
Wall time: 12.2 s


In [144]:
len(pd_res)

32

In [145]:
pd_res[0].head()

Unnamed: 0,title,stats,tags,votes,answers,views
0,Creating xr.DataArray with dask.delayed coordi...,"[0, 0, 10]","[python, dask, python-xarray]",0,0,10
1,Insert data to mssql in partitions,"[0, 0, 20]","[python, sql-server, dataframe, sqlalchemy, dask]",0,0,20
2,How to check if file is empty in dask python w...,"[0, 0, 24]","[python, dask]",0,0,24
3,How to check if xarray's `DataArray` is `dask....,"[1, 1, 21]","[python, dask, python-xarray]",1,1,21
4,How to calculate the columnwise minimum of a d...,"[0, 0, 6]",[dask],0,0,6


## Parallel
This took ~12 s, let's see how long it takes if we do it in parallel using futures and delayed.

## client.submit

Notice that the code changes are minimal compare to the sequential version

In [146]:
%%time
fut = []
for page_num in range(1, 33):
    future = client.submit(scrape_tag, page_num)
    fut.append(future)

res = client.gather(fut)

CPU times: user 373 ms, sys: 73.7 ms, total: 446 ms
Wall time: 2.52 s


In [147]:
fut[0]

In [148]:
res[0].head(3)

Unnamed: 0,title,stats,tags,votes,answers,views
0,Creating xr.DataArray with dask.delayed coordi...,"[0, 0, 10]","[python, dask, python-xarray]",0,0,10
1,Insert data to mssql in partitions,"[0, 0, 20]","[python, sql-server, dataframe, sqlalchemy, dask]",0,0,20
2,How to check if file is empty in dask python w...,"[0, 0, 24]","[python, dask]",0,0,24


### `client.map()`

We can do this in a different way using `client.map()`, and in fewer lines of code. We first will delete the futures in `fut` as dask is smart enough to realize we already computed this.

In [149]:
del fut

In [151]:
%%time
fut_map = client.map(scrape_tag, range(1, 33)) #this returns a list of futures
res_map = client.gather(fut_map)

CPU times: user 250 ms, sys: 58 ms, total: 308 ms
Wall time: 2.55 s


In [152]:
res_map[0].head(3)

Unnamed: 0,title,stats,tags,votes,answers,views
0,Creating xr.DataArray with dask.delayed coordi...,"[0, 0, 10]","[python, dask, python-xarray]",0,0,10
1,Insert data to mssql in partitions,"[0, 0, 20]","[python, sql-server, dataframe, sqlalchemy, dask]",0,0,20
2,How to check if file is empty in dask python w...,"[0, 0, 24]","[python, dask]",0,0,24


### `client.map()` with lambda functions

Let's suppose that we want to explore a different `tag`. We can use `client.map` along with a `lambda` function to pass the variable along. 

In [153]:
%%time
fut_py = client.map(lambda p: scrape_tag(p, tag="python"), range(1, 33))
res_py = client.gather(fut_py)

CPU times: user 282 ms, sys: 60.8 ms, total: 342 ms
Wall time: 2.58 s


In [154]:
res_py[0].head(3)

Unnamed: 0,title,stats,tags,votes,answers,views
0,What does 1.(x = y or z ) & 2.(x=y and z) assi...,"[0, 0, 2]",[python],0,0,2
1,factorial calculation by user defining functio...,"[0, 0, 2]","[python, python-3.x, python-2.7]",0,0,2
2,Extracting minimum/maximum/average value from ...,"[0, 0, 2]","[python, numpy]",0,0,2


## `delayed` 

Another approach to this problem is to use delayed. Let's create a separate function which is the **exact same** as `scrap_tag` but we will use now the `@dask.delayed` decorator. 

For simplicity, I will rename the function, but keep in mind that nothing has changed

In [155]:
@dask.delayed
def scrape_tag_delayed(page_num, tag="dask", query_filter="Newest", pagesize=50):
    
    base_url = "https://stackoverflow.com/questions/tagged/"
    url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
    
    datas_page = extract_url_and_parse_html(url)

    return pd.DataFrame(datas_page)

In [160]:
%%time
res = []
for page in range(1, 33):
    res.append(scrape_tag_delayed(page))

CPU times: user 2.97 ms, sys: 800 µs, total: 3.77 ms
Wall time: 4.03 ms


Notice that nothing happened in the dashboard, and thing run very fast. This is because no computation has happened yet. Remember, `delayed` is **lazy**

In [163]:
res[:3]

[Delayed('scrape_tag_delayed-834741ab-246e-4e86-b256-7a76660c971d'),
 Delayed('scrape_tag_delayed-2582c001-84ec-4f59-ba0e-b275ee47290d'),
 Delayed('scrape_tag_delayed-1e6e3e34-a55e-4fe9-b7b0-87f0915af4a1')]

In [164]:
%%time
r = dask.compute(res)

CPU times: user 266 ms, sys: 55.3 ms, total: 322 ms
Wall time: 2.46 s


# wait as_completed and do things

In [169]:
from dask.distributed import as_completed

## NOTE
HERE DO EXAMPLE BELOW WITH WAIT AND TIME IT 

In [189]:
%%time
t15 = client.map(scrape_tag, range(15))

tot_views= 0 

for future in as_completed(t15):
    test = future.result()['view']
    tot_views += test.sum()

tot_views

CPU times: user 35.5 ms, sys: 7.28 ms, total: 42.8 ms
Wall time: 51.7 ms


105874

In [118]:
import dask.dataframe as dd

In [183]:
test_del = dd.from_delayed(t15)

In [184]:
test_del

Unnamed: 0_level_0,title,stats,tags,vote,answer,view
npartitions=15,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,object,object,object,int64,int64,int64
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [185]:
test_del['view'].sum()

dd.Scalar<series-..., dtype=int64>

In [186]:
%%time
test_del['view'].sum().compute()

CPU times: user 37.5 ms, sys: 4.54 ms, total: 42.1 ms
Wall time: 49 ms


105874

In [191]:
client.shutdown()