# Parallelize your python code

In this lesson you will learn how to paralellize custom python code using Dask. You will learn about the Futures and Delayed APIs, and how to use them to parallelize customs functions.

The example we will be tackling consist of scrapping and cleaning some data from Stack Overflow website. BUt first let's do a quick recap on Futures and Delayed objects.

## Futures and Delayed: low-level collections

Dask low-level collections are the best tools when you need to have fine control control to build custom parallel and distributed computations.

**NOTE:** For an introductory lesson on futures and delayed revisit:
- https://tutorial.dask.org/03_dask.delayed.html
- https://tutorial.dask.org/05_futures.html


## Recap Basics
### Futures

Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

In [None]:
import dask
from dask.distributed import Client

In [None]:
client = Client(n_workers=4)
client

Let's make a toy functions, `inc`  that sleep for a while to simulate work. We'll then time running these functions normally.

In [None]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

We can run these locally

In [None]:
inc(1)

**`client.submit()`**

Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.

In [None]:
future = client.submit(inc, 1)  # returns immediately with pending future
future

If you wait a second, and then check on the future again, you’ll see that it has finished.

In [None]:
future

You can block on the computation and gather the result with the `.result()` method.

In [None]:
future.result()

**`client.map()`**

In [None]:
futures = client.map(inc, range(8))  # returns immediately with pending list of futures
futures

In [None]:
future_sum = client.submit(sum, futures)
future_sum.result()

**NOTE:** For an introductory lesson on futures revisit:
- https://tutorial.dask.org/05_futures.html

**Useful links**
* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Delayed 

Similarly to `futures`, `delayed` can be used to support arbitrary task scheduling, but in the case of `delayed` this happens **lazily**. This is the important difference between futures and delayed, `delayed` constructs a graph while `futures` are eager.


In [None]:
@dask.delayed
def inc(x):
    sleep(1)
    return x + 1

In [None]:
x = inc(1)
x

This ran immediately, since nothing has really happened yet.

To get the result, call `compute`. Notice that this runs faster than the original code.

In [None]:
x.compute()

The same example we did with `client.map()`, using delayed would be

In [None]:
results = []
for x in range(8):
    y = inc(x)
    results.append(y)
    
total = sum(results)

In [None]:
total.compute()

**NOTE:** For an introductory lesson on delayed revisit:
- https://tutorial.dask.org/03_dask.delayed.html

**Useful links**

* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)
* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)
* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)
* [Delayed examples](https://examples.dask.org/delayed.html)
* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)

## Grown-up example: Scrapping and cleaning SO data

In the re-cap as well as in plenty of introductory tutorials we use toy examples. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping and cleaning workflow.


EXAMPLAIN EXAMPLE HERE SIMPLY: https://stackoverflow.com/questions/



Note about throttling: 

Stack exchange has a throttling limit.

> If an application does not have an access_token, then the application shares an IP based quota with all other applications on that IP. This quota is based on the key being passed by the applications; it is the max of the daily request limit for the applications involved, which by default is 10,000. This quota scheme is essentially unchanged from earlier versions of the API.

In the following code we will be working well within those limits, but if you want to explore more, keep in mind that limitations you'll run into. https://api.stackexchange.com/docs/throttle

In [1]:
import re
import time

import pandas as pd
import requests
from requests_html import HTML

Except for the number of pages to scrape and the size of them. The code is pretty much ready, we need to polish 
it and write the narrative around the examples. 

TODO:

    - Introduce the "SO problem"
    - Write text around how to solve it serialy
    - Write text around multiple ways to parallelize the workflow
    - Write text around as_completed, vs dask dataframe example
    - Select some parts to be exercises. 
     

In [4]:
# r = requests.get(url)
# html_str = r.text
# html = HTML(html=html_str)

`.s-post-summary`: Base parent container for a post summary

https://stackoverflow.design/product/components/post-summary/

In [5]:
def clean_scraped_data(text, keyname=None):
    """
    Cleans the scraped data once in text format
    """
    if keyname == "stats":
        text_strings = re.findall(r"\d+", text)
        
        try:
            text =[int(s) for s in text_strings]
        except ValueError:
            text =[int(re.sub("k", "000", s)) for s in text_strings]

    elif keyname == "tags":
        return text.split("\n")
    return text

In [6]:
def extract_url_and_parse_html(url):
    """
    Given a SO url gets questions summary and cleans data.
    
    see also: clean_scraped_data
    """
    # function that will parse a single page
    r = requests.get(url)
    
    if r.status_code not in range(200, 299):
        return []
    
    #get html
    html = HTML(html=r.text)
        
    questions = html.find(".s-post-summary")
    
    key_class_dict = {"title": ".s-post-summary--content-title" ,
                  "stats": ".s-post-summary--stats",
                  "tags": ".s-post-summary--meta-tags"}

    datas = []

    for q_el in questions:
        q_data = {}
        for k, v in key_class_dict.items():
            q_data[k] = clean_scraped_data(q_el.find(v, first=True).text, k)
        
        q_data["vote"] = q_data["stats"][0]
        q_data["answer"] = q_data["stats"][1]
        q_data["view"] = q_data["stats"][2]
        datas.append(q_data)
    return datas

In [7]:
def scrape_tag(page_num, tag="python", query_filter="Newest", pagesize=50):
    
    base_url = "https://stackoverflow.com/questions/tagged/"
    url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
    
    datas_page = extract_url_and_parse_html(url)

    return pd.DataFrame(datas_page)

## Sequential

In [9]:
%%time
pd_res = []
for page in range(1, 11):
    pd_res.append(scrape_tag(page))

##pd_res

CPU times: user 2.2 s, sys: 36.6 ms, total: 2.24 s
Wall time: 3.94 s


## client.submit

### NOTE: ADD SOMETHING WITH wait. 

In [14]:
%%time
fut = []
for page_num in range(1, 11):
    future = client.submit(scrape_tag, page_num)
    fut.append(future)

res = client.gather(fut)

CPU times: user 209 ms, sys: 77.1 ms, total: 286 ms
Wall time: 1.46 s


In [15]:
fut[0]

In [18]:
res[0].head(3)

Unnamed: 0,title,stats,tags,vote,answer,view
0,Python errors in command prompt,"[0, 0, 4]","[python, python-3.x, python-2.7]",0,0,4
1,How to use KinematicTrajectoryOptimization in ...,"[1, 0, 3]","[python, optimization, drake]",1,0,3
2,Python Selenium & Browserstack - Connect to proxy,"[0, 0, 6]","[python, selenium, browserstack]",0,0,6


### client.map
lambda version or none lambda version

In [21]:
del fut

In [28]:
%%time
t = client.map(scrape_tag, range(1, 11))
test = client.gather(t)
#wait(t)

CPU times: user 76 ms, sys: 15.4 ms, total: 91.4 ms
Wall time: 831 ms


In [30]:
test[0].head(3)

Unnamed: 0,title,stats,tags,vote,answer,view
0,Python errors in command prompt,"[0, 0, 4]","[python, python-3.x, python-2.7]",0,0,4
1,How to use KinematicTrajectoryOptimization in ...,"[1, 0, 3]","[python, optimization, drake]",1,0,3
2,Python Selenium & Browserstack - Connect to proxy,"[0, 0, 6]","[python, selenium, browserstack]",0,0,6


In [31]:
%%time
t_d = client.map(lambda p: scrape_tag(p, tag="dask"), range(1, 11))
t_d_res = client.gather(t_d)

CPU times: user 104 ms, sys: 31.6 ms, total: 136 ms
Wall time: 1.23 s


In [33]:
t_d_res[0].head(3)

Unnamed: 0,title,stats,tags,vote,answer,view
0,Load data with chunks,"[0, 1, 20]","[python, pandas, sqlalchemy, dask, chunks]",0,1,20
1,Apply a function over the columns of a Dask array,"[2, 0, 23]","[dask, dask-distributed, dask-dataframe, dask-...",2,0,23
2,Is py-datatable compatible with Dask?,"[0, 0, 18]","[dask, py-datatable]",0,0,18


## use delayed

In [28]:
import dask

In [39]:
@dask.delayed
def scrape_tag_delayed(page_num, tag="python", query_filter="Newest", pagesize=50):
    base_url = "https://stackoverflow.com/questions/tagged/"
    url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
    datas_page = extract_url_and_parse_html(url)
        #time.sleep(1.2) #need between requests.
    return pd.DataFrame(datas_page)

In [40]:
res = []
for page in range(1, 11):
    res.append(scrape_tag_delayed(page))

In [41]:
res

[Delayed('scrape_tag_delayed-2cc531c7-c50d-40af-bf65-d3a1909de259'),
 Delayed('scrape_tag_delayed-7980fbe0-dbc5-46f6-b79b-379617bead4b'),
 Delayed('scrape_tag_delayed-ca4684a0-a78d-4241-ab0f-c0bea51900e5'),
 Delayed('scrape_tag_delayed-e8a95c0e-e2ad-432e-929a-c4267b7fda68'),
 Delayed('scrape_tag_delayed-6df82447-fc5f-49f9-8f8a-84f2addade57'),
 Delayed('scrape_tag_delayed-fce202db-e24c-42c0-99bf-ca28040f48fc'),
 Delayed('scrape_tag_delayed-812e0f2d-558b-4262-a56c-fa5a6c05680a'),
 Delayed('scrape_tag_delayed-18e7de00-63e9-471b-a921-dbb9c7080fe2'),
 Delayed('scrape_tag_delayed-2c163ab4-4732-49a1-b093-b443499c3428'),
 Delayed('scrape_tag_delayed-b17adc02-e4c2-42a9-909f-b3f0d6a16c6f')]

In [33]:
%%time
r = dask.compute(res)

CPU times: user 99.9 ms, sys: 30.3 ms, total: 130 ms
Wall time: 1.21 s


In [169]:
from dask.distributed import as_completed

## NOTE
HERE DO EXAMPLE BELOW WITH WAIT AND TIME IT 

In [189]:
%%time
t15 = client.map(scrape_tag, range(15))

tot_views= 0 

for future in as_completed(t15):
    test = future.result()['view']
    tot_views += test.sum()

tot_views

CPU times: user 35.5 ms, sys: 7.28 ms, total: 42.8 ms
Wall time: 51.7 ms


105874

In [118]:
import dask.dataframe as dd

In [183]:
test_del = dd.from_delayed(t15)

In [184]:
test_del

Unnamed: 0_level_0,title,stats,tags,vote,answer,view
npartitions=15,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,object,object,object,int64,int64,int64
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [185]:
test_del['view'].sum()

dd.Scalar<series-..., dtype=int64>

In [186]:
%%time
test_del['view'].sum().compute()

CPU times: user 37.5 ms, sys: 4.54 ms, total: 42.1 ms
Wall time: 49 ms


105874

In [191]:
client.shutdown()