<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
     
# Parallelize your python code

In this lesson you will learn how to paralellize custom python code using Dask. You will learn about the Futures API, and how to use it to parallelize customs functions.

The example we will be tackling consist of scrapping and cleaning some data from Stack Overflow website. But first let's do a quick recap on Futures.

## Futures: a low-level collection.

Dask low-level collections are the best tools when you need to have fine control control to build custom parallel and distributed computations.

One of the most common cases where you can use `Futures` is when you have a for loop. For example, you need to apply a read-transform-write function over multiple files, where you serial code will look something like


```python
def process_file(filename):
    data = read_a_file(filename)
    data_transformed = do_a_transformation(data)
    destination = f"results/{filename}"
    write_out_data(data_transformed, destination)
    return destination

files = ["file_1", "file_2", "file_3", ..., "file_n"] #list of files
new_files = [] #where we save the destination file names

for f in files:
    new_files.append(process_file(f)            
```

In the code above, every call of `process_file` is independent from each other, this is what is call an embarrassingly parallel problem. You can do this in parallel by doing

```python
futures = []
for f in files:
    future = client.submit(process_file, f)
    futures.append(future)
    
futures
```

## Example: Get SO get questions page title 

During this lesson, we will be working with the Stack Overflow question pages. To start let's see how to grab the title of each page, and how for multiple pages we can perform this in parallel. 

If you go to https://stackoverflow.com/questions/ you will see a list of the newest posts, if you to the bottom of the page you can switch to the next page. For example, the top of page number two at the moment of the creation of this notebook, looked like 

<center>
<img src="images/SO_page.png"
     width="65%"
     alt="SO page\">
</center>

### Get the title

The title of the page is what is showed in the tab. The following function gets the title of a page, given it's page number:

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import time

In [22]:
def get_questions_page_title(page_num):
    """ Get title of a SO questions page
    """
    url = f"https://stackoverflow.com/questions?tab=newest&page={page_num}"
    req = requests.get(url)
    html = bs(req.text, "html.parser")

    return html.title.text

In [23]:
page_2 = get_questions_page_title(2)
page_2

**Serial code to get 4 pages would be**

In [6]:
%%time
page_html = []
for p in range(1,9): #page numbers start in 1
    page_html.append(get_questions_page_title(p))
    

CPU times: user 1e+03 ms, sys: 31.4 ms, total: 1.03 s
Wall time: 2.8 s


In [8]:
page_html[:3]

['Newest Questions - Page 1 - Stack Overflow',
 'Newest Questions - Page 2 - Stack Overflow',
 'Newest Questions - Page 3 - Stack Overflow']

### Exercise

Run the code in parallel, using futures.

In [15]:
from dask.distributed import Client, wait

In [10]:
client = Client(n_workers=4)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:64310,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:64322,Total threads: 2
Dashboard: http://127.0.0.1:64327/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64315,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-b9is7613,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-b9is7613

0,1
Comm: tcp://127.0.0.1:64323,Total threads: 2
Dashboard: http://127.0.0.1:64326/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64313,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-d2q8g6tj,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-d2q8g6tj

0,1
Comm: tcp://127.0.0.1:64330,Total threads: 2
Dashboard: http://127.0.0.1:64331/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64316,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-1t8xq33d,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-1t8xq33d

0,1
Comm: tcp://127.0.0.1:64321,Total threads: 2
Dashboard: http://127.0.0.1:64324/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:64314,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-8x1_t2zz,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-8x1_t2zz


In [11]:
#Solution
futures = []
for p in range(1,9):
    future = client.submit(get_questions_page_title, p)
    futures.append(future)
    
futures

[<Future: pending, key: get_questions_page_title-cf6b290a43bbc50b6b7ae36d0bea825c>,
 <Future: pending, key: get_questions_page_title-ea19e73919b66b229f06298dd0296f66>,
 <Future: pending, key: get_questions_page_title-a7eee38cc4e44a007b4af84a4dbcdfda>,
 <Future: pending, key: get_questions_page_title-c5f800008e189c6188c490aeda41ca28>,
 <Future: pending, key: get_questions_page_title-84a443272ee63b7d105e90d2fc8c9d7c>,
 <Future: pending, key: get_questions_page_title-5b408944028ad056224128a8ee0a680f>,
 <Future: pending, key: get_questions_page_title-965a4ee509af33e806162b35830cb692>,
 <Future: pending, key: get_questions_page_title-1cc10177431e28480edf4972c7d585c6>]

In [12]:
futures[0]

In [13]:
futures[0].result()

'Newest Questions - Page 1 - Stack Overflow'

In [None]:
results = [future.result() for future in futures]
results

**Extra:**

To be able to `%%time` the cell and compare times with the serial version, you will need to wait for the futures to finish doing `wait(futures)`. If you try to do that and re run the cell, you will notice it is immediate, this is because by default, distributed assumes that all functions are pure. Pure functions:

- always return the same output for a given set of inputs
- do not have side effects, like modifying global state or creating files

 
You can use the `pure=False` keyword argument in the `client.submit()`. Modify your solution to match this code


```python
%%time
futures = []
for p in range(1,9):
    future = client.submit(get_questions_page_title, p, pure=False)
    futures.append(future)
    
wait(futures)
```

## TODO
-  do client map as equivalent of for loop submit with page title
- replace code such that request we get html, 1 func to get html 1 func that takes it and gets links 
    and ass completed pass it to get post links

- transition to complicated example where we have an extra function to get data
- we also need motivation and show a plot or something. 
- something like (histogram with people usrname and +votes, +most answered questions)

```
page_2.find_all("a", class_="s-link")[3:-1][0]

t = page_2.find_all("a", class_="s-link")[3:-1][0]['href']

```

## Recap the Basics
### Futures

Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

Let's make a toy functions, `inc`  that sleep for a while to simulate work. We'll then time running these functions normally.

In [None]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

We can run these locally

In [None]:
inc(1)

**`client.submit()`**

Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.

In [None]:
future = client.submit(inc, 1)  # returns immediately with pending future
future

If you wait a second, and then check on the future again, you’ll see that it has finished.

In [None]:
future

You can block on the computation and gather the result with the `.result()` method.

In [None]:
future.result()

**`client.map()`**

In [None]:
futures = client.map(inc, range(8))  # returns immediately with pending list of futures
futures

In [None]:
future_sum = client.submit(sum, futures)
future_sum.result()

**`as_completed()`**

In [None]:
from dask.distributed import as_completed
import numpy as np

In [None]:
def random_score(x):
    return np.random.uniform(low=0.5, high=10.0)

In [None]:
score_futures = client.map(random_score, range(20))

best_score = 0
for future in as_completed(score_futures):
    score = future.result()
    if score > best_score:
        best_score = score

In [None]:
print(best_score)

**Useful links: futures**
* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

## Grown-up example: Scrapping and cleaning SO data

In the re-cap as well as in plenty of introductory tutorials we use toy examples. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping and cleaning workflow.

You'll be scraping multiple pages from https://stackoverflow.com/questions/, then you will be finding all the links for every post in each page and finally, getting some data from each post. You'll first see how the sequential code works, and then you'll use `futures` to do this in parallel

Note about throttling:

When scrapping directly from the pages and not using the API, it is not clear what are the throttling limitations, but from experience we run into them pretty quickly.

The following examples, work as they are, if you change the number of pages you will likely hit a limit and be banned for few minutes. We will work around this towards the end, in the meantime avoid changing the number of pages

In [None]:
#NEEDS MOTIVATION, AND EXMAPLANATION OF WHTA WE ARE TRYING TO DO

#PROBABLY A PLOT SHOWING THE RESULTS (IMAGE)

## Scrape, crawl, get data

We wrote some functions that get a page and clean the data from all the posts in that page and returns it as a list of dictionaries.

In [None]:
def request_html_page(url):
    req = requests.get(url)
    html = bs(req.text, "html.parser")
    return html

In [None]:
def get_page_html_links(page_num, tag="dask", query_filter = "MostVotes"):
    base_url = "https://stackoverflow.com/questions/tagged/"
    
    page_url = f"{base_url}{tag}?sort={query_filter}&page={page_num}"

    page_html = request_html_page(page_url)
    
    return page_html

In [None]:
def get_post_links_per_page(html_page):
    question_href = html_page.find_all("a", class_="s-link")[3:-1]
    
    question_link = [f"https://stackoverflow.com{q['href']}" for q in question_href]
    
    return question_link

In [None]:
def get_data(post_link):
    html_post = request_html_page(post_link)
    post_info = {}
    
    
    post_info["title"] = html_post.title.text
    post_info["question"] = html_post.find("div", class_="s-prose js-post-body").text
    
    answ = html_post.find("div", class_="answer") #this will gets us the first/most voted answer
    
    if answ:
        post_info["best_answer_votes"] = int(answ["data-score"])
    
        best_answer_author_obj = answ.find("span", itemprop="name")
        
        if best_answer_author_obj:
            best_answer_author = best_answer_author_obj.text
        else:
            best_answer_author = "comunity_post"

        post_info["best_answer_usrname"] = best_answer_author
    else:
        post_info["best_answer_votes"] = 0
        post_info["best_answer_usrname"] = "no-answer"

    
    return post_info

## Serial

In [None]:
%%time

df_list =[]
for page_num in range(1, 3): #more than 2 pages and get trhottling issues
    page_html = get_page_html_links(page_num)
    posts_links = get_post_links_per_page(page_html)
    list_post_data = []
    
    for link in posts_links:
        p_data = get_data(link)
        list_post_data.append(p_data)

    df = pd.DataFrame(list_post_data)
    df_list.append(df)

In [None]:
df_list[0].head()

## Parallel

In [None]:
from dask.distributed import wait, as_completed

### Get pages and links of posts

In [None]:
%%time
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)

**`wait()`**

Notice that here we used `wait()`, you can wait on a future or collection of futures using the `wait` function, which blocks until all futures are finished or have erred. This is useful when you need the all the futures to be completed to proceed with your computations. 

**`as_completed()`**

In other situations you might need to iterate over the futures as they complete, to do so you will use the `as_completed` function.

In [None]:
%%time

posts_links_futures = client.map(get_post_links_per_page, pages_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

## Avoid throttling using a cluster. 

In the example above, if we try to work with more pages, you will hit throttling issues. This is a problem, but we can solve it by scaling out to a lot of machines. When using a cluster each worker has its own public ip-address so it is like we are requesting from different machines. 

Let's create a coiled cluster and scrape a bigger number of pages:

In [None]:
#Shutdown LocalCluster
client.shutdown()

In [None]:
import coiled

In [None]:
cluster = coiled.Cluster(n_workers=10, 
                        package_sync=True)

In [None]:
client =  Client(cluster)
client

In [None]:
%%time
pages_10_futures = client.map(get_page_html_links, range(1,11))
wait(pages_10_futures)

posts_links_futures = client.map(get_post_links_per_page, pages_10_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

In [None]:
len(dfs_data) #10 pages

At this point, we have the data to build each page dataframe. 

In [None]:
dfs_data[0]

### Exercise 
Using futures, transform `dfs_data` into a list of pandas.Dataframe futures.

In [None]:
## Solution
df_futures = client.map(pd.DataFrame, dfs_data)

In [None]:
df_futures[0]

Now we have some pandas dataframes!!

### Exercise

Write a function that returns the `value_counts` of the `best_answer_usrname`, and apply it to the dataframe futures

In [None]:
###Solution
def best_ans_val_counts(df):
    return df.best_answer_usrname.value_counts()

best_ans_val_counts_futures = client.map(best_ans_val_counts, df_futures)
best_ans_val_counts_futures[0].result()

### Exercise

Write a function that calculates the total amount of votes by best answer, apply it to the dataframe futures, and as they are completed accumulate the votes per user_name in a dictionary.

In [None]:
#solution
def most_votes(df):
    return df.groupby("best_answer_usrname")['best_answer_votes'].sum()

most_votes_futures = client.map(most_votes, df_futures, pure=False)

In [None]:
##solution
d = {}
for fut in as_completed(most_votes_futures):
    s = fut.result()
    fut_d = s.to_dict()
    for k, v in fut_d.items():
        if k in d:
            d[k] += v
        else:
            d[k] = v

In [None]:
max(d, key=d.get)

In [None]:
max(d.values())

### Dask dataframes API

Now we are on dataframe world, we can do pandas-like operations, for example.

We can do multiple operations on these dataframes using `futures` but at this point since we are working with dataframes we can use `dask.dataframes`. 

In [None]:
import dask.dataframe as dd

In [None]:
ddf_so = dd.from_delayed(df_futures)

In [None]:
ddf_so

In [None]:
ddf_so.columns

We can check which of the user tht got a best answer, has the most "best answers"

In [None]:
ddf_so.best_answer_usrname.value_counts().compute()[:6]

In [None]:
We can also check how many votes, this users got:

In [None]:
ddf_so.groupby("best_answer_usrname")['best_answer_votes'].sum().compute()

## Exercise:

Modify the following code, to get a different tag (e.g `tag="python"`) and re-run the experiment

```python
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)
```

Hint: You can use `client.map()` with `lambda` functions


In [None]:
###Solution
pages_py_futures = client.map(lambda p: get_page_html_links(p, tag="python"), range(1, 5))
wait(pages_py_futures)

### Useful links

- https://tutorial.dask.org/05_futures.html

**Useful links**

* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Next lesson

In the next lesson, you will get better at `dask.Dataframes`. We will re-cap the basics, but dive deeper into data formats (csv vs parquet),  learn about `pyarrow-strings`, shuffle operations, and other useful content that is not usually covered in the introductory material
