<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
     
# Parallelize your python code

In this lesson you will learn how to parallelize custom python code using Dask using the Futures API.

## Futures: a low-level collection.

Dask low-level collections are the best tools when you need to have fine control to build custom parallel and distributed computations. 

The `futures` interface (derived from the built-in `concurrent.futures`) provides fine-grained real-time execution for custom situations. It allows you to submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

### Why use Futures?

The `futures` API offers a work submission style that can easily emulate the map/reduce paradigm. If that is familiar to you then futures might be the simplest entrypoint into Dask.

The other big benefit of futures is that the intermediate results, represented by futures, can be passed to new tasks without having to pull data locally from the cluster. The **call returns immediately**, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

### When do we us Futures?

One of the most common cases where you can use `Futures` is when you have a for loop. For example, you need to apply a **read-transform-write** function over multiple files. Your serial code will look something like:


```python
# Serial code
def process_file(filename):
    data = read_a_file(filename)
    data_transformed = do_a_transformation(data)
    destination = f"results/{filename}"
    write_out_data(data_transformed, destination)
    return destination

files = ["file_1", "file_2", "file_3", ..., "file_n"] #list of files
new_files = [] #where we save the destination file names

for f in files:
    new_files.append(process_file(f)            
```

Notice that every call of `process_file` is independent from each other, this is what is call an embarrassingly parallel problem. You can do this in parallel with Dask by doing

```python
#Parallel code
futures = []
for f in files:
    future = client.submit(process_file, f)
    futures.append(future)
    
futures
```

## Example: Get SO get questions page title 

During this lesson, you will be working with the Stack Overflow question pages. To start let's see how to grab the title of each page, and how for multiple pages we can perform this in parallel. 

If you go to https://stackoverflow.com/questions/ you will see a list of the newest posts, if you go to the bottom of the page you can switch to the next page. For example, the top of page number two at the moment of the creation of this notebook, looked like 

<center>
<img src="images/SO_page.png"
     width="65%"
     alt="SO page\">
</center>

### Get the title

The title of the page is what is showed in the tab. The following function gets the title of a page given its page number:

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import time

In [None]:
def get_questions_page_title(page_num):
    """ Get title of a SO questions page
    """
    url = f"https://stackoverflow.com/questions?tab=newest&page={page_num}"
    req = requests.get(url)
    html = bs(req.text, "html.parser")

    return html.title.text

In [None]:
page_2 = get_questions_page_title(2)
page_2

### Serial code to get 8 pages would be:

In [None]:
%%time
page_html = []
for p in range(1,9): #page numbers start in 1
    page_html.append(get_questions_page_title(p))
    

In [None]:
page_html[:3]

### Exercise

Run the code in parallel, using futures.

In [None]:
from dask.distributed import Client, wait

In [None]:
client = Client(n_workers=4)
client

In [None]:
#Solution
futures = []
for p in range(1,9):
    future = client.submit(get_questions_page_title, p)
    futures.append(future)
    
futures

In [None]:
futures[0]

In [None]:
futures[0].result()

In [None]:
results = [future.result() for future in futures]
results

**Extra:**

To be able to `%%time` the cell and compare times with the serial version, you will need to wait for the futures to finish doing `wait(futures)`. If you try to do that and re run the cell, you will notice it is immediate, this is because by default, distributed assumes that all functions are pure. Pure functions:

- always return the same output for a given set of inputs
- do not have side effects, like modifying global state or creating files

 
You can use the `pure=False` keyword argument in the `client.submit()`. Modify your solution to match this code


```python
%%time
futures = []
for p in range(1,9):
    future = client.submit(get_questions_page_title, p, pure=False)
    futures.append(future)
    
wait(futures)
```

**`client.map()`**

With `client.submit()` you can submit individual functions for evaluation with one set of inputs, and together with a `for-loop` you can also evaluate over a sequence of inputs. `client.map()` provides a simpler interface to perform the former, let's see how to perform the example above now using `client.map()`


In [None]:
futures = client.map(get_questions_page_title, range(1, 9))

`client.map()` returns a list of futures, you can block on the computation and gather the result by doing:

In [None]:
res = client.gather(futures)
res

### Futures are great...

The other big benefit of `futures` is that the intermediate results, represented by `futures`, can be passed to new tasks without having to pull data locally from the cluster. New operations can be setup to work on the output of previous jobs that have not even begun yet.

Let's brake our steps into multiple functions

- `request_html_page`: given a url returns the html of that page
- `get_page_html_links`: given a SO questions page number returns a the html for that page number.
- `get_post_links_per_page`: given a SO questions html page, returns a list with the posts of that page.


<center>
<img src="images/dask_SO_posts_links.png"
     width="65%"
     alt="dask post links\">
</center>




In [None]:
def request_html_page(url):
    """Given a url returns the html of that page
    """
    req = requests.get(url)
    html = bs(req.text, "html.parser")
    return html

In [None]:
def get_page_html_links(page_num, tag="dask", query_filter = "MostVotes"):
    """Given a SO questions page number returns a the html for that page number
    for a tag and query_filter.
    """
    base_url = "https://stackoverflow.com/questions/tagged/"
    
    page_url = f"{base_url}{tag}?sort={query_filter}&page={page_num}"

    page_html = request_html_page(page_url)
    
    return page_html

In [None]:
def get_post_links_per_page(html_page):
    """Given a SO questions html page, returns a list with the posts of that page.
    """
    question_href = html_page.find_all("a", class_="s-link")[2:-1]
    
    question_link = [f"https://stackoverflow.com{q['href']}" for q in question_href]
    
    return question_link

### Explore the functions:  

In [None]:
page_number = 3

page_3_html = get_page_html_links(page_num=page_number)

In [None]:
post_links_page_3 = get_post_links_per_page(page_3_html)

In [None]:
len(post_links_page_3)

In [None]:
post_links_page_3[:3]

### Get post links for multiple pages

In [None]:
# serial code
page_posts_links = []
for page in range(1, 5):
    page_html = get_page_html_links(page_num=page)
    posts_links = get_post_links_per_page(page_html)
    
    page_posts_links.append(posts_links)

In [None]:
len(page_posts_links)

In [None]:
[len(l) for l in page_posts_links]

**Parallel code: using `client.map()`**

We can get first the futures for every page html, and pass those futures as the iterator to get the links per page.

In [None]:
pages_html_futures = client.map(get_page_html_links, range(1,5))
wait(pages_html_futures) #wait until completed

**`wait()`**

Notice that here we used `wait()`, you can wait on a future or collection of futures using the `wait` function, which blocks until all futures are finished or have erred. This is useful when you need the all the futures to be completed to proceed with your computations. 

In [None]:
pages_html_futures[0]

### Exercise:

Using `client.map()` and the `pages_html_futures` you just got, to get the post's links for the four pages, in parallel.

In [None]:
#Solution
posts_links_futures = client.map(get_post_links_per_page, pages_html_futures)
posts_links_futures

In [None]:
posts_links_futures[0]

In [None]:
posts_links_futures[0].result()[:3]

**`as_completed()`**

In the example above we waited for the the `pages_html_futures` to finish before we proceed to get the `posts_links_futures`. However, we can get the `post_links_futures` for every page as the `pages_html_futures` finish. 

`as_completed()` return futures in the order in which they complete. It returns an iterator that yields the input future objects in the order in which they complete. 

In [None]:
from dask.distributed import as_completed

In [None]:
pages_html_futures = client.map(get_page_html_links, range(1,5), pure=False) #use pure=False to re-compute

post_links_futures = []
for p in as_completed(pages_html_futures):
    post_links_futures.append(client.submit(get_post_links_per_page, p))
    

In [None]:
post_links_futures

## Grown-up example: Scrape, crawl and get SO data

Let's use all what we've learn in the examples above, to do something a bit more advanced. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping, crawling and get data workflow.

Up to now, you learned how to scrape multiple pages from https://stackoverflow.com/questions/, and to get a list of the post links for every page. Let's go a step further and get some data of each post. For example we can 

- Title
- Question body
- Most voted answer
- Number of votes for the best answer
- Who answer the most voted answer

<center>
<img src="images/data_from_post.png"
     width="65%"
     alt="data to extract">
</center>


#### Data insights

For every page we will end up with one dictionary per post that contains the information above, we can convert them into a dataframe and for example find useful aggregated information like:

- Which username gets the most "best answers"?
- Which of the best answer usernames is the most voted?


<center>
<img src="images/sketch_bar_plots.png"
     width="85%"
     alt="bar plots sketch">
</center>

**Note about throttling:**

When scrapping directly from the pages and not using the API, it is not clear what are the throttling limitations, but from experience we run into them pretty quickly.

The following examples, work as they are, if you change the number of pages you will likely hit a limit and be banned for few minutes. We will work around this towards the end, in the meantime avoid changing the number of pages

## Scrape, crawl, get data, and plot

Below you have our set of functions that we use above, plus function that will allow us to scrape the data needed to get some insights.

You will see this functions in action in serial and together we will use all what we learned about futures to run things in parallel.

In [None]:
def request_html_page(url):
    """Given a url returns the html of that page
    """
    req = requests.get(url)
    html = bs(req.text, "html.parser")
    return html

In [None]:
def get_page_html_links(page_num, tag="dask", query_filter = "MostVotes"):
    """Given a SO questions page number returns a the html for that page number
    for a tag and query_filter.
    """
    base_url = "https://stackoverflow.com/questions/tagged/"
    
    page_url = f"{base_url}{tag}?sort={query_filter}&page={page_num}"

    page_html = request_html_page(page_url)
    
    return page_html

In [None]:
def get_post_links_per_page(html_page):
    """Given a SO questions html page, returns a list with the posts of that page.
    """
    question_href = html_page.find_all("a", class_="s-link")[2:-1]
    
    question_link = [f"https://stackoverflow.com{q['href']}" for q in question_href]
    
    return question_link

In [None]:
def get_data(post_link):
    """Get data from a SO post as a dictionary
    
    - Title
    - Question body
    - Number of votes for the best answer
    - Who answer the most voted answer
    
    """
    html_post = request_html_page(post_link)
    post_info = {}
    
    
    post_info["title"] = html_post.title.text
    post_info["question"] = html_post.find("div", class_="s-prose js-post-body").text
    
    answ = html_post.find("div", class_="answer") #this will gets us the first/most voted answer
    
    if answ:
        post_info["best_answer_votes"] = int(answ["data-score"])
    
        best_answer_author_obj = answ.find("span", itemprop="name")
        
        if best_answer_author_obj:
            best_answer_author = best_answer_author_obj.text
        else:
            best_answer_author = "comunity_post"

        post_info["best_answer_usrname"] = best_answer_author
    else:
        post_info["best_answer_votes"] = 0
        post_info["best_answer_usrname"] = "no-answer"

    
    return post_info

## Serial

In [None]:
%%time

df_list =[]
for page_num in range(1, 3): #more than 2 pages and get trhottling issues
    page_html = get_page_html_links(page_num)
    posts_links = get_post_links_per_page(page_html)
    list_post_data = []
    
    for link in posts_links:
        p_data = get_data(link)
        list_post_data.append(p_data)

    df = pd.DataFrame(list_post_data)
    df_list.append(df)

In [None]:
df_list[0].head()

In [None]:
most_answers = df_list[0]["best_answer_usrname"].value_counts()
most_answers[:3].plot.bar(title="Most Best Answers",
                       ylabel="Votes",
                       xlabel="Usernames");

In [None]:
most_voted = df_list[0].groupby("best_answer_usrname").best_answer_votes.sum().sort_values(ascending=False)

In [None]:
most_voted[:3].plot.bar(title="Most Voted Usernames",
                       ylabel="Votes");

In [None]:
most_answers_series = []
most_voted_series = []
for df in df_list:
    most_answ = df["best_answer_usrname"].value_counts()
    most_vot = df.groupby("best_answer_usrname").best_answer_votes.sum()
    
    most_answers_series.append(most_answ)
    most_voted_series.append(most_vot)
    

In [None]:
most_answers_tot = pd.concat(most_answers_series, axis=1).sum(axis=1).sort_values(ascending=False)
most_voted_tot = pd.concat(most_voted_series, axis=1).sum(axis=1).sort_values(ascending=False)

In [None]:
most_answers_tot[:5].plot.bar(title="Most Best Answers",
                       ylabel="Votes",
                       xlabel="Usernames");

In [None]:
most_voted_tot[:5].plot.bar(title="Most Voted Usernames",
                       ylabel="Votes");

## Parallel

### Get pages and links of posts

In [None]:
%%time
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)

posts_links_futures = client.map(get_post_links_per_page, pages_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

### Parallel is fast

Well that was ~9x on a laptop with 4 cpu cores. But we have only processed 2 pages, let's tackle more pages!

**Note:** If you are running this from Binder, the limitation on Binder resources will end up on lower speedups.

## Avoid throttling using a cluster. 

In the example above, if we try to work with more pages, you will hit throttling issues. This is a problem, but we can solve it by scaling out to a lot of machines. When using a cluster each worker has its own public ip-address so it is like we are requesting from different machines. 

Let's create a coiled cluster and scrape a bigger number of pages:

In [None]:
#Shutdown LocalCluster
client.shutdown()

In [None]:
### coiled login 
#!coiled login --token ### --account dask-tutorials

In [None]:
import coiled

In [None]:
cluster = coiled.Cluster(n_workers=10, 
                        package_sync=True,
                        scheduler_port=443) #port needed for binder

In [None]:
#When running from binder dask-lab extension won't work, use link to dashboard
client =  Client(cluster)
client

In [None]:
%%time
pages_10_futures = client.map(get_page_html_links, range(1,11))
wait(pages_10_futures)

posts_links_futures = client.map(get_post_links_per_page, pages_10_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

In [None]:
len(dfs_data) #10 pages

At this point, we have the data to build each page dataframe. 

In [None]:
dfs_data[0]

### Let's get some dataframes:
In the serial code you got some insights on the data by manipulation the pandas dataframes. At the moment our futures do not have dataframes, but we can convert them by mapping pandas.Dataframe into our `dfs_data` futures.

In [None]:
df_futures = client.map(pd.DataFrame, dfs_data)

In [None]:
df_futures[0]

Now we have some pandas dataframes!!

### Exercise

Write a function that returns the `value_counts` of the `best_answer_usrname`, and apply it to the dataframe futures.

In [None]:
###Solution
def best_ans_val_counts(df):
    return df.best_answer_usrname.value_counts()

best_ans_val_counts_futures = client.map(best_ans_val_counts, df_futures)
best_ans_val_counts_futures[0].result()

### Exercise

Write a function that calculates the total amount of votes by best answer, apply it to the dataframe futures.

In [None]:
#solution
def most_votes(df):
    return df.groupby("best_answer_usrname")['best_answer_votes'].sum()

most_votes_futures = client.map(most_votes, df_futures, pure=False)

## Results for 10 pages aggregation

Now we have 10 futures that each of them is a `pd.Series`. We can bring this to the client, concatenate them and re-do our plots.

### Exercise

Gather the results into a a list of `pd.Series`.

In [None]:
# Solution
best_ans_count_res = client.gather(best_ans_val_counts_futures)
most_votes_res = client.gather(most_votes_futures)

### Let's Plot!

In [None]:
most_answers_tot = pd.concat(best_ans_count_res, axis=1).sum(axis=1).sort_values(ascending=False)
most_voted_tot = pd.concat(most_votes_res, axis=1).sum(axis=1).sort_values(ascending=False)

In [None]:
most_answers_tot[:5].plot.bar(title="Most Best Answers",
                       ylabel="Votes",
                       xlabel="Usernames");

In [None]:
most_voted_tot[:5].plot.bar(title="Most Voted Usernames",
                       ylabel="Votes");

### Dask dataframes API

Now we are on dataframe world, we can do pandas-like operations, for example.

We can do multiple operations on these dataframes using `futures` but at this point since we are working with dataframes we can use `dask.dataframes`. 

In [None]:
import dask.dataframe as dd

In [None]:
ddf_so = dd.from_delayed(df_futures)

In [None]:
ddf_so

In [None]:
ddf_so.columns

We can check which of the user tht got a best answer, has the most "best answers"

In [None]:
ddf_so.best_answer_usrname.value_counts().compute()[:6]

We can also check how many votes, these users got:

In [None]:
ddf_so.groupby("best_answer_usrname")['best_answer_votes'].sum().compute()

## Extra:

Repeat the analysis we did for the `tag="dask"` for a different one, like `tag="python"`.

You will need to modify this portion of the code to

```python
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)
```
to use `client.map()` with `lambda` functions like:

```python
# Solution
pages_py_futures = client.map(lambda p: get_page_html_links(p, tag="python"), range(1, 3))
wait(pages_py_futures)
```

### Useful links

- https://tutorial.dask.org/05_futures.html

**Useful links**

* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Next lesson

Register [here](https://www.coiled.io/tutorials) for reminders. 

In the next lesson, we’ll learn some best practices around working with larger-than-memory datasets. We’ll use the Uber/Lyft dataset to:

- Manipulate Parquet files and optimize queries
- Navigate inconvenient file sizes and data types
- Extract useful features with Dask Dataframe

By the end, we’ll learn the advantages of working with the Parquet file format and how to efficiently perform an exploratory analysis with Dask.