![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Parallelization

## Introduction

This lab will combine parallelization with some of the other topics you have learned in the Intermediate Python module of this program (list comprehensions, requests library, functional programming, web scraping, etc.). You will write code that extracts a list of links from a web page, requests each URL, and then indexes the page referenced by each link - both sequentially and in parallel.

## Resources

- [Multiprocessing Library Documentation](https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing#module-multiprocessing)
- [Python Parallel Computing (in 60 Seconds or less)](https://dbader.org/blog/python-parallel-computing-in-60-seconds)
- [Python Multiprocessing: Pool vs Process – Comparative Analysis](https://www.ellicium.com/python-multiprocessing-pool-process/)

## Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:

url = 'https://en.wikipedia.org/wiki/Data_science'

In [3]:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [4]:
response

<Response [200]>

## Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [5]:
from bs4 import BeautifulSoup

In [6]:
links = soup.find_all('a', href=True)
unique_links = list(set([link['href'] for link in links]))

## Step 3: Use list comprehensions with conditions to clean the link list.

Create a list with the absolute link and remove any that contain a percentage sign (%)

In [7]:
absolute_links = [link for link in unique_links if link.startswith('http') and '%' not in link]

In [8]:
absolute_links

['https://es.wikipedia.org/wiki/Ciencia_de_datos',
 'https://en.wikipedia.org/w/index.php?title=Data_science&oldid=1051586722',
 'https://nl.wikipedia.org/wiki/Datawetenschap',
 'https://wikimediafoundation.org/',
 'https://statmodeling.stat.columbia.edu/2013/11/14/statistics-least-important-part-data-science/',
 'https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/',
 'https://magazine.amstat.org/blog/2016/06/01/datascience-2/',
 'https://web.archive.org/web/20140102194117/http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://fi.wikipedia.org/wiki/Datatiede',
 'https://towardsdatascience.com/how-data-science-will-impact-future-of-businesses-7f11f5699c4d',
 'https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/',
 'https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html',
 'https:

## Step 4: Write a function called crawl_page that accepts a link and does the following.

- Request the content of the page referenced by that link.
- Create a soup with the request content.
- Extract a list of links
- Return the count of links in the page

In [9]:
def crawl_page(url):
    import requests
    from bs4 import BeautifulSoup
    
    html = requests.get(url).content
    soup = BeautifulSoup(html)
    link_list = soup.find_all('a', href=True)
    return len(link_list)

In [10]:
crawl_page(url)

475

## Step 5: Sequentially loop through the list of links, running the crawl_page function each time and save result in a list.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [12]:
%%time
result_list = []

for url in absolute_links:
    result_list.append(crawl_page(url))

print(result_list)

[275, 489, 128, 100, 0, 331, 4, 67, 151, 119, 76, 4, 69, 54, 71, 369, 113, 363, 0, 171, 0, 162, 298, 229, 176, 279, 176, 146, 0, 10, 116, 226, 172, 100, 0, 162, 1, 39, 287, 140, 80, 70, 189, 99, 126, 115, 157, 21, 2, 27, 362, 258, 279, 5, 272, 1, 237, 83, 125]
CPU times: user 12.7 s, sys: 234 ms, total: 13 s
Wall time: 1min 33s


## Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

In [15]:
# import multiprocessing

import multiprocess

In [16]:
%%time
pool = multiprocess.Pool()
result = pool.map(crawl_page, absolute_links)
pool.terminate()

print(result)

[275, 489, 128, 100, 0, 330, 4, 67, 151, 119, 76, 4, 69, 54, 71, 369, 113, 363, 0, 171, 0, 162, 298, 229, 176, 279, 176, 146, 0, 10, 116, 226, 172, 100, 0, 162, 1, 39, 287, 140, 80, 70, 189, 99, 130, 115, 157, 21, 2, 32, 362, 258, 279, 5, 272, 1, 237, 83, 125]
CPU times: user 29.5 ms, sys: 57.8 ms, total: 87.3 ms
Wall time: 12.9 s
