# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
response = requests.get(url, 'lxml')
html = response.content
html

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Data science - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"dcb0e429-e537-47b5-82bf-682c113c63d1","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1016655116,"wgRevisionId":1016655116,"wgArticleId":35458904,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: others","Articles with short description","Short description matches Wikidata","Use dmy dates from December 2012","Information science","Computer occup

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [7]:
from bs4 import BeautifulSoup

In [8]:
soup = BeautifulSoup(html)
x = soup.find_all('a')
y = []
for i in x:
    if 'href' in i.attrs:
        if i['href'].startswith('/wiki') or  i['href'].startswith('https') == True:
            y.append(i.get('href'))
y

['/wiki/Information_science',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/Comet_NEOWISE',
 '/wiki/Astronomical_survey',
 '/wiki/Space_telescope',
 '/wiki/Wide-field_Infrared_Survey_Explorer',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File:Multi-Layer_Neural_Network-Vector-Blank.svg',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensemble_learning',
 '/wiki/Bootstrap_aggre

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [9]:
domain = 'http://wikipedia.org'

In [10]:
absolute = []
relative = []
for i in y:
    if i.startswith('https'):
        if '%' not in i:
            absolute.append(i)
    else:
        if '%' not in i:
            relative.append(f'{domain}{i}')

In [12]:
total = []
for i in relative:
    absolute.append(i)
    for x in absolute:
        total.append(x)
final = list(set(total))
final

['https://www.stat.purdue.edu/~wsc/',
 'https://stats.wikimedia.org/#/en.wikipedia.org',
 'http://wikipedia.org/wiki/Portal:Current_events',
 'http://wikipedia.org/wiki/Perceptron',
 'http://wikipedia.org/wiki/Online_machine_learning',
 'http://wikipedia.org/wiki/List_of_datasets_for_machine-learning_research',
 'http://wikipedia.org/wiki/Mathematics',
 'http://wikipedia.org/wiki/Gated_recurrent_unit',
 'http://wikipedia.org/wiki/Peter_Naur',
 'http://wikipedia.org/wiki/Journal_of_Machine_Learning_Research',
 'http://wikipedia.org/wiki/Basic_research',
 'http://wikipedia.org/wiki/Supervised_learning',
 'http://wikipedia.org/wiki/Reinforcement_learning',
 'http://wikipedia.org/wiki/Special:BookSources/978-0-9825442-0-4',
 'http://wikipedia.org/wiki/Data_(computing)',
 'http://wikipedia.org/wiki/Informatics',
 'http://wikipedia.org/wiki/Data_archaeology',
 'http://wikipedia.org/wiki/Data_degradation',
 'http://wikipedia.org/wiki/Wikipedia:Contents',
 'http://wikipedia.org/wiki/Automated_

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [11]:
import os

In [13]:
directory = 'wikipedia1'
parent_dir = '/Users/eduar/downloads'
path = os.path.join(parent_dir, directory)
os.mkdir(path)

In [15]:
os.chdir(path)

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [16]:
from slugify import slugify

In [41]:
url = 'https://en.wikipedia.org/wiki/Data_science'
def index_page(x):
    import requests
    from slugify import slugify
    s = requests.session()
    response = s.get(x, headers={'Cache-Control': 'no-cache', "Pragma": "no-cache"})
    html = response.content
    x1 = slugify(x)
    filename = x1 + '.html'
    file = open(filename, "wb")
    file.write(html)
    s.cookies.clear()
index_page(url)

In [None]:
#with open(path + '/' + filename, "wb") as f:f.write(html)

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [19]:
import time 
from tqdm.auto import tqdm

In [20]:
%%time
list(map(index_page, tqdm(final)))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=268.0), HTML(value='')))


Wall time: 9min 39s


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [27]:
from multiprocessing import Pool, cpu_count
cpu_count()

4

In [39]:
import requests
import time 
from tqdm.auto import tqdm
from multiprocess import Pool
import multiprocessing

In [42]:
%%time
pool = Pool()
result = pool.map(index_page, final)
pool.terminate()

Wall time: 2min 26s


In [49]:
def count_items(path):
    return len(os.listdir(path))

In [44]:
count_items(path)

267

In [58]:
%%time
pool = Pool()
result = pool.map_async(index_page, final)

Wall time: 191 ms


In [62]:
count_items(path)

266

**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.