# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
# your code here
html = requests.get(url)

### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [3]:
# your code here
soup = BeautifulSoup(html.content)
links = soup.find_all(href = re.compile('.*'))
unique_links = list(set([link['href'] for link in links]))
unique_links

['/wiki/Data_analysis',
 '/wiki/Information_visualization',
 'https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute',
 '#cite_ref-:4_31-2',
 '/wiki/Data_curation',
 '/wiki/Information_privacy',
 '/wiki/Recurrent_neural_network',
 '/wiki/Mathematics',
 '/wiki/TensorFlow',
 '/wiki/Perceptron',
 '/wiki/Data_warehouse',
 '/wiki/Deep_learning',
 'https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/',
 '/wiki/Template:Machine_learning_bar',
 '/wiki/Montpellier_2_University',
 '/wiki/Vasant_Dhar',
 '/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action',
 '/static/favicon/wikipedia.ico',
 '/static/apple-touch/wikipedia.png',
 '#cite_ref-28',
 '/wiki/Pytorch',
 'https://ru.wikipedia.org/wiki/%D0%9D%D0%B0%D1%83%D0%BA%D0%B0_%D0%BE_%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85',
 '/wiki/American_Statistical_Association',
 '/w/index.php?title=Special:UserLogin&returnto=Data+science',
 '/wiki/Wikipedia:File_Upload_Wizard',
 '#cite_ref-15',
 '/wiki/Special:MyContribut

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [4]:
domain = 'http://wikipedia.org'

In [5]:
# your code here
html = requests.get(domain)
soup = BeautifulSoup(html.content)
links = soup.find_all(href = re.compile('.*'))

In [6]:
absolute_links = [link['href'] for link in links if link['href'].startswith('http') and '%' not in link['href']]
absolute_links

['https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications',
 'https://creativecommons.org/licenses/by-sa/3.0/']

In [7]:
relative_links = ['http:' + link['href'] for link in links if link['href'].startswith('/') and\
                  '%' not in link['href']]
relative_links

['http:/static/apple-touch/wikipedia.png',
 'http:/static/favicon/wikipedia.ico',
 'http://creativecommons.org/licenses/by-sa/3.0/',
 'http://upload.wikimedia.org',
 'http://en.wikipedia.org/',
 'http://ja.wikipedia.org/',
 'http://es.wikipedia.org/',
 'http://de.wikipedia.org/',
 'http://ru.wikipedia.org/',
 'http://fr.wikipedia.org/',
 'http://it.wikipedia.org/',
 'http://zh.wikipedia.org/',
 'http://pt.wikipedia.org/',
 'http://pl.wikipedia.org/',
 'http://ar.wikipedia.org/',
 'http://de.wikipedia.org/',
 'http://en.wikipedia.org/',
 'http://es.wikipedia.org/',
 'http://fr.wikipedia.org/',
 'http://it.wikipedia.org/',
 'http://nl.wikipedia.org/',
 'http://ja.wikipedia.org/',
 'http://pl.wikipedia.org/',
 'http://pt.wikipedia.org/',
 'http://ru.wikipedia.org/',
 'http://ceb.wikipedia.org/',
 'http://sv.wikipedia.org/',
 'http://uk.wikipedia.org/',
 'http://vi.wikipedia.org/',
 'http://war.wikipedia.org/',
 'http://zh.wikipedia.org/',
 'http://ast.wikipedia.org/',
 'http://az.wikipedi

In [8]:
all_links = list(set(absolute_links + relative_links))
all_links

['http://se.wikipedia.org/',
 'http://pt.wikipedia.org/',
 'http://ie.wikipedia.org/',
 'http://dsb.wikipedia.org/',
 'https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications',
 'http://cv.wikipedia.org/',
 'http://bug.wikipedia.org/',
 'http://av.wikipedia.org/',
 'http://ks.wikipedia.org/',
 'http://ki.wikipedia.org/',
 'http://st.wikipedia.org/',
 'http://meta.wikimedia.org/wiki/Privacy_policy',
 'http://hu.wikipedia.org/',
 'http://bjn.wikipedia.org/',
 'http://ca.wikipedia.org/',
 'http://species.wikimedia.org/',
 'http://uk.wikipedia.org/',
 'http://vi.wikipedia.org/',
 'http://he.wikipedia.org/',
 'http://mhr.wikipedia.org/',
 'http://frp.wikipedia.org/',
 'http://hsb.wikipedia.org/',
 'http://na.wikipedia.org/',
 'http://myv.wikipedia.org/',
 'http://es.wikipedia.org/',
 'http://lt.wikipedia.org/',
 'http://crh.wikipedia.org/',
 'http://fy.wikipedia.org/',
 'http://xal.wikipedia.org/',
 'http://nds.wikipedia.org/',
 'http://bar.wikipedia.org/',
 'http://gag.wikipedi

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [9]:
import os

In [10]:
# your code here
os.mkdir('./wikipedia')

FileExistsError: [WinError 183] Não é possível criar um arquivo já existente: './wikipedia'

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [11]:
def index_page(link):
    try:
        import requests
        from slugify import slugify
        response = requests.get(link)
        f = open(f'{slugify(link)}.html', 'w')
        f.write(f'{requests.get(link).content}')
        f.close()
    except:
        pass

In [12]:
pwd

'C:\\Users\\55119\\Desktop\\DAFT-202006\\W4\\labs'

In [13]:
os.chdir('./wikipedia')

### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [14]:
# your code here
for l in all_links:
    index_page(l)

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [39]:
from multiprocess import Pool

In [17]:
# your code here
pool = Pool(processes = 8)
result = pool.map(index_page, all_links)
pool.terminate()

In [14]:
from multiprocessing import Pool, cpu_count

cpu_count()

8

In [25]:
os.chdir('./labs')

In [17]:
pwd

'C:\\Users\\55119\\Desktop\\DAFT-202006\\W4\\labs\\wikipedia'

In [15]:
from w4d3 import index_pages

In [34]:
pool = Pool(processes = 8)
result = pool.map(index_pages, all_links)
pool.terminate()

**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.

In [16]:
import os.path

In [37]:
os.chdir('./wikipedia')

In [17]:
def how_many_files():
    return len([name for name in os.listdir('.') if os.path.isfile(name)])

In [50]:
pwd

'C:\\Users\\55119\\Desktop\\DAFT-202006\\W4\\labs\\wikipedia'

In [19]:
pool = Pool(processes = 8)
result = pool.map_async(index_pages, all_links)
pool.terminate()
how_many_files()

23

In [20]:
how_many_files()

307