# Parallelization Lab

In this lab, you will be leveraging several concepts you have learned to obtain a list of links from a web page and crawl and index the pages referenced by those links - both sequentially and in parallel. Follow the steps below to complete the lab.

### Step 1: Use the requests library to retrieve the content from the URL below.

In [1]:
import requests

url = 'https://en.wikipedia.org/wiki/Data_science'

In [2]:
# your code here
!pip install httplib2



### Step 2: Use BeautifulSoup to extract a list of all the unique links on the page.

In [3]:
from bs4 import BeautifulSoup
import httplib2

In [4]:
# your code here
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen('https://en.wikipedia.org/wiki/Data_science')
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

#mw-head
#searchInput
/wiki/Information_science
/wiki/File:PIA23792-1600x1200(1).jpg
/wiki/File:PIA23792-1600x1200(1).jpg
/wiki/Comet_NEOWISE
/wiki/Astronomical_survey
/wiki/Space_telescope
/wiki/Wide-field_Infrared_Survey_Explorer
/wiki/Machine_learning
/wiki/Data_mining
/wiki/File:Multi-Layer_Neural_Network-Vector-Blank.svg
/wiki/Statistical_classification
/wiki/Cluster_analysis
/wiki/Regression_analysis
/wiki/Anomaly_detection
/wiki/Automated_machine_learning
/wiki/Association_rule_learning
/wiki/Reinforcement_learning
/wiki/Structured_prediction
/wiki/Feature_engineering
/wiki/Feature_learning
/wiki/Online_machine_learning
/wiki/Semi-supervised_learning
/wiki/Unsupervised_learning
/wiki/Learning_to_rank
/wiki/Grammar_induction
/wiki/Supervised_learning
/wiki/Statistical_classification
/wiki/Regression_analysis
/wiki/Decision_tree_learning
/wiki/Ensemble_learning
/wiki/Bootstrap_aggregating
/wiki/Boosting_(machine_learning)
/wiki/Random_forest
/wiki/K-nearest_neighbors_algorithm
/wi

In [5]:
import requests
from bs4 import BeautifulSoup as bs


url = 'https://en.wikipedia.org/wiki/Data_science'
r = requests.get(url)
html_content = r.text
soup = bs(html_content, 'lxml')

links = [i.get('href') for i in soup.find_all('a', href=True)]
links

['#mw-head',
 '#searchInput',
 '/wiki/Information_science',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/Comet_NEOWISE',
 '/wiki/Astronomical_survey',
 '/wiki/Space_telescope',
 '/wiki/Wide-field_Infrared_Survey_Explorer',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File:Multi-Layer_Neural_Network-Vector-Blank.svg',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensemble_lear

In [6]:
links = []

x = soup.find_all('a')
for i in x:
    if 'href' in i.attrs:
        if i['href'].startswith('/wiki/') or  i['href'].startswith('https') == True:
            links.append(i.get('href'))
links

['/wiki/Information_science',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/File:PIA23792-1600x1200(1).jpg',
 '/wiki/Comet_NEOWISE',
 '/wiki/Astronomical_survey',
 '/wiki/Space_telescope',
 '/wiki/Wide-field_Infrared_Survey_Explorer',
 '/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File:Multi-Layer_Neural_Network-Vector-Blank.svg',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensemble_learning',
 '/wiki/Bootstrap_aggre

### Step 3: Use list comprehensions with conditions to clean the link list.

There are two types of links, absolute and relative. Absolute links have the full URL and begin with *http* while relative links begin with a forward slash (/) and point to an internal page within the *wikipedia.org* domain. Clean the respective types of URLs as follows.

- Absolute Links: Create a list of these and remove any that contain a percentage sign (%).
- Relative Links: Create a list of these, add the domain to the link so that you have the full URL, and remove any that contain a percentage sign (%).
- Combine the list of absolute and relative links and ensure there are no duplicates.

In [7]:
domain = 'http://wikipedia.org'

In [8]:
# your code here
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

base_url = 'http://wikipedia.org'
url = 'https://en.wikipedia.org/wiki/Data_science'

soup = BeautifulSoup(requests.get(url).content)


for img in soup.find_all('img', src=True):
    src = img.get('src')
    if not src.startswith('http'):
        src = urljoin(base_url, src)

    print(src)
    


http://upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/250px-PIA23792-1600x1200%281%29.jpg
http://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Multi-Layer_Neural_Network-Vector-Blank.svg/150px-Multi-Layer_Neural_Network-Vector-Blank.svg.png
http://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
http://wikipedia.org/static/images/footer/wikimedia-button.png
http://wikipedia.org/static/images/footer/poweredby_mediawiki_88x31.png


In [9]:
absolute = []
relative = []


for i in links:
    if i.startswith('https'):
        if '%' not in i:
            absolute.append(i)
    else:
        if '%' not in i:
            relative.append(f'{domain}{i}')
absolute


['https://arxiv.org/list/cs.LG/recent',
 'https://en.wikipedia.org/w/index.php?title=Template:Machine_learning_bar&action=edit',
 'https://www.wikidata.org/wiki/Q2963551',
 'https://api.semanticscholar.org/CorpusID:6107147',
 'https://web.archive.org/web/20141109113411/http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 'https://web.archive.org/web/20140102194117/http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 'https://www.springer.com/book/9784431702085',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://web.archive.org/web/20170320193019/https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://api.semanticscholar.org/CorpusID:9743327',
 'https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html',
 'https://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks',
 'https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/',
 'https://benfry.com/phd/dis

In [10]:
relative

['http://wikipedia.org/wiki/Information_science',
 'http://wikipedia.org/wiki/File:PIA23792-1600x1200(1).jpg',
 'http://wikipedia.org/wiki/File:PIA23792-1600x1200(1).jpg',
 'http://wikipedia.org/wiki/Comet_NEOWISE',
 'http://wikipedia.org/wiki/Astronomical_survey',
 'http://wikipedia.org/wiki/Space_telescope',
 'http://wikipedia.org/wiki/Wide-field_Infrared_Survey_Explorer',
 'http://wikipedia.org/wiki/Machine_learning',
 'http://wikipedia.org/wiki/Data_mining',
 'http://wikipedia.org/wiki/File:Multi-Layer_Neural_Network-Vector-Blank.svg',
 'http://wikipedia.org/wiki/Statistical_classification',
 'http://wikipedia.org/wiki/Cluster_analysis',
 'http://wikipedia.org/wiki/Regression_analysis',
 'http://wikipedia.org/wiki/Anomaly_detection',
 'http://wikipedia.org/wiki/Automated_machine_learning',
 'http://wikipedia.org/wiki/Association_rule_learning',
 'http://wikipedia.org/wiki/Reinforcement_learning',
 'http://wikipedia.org/wiki/Structured_prediction',
 'http://wikipedia.org/wiki/Featur

In [11]:
total = []
for i in relative:
    absolute.append(i)
    for x in absolute:
        total.append(x)
final = list(set(total))
final


['http://wikipedia.org/wiki/Decision_tree',
 'http://wikipedia.org/wiki/Wikipedia:About',
 'http://wikipedia.org/wiki/Category:Use_dmy_dates_from_December_2012',
 'http://wikipedia.org/wiki/Regression_analysis',
 'http://wikipedia.org/wiki/Data_storage',
 'http://wikipedia.org/wiki/Special:MyContributions',
 'http://wikipedia.org/wiki/Statistical_classification',
 'https://es.wikipedia.org/wiki/Ciencia_de_datos',
 'http://wikipedia.org/wiki/Data_reduction',
 'http://wikipedia.org/wiki/Transformer_(machine_learning_model)',
 'http://wikipedia.org/wiki/Informatics',
 'http://wikipedia.org/wiki/Data_integrity',
 'http://wikipedia.org/wiki/Wide-field_Infrared_Survey_Explorer',
 'http://wikipedia.org/wiki/Statistics',
 'http://wikipedia.org/wiki/Echo_state_network',
 'https://arxiv.org/list/cs.LG/recent',
 'http://wikipedia.org/wiki/Ensemble_learning',
 'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century',
 'http://wikipedia.org/wiki/Data_mining',
 'http://wikipedia.

### Step 4: Use the os library to create a folder called *wikipedia* and make that the current working directory.

In [45]:
import os

In [13]:
# your code here
os.makedirs('Wikipedia')  

FileExistsError: [WinError 183] Não é possível criar um arquivo já existente: 'Wikipedia'

In [46]:
os.chdir('C:/Users/user/1.IRONHACK/Ironhack Labs/Semana 4 Dia 4/Parallelization/your-code/Wikipedia')

os.getcwd()


'C:\\Users\\user\\1.IRONHACK\\Ironhack Labs\\Semana 4 Dia 4\\Parallelization\\your-code\\Wikipedia'

### Step 5: Write a function called index_page that accepts a link and does the following.

- Tries to request the content of the page referenced by that link.
- Slugifies the filename using the `slugify` function from the [python-slugify](https://pypi.org/project/python-slugify/) library and adds a .html file extension.
    - If you don't already have the python-slugify library installed, you can pip install it as follows: `$ pip3 install python-slugify`.
    - To import the slugify function, you would do the following: `from slugify import slugify`.
    - You can then slugify a link as follows `slugify(link)`.
- Creates a file in the wikipedia folder using the slugified filename and writes the contents of the page to the file.
- If an exception occurs during the process above, just `pass`.

In [None]:
!pip install python-slugify

In [30]:
from slugify import slugify

In [18]:
# your code here
def index_page(url):
    
    r = requests.get(url)
    content = r.content
    slug = slugify(url)
    filename = slug + '.html'
    print(slug)
    file = open(filename, "wb")
    file.write(content)

    

index_page('https://en.wikipedia.org/wiki/Data_science')


https-en-wikipedia-org-wiki-data-science


### Step 6: Sequentially loop through the list of links, running the index_page function each time.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run. 

_hint: Use tqdm to keep track of the time._ 

In [None]:
# your code here
import time 
from tqdm.auto import tqdm


In [35]:
%%time

list(map(index_page, tqdm(final)))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=268.0), HTML(value='')))

http-wikipedia-org-wiki-decision-tree
http-wikipedia-org-wiki-wikipedia-about
http-wikipedia-org-wiki-category-use-dmy-dates-from-december-2012
http-wikipedia-org-wiki-regression-analysis
http-wikipedia-org-wiki-data-storage
http-wikipedia-org-wiki-special-mycontributions
http-wikipedia-org-wiki-statistical-classification
https-es-wikipedia-org-wiki-ciencia-de-datos
http-wikipedia-org-wiki-data-reduction
http-wikipedia-org-wiki-transformer-machine-learning-model
http-wikipedia-org-wiki-informatics
http-wikipedia-org-wiki-data-integrity
http-wikipedia-org-wiki-wide-field-infrared-survey-explorer
http-wikipedia-org-wiki-statistics
http-wikipedia-org-wiki-echo-state-network
https-arxiv-org-list-cs-lg-recent



KeyboardInterrupt: 

### Step 7: Perform the page indexing in parallel and note the difference in performance.

Remember to include `%%time` at the beginning of the cell so that it measures the time it takes for the cell to run.

Use both methods, i.e., for one hand use the `multiprocess` module to use the function created in the jupyter notebook and run the download in parallel.

And for another hand create a python file containing the function to download the file and use the `multiprocessing` module to run. 

In [21]:
from multiprocess import Pool
import time 
from tqdm.auto import tqdm
import requests

In [47]:
def index_page(url):
    from slugify import slugify
    import requests
    r = requests.get(url)
    content = r.content
    slug = slugify(url)
    filename = slug + '.html'
    print(slug)
    with open(filename, "wb") as file:
        file.write(content)
    print(content)

    

index_page('https://en.wikipedia.org/wiki/Data_science')

https-en-wikipedia-org-wiki-data-science
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Data science - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"0775bcec-57a7-427a-850c-ae43791d7bd2","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1016655116,"wgRevisionId":1016655116,"wgArticleId":35458904,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: others","Articles with short description","Short description matches Wikidata","Use dmy dates from December 20

In [48]:
%%time

pool = Pool(processes=6)
result = pool.map(index_page, tqdm(final))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=268.0), HTML(value='')))


Wall time: 1min 22s


**BONUS**: Create a function that counts how many files are there in the wikipedia folder using the `os` module. 

Delete the files from the folder before you run and perform the above solution asynchronously. 

Use your function to check how many files are being downloaded.