# 1. Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```pdf``` files.

In [1]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers


In [2]:
## write function here
def mkRequest (url):
    '''
    Takes a provided url and returns requested response
    '''
    response = requests.get(url)
    if 200 <= response.status_code < 400: #controla si funciona o no la web y es scrapeable
        return response
    else:
        print(f"request returned{response.status_code} error")

In [3]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

In [4]:
## call the function
response = mkRequest(url)
response

<Response [200]>

In [5]:
## create function to create soup
def mkSoup (response):
    '''
    Make soup
    '''
    return BeautifulSoup (response.text, "html.parser")

In [6]:
## call the function
soup = mkSoup(response)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

In [7]:
# MC function  
def scraper (url):
    '''
    Enter url to return soup of page
    '''
    return mkSoup(mkRequest(url))

In [8]:
soup = scraper(url)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

In [10]:
## save a tags in list 
aTags = soup.find("ul", class_ = "pdfs").find_all("a")
aTags

[<a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>]

In [11]:
## base url
base_url = "https://sandeepmj.github.io/scrape-example-page/"

In [12]:
## save without html using for loop
links = [base_url + aTag.get("href") for aTag in aTags]
links

['https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_4.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_5.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_6.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_7.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_8.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_9.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_10.pdf']

In [15]:
## import wget
import wget

In [16]:
## make timer function
def snoozer (start_range, end_range):
    snooze_time = randrange(start_range,end_range)
    print(f" \n Snoozing for {snooze_time} seconds")
    return time.sleep(snooze_time) # sin esta línea, no duerme el programa

In [17]:
## download with timer
link_count = 1
start_range, end_range = 10, 21
for link in links:
    print(f"Downloading link {link_count} of {len(links)}")
    link_count +=1
    wget.download(link)
    snoozer(start_range, end_range)

Downloading link 1 of 10
 
 Snoozing for 14 seconds
Downloading link 2 of 10
 
 Snoozing for 18 seconds
Downloading link 3 of 10
 
 Snoozing for 10 seconds
Downloading link 4 of 10
 
 Snoozing for 16 seconds
Downloading link 5 of 10
 
 Snoozing for 16 seconds
Downloading link 6 of 10
 
 Snoozing for 17 seconds
Downloading link 7 of 10
 
 Snoozing for 15 seconds
Downloading link 8 of 10
 
 Snoozing for 12 seconds
Downloading link 9 of 10
 
 Snoozing for 20 seconds
Downloading link 10 of 10
 
 Snoozing for 20 seconds


# 2. Demo downloading ALL files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all  files.

In [21]:
## save in a list called file_holder
file_holder = soup.find_all("ul", class_ = "downloadable")
file_holder

[<ul class="txts downloadable">
 <p class="pages">Download this first set of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="

In [27]:
#save a tags in list
aTag_list = []
for atag in file_holder:
    aTag_list.append(atag.find_all("a"))
aTag_list

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/pdf_1.pdf">1</a>,
  <a href="files/pdf_2.pdf">2</a>,
  <a href="files/pdf_3.pdf">3</a>,
  <a href="files/pdf_4.pdf">4</a>,
  <a href="files/pdf_5.pdf">5</a>,
  <a href="files/pdf_6.pdf">6</a>,
  <a href="files/pdf_7.pdf">7</a>,
  <a href="files/pdf_8.pdf">8</a>,
  <a href="files/pdf_9.pdf">9</a>,
  <a href="files/pdf_10.pdf">10</a>],
 [<a href="files/text_doc_A.txt">1</a>,
  <a href="files/text_doc_B.txt">2</a>,
  <a href="files/text_doc_C.txt">3</a>,
  <a href="files/text_doc_D.txt">4</a>,
  <a href="files/text_doc_E.txt">5</a>,
  <a href="files/text_doc_F.txt">6<

In [28]:
#import itertools
import itertools

In [43]:
#convert to html links
html_links = list(itertools.chain(*aTag_list)) 
html_links


[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>,
 <a href="files/text_doc_A.txt">1</a>,
 <a href="files/text_doc_B.txt">2</a>,
 <a href="files/text_doc_C.txt">3</a>,
 <a href="files/text_doc_D.txt">4</a>,
 <a href="files/text_doc_E.txt">5</a>,
 <a href="files/text_doc_F.txt">6</a>,
 <a href="files/text_do

In [45]:
#Add base url
all_links = [base_url + link.get("href") for link in html_links]
all_links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/

In [48]:
## download with timer
link_count_all = 1
start_range, end_range = 10, 21
for link in all_links:
    print(f"Downloading link {link_count_all} of {len(all_links)}")
    link_count_all +=1
    wget.download(link)
    snoozer(start_range, end_range)

Downloading link 1 of 30
 
 Snoozing for 12 seconds
Downloading link 2 of 30
 
 Snoozing for 16 seconds
Downloading link 3 of 30
 
 Snoozing for 19 seconds
Downloading link 4 of 30
 
 Snoozing for 14 seconds
Downloading link 5 of 30
 
 Snoozing for 12 seconds
Downloading link 6 of 30
 
 Snoozing for 19 seconds
Downloading link 7 of 30
 
 Snoozing for 18 seconds
Downloading link 8 of 30
 
 Snoozing for 10 seconds
Downloading link 9 of 30
 
 Snoozing for 14 seconds
Downloading link 10 of 30
 
 Snoozing for 13 seconds
Downloading link 11 of 30
 
 Snoozing for 10 seconds
Downloading link 12 of 30
 
 Snoozing for 12 seconds
Downloading link 13 of 30
 
 Snoozing for 15 seconds
Downloading link 14 of 30
 
 Snoozing for 11 seconds
Downloading link 15 of 30
 
 Snoozing for 11 seconds
Downloading link 16 of 30
 
 Snoozing for 16 seconds
Downloading link 17 of 30
 
 Snoozing for 10 seconds
Downloading link 18 of 30
 
 Snoozing for 20 seconds
Downloading link 19 of 30
 
 Snoozing for 17 seconds
Do

# 3. Conversion function


Write a function that takes string values like ```$12.24```, ```10,201.7654``` and ```$12,501``` and converts them into floating point numbers like ```12.24```, ```10201.77``` and ```12501.0```

Test it out on those 3 string values.

In [89]:
## write function here
def convert (number):
    '''
    Takes a number provided in string format and converts it to a floating point number with 2 decimal places
    '''
    return round(float(number.replace("$", "").replace(",","")),2)

In [90]:
## call it on "$12.24"
convert("$12.24")

12.24

In [91]:
## call it on "10,201.7654"
convert("10,201.7654")

10201.77

In [92]:
## call it on "$12,501"
convert("$12,501")

12501.0