# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

# Downloading files from websites 

Having to download files from a site that holds target data is one of the most common types of scrapes:

- <a href="https://www.uscourts.gov/report-name/bankruptcy-filings">Bankruptcy documents</a>
- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">NYS Doctors Displinary Decisions</a>

In [1]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

In [2]:
# scrape url to soup
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

## Find all PDFs files

In [3]:
# save target data in list
aTags = soup.find("ul", class_="pdfs").find_all("a")
aTags

[<a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>]

In [4]:
# what type
type(aTags)

bs4.element.ResultSet

In [5]:
base_url = "https://sandeepmj.github.io/scrape-example-page/"

In [6]:
# save urls in a list using lc
links = [ base_url + aTag.get("href") for aTag in aTags ]
links

['https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_4.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_5.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_6.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_7.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_8.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_9.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_10.pdf']

In [7]:
# importing new library
import wget

In [8]:
# pulling snoozer timer from class exercise
from random import randrange
import time
def snoozer(start_time, end_time):
    '''
    This function creates a snoozer that can be used when scraping.
    para1 = start time of range, in seconds
    para2 = end time of range, in seconds
    '''
    timer = randrange(start_time, end_time)
    print(f"Snoozing for {timer} seconds...")
    time.sleep(timer)

In [9]:
# download with timer
link_count = 0
start_range, end_range = 10, 21

for link in links:
    link_count += 1
    print(f"Downloading link {link_count} of {len(links)}.")
    ## downloading docs
    wget.download(link)
    print("")
    snoozer(start_range, end_range)
print(f"Done downloading {link_count} of {len(links)}.")

Downloading link 1 of 10.
100% [..........................................................] 12812 / 12812
Snoozing for 12 seconds...
Downloading link 2 of 10.
100% [..........................................................] 12897 / 12897
Snoozing for 18 seconds...
Downloading link 3 of 10.
100% [..........................................................] 12908 / 12908
Snoozing for 15 seconds...
Downloading link 4 of 10.
100% [..........................................................] 12843 / 12843
Snoozing for 16 seconds...
Downloading link 5 of 10.
100% [..........................................................] 12881 / 12881
Snoozing for 20 seconds...
Downloading link 6 of 10.
100% [..........................................................] 12906 / 12906
Snoozing for 16 seconds...
Downloading link 7 of 10.
100% [..........................................................] 12816 / 12816
Snoozing for 14 seconds...
Downloading link 8 of 10.
100% [.....................................

## Your turn - Download the first 10 text files.