# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all files as one.

In [8]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

# from google.colab import files ## code for downloading in google colab

### Create function to handle our initial requests

In [11]:
## write function here
def mkRequest (url):
    '''
    Takes a providad url and returns rreuquested response
    '''
    response = requests.get(url)
    if 200 <= response.status_code < 400: #controla si funciona o no la web y es scrapeable
        return response
    else:
        print(f"request returned{response.status_code} error")

In [12]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

In [16]:
## call the function
response = mkRequest(url)
response

<Response [200]>

## Turn page into soup

In [20]:
## create function to create soup
def mkSoup (response):
    '''
    Make soup
    '''
    return BeautifulSoup (response.text, "html.parser")

In [21]:
## call the function
soup = mkSoup(response)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

In [31]:
# MC function (Master of ceremony. Nombre informal de Sandeep)
def scraper (url):
    '''
    Enter url to return soup of page
    '''
    return mkSoup(mkRequest(url))

In [32]:
soup = scraper(url)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

## Find all txt files

In [49]:
## save in list called txt_holder
aTags = soup.find("ul", class_ = "txts").find_all("a")
aTags

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

In [None]:
## what type


## Find all the ```a``` tags 

In [52]:
## target a tags
aTags

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

In [59]:
## base url
base_url = "https://sandeepmj.github.io/scrape-example-page/"

In [60]:
## save without html using for loop
links = [base_url + aTag.get("href") for aTag in aTags]
links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']

In [None]:
## save without html using list comprehension

## What is missing from the URLs?

## Create a list of the full URLs

Without all the ```html```

In [61]:
## lc


## Download all the ```txt``` documents

pip install wget, a great utility to download from links

In [62]:
pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9680 sha256=a1d48fa3ee14711490e498397cefdece88d7c8efe75559e9fa7d0d900d70081f
  Stored in directory: /Users/patxiuranga/Library/Caches/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Note: you may need to restart the kernel to use updated packages.


In [64]:
## import wget
import wget

In [67]:
## make timer function
def snoozer (start_range, end_range):
    snooze_time = randrange(start_range,end_range)
    print(f" \n Snoozing for {snooze_time} seconds")
    return time.sleep(snooze_time) # sin esta línea, no duerme el programa

In [69]:
snoozer(5,10)

 
 Snoozing for 5 seconds


In [84]:
## download with timer
link_count = 1
start_range, end_range = 10, 21
for link in links:
    print(f"Downloading link {link_count} of {len(links)}")
    link_count +=1
    wget.download(link)
    snoozer(start_range, end_range)

Downloading link 1 of 10
 
 Snoozing for 14 seconds
Downloading link 2 of 10
 
 Snoozing for 17 seconds
Downloading link 3 of 10
 
 Snoozing for 12 seconds
Downloading link 4 of 10
 
 Snoozing for 17 seconds
Downloading link 5 of 10
 
 Snoozing for 11 seconds
Downloading link 6 of 10
 
 Snoozing for 10 seconds
Downloading link 7 of 10
 
 Snoozing for 10 seconds
Downloading link 8 of 10
 
 Snoozing for 20 seconds
Downloading link 9 of 10
 
 Snoozing for 15 seconds
Downloading link 10 of 10
 
 Snoozing for 16 seconds


In [88]:
#find all text files


all_text = soup.find_all("ul", class_="txts")

In [104]:
aTag_list = []
for atag in all_text:
    aTag_list.append(atag.find_all("a"))


In [105]:
aTag_list #Estan nested (lista en lista). Encadenados. Pero flat es mejor que nested. Tenemos que desencadenarlo.

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/text_doc_A.txt">1</a>,
  <a href="files/text_doc_B.txt">2</a>,
  <a href="files/text_doc_C.txt">3</a>,
  <a href="files/text_doc_D.txt">4</a>,
  <a href="files/text_doc_E.txt">5</a>,
  <a href="files/text_doc_F.txt">6</a>,
  <a href="files/text_doc_G.txt">7</a>,
  <a href="files/text_doc_H.txt">8</a>,
  <a href="files/text_doc_I.txt">9</a>,
  <a href="files/text_doc_J.txt">10</a>]]

In [107]:
#Así lo desencadeno
flat_list = []
for sub_list in aTag_list:
    for item in sub_list:
        flat_list.append(item)

In [108]:
flat_list

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/text_doc_A.txt">1</a>,
 <a href="files/text_doc_B.txt">2</a>,
 <a href="files/text_doc_C.txt">3</a>,
 <a href="files/text_doc_D.txt">4</a>,
 <a href="files/text_doc_E.txt">5</a>,
 <a href="files/text_doc_F.txt">6</a>,
 <a href="files/text_doc_G.txt">7</a>,
 <a href="files/text_doc_H.txt">8</a>,
 <a href="files/text_doc_I.txt">9</a>,
 <a href="files/text_doc_J.txt">10</a>]

In [110]:
#con itertools
import itertools

In [120]:
html_links = list(itertools.chain(*aTag_list)) 
html_links
# El asterisco indica que tengo que tomar cada elemento de la lista por separado (y encadenarlo)

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/text_doc_A.txt">1</a>,
 <a href="files/text_doc_B.txt">2</a>,
 <a href="files/text_doc_C.txt">3</a>,
 <a href="files/text_doc_D.txt">4</a>,
 <a href="files/text_doc_E.txt">5</a>,
 <a href="files/text_doc_F.txt">6</a>,
 <a href="files/text_doc_G.txt">7</a>,
 <a href="files/text_doc_H.txt">8</a>,
 <a href="files/text_doc_I.txt">9</a>,
 <a href="files/text_doc_J.txt">10</a>]