# Intro to Scraping

## Part 1: BeautifulSoup

BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

In [None]:
# Imports
# If this isn't working, uncomment the following line to install (this is not the recommended way)
# !pip install beautifulsoup4
from bs4 import BeautifulSoup
import pandas as pd

# Import the requests library, which underlies most of this tutorial
# You don't actually need to know much more than requests.get(url) though # https://2.python-requests.org/en/v2.5.3/user/advanced/
import requests 

The first thing to do is to get the HTML of the page we are trying to scrape, in this case Worker Adjustment and Retraining Notification data from the NYC department of labor.
https://labor.ny.gov/app/warn/

In [None]:
# Identify a target URL, and fetch the HTML 
warn_url = "https://labor.ny.gov/app/warn/"
response = requests.get(warn_url)
print(response.status_code)

We want a 200 (success) status code returned from the request. If you don't see this, you might not be connected to the internet

In [None]:
# Assuming we only want 2020 data
warn_2020_url = "https://labor.ny.gov/app/warn/default.asp?warnYr=2020"
response = requests.get(warn_url)

### Create a soup object to help parse the html

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that helps parse HTML files.

Pass the HTML we requested from the WARN site to BeautifulSoup to create a 'soup' object for the page.


In [None]:
soup = BeautifulSoup(response.text, 'html.parser')
# print(soup.prettify())

### Looking at the WARN page

[](img/warn.png)

![WARN](img/warn.png)

This table looks like the thing we want to parse. We need a list of each link to a WARN record. Let's extract this list.


### Quick HTML Syntax Review
![HTML tag syntax](img/html-tag.png)

### Get the table with .find()

[find()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) is the simplest of BeautifulSoup's HTML parsing methods. It will return the first HTML element that matches your query, in this case a Tag Name. Because there is only one table on the WARN page, this will work fine for us.

In [None]:
# Get the table
table = soup.find('table')
print('1', type(table))

# This is does same thing, but the syntax is a little more simple
table = soup.table
print('2', type(table))

### Look at the items

In [None]:
# Here's one way to do it
for item in table:
    print(item)

### find_all()

find_all() is identical to find(), except it will return a list of ALL occurances of your query.

We call find_all() directly on the table we've already extracted, rather than the soup object. This way, we only find matches that are in the table, not other links on the page.

In [None]:
# find_all() called on the table. find_all() returns a list
link_tags = table.find_all('a')
print('number of links:', len(link_tags))

# Take a look
for link in link_tags:
    print(link.prettify())

### Accessing Attributes

We want to look at each WARN record, so we need to get the url for each page. We can do this by accessing the href attribute from each 'a' (hyperlink) tag.

In [None]:
# Create a list of links
links = []

# Iterate over each <a>
for link in table.find_all('a'):
    # Use the bracket notation to get an attribute from a tag
    # print(link['href'])
    
    # add the current link text to the list
    links.append(warn_url + link['href'])

In [None]:
for link in links:
    print(link)

In [None]:
import time
import gzip
import unicodedata

# Create a list to store the data we scrape.
# Each item in the list will correspond to a single WARN listing
# Each column will be a piece of single labeled piece information from the listing
data = []

def scrape_single_page(url):
    
    # Print the current URL that we're scraping
    print('scraping', url, end='\r')
    
    # Create a dictionary to store the data for a single WARN listing
    page_data = {}
    
    # Fetch the page
    response = requests.get(url)
    
    # This is pretty atypical. Some of these pages are not being unzipped correctly
    # If the request didn't automatically unzip the html, we have to do it ourselves
    # We can check the encoding that the requests library thinks the page has returned
    if response.apparent_encoding is None:
        html = gzip.decompress(response.content).decode('utf-8')
    else:
        html = response.text
        
    # Remove non-breaking space characters in the HTML
    html = html.replace('&nbsp;', ' ')
    
    # Make the soup for the single page
    page_soup = BeautifulSoup(html, 'html.parser')
    
    # Sanity check
    # print(page_soup.prettify())
    
    # Get the first table (there should only be one)
    table = page_soup.table
    
    # Get each paragraph tag
    paragraphs = table.find_all('p')
    
    # Use .text to get the inner text for each p
    for paragraph in paragraphs:
        # print(paragraph.text)
        text = paragraph.text
        
        # We are going to split on only the first colon in each row (':') by using text.split(delim, 1)
        split_text = text.split(':', 1)
        
        if len(split_text) == 2:
            # Add this paragraph to the page data
            page_data[split_text[0]] = split_text[1]
        
        
    # After looping through each paragraph, add this listing to the DataFrame
    data.append(page_data)

for link in links:
    scrape_single_page(link)
    
    # This is the most important line in the entire notebook
    # This line ensures that you won't crash servers and make peoople come knocking on your door
    # Don't be an idiot!
    time.sleep(0.25)

In [None]:
# Create a pandas DataFrame to store the data we scrape.
# Each row in the dataframe will correspond to a single WARN listing
# Each column will be a piece of single labeled piece information from the listing
data = []

def scrape_single_page(url):
    
    # Create a dictionary to store the data for a single WARN listing
    page_data = {}
    
    # Fetch the page
    response = requests.get(url)
    page_soup = BeautifulSoup(response.text, 'html.parser')
    
    # Sanity check
    # print(page_soup.prettify())
    
    # Get the first/only table
    table = page_soup.table
    
    # Get each paragraph tag
    paragraphs = table.find_all('p')
    
    # Use .text to get the inner text for each p
    for paragraph in paragraphs:
        # print(paragraph.text)
        text = paragraph.text
        print(text)
        
        # We are going to split on only the first colon in each row (':') by using text.split(delimeter, 1)
        split_text = text.split(':', 1)
        
        print(split_text)
        
        # Add this paragraph to the page data
        page_data[split_text[0]] = split_text[1]
        
    # After looping through each paragraph, add this listing to the DataFrame
    data.append(page_data)

scrape_single_page(links[0])

In [None]:
import time
import gzip

# This is a simple progress bar library. Nice for long tasks
! pip install tqdm
from tqdm import tqdm

def get_warn_page(url):
    
    ## Print the current URL that we're scraping
    # print('scraping', url, end='\r')
    
    # Fetch the page
    response = requests.get(url)
    
    # This is pretty atypical. Some of these pages are not being unzipped correctly
    # If the request didn't automatically unzip the html, we have to do it ourselves
    # We can check the encoding that the requests library thinks the page has returned
    if response.apparent_encoding is None:
        html = gzip.decompress(response.content).decode('utf-8')
    else:
        html = response.text
        
    # Remove non-breaking space characters in the HTML
    html = html.replace('&nbsp;', ' ')
    
    return html
    

warn_pages = []

for link in tqdm(links):
    html = get_warn_page(link)
    warn_pages.append(html)
    
    # This is the most important line in the entire notebook
    # This line ensures that you won't crash a server. 
    # It's necessary whenever you are scraping in a loop
    time.sleep(0.1)
    
print('done fetching data')

In [68]:
print(len(warn_pages))

416


In [152]:
import re

def parse_single_page(html, url):
    # Create a dictionary to store the data for a single WARN listing
    page_data = {'URL': url}
    
    # Make the soup for the single page
    page_soup = BeautifulSoup(html, 'html.parser')
    
    # Sanity check
    # print(page_soup.prettify())
    
    # Get the first table (there should only be one)
    table = page_soup.table
    
    # Get all text in the table
    # get_text() will get all text in all elements underneath the current element, 
    # in this case all of the text in the <p> tags
    table_text = table.get_text()
    
    # Use a regular expression to throw away some extra info we don't care about
    table_text = re.split('(?:Additional|Other|Location).*', table_text)[0]
    
    # Split the text into a list of lines with the newline character '\n'
    lines = table_text.split('\n')
    
    for line in lines:
        line = line.strip()
        line = line.replace('  ', ' ')
        line = line.replace('Dates', 'Date')
        line = line.replace('Counties', 'County')
        split_text = line.split(':', 1)
        # print('split', split_text)
        
        if len(split_text) == 2:
            
            key = split_text[0]
            value = split_text[1]
            
            # https://docs.python.org/3/library/re.html
#             if re.match('(?:Additional|Other|Location).*', key):
#                 page_data['Other sites'] = value
#             elif re.match('[0-9]{4}-[0-9]{4}', key):
#                 page_data['Other sites'] = value
#             else:
                
            page_data[key] = value
    
    return page_data


# Create a list to store the data we scrape.
# Each item in the list will correspond to a single WARN listing
# Each column will be a piece of single labeled piece information from the listing
data = []

for html_page, link in zip(warn_pages, links):
    page_data = parse_single_page(html_page, link)
    data.append(page_data)
    
print('done parsing html')

done parsing html


In [154]:
print(len(data))
df = pd.DataFrame(data)
# df[df['2019-0247']]

# df[df['New York City'].notna()]
# for row in df:
#     print(row)

416


In [155]:
df

Unnamed: 0,URL,Date of Notice,Event Number,Rapid Response Specialist,Reason Stated for Filing,Company,County,Contact,Phone,Business Type,Number Affected,Total Employees,Layoff Date,Closing Date,Reason for Dislocation,FEIN NUM,Union,Classification
0,https://labor.ny.gov/app/warn/details.asp?id=7417,3/23/2020,2019-0671,Stuart Goldberg,Temporary Plant Layoff,DL1961 Premium Denim Inc. 121 Varick Str...,New York | WDB Name: NEW YORK CITY | Region: ...,Rati Bhandari,(646) 514-9738,Denim Retailer,37,-----,3/23/2020,-----,Unforeseeable business circumstances prompted...,-----,The employees are not represented by a union,Temporary Plant Layoff
1,https://labor.ny.gov/app/warn/details.asp?id=7418,3/23/2020,2019-0670,Stuart Goldberg,Plant Closing,"Citizen Watch Company of America, Inc. 5...",Queens | WDB Name: NEW YORK CITY | Region: Ne...,"Seth Presser, Vice President, Legal",(212) 497-9795,Watch Manufacturing,42,42,6/21/2020,6/21/2020,Unforeseeable business circumstances prompted...,-----,Independent Production Maintenance and Servic...,Plant Closing
2,https://labor.ny.gov/app/warn/details.asp?id=7419,3/25/2020,2019-0668,Stuart Goldberg,Temporary Plant Layoff,Indochino Apparel (US) Inc. 424 Broome S...,New York/Kings | WDB Name: NEW YORK CITY | Re...,"Ryan Mann, Manager, People and Culture - Oper...",(778) 945-2172 Ext: 808,Retail,56,-----,3/25/2020,-----,Unforeseeable business circumstances prompted...,-----,The employees are not represented by a union,Temporary Plant Layoff
3,https://labor.ny.gov/app/warn/details.asp?id=7420,3/16/2020,2019-0699,Stuart Goldberg,Temporary Plant Closing,Howard Beach Fitness Center dba Limitles...,Queens | WDB Name: NEW YORK CITY | Region: Ne...,"Joseph Ponte, Operations Manager",(718) 845-4653,Gym,-----,-----,3/16/2020,3/16/2020,Unforeseeable business circumstances prompted...,-----,The employees are not represented by a union,Temporary Plant Closing
4,https://labor.ny.gov/app/warn/details.asp?id=7421,3/23/2020,2019-0697,Stuart Goldberg,Temporary Plant Layoff,Fitzpatrick Grand Central Hotel 141 East...,New York | WDB Name: NEW YORK CITY | Region: ...,"Tony Ruscitto, Director of Human Resources",(212) 784-2566,Hotel,57,-----,3/20/2020,-----,Unforeseeable business circumstances prompted...,-----,"New York Hotel & Motel Trades Council, AFL-CIO",Temporary Plant Layoff
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
411,https://labor.ny.gov/app/warn/details.asp?id=7053,1/6/2020,2019-0207,Frederick Danks,Plant Closing,Macy's Broadway Mall Store (Macy's Retail Hold...,Nassau | WDB Name: OYSTER BAY | Region: Long ...,"Heath R. Salit, Human Resources Business Partner",(646) 429-7462,Retail Store,155,155,Macy's Broadway Mall Store employee separatio...,"April 19, 2020; June 29, 2020",Economic,43-0398035,The employees are not represented by a union.,Plant Closing
412,https://labor.ny.gov/app/warn/details.asp?id=7052,12/27/2019,2019-0206,Regenna Darrah,Temporary Plant Closing,Wesley Gardens Nursing Home 3 Upton Park Roche...,Monroe | WDB Name: MONROE | Region: Finger Lakes,"Sharon Davis, Human Resources Manager",(585) 241-2105,Nursing Home,132,132,"Beginning on 12/31/2019, with all affected em...",12/27/2019,Due to a water line break,22-3139841,1199 SEIU,Temporary Plant Closing
413,https://labor.ny.gov/app/warn/details.asp?id=7051,12/30/2019,2019-0205,Stuart Goldberg,Plant Closing,"127 W. 43rd St. Chophouse, Inc. (Heartla...",New York | WDB Name: NEW YORK CITY | Region: ...,"Jon Bloostein, Chief Executive Officer",(917) 999-6532,Restaurant,106,106,The layoffs are expected to commence on March...,3/31/2020,Economic,13-4141784,The employees are not represented by a union.,Plant Closing
414,https://labor.ny.gov/app/warn/details.asp?id=7050,12/30/2019,2019-0201,"Jacqueline Huertas, Karl Price, Regenna Darra...",Plant Closing,"New York Express and Logistics, LLC 292 Wolf R...",Albany | WDB Name: CAPITAL DISTRICT | Region:...,"Chris Kalavantis, Operations Manager",(617) 968-5311,Trucking company providing freight transporta...,48,48,3/31/2020,3/31/2020,Contract between New York Express and Logisti...,47-2136557,The employees are not represented by a union.,Plant Closing


In [80]:
!pip install openpyxl
df.to_excel("output.xlsx")  
df.to_csv("output.csv")



In [None]:
! open ouptut.xlsx