# Do Now: Scraping

### Imports

In [1]:
import requests
from bs4 import BeautifulSoup

## Question 1: New York Times

I've heard that https://www.nytimes.com is an important paper, but I don't want to subscribe. Can you print out most of the headlines on the page for me?

In [2]:
# lxml - very strict parser, throws out anything resembling bad HTML
# html.parser - fine and good and reasonable
# html5lib - very slow, but very very accepting

response = requests.get("https://www.nytimes.com/")
doc = BeautifulSoup(response.text)

```
<div>
I would say
<p>Hello</p>
</div>
```

In [5]:
headlines = doc.find_all('h2')
for headline in headlines:
    print(headline.string)

Your Tuesday Briefing
New York Today
Listen to ‘The Daily’
As U.S. and Iran Face Off, Europe Is Stuck in the Middle
Trump’s Promises on Iran and North Korea Were Easier Said Than Done
Cities Start to Question an American Ideal
Is Trump a Fluke, or the Future?
Conservative Host’s Listeners Don’t Want to Hear Doubts About Trump
Nine Democratic candidates spoke about poverty and systemic racism at a forum.
Racial Slurs Cost a Pro-Gun Parkland Student a Place at Harvard
Facebook Plans Global Financial System Based on Cryptocurrency
Here’s how Libra could work for users.
UNC Hospital Suspends Complex Heart Surgeries on Children
Sarah Huckabee Sanders Wants You to Know She Was Not Impressed
Androgyny Is Now Fashionable in the W.N.B.A.
Harvard’s False Path to Wisdom
Is the Religious Right Privileged?
I’m a Climber, and a Mother, and Doing Great, Thank You
‘There’s Just No Doubt That It Will Change the World’: David Chalmers on V.R. and A.I.
Beijing Is Treading Lightly in Hong Kong, for Now
St

In [12]:
headlines = doc.find_all('span', {'class': 'balancedHeadline'})
headlines = doc.find_all('span', class_='balancedHeadline')
for headline in headlines:
    print(headline.string)

## Question 2: Nike sales

I need some shoes, but I also love sales! Visit https://store.nike.com/us/en_us/pw/mens-clearance/47Z7pu?ipp=120 and print out the name of every shoe that's on sale.

* **Better:** No space between each shoe name (so just one per line)

In [13]:
response = requests.get('https://store.nike.com/us/en_us/pw/mens-clearance/47Z7pu?ipp=120')
doc = BeautifulSoup(response.text, 'html.parser')

In [14]:
# product-display-name nsg-font-family--base edf-font-size--regular nsg-text--dark-grey

In [21]:
# "CSS selectors"
# .product-display-name means "the class of product-display-name"
#doc.select('.product-display-name')

names = doc.find_all('p', class_='product-display-name')
for name in names:
    print(name.text)

Nike Air Max 720
Nike Air VaporMax 2019
Nike Sportswear
Nike Sportswear Tech Pack
Nike Epic React Flyknit 2
Nike Air Max 98
Nike Air Max 270 SE
Nike Free RN 5.0
Air Jordan 11 Retro Low LE
Air Jordan 4 Retro
Air Jordan 3 Retro TH SP
Air Jordan 1 Retro High OG
Nike Air More Uptempo 720 QS 2
Nike SB Stefan Janoski Max
Golden State Warriors Nike Therma Flex Showtime
Jordan Wings Classics
Nike Zoom Fly Flyknit
Nike Zoom Fly
Nike Air Max 270
Nike Air Max Dia
LeBron 16
Air Jordan 9 Retro
Nike Sportswear Tech Fleece
Nike Sportswear Tech Fleece
Nike Zonal Cooling TW
Nike Sportswear Tech Fleece
Nike Air Max 97 QS
Nike Air Max 97
Nike Air VaporMax Plus
Nike Free RN Flyknit 3.0
Nike React Element 55
Kyrie 5
LeBron Soldier 12 SFG
Nike Therma Elite
Kevin Durant Earned City Edition Swingman (Golden State Warriors)
Jordan Jumpman Air
Nike Therma Flex Showtime
Nike Pro Tech Pack
Nike Therma Sphere Tech Pack
Nike Tech Pack
Nike Dri-FIT
Nike Dri-FIT
Nike Challenger
Nike Challenger
Air Jordan Legacy 312
N

In [25]:
# "CSS selectors"
# .product-display-name means "the class of product-display-name"
#doc.select('.product-display-name')

# Find all of the big grid things that the shoes inside of
items = doc.find_all('div', class_='grid-item')
for item in items:
    # is 'shoe' in the url that it seems to be hiding? If so.../
    if 'shoe' in item['data-pdpurl']:
        # find the 'product-display-name' and print it out
        name = item.find('p', class_='product-display-name')
        print(name.text)

Nike Air Max 720
Nike Air VaporMax 2019
Nike Epic React Flyknit 2
Nike Air Max 98
Nike Air Max 270 SE
Nike Free RN 5.0
Air Jordan 11 Retro Low LE
Air Jordan 4 Retro
Air Jordan 3 Retro TH SP
Air Jordan 1 Retro High OG
Nike Air More Uptempo 720 QS 2
Nike SB Stefan Janoski Max
Nike Zoom Fly Flyknit
Nike Zoom Fly
Nike Air Max 270
Nike Air Max Dia
LeBron 16
Air Jordan 9 Retro
Nike Air Max 97 QS
Nike Air Max 97
Nike Air VaporMax Plus
Nike Free RN Flyknit 3.0
Nike React Element 55
Kyrie 5
LeBron Soldier 12 SFG
Air Jordan Legacy 312
Nike Air Max Plus Premium
Kyrie 5 x ROKIT All Star
Nike Zoom Pegasus Turbo
Nike Metcon Flyknit 3
Nike Free RN 2018
Nike Free RN Flyknit 2018
Nike Mercurial Vapor XII Academy Neymar Jr IC
Nike Zoom Fly Flyknit
Nike Revolution 4
Nike Odyssey React Flyknit 2
Nike Air Skylon II
Nike EXP-X14
Nike Air Wildwood ACG
Nike Okwahn II
Nike Air Max 270 Bowfin
Nike Air Max 97 Realtree®
Nike Air Max 270 Premium
Nike Air Zoom Alpha
Nike Zoom Stefan Janoski
Nike SB Blazer Zoom Low


## Question 3: School Board Minutes

Visit http://www.marylandpublicschools.org/stateboard/Pages/Meetings-2018.aspx and get a list of the school board minutes.

* **Good:** Print out each link's text, and the URL it goes to
* **Better:** I only want the meeting agendas and minutes
* **Best:** I only want the minutes (the PDFs)
* *Bonus: Save it to a list instead of printing it out*

In [30]:
response = requests.get('http://www.marylandpublicschools.org/stateboard/Pages/Meetings-2018.aspx')
doc = BeautifulSoup(response.text) 

In [37]:
# Find that big table that has all of the minutes inside of it
table = doc.find('div', id='ctl00_PlaceHolderMain_ctl02__ControlWrapper_RichHtmlField')


# Right-click an element, Copy > copy selector
# doc.select_one + paste in whatever it gives you
# this only finds ONE thing, like a submit button or a table or whatever
table = doc.select_one('#ctl00_PlaceHolderMain_ctl02__ControlWrapper_RichHtmlField > table')
# Now find all of the links inside of that table (table.find_all, not doc.find_all)
links = table.find_all('a')

for link in links:
    url = link['href']
    print(link.text, url)

January 29, 2018 Board Agenda /stateboard/Pages/meeting-agendas/2018-01-29.aspx
January 29, 2018Board Minutes​ /stateboard/Documents/minutes/2018/January292018.pdf
January 30, 2018 Board Agenda /stateboard/Pages/meeting-agendas/2018-01-30.aspx
 /stateboard/Pages/meeting-agendas/2016-02-11.aspx
January 30, 2018Board Minutes​ /stateboard/Documents/minutes/2018/January302018.pdf
February 27, 2018 Board Agenda​ /stateboard/Pages/meeting-agendas/2018-02-27.aspx
February 27, 2018 Board Minutes​ /stateboard/Documents/minutes/2018/February272018.pdf
March 20, 2018Board Agenda​​ /stateboard/Pages/meeting-agendas/2018-03-20.aspx
March 20, 2018Board Minutes /stateboard/Documents/minutes/2018/March202018.pdf
April 24, 2018Board Agenda​​ /stateboard/Pages/meeting-agendas/2018-04-24.aspx
April 24, 2018Board Minutes /stateboard/Documents/minutes/2018/April242018.pdf
May 22, 2018Board Agenda​​ /stateboard/Pages/meeting-agendas/2018-05-22.aspx
May 22, 2018Board Minutes /stateboard/Documents/minutes/201