## 12-Web-Scraping-and-Document-Databases - Day 2 - Web Scraping

### Class Objectives

* Use Beautiful Soup to scrape your own data from the web.
* Save the results of web scraping into MongoDB.

### Presentation:
* [Web Scraping](https://ucb.bootcampcontent.com/UCB-Coding-Bootcamp/ucb-bel-data-pt-10-2020-u-c/blob/master/01-Lesson-Plans/12-Web-Scraping-and-Document-Databases/Slideshows/Data-12.2-Web_Scraping.pdf)


### Resorces:
* [Selenium](https://splinter.readthedocs.io/en/latest/drivers/chrome.html)
* [Python Requests](http://docs.python-requests.org/en/master/)
* [Webscraping with BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [Beautiful soup intro](https://www.pythonforbeginners.com/beautifulsoup/scraping-websites-with-beautifulsoup)
* [Python Splinter](https://splinter.readthedocs.io/en/latest/)
* [Splinter Docs](https://splinter.readthedocs.io/en/latest/drivers/chrome.html)


### Install
* Beautiful Soup `[sudo] pip install bs4`
* Selenium `[sudo] pip install selenium`
* html5 `[sudo] pip install html5 lib`
* Splinter `[sudo] pip install splinter`
* Webdriver Manager `[sudo] pip install webdriver-manager`

# ==========================================

### 2.01 Instructor Do: Introduction to Beautiful Soup (0:10)

In [26]:
# Dependencies
from bs4 import BeautifulSoup as bs

In [27]:
html_string = """
<html>
<head>
  <title>
     A Simple HTML Document
  </title>
</head>
  <body>
    <p>This is a very simple HTML document</p>
    <p>It only has two paragraphs</p>
  </body>
</html>
"""

In [29]:
# Create a Beautiful Soup object
soup = bs(html_string, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [30]:
# Print formatted version of the soup
print(soup.prettify())

<html>
 <head>
  <title>
   A Simple HTML Document
  </title>
 </head>
 <body>
  <p>
   This is a very simple HTML document
  </p>
  <p>
   It only has two paragraphs
  </p>
 </body>
</html>



In [7]:
# Extract the title of the HTML document
soup.title

<title>
     A Simple HTML Document
  </title>

In [8]:
# Extract the text of the title
soup.title.text

'\n     A Simple HTML Document\n  '

In [9]:
# Clean up the text
soup.title.text.strip()

'A Simple HTML Document'

In [31]:
# Extract the contents of the HTML body
soup.body

<body>
<p>This is a very simple HTML document</p>
<p>It only has two paragraphs</p>
</body>

In [32]:
soup.body.text.strip()

'This is a very simple HTML document\nIt only has two paragraphs'

In [11]:
# Extract the text of the body
soup.body.text

'\nThis is a very simple HTML document\nIt only has two paragraphs\n'

In [12]:
# Text of the first paragraph
soup.body.p.text

'This is a very simple HTML document'

In [13]:
# Extract all paragraph elements
soup.body.find_all('p')

[<p>This is a very simple HTML document</p>, <p>It only has two paragraphs</p>]

In [14]:
texts = [x.text.strip() for x in soup.body.find_all('p')]

texts

['This is a very simple HTML document', 'It only has two paragraphs']

In [15]:
# Extract paragraph by index
soup.body.find_all('p')[0].text.strip()

'This is a very simple HTML document'

In [16]:
soup.body.find_all('p')[1]

<p>It only has two paragraphs</p>

In [17]:
# The text of the first paragraph
soup.body.find('p').text

'This is a very simple HTML document'

# ==========================================

### 2.02 Students Do: CNN Soup (0:15)

# A Soup Starter

## Instructions

* Believe it or not, CNN's website for **1996: Year in Review** is still alive on the web: <http://edition.cnn.com/EVENTS/1996/year.in.review/>

* We have, however, stored the HTML document as a string in your starter file.

* Your task, should you accept it (and you should), is to use Beautiful Soup to scrape and print the following pieces of information:

1. The **title**

2. All **paragraph** texts

3. The top 10 headlines (warning: this one is a bit tricky, and may not come out perfectly!)

## Hints

* For the third task, you will need a means of filtering the data, perhaps over multiple iterations.

## Bonus

* If you finish early, head over to the [Beautful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to read up on accessing `attributes` and navigating the DOM.


In [33]:
# Dependencies
import os
from bs4 import BeautifulSoup as bs

In [21]:
# Read HTML from file
filepath = os.path.join("2", "Activities", "02-Stu_CNNSoup", "Resources", "template.html")
with open(filepath) as file:
    html = file.read()

FileNotFoundError: [Errno 2] No such file or directory: '2/Activities/02-Stu_CNNSoup/Resources/template.html'

In [19]:
# Create a Beautiful Soup object
soup = bs(html, 'lxml')

NameError: name 'html' is not defined

In [19]:
# Extract title text
title = soup.title.text
print(title)

Top Ten Stories From 1996


In [20]:
# Print all paragraph texts
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print("-"*64)
    print(paragraph.text.strip())

----------------------------------------------------------------

----------------------------------------------------------------
What were the biggest stories of the year?

It's a question journalists like to ask themselves at the end of every
            year. Now you can join in the process. Here are our selections for the top ten news
            stories of 1996.

            Disagree with our choices? Then tell us what stories you think were most compelling
            in the poll below.
----------------------------------------------------------------

----------------------------------------------------------------
What makes a big
            story BIG?
----------------------------------------------------------------
It depends on your criteria, of course, and your perspective. That's why we offered
            a poll to find out what you think.
----------------------------------------------------------------
For our list, we polled producers throughout the CNN/Pathfinder famil

In [21]:
# Print all ten headlines
tds = soup.find_all('td')
# A blank list to hold the headlines
headlines = []
# Loop over td elements
for td in tds:
    # If td element has an anchor...
    if (td.a):
        # And the anchor has non-blank text...
        if (td.a.text):
            # Append the td to the list
            headlines.append(td)

In [22]:
headlines

[<td><a href="topten/israel/israel.index.html" target="_top"><b>Israel</b> elects <b>Netanyahu</b></a></td>,
 <td><a href="topten/twa/twa.index.html" target="_top">Crash of TWA Flight 800</a></td>,
 <td><a href="topten/yeltsin/yeltsin.index.html" target="_top"><b>Russia</b> elects <b>Yeltsin</b></a></td>,
 <td><a href="topten/clinton/clinton.index.html" target="_top"><b>U.S</b>. elects <b>Clinton</b></a></td>,
 <td><a href="topten/hutu/hutu.index.html" target="_top"><b>Hutu-Tutsi</b> conflict in central Africa</a></td>,
 <td><a href="topten/bosnia/bosnia.index.html" target="_top">Peace, elections in <b>Bosnia</b></a></td>,
 <td><a href="topten/saudi/saudi.index.html" target="_top"><b>U.S</b>. base bombed in <b>Saudi Arabia</b></a></td>,
 <td><a href="topten/olympics/olympics.index.html" target="_top">Centennial <b>Olympic</b> Games</a></td>,
 <td><a href="topten/aids/aids.index.html" target="_top">Advances against <b>AIDS</b></a></td>,
 <td><a href="topten/unabomb/unabomb.index.html" t

In [23]:
# Print only the headlines
for x in range(10):
    print(headlines[x].text)        


Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S. elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S. base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested


# ==========================================

### 2.03 Instructor Do: Craig's Wishlist (0:05)

In [22]:
# Dependencies
from bs4 import BeautifulSoup
import requests

In [34]:
# URL of page to be scraped
url = 'https://newjersey.craigslist.org/search/sss?sort=rel&query=guitar'
url = 'https://sfbay.craigslist.org/search/sss?sort=rel&query=guitar'

In [35]:
# Retrieve page with the requests module
response = requests.get(url)

In [36]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = BeautifulSoup(response.text, 'html.parser')

In [37]:
# Examine the results, then determine element that contains sought info
print(soup.prettify())

﻿
<!DOCTYPE html>
<html class="no-js">
 <head>
  <title>
   SF bay area for sale "guitar"  - craigslist
  </title>
  <script id="ld_breadcrumb_data" type="application/ld+json">
   {"@context":"https://schema.org","itemListElement":[{"item":{"name":"sfbay.craigslist.org","@id":"https://sfbay.craigslist.org"},"position":1,"@type":"ListItem"},{"item":{"name":"for sale","@id":"https://sfbay.craigslist.org/d/for-sale/search/sss"},"position":2,"@type":"ListItem"}],"@type":"BreadcrumbList"}
  </script>
  <meta content="" name="description"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible">
   <link href="https://sfbay.craigslist.org/d/for-sale/search/sss?query=guitar&amp;sort=rel" rel="canonical"/>
   <link href="https://sfbay.craigslist.org/d/for-sale/search/sss?s=120&amp;query=guitar&amp;sort=rel" rel="next"/>
   <meta content="width=device-width,initial-scale=1" name="viewport"/>
   <link href="//www.craigslist.org/styles/cl.css?v=5ea548767c8f312eb8ee55e79d68d2c4" media="all" rel="s

In [38]:
# results are returned as an iterable list
results = soup.find_all('li', class_="result-row")
results[0]

<li class="result-row" data-pid="7259695769" data-repost-of="7198590159">
<a class="result-image gallery" data-ids="3:00c0c_7kew5CsFvku_0CI0t2,3:01616_RRgrsaH15L_0CI0t2,3:00N0N_40Pdg47g7WU_0CI0t2,3:00h0h_76QGZVVMiZ2_0CI0t2,3:00I0I_gvDy5Xl5FwR_0CI0t2" href="https://sfbay.craigslist.org/sfc/fuo/d/san-francisco-vintage-yngve-ekstrom/7259695769.html">
<span class="result-price">$350</span>
</a>
<div class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2021-01-09 17:57" title="Sat 09 Jan 05:57:35 PM">Jan  9</time>
<h3 class="result-heading">
<a class="result-title hdrlnk" data-id="7259695769" href="https://sfbay.craigslist.org/sfc/fuo/d/san-francisco-vintage-yngve-ekstrom/7259695769.html" id="postid_7259695769">Vintage Yngve Ekstrom Guitar Pik End Table</a>
</h3>
<span class="result-meta">
<span class="result-price">$350</span>
<span class="result-hood"> (San Francisco)</span>


In [39]:
# Loop through returned results
for result in results:
    # Error handling
    try:
        # Identify and return title of listing
        title = result.find('a', class_="result-title").text
        # Identify and return price of listing
        price = result.find('span', class_="result-price").text
#         price = result.a.span.text
        # Identify and return link to listing
        link = result.a['href']
#         image = result.a.img
        image = result.find_all('img')

        # Print results only if title, price, and link are available
        if (title and price and link):
            print('-------------')
            print(title)
            print(price)
#             image = result.a.img
#             print(image)
#             print(result)
            print(link)
    except AttributeError as e:
        print(e)

-------------
Vintage Yngve Ekstrom Guitar Pik End Table
$350
https://sfbay.craigslist.org/sfc/fuo/d/san-francisco-vintage-yngve-ekstrom/7259695769.html
-------------
GUITAR IBANEZ CUSTOM FLAME INFERNO
$0
https://sfbay.craigslist.org/eby/msg/d/antioch-guitar-ibanez-custom-flame/7249798501.html
-------------
Behringer Strat style Electric Guitar
$110
https://sfbay.craigslist.org/nby/msg/d/novato-behringer-strat-style-electric/7259694995.html
-------------
Custom Halo Electric Guitar and Amp Combo
$1,350
https://sfbay.craigslist.org/sby/msg/d/san-jose-custom-halo-electric-guitar/7255032565.html
-------------
EVERGREEN - Guitar cases - FREE - not full size
$0
https://sfbay.craigslist.org/sby/zip/d/san-jose-evergreen-guitar-cases-free/7259691519.html
-------------
Guitar Amp
$25
https://sfbay.craigslist.org/sfc/msg/d/san-francisco-guitar-amp/7247243865.html
-------------
Blackstar ID Core 20 guitar amplifier
$99
https://sfbay.craigslist.org/sby/msg/d/san-jose-blackstar-id-core-20-guitar/72

# ==========================================

### 2.04 Students Do: Reddit Scraper (0:10)

## Instructions

* In this activity, you will scrape the Programmer-Humor.html file provided

* Use Beautiful Soup to scrape only threads that have twenty or more comments, then print the thread's title, number of comments, and the URL to the thread.

## Bonus

* If you finish early, try to display each thread's top comment in your output!

* As an added bonus try re-scraping using the URL instead. What happens when you try to do this? Why might this be happening?

In [104]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import os

In [105]:
filepath = os.path.join("2/Activities/04-Stu_RedditScrape/Unsolved/Programmer-Humor.html")
with open(filepath, encoding='utf-8') as file:
    html = file.read()

In [106]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = BeautifulSoup(html, 'html.parser')

In [107]:
print(soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0041)https://www.reddit.com/r/ProgrammerHumor/ -->
<html class="js cssanimations csstransforms" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Programmer Humor
  </title>
  <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
  <meta content="reddit: the front page of the internet" name="description"/>
  <meta content="always" name="referrer"/>
  <link href="https://www.reddit.com/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/>
  <link href="https://www.reddit.com/r/ProgrammerHumor/" rel="canonical"/>
  <meta content="width=1024" name="viewport"/>
  <link href="https://out.reddit.com/" rel="dns-prefetch"/>
  <link href="https://out.reddit.com/" rel="preconnect"/>
  <meta content="https://www.redditstatic.com/icon.png" property="og:image"/>
  <meta content="reddit" property="og:site_nam

In [109]:
# Find the number of subscribers
number_subscribers = soup.find("span", class_='subscribers').find('span', class_='number').text
print(f"The number of subscribers: {number_subscribers}")

The number of subscribers: 422,381


In [111]:
# Examine the results, then determine element that contains sought info
# results are returned as an iterable list
# Examine the results and look for a div withe the class 'top-matter'
results = soup.find_all('div', class_='top-matter')
len(results)

27

In [114]:
# Loop through returned results
for result in results:
    
    # Retrieve the thread title
    title = result.find('p', class_='title')
    
    
    # Access the thread's text content
    title_text = title.a.text
#     print(title_text)

    try:
        # Access the thread with CSS selectors
        thread = result.find('li', class_='first')    

        # The number of comments made in the thread
        comments = thread.text.lstrip()

        # Parse string, e.g. '47 comments' for possible numeric manipulation
        comments_num = int(comments.split()[0])

        # Access the href attribute with bracket notation
        link = thread.a['href']

        # Run if the thread has 20 or more comments
        if (comments_num >=20 ):
            print('\n-----------------\n')
            print(title_text)
            print('Comments:', comments_num)
            print(link)
    except AttributeError as e:
        print(e)

'NoneType' object has no attribute 'text'

-----------------

[Meta] Clarification on rules
Comments: 79
https://www.reddit.com/r/ProgrammerHumor/comments/6y2b47/meta_clarification_on_rules/

-----------------

Doing conditionals
Comments: 258
https://www.reddit.com/r/ProgrammerHumor/comments/7pw5qk/doing_conditionals/

-----------------

Perfect date
Comments: 58
https://www.reddit.com/r/ProgrammerHumor/comments/7pyyl2/perfect_date/

-----------------

The truth about java.
Comments: 61
https://www.reddit.com/r/ProgrammerHumor/comments/7pxod4/the_truth_about_java/

-----------------

It all makes sense now.
Comments: 341
https://www.reddit.com/r/ProgrammerHumor/comments/7pp66f/it_all_makes_sense_now/

-----------------

This is where US's bandwidth going.
Comments: 20
https://www.reddit.com/r/ProgrammerHumor/comments/7pv1ta/this_is_where_uss_bandwidth_going/


In [115]:
# BONUS
# Try to scrape the site using the URL
url = 'https://www.reddit.com/r/ProgrammingHumor/'

# Retrieve page with the requests module
html = requests.get(url)

In [116]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = BeautifulSoup(html.text, 'html.parser')

In [117]:
# Display how different this HTML looks
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <script>
   var __SUPPORTS_TIMING_API = typeof performance === 'object' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;
    function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };
    var __firstPostLoaded = false;
    function __markFirstPostVisible() {
      if (__firstPostLoaded) { return; }
      __firstPostLoaded = true;
      __perfMark("first_post_title_image_loaded");
    }
    var __firstCommentLoaded = false;
    function __markFirstCommentVisible() {
      if (__firstCommentLoaded) { return; }
      __firstCommentLoaded = true;
      __perfMark("first_comment_loaded");
    }
  </script>
  <script>
   __perfMark('head_tag_start');
  </script>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="origin-when-cross-origin" name="referrer"/>
  <style>
   /* http://meyerweb.com/eric/tools/css/reset/
    v2.0 | 201101

# ==========================================

### 2.05 Instructor Do: Mongo Craig (0:10)

In [40]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import pymongo

In [41]:
# Initialize PyMongo to work with MongoDBs
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [42]:
# Define database and collection
db = client.craigslist_db
collection = db.items

In [43]:
# URL of page to be scraped
url = 'https://newjersey.craigslist.org/search/sss?sort=rel&query=guitar'
url = 'https://sfbay.craigslist.org/search/sss?sort=rel&query=guitar'

# Retrieve page with the requests module
response = requests.get(url)
# Create BeautifulSoup object; parse with 'lxml'
soup = BeautifulSoup(response.text, 'lxml')

In [44]:
# Examine the results, then determine element that contains sought info
# results are returned as an iterable list
results = soup.find_all('li', class_='result-row')

# Loop through returned results
for result in results:
    # Error handling
    try:
        # Identify and return title of listing
        title = result.find('a', class_='result-title').text
        # Identify and return price of listing
#         price = result.a.span.text
        price = result.find('span', class_="result-price").text
        price = float(price.replace("$", ""))
        # Identify and return link to listing
        link = result.a['href']

        # Run only if title, price, and link are available
        if (title and price and link):
            # Print results
            print('-------------')
            print(title)
            print(price)
            print(link)

            # Dictionary to be inserted as a MongoDB document
            post = {
                'title': title,
                'price': price,
                'url': link
            }

            collection.insert_one(post)

    except Exception as e:
        print(e)

could not convert string to float: '2,699'
-------------
3 Monster guitar / instrument cables 20 feet long
20.0
https://sfbay.craigslist.org/eby/msg/d/berkeley-monster-guitar-instrument/7258022193.html
could not convert string to float: '1,000'
-------------
Vintage Yngve Ekstrom Guitar Pik End Table
350.0
https://sfbay.craigslist.org/sfc/fuo/d/san-francisco-vintage-yngve-ekstrom/7259695769.html
-------------
Behringer Strat style Electric Guitar
110.0
https://sfbay.craigslist.org/nby/msg/d/novato-behringer-strat-style-electric/7259694995.html
could not convert string to float: '1,350'
-------------
Guitar Amp
25.0
https://sfbay.craigslist.org/sfc/msg/d/san-francisco-guitar-amp/7247243865.html
-------------
Blackstar ID Core 20 guitar amplifier
99.0
https://sfbay.craigslist.org/sby/msg/d/san-jose-blackstar-id-core-20-guitar/7257364877.html
-------------
Wii Console Guitar Hero Bundle 6 Games Balance Board
225.0
https://sfbay.craigslist.org/nby/vgm/d/petaluma-wii-console-guitar-hero-bun

-------------
Guitar Custom Built- Nice
250.0
https://sfbay.craigslist.org/nby/msg/d/rohnert-park-guitar-custom-built-nice/7259469587.html


In [139]:
# Display items in MongoDB collection
listings = db.items.find()

for listing in listings:
    print(listing)

{'_id': ObjectId('5ff0cee91f7a3580057808f2'), 'title': 'Laguna Electric Guitar LE122', 'price': 80.0, 'url': 'https://sfbay.craigslist.org/sby/msg/d/san-jose-laguna-electric-guitar-le122/7244968364.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f3'), 'title': 'Fender Acoustic Guitar', 'price': 70.0, 'url': 'https://sfbay.craigslist.org/scz/msg/d/santa-cruz-fender-acoustic-guitar/7255868125.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f4'), 'title': 'Agile AL-2000 electric guitar with Seymour Duncan pickups', 'price': 300.0, 'url': 'https://sfbay.craigslist.org/pen/msg/d/san-bruno-agile-al-2000-electric-guitar/7255869018.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f5'), 'title': 'Fender R.A.D amplifier (guitar)', 'price': 100.0, 'url': 'https://sfbay.craigslist.org/sfc/msg/d/san-francisco-fender-rad-amplifier/7252415537.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f6'), 'title': 'Montoya classical guitar', 'price': 90.0, 'url': 'https://sfbay.craigslist.org/eby/msg/d/berkeley-mon

# ==========================================

### 2.06 Students Do: Hockey Headers (0:10)

Teamwork! Speed! Mental and physical toughness! Passion! Excitement! Unpredictable matchups down to the wire! What could be better? While these terms could easily be applied to a data science hackathon, we're talking about the magnificent sport of hockey.

Your assignment is to scrape the articles on the news page of the NHL website - which is frequently updated - and then post the results of your scraping to MongoDB.

## Instructions

* Use Beautiful Soup and requests to scrape the header and subheader of each article on the news page.

* Post the above information as a MongoDB document and then print all of the documents on the database to the console.

* In addition to the above, post the date of the article publication as well.


In [141]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import pymongo

In [142]:
# Initialize PyMongo to work with MongoDBs
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [143]:
# Define database and collection
db = client.nhl_db
collection = db.articles

In [45]:
# URL of page to be scraped
url = 'https://www.nhl.com/news'

# Retrieve page with the requests module
response = requests.get(url)
# Create BeautifulSoup object; parse with 'lxml'
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en_US">
 <head>
  <title>
   NHL Hockey News | NHL.com
  </title>
  <!-- meta meta tag -->
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="no-cache" http-equiv="Cache-Control"/>
  <meta content="no-cache" http-equiv="Pragma"/>
  <meta content="-1" http-equiv="Expires"/>
  <meta content="en" http-equiv="content-language"/>
  <meta content="nhl, nhl.com, www.nhl.com, playoffs, scores, video, photos, standings, news, features, players, shop, auctions, tickets, mobile, game center live, stanley cup, winter classic, draft, free agency" name="keywords"/>
  <meta content="US" name="countryCode"/>
  <meta content="NHL Hockey News" property="og:title"/>
  <meta content="NHL Hockey News NHL.com" itemprop="name"/>
  <meta content="NHL.com" property="og:site_name"/>
  <meta content="website" property="og:type"/>
  <meta content="https://cms.nhl.bamgrid.com/images/photos/3201

In [145]:
# Retrieve the parent divs for all articles
results = soup.find_all('div', class_='article-item__top')

# loop over results to get article data
for result in results:
    # scrape the article header 
    header = result.find('h1', class_='article-item__headline').text
    
    # scrape the article subheader
    subheader = result.find('h2', class_='article-item__subheader').text
    
    # scrape the datetime
    datetime = result.find('span', class_='article-item__date')['data-date'] 
    
    # get only the date from the datetime
    date = datetime.split('T')[0]
    
    # print article data
    print('-----------------')
    print(header)
    print(subheader)
    print(datetime)
    print(date)

    # Dictionary to be inserted into MongoDB
    post = {
        'header': header,
        'subheader': subheader,
        'date': date,
    }

    # Insert dictionary into MongoDB as a document
    collection.insert_one(post)

-----------------
NHL teams that missed playoffs start training camp
Red Wings, Senators, Devils among those fired up for first practice
2021-01-01T16:54:21-0500
2021-01-01
-----------------
Lightning season preview: Return of Stamkos bodes well for Cup defense
Kucherov's absence negated by captain's health, arrival of key prospects
2021-01-02T00:00:00-0500
2021-01-02
-----------------
Hall takes ice for first time with Sabres
Forward says familiarity with coach Krueger will help him get acclimated quicker in training camp
2021-01-01T17:07:33-0500
2021-01-01
-----------------
3 'Star' keys for United States against Slovakia in WJC quarterfinals
NHL Network analyst Starman says goalie decision, play of forwards will be important
2021-01-02T09:52:51-0500
2021-01-02
-----------------
Stars season preview: Better start vital without Seguin, Bishop
Western Conference champs not expected to have injured forward, goalie for first two months
2021-01-02T00:01:59-0500
2021-01-02
----------------

In [146]:
# Display the MongoDB records created above
articles = db.articles.find()
for article in articles:
    print(article)

{'_id': ObjectId('5ff0d10a1f7a358005780959'), 'header': 'NHL teams that missed playoffs start training camp', 'subheader': 'Red Wings, Senators, Devils among those fired up for first practice', 'date': '2021-01-01'}
{'_id': ObjectId('5ff0d10a1f7a35800578095a'), 'header': 'Lightning season preview: Return of Stamkos bodes well for Cup defense', 'subheader': "Kucherov's absence negated by captain's health, arrival of key prospects", 'date': '2021-01-02'}
{'_id': ObjectId('5ff0d10a1f7a35800578095b'), 'header': 'Hall takes ice for first time with Sabres', 'subheader': 'Forward says familiarity with coach Krueger will help him get acclimated quicker in training camp', 'date': '2021-01-01'}
{'_id': ObjectId('5ff0d10a1f7a35800578095c'), 'header': "3 'Star' keys for United States against Slovakia in WJC quarterfinals", 'subheader': 'NHL Network analyst Starman says goalie decision, play of forwards will be important', 'date': '2021-01-02'}
{'_id': ObjectId('5ff0d10a1f7a35800578095d'), 'header'

# ==========================================

# BREAK (0:30)

# ==========================================

### 2.07 Instructor Do: Introduction to Splinter (0:15)

In [3]:
from splinter import Browser
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager

In [9]:
# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280


 


[WDM] - Driver [C:\Users\k\.wdm\drivers\chromedriver\win32\87.0.4280.88\chromedriver.exe] found in cache


# Mac Users

In [10]:
# https://splinter.readthedocs.io/en/latest/drivers/chrome.html
# !which chromedriver
# executable_path = {'executable_path': '/usr/local/bin/chromedriver'}


# Windows Users

In [11]:
# executable_path = {'executable_path': '2/Activities/07-Ins_Splinter/Solved/chromedriver.exe'}

In [12]:
# import os
# if os.name=="nt":
#     executable_path = {'executable_path': './chromedriver.exe'}
# else:
#     executable_path = {"executable_path": "/usr/local/bin/chromedriver"}

In [13]:
# browser = Browser('chrome', **executable_path, headless=False)

In [47]:
url = 'http://quotes.toscrape.com/'
browser.visit(url)

NameError: name 'browser' is not defined

In [46]:
for x in range(1, 6):

    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')

    quotes = soup.find_all('span', class_='text')

    for quote in quotes:
        print('page:', x, '-------------')
        print(quote.text)

    browser.click_link_by_partial_text('Next')

NameError: name 'browser' is not defined

In [59]:
browser.quit()

# ==========================================

### 2.08 Students Do: Bookscraper (0:15)

## Instructions

* Go to <http://books.toscrape.com/>

* Scrape the titles and the URLs to all books on this fictional online bookstore. Display the results in console.

* That's it!

* If you're craving extra challenge, try scraping all books by **category**. Good luck!


In [48]:
from splinter import Browser
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager

In [49]:
# # Mac Users
# # https://splinter.readthedocs.io/en/latest/drivers/chrome.html
# # !which chromedriver
# # executable_path = {'executable_path': '/usr/local/bin/chromedriver'}

# # Windows Users
# import os
# if os.name=="nt":
#     executable_path = {'executable_path': './chromedriver.exe'}
# else:
#     executable_path = {"executable_path": "/usr/local/bin/chromedriver"}

In [50]:
# browser = Browser('chrome', **executable_path, headless=False)

In [51]:
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280
[WDM] - Driver [/Users/AliciaLy/.wdm/drivers/chromedriver/mac64/87.0.4280.88/chromedriver] found in cache


 


In [52]:
url = 'http://books.toscrape.com/'
browser.visit(url)

In [54]:
# Iterate through all pages
for x in range(50):
    # HTML object
    html = browser.html
    # Parse HTML with Beautiful Soup
    soup = BeautifulSoup(html, 'html.parser')
    # Retrieve all elements that contain book information
    articles = soup.find_all('article', class_='product_pod')

    # Iterate through each book
    for article in articles:
        # Use Beautiful Soup's find() method to navigate and retrieve attributes
        h3 = article.find('h3')
        link = h3.find('a')
        href = link['href']
        title = link['title']
        print('-----------')
        print(title)
        print(url + href)

    # Click the 'Next' button on each page
    try:
        browser.links.find_by_partial_text('next').click()
          
    except:
        print("Scraping Complete")


-----------
In Her Wake
http://books.toscrape.com/in-her-wake_980/index.html
-----------
How Music Works
http://books.toscrape.com/how-music-works_979/index.html
-----------
Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More
http://books.toscrape.com/foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-condiments-and-more_978/index.html
-----------
Chase Me (Paris Nights #2)
http://books.toscrape.com/chase-me-paris-nights-2_977/index.html
-----------
Black Dust
http://books.toscrape.com/black-dust_976/index.html
-----------
Birdsong: A Story in Pictures
http://books.toscrape.com/birdsong-a-story-in-pictures_975/index.html
-----------
America's Cradle of Quarterbacks: Western Pennsylvania's Football Factory from Johnny Unitas to Joe Montana
http://books.toscrape.com/americ

-----------
Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Princess Jellyfish 2-in-1 Omnibus #1)
http://books.toscrape.com/princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920/index.html
-----------
Princess Between Worlds (Wide-Awake Princess #5)
http://books.toscrape.com/princess-between-worlds-wide-awake-princess-5_919/index.html
-----------
Pop Gun War, Volume 1: Gift
http://books.toscrape.com/pop-gun-war-volume-1-gift_918/index.html
-----------
Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics
http://books.toscrape.com/political-suicide-missteps-peccadilloes-bad-calls-backroom-hijinx-sordid-pasts-rotten-breaks-and-just-plain-dumb-mistakes-in-the-annals-of-american-politics_917/index.html
-----------
Patience
http://books.toscrape.com/patience_916/index.html
-----------
Outcast, Vol. 1: A Darkness Surrounds Him (Outcast #1)
http://books.toscrape

-----------
The Shadow Hero (The Shadow Hero)
http://books.toscrape.com/the-shadow-hero-the-shadow-hero_860/index.html
-----------
The Secret (The Secret #1)
http://books.toscrape.com/the-secret-the-secret-1_859/index.html
-----------
The Regional Office Is Under Attack!
http://books.toscrape.com/the-regional-office-is-under-attack_858/index.html
-----------
The Psychopath Test: A Journey Through the Madness Industry
http://books.toscrape.com/the-psychopath-test-a-journey-through-the-madness-industry_857/index.html
-----------
The Project
http://books.toscrape.com/the-project_856/index.html
-----------
The Power of Now: A Guide to Spiritual Enlightenment
http://books.toscrape.com/the-power-of-now-a-guide-to-spiritual-enlightenment_855/index.html
-----------
The Omnivore's Dilemma: A Natural History of Four Meals
http://books.toscrape.com/the-omnivores-dilemma-a-natural-history-of-four-meals_854/index.html
-----------
The Nerdy Nummies Cookbook: Sweet Treats for the Geek in All of Us
ht

-----------
Whole Lotta Creativity Going On: 60 Fun and Unusual Exercises to Awaken and Strengthen Your Creativity
http://books.toscrape.com/whole-lotta-creativity-going-on-60-fun-and-unusual-exercises-to-awaken-and-strengthen-your-creativity_780/index.html
-----------
What's It Like in Space?: Stories from Astronauts Who've Been There
http://books.toscrape.com/whats-it-like-in-space-stories-from-astronauts-whove-been-there_779/index.html
-----------
We Are Robin, Vol. 1: The Vigilante Business (We Are Robin #1)
http://books.toscrape.com/we-are-robin-vol-1-the-vigilante-business-we-are-robin-1_778/index.html
-----------
Walt Disney's Alice in Wonderland
http://books.toscrape.com/walt-disneys-alice-in-wonderland_777/index.html
-----------
V for Vendetta (V for Vendetta Complete)
http://books.toscrape.com/v-for-vendetta-v-for-vendetta-complete_776/index.html
-----------
Until Friday Night (The Field Party #1)
http://books.toscrape.com/until-friday-night-the-field-party-1_775/index.html
-

-----------
Hold Your Breath (Search and Rescue #1)
http://books.toscrape.com/hold-your-breath-search-and-rescue-1_700/index.html
-----------
Hamilton: The Revolution
http://books.toscrape.com/hamilton-the-revolution_699/index.html
-----------
Greek Mythic History
http://books.toscrape.com/greek-mythic-history_698/index.html
-----------
God: The Most Unpleasant Character in All Fiction
http://books.toscrape.com/god-the-most-unpleasant-character-in-all-fiction_697/index.html
-----------
Glory over Everything: Beyond The Kitchen House
http://books.toscrape.com/glory-over-everything-beyond-the-kitchen-house_696/index.html
-----------
Feathers: Displays of Brilliant Plumage
http://books.toscrape.com/feathers-displays-of-brilliant-plumage_695/index.html
-----------
Far & Away: Places on the Brink of Change: Seven Continents, Twenty-Five Years
http://books.toscrape.com/far-away-places-on-the-brink-of-change-seven-continents-twenty-five-years_694/index.html
-----------
Every Last Word
http://

-----------
Hide Away (Eve Duncan #20)
http://books.toscrape.com/hide-away-eve-duncan-20_620/index.html
-----------
Furiously Happy: A Funny Book About Horrible Things
http://books.toscrape.com/furiously-happy-a-funny-book-about-horrible-things_619/index.html
-----------
Everyday Italian: 125 Simple and Delicious Recipes
http://books.toscrape.com/everyday-italian-125-simple-and-delicious-recipes_618/index.html
-----------
Equal Is Unfair: America's Misguided Fight Against Income Inequality
http://books.toscrape.com/equal-is-unfair-americas-misguided-fight-against-income-inequality_617/index.html
-----------
Eleanor & Park
http://books.toscrape.com/eleanor-park_616/index.html
-----------
Dirty (Dive Bar #1)
http://books.toscrape.com/dirty-dive-bar-1_615/index.html
-----------
Can You Keep a Secret? (Fear Street Relaunch #4)
http://books.toscrape.com/can-you-keep-a-secret-fear-street-relaunch-4_614/index.html
-----------
Boar Island (Anna Pigeon #19)
http://books.toscrape.com/boar-island

-----------
Roller Girl
http://books.toscrape.com/roller-girl_540/index.html
-----------
Rising Strong
http://books.toscrape.com/rising-strong_539/index.html
-----------
Proofs of God: Classical Arguments from Tertullian to Barth
http://books.toscrape.com/proofs-of-god-classical-arguments-from-tertullian-to-barth_538/index.html
-----------
Please Kill Me: The Uncensored Oral History of Punk
http://books.toscrape.com/please-kill-me-the-uncensored-oral-history-of-punk_537/index.html
-----------
Out of Print: City Lights Spotlight No. 14
http://books.toscrape.com/out-of-print-city-lights-spotlight-no-14_536/index.html
-----------
My Life Next Door (My Life Next Door )
http://books.toscrape.com/my-life-next-door-my-life-next-door_535/index.html
-----------
Miller's Valley
http://books.toscrape.com/millers-valley_534/index.html
-----------
Man's Search for Meaning
http://books.toscrape.com/mans-search-for-meaning_533/index.html
-----------
Love That Boy: What Two Presidents, Eight Road Trip

-----------
Benjamin Franklin: An American Life
http://books.toscrape.com/benjamin-franklin-an-american-life_460/index.html
-----------
At The Existentialist Café: Freedom, Being, and apricot cocktails with: Jean-Paul Sartre, Simone de Beauvoir, Albert Camus, Martin Heidegger, Edmund Husserl, Karl Jaspers, Maurice Merleau-Ponty and others
http://books.toscrape.com/at-the-existentialist-cafe-freedom-being-and-apricot-cocktails-with-jean-paul-sartre-simone-de-beauvoir-albert-camus-martin-heidegger-edmund-husserl-karl-jaspers-maurice-merleau-ponty-and-others_459/index.html
-----------
A Summer In Europe
http://books.toscrape.com/a-summer-in-europe_458/index.html
-----------
A Short History of Nearly Everything
http://books.toscrape.com/a-short-history-of-nearly-everything_457/index.html
-----------
A Gathering of Shadows (Shades of Magic #2)
http://books.toscrape.com/a-gathering-of-shadows-shades-of-magic-2_456/index.html
-----------
The Sound Of Love
http://books.toscrape.com/the-sound-o

-----------
Isla and the Happily Ever After (Anna and the French Kiss #3)
http://books.toscrape.com/isla-and-the-happily-ever-after-anna-and-the-french-kiss-3_380/index.html
-----------
If I Stay (If I Stay #1)
http://books.toscrape.com/if-i-stay-if-i-stay-1_379/index.html
-----------
I Know Why the Caged Bird Sings (Maya Angelou's Autobiography #1)
http://books.toscrape.com/i-know-why-the-caged-bird-sings-maya-angelous-autobiography-1_378/index.html
-----------
Harry Potter and the Deathly Hallows (Harry Potter #7)
http://books.toscrape.com/harry-potter-and-the-deathly-hallows-harry-potter-7_377/index.html
-----------
Fruits Basket, Vol. 5 (Fruits Basket #5)
http://books.toscrape.com/fruits-basket-vol-5-fruits-basket-5_376/index.html
-----------
Foundation (Foundation (Publication Order) #1)
http://books.toscrape.com/foundation-foundation-publication-order-1_375/index.html
-----------
Fool Me Once
http://books.toscrape.com/fool-me-once_374/index.html
-----------
Find Her (Detective D.

-----------
Walk the Edge (Thunder Road #2)
http://books.toscrape.com/walk-the-edge-thunder-road-2_300/index.html
-----------
Voyager (Outlander #3)
http://books.toscrape.com/voyager-outlander-3_299/index.html
-----------
Very Good Lives: The Fringe Benefits of Failure and the Importance of Imagination
http://books.toscrape.com/very-good-lives-the-fringe-benefits-of-failure-and-the-importance-of-imagination_298/index.html
-----------
Vegan Vegetarian Omnivore: Dinner for Everyone at the Table
http://books.toscrape.com/vegan-vegetarian-omnivore-dinner-for-everyone-at-the-table_297/index.html
-----------
Unstuffed: Decluttering Your Home, Mind, and Soul
http://books.toscrape.com/unstuffed-decluttering-your-home-mind-and-soul_296/index.html
-----------
Under the Banner of Heaven: A Story of Violent Faith
http://books.toscrape.com/under-the-banner-of-heaven-a-story-of-violent-faith_295/index.html
-----------
Two Boys Kissing
http://books.toscrape.com/two-boys-kissing_294/index.html
-------

-----------
Seven Days in the Art World
http://books.toscrape.com/seven-days-in-the-art-world_220/index.html
-----------
Seven Brief Lessons on Physics
http://books.toscrape.com/seven-brief-lessons-on-physics_219/index.html
-----------
Scarlet (The Lunar Chronicles #2)
http://books.toscrape.com/scarlet-the-lunar-chronicles-2_218/index.html
-----------
Sarah's Key
http://books.toscrape.com/sarahs-key_217/index.html
-----------
Saga, Volume 3 (Saga (Collected Editions) #3)
http://books.toscrape.com/saga-volume-3-saga-collected-editions-3_216/index.html
-----------
Running with Scissors
http://books.toscrape.com/running-with-scissors_215/index.html
-----------
Rogue Lawyer (Rogue Lawyer #1)
http://books.toscrape.com/rogue-lawyer-rogue-lawyer-1_214/index.html
-----------
Rise of the Rocket Girls: The Women Who Propelled Us, from Missiles to the Moon to Mars
http://books.toscrape.com/rise-of-the-rocket-girls-the-women-who-propelled-us-from-missiles-to-the-moon-to-mars_213/index.html
-------

-----------
Civilization and Its Discontents
http://books.toscrape.com/civilization-and-its-discontents_140/index.html
-----------
Cinder (The Lunar Chronicles #1)
http://books.toscrape.com/cinder-the-lunar-chronicles-1_139/index.html
-----------
Catastrophic Happiness: Finding Joy in Childhood's Messy Years
http://books.toscrape.com/catastrophic-happiness-finding-joy-in-childhoods-messy-years_138/index.html
-----------
Career of Evil (Cormoran Strike #3)
http://books.toscrape.com/career-of-evil-cormoran-strike-3_137/index.html
-----------
Breaking Dawn (Twilight #4)
http://books.toscrape.com/breaking-dawn-twilight-4_136/index.html
-----------
Brave Enough
http://books.toscrape.com/brave-enough_135/index.html
-----------
Boy Meets Boy
http://books.toscrape.com/boy-meets-boy_134/index.html
-----------
Born to Run: A Hidden Tribe, Superathletes, and the Greatest Race the World Has Never Seen
http://books.toscrape.com/born-to-run-a-hidden-tribe-superathletes-and-the-greatest-race-the-worl

-----------
The Bhagavad Gita
http://books.toscrape.com/the-bhagavad-gita_60/index.html
-----------
The Bette Davis Club
http://books.toscrape.com/the-bette-davis-club_59/index.html
-----------
The Art of Not Breathing
http://books.toscrape.com/the-art-of-not-breathing_58/index.html
-----------
Taking Shots (Assassins #1)
http://books.toscrape.com/taking-shots-assassins-1_57/index.html
-----------
Starlark
http://books.toscrape.com/starlark_56/index.html
-----------
Skip Beat!, Vol. 01 (Skip Beat! #1)
http://books.toscrape.com/skip-beat-vol-01-skip-beat-1_55/index.html
-----------
Sister Sable (The Mad Queen #1)
http://books.toscrape.com/sister-sable-the-mad-queen-1_54/index.html
-----------
Shatter Me (Shatter Me #1)
http://books.toscrape.com/shatter-me-shatter-me-1_53/index.html
-----------
Shameless
http://books.toscrape.com/shameless_52/index.html
-----------
Shadow Rites (Jane Yellowrock #10)
http://books.toscrape.com/shadow-rites-jane-yellowrock-10_51/index.html
-----------
Settl

Scraping Complete


In [24]:
browser.quit()

# ==========================================

### 2.09 Instructor Do: Pandas Scraping (0:10)

Scraping with Pandas

In [55]:
import pandas as pd

We can use the `read_html` function in Pandas to automatically scrape any tabular data from a page.

In [56]:
url = 'https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'

In [57]:
tables = pd.read_html(url)
len(tables)

8

#### What we get in return is a list of dataframes for any tabular data that Pandas found.

In [58]:
type(tables)

list

#### We can slice off any of those dataframes that we want using normal indexing.

In [59]:
df = tables[0]
df.head()

Unnamed: 0_level_0,City,Building,Start Date,End Date,Duration,Ref
Unnamed: 0_level_1,Albany Congress,Albany Congress,Albany Congress,Albany Congress,Albany Congress,Albany Congress
0,"Albany, New York",Stadt Huys,"June 19, 1754","July 11, 1754",22 days,[8]
1,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress
2,"New York, New York",City Hall,"October 7, 1765","October 25, 1765",23 days,[9]
3,First Continental Congress,First Continental Congress,First Continental Congress,First Continental Congress,First Continental Congress,First Continental Congress
4,"Philadelphia, Pennsylvania",Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,[10]


#### Drop all single header rows

In [60]:
df.columns = df.columns.get_level_values(0)
df.columns

Index(['City', 'Building', 'Start Date', 'End Date', 'Duration', 'Ref'], dtype='object')

In [61]:
df.columns = df.columns.get_level_values(0)
df = df.loc[df.Ref.str.startswith("[")]
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,Ref
0,"Albany, New York",Stadt Huys,"June 19, 1754","July 11, 1754",22 days,[8]
2,"New York, New York",City Hall,"October 7, 1765","October 25, 1765",23 days,[9]
4,"Philadelphia, Pennsylvania",Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,[10]
6,"Philadelphia, Pennsylvania",Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",[11]
7,"Baltimore, Maryland",Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,[12]


#### Slipt column values into two separate columns

In [62]:
columnsplit = df['City'].str.split(", ", expand=True)
df = df.assign(City=columnsplit[0],State=columnsplit[1])
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,Ref,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,[8],New York
2,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,[9],New York
4,Philadelphia,Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,[10],Pennsylvania
6,Philadelphia,Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",[11],Pennsylvania
7,Baltimore,Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,[12],Maryland


#### Drop a column

In [63]:
df = df.drop(['Ref'], axis=1)
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,New York
2,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,New York
4,Philadelphia,Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,Pennsylvania
6,Philadelphia,Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",Pennsylvania
7,Baltimore,Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,Maryland


#### Reset an index

In [64]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,New York
1,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,New York
2,Philadelphia,Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,Pennsylvania
3,Philadelphia,Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",Pennsylvania
4,Baltimore,Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,Maryland


In [65]:
df.loc[df.State=="New York"]

Unnamed: 0,City,Building,Start Date,End Date,Duration,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,New York
1,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,New York
13,New York,City Hall,"January 11, 1785","October 6, 1788","3 years, 11 months and 5 days",New York
14,New York,Federal Hall,"March 4, 1789","December 5, 1790","1 year, 9 months and 1 day",New York


## DataFrames as HTML

#### Pandas also had a `to_html` method that we can use to generate HTML tables from DataFrames.

In [51]:
html_table = df.to_html()
html_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>City</th>\n      <th>Building</th>\n      <th>Start Date</th>\n      <th>End Date</th>\n      <th>Duration</th>\n      <th>State</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Albany</td>\n      <td>Stadt Huys</td>\n      <td>June 19, 1754</td>\n      <td>July 11, 1754</td>\n      <td>22\xa0days</td>\n      <td>New York</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>New York</td>\n      <td>City Hall</td>\n      <td>October 7, 1765</td>\n      <td>October 25, 1765</td>\n      <td>23\xa0days</td>\n      <td>New York</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Philadelphia</td>\n      <td>Carpenters\' Hall</td>\n      <td>September 5, 1774</td>\n      <td>October 26, 1774</td>\n      <td>1\xa0month and 21\xa0days</td>\n      <td>Pennsylvania</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Philadelphia</td>\n      <td>In

#### You may have to strip unwanted newlines to clean up the table.

In [53]:
html_table.replace('\n', '')

'<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>City</th>      <th>Building</th>      <th>Start Date</th>      <th>End Date</th>      <th>Duration</th>      <th>State</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Albany</td>      <td>Stadt Huys</td>      <td>June 19, 1754</td>      <td>July 11, 1754</td>      <td>22\xa0days</td>      <td>New York</td>    </tr>    <tr>      <th>1</th>      <td>New York</td>      <td>City Hall</td>      <td>October 7, 1765</td>      <td>October 25, 1765</td>      <td>23\xa0days</td>      <td>New York</td>    </tr>    <tr>      <th>2</th>      <td>Philadelphia</td>      <td>Carpenters\' Hall</td>      <td>September 5, 1774</td>      <td>October 26, 1774</td>      <td>1\xa0month and 21\xa0days</td>      <td>Pennsylvania</td>    </tr>    <tr>      <th>3</th>      <td>Philadelphia</td>      <td>Independence Hall</td>      <td>May 10, 1775</td>      <td>December 12, 1776</td>      <

You can also save the table directly to a file.

In [54]:
df.to_html('2/Activities/09-Ins_Pandas_Scraping/Solved/table.html')

In [80]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open 2/Activities/09-Ins_Pandas_Scraping/Solved/table.html

'open' is not recognized as an internal or external command,
operable program or batch file.


# ==========================================

### 2.10 Students Do: Doctor Decoder (0:10)

In this activity, you will use `read_html` from Pandas to scrape a Wikipedia article. You will then use the resulting DataFrame to convert a list of medical abbreviations to their full description.

## Instructions

* Use Panda's `read_html` to parse the URL.

* Find the medical abbreviations DataFrame in the list of DataFrames as assign it to `df`.

  * Assign the columns `['abb', 'full_name', 'other']`

* Drop the `other` column from the DataFrame.

* Drop the header row (the first row) and set the index to the `abb` column.

* Loop through the list of medical abbreviations and print the abbreviation along with the full description.

  * Use the DataFrame to perform the lookup.

# Doctor Decoder

Use Pandas scraping to help decode the medical abbreviations that a doctor might use.

In [66]:
import pandas as pd

Use Pandas to scrape the following site and decode the medical abbreviations in the list

In [67]:
url = 'https://en.wikipedia.org/wiki/List_of_medical_abbreviations'
med_abbreviations = ['BMR', 'BP', 'ECG', 'MRI', 'qid', 'WBC']

In [68]:
# Use Panda's `read_html` to parse the url
### BEGIN SOLUTION
tables = pd.read_html(url)
len(tables)
### END SOLUTION

6

In [69]:
tables[2].head()

Unnamed: 0,0,1,2
0,EG abb,EG full name,"Other(ver change, need to know...etc.)"
1,ABG,arterial blood gas,
2,ACE,angiotensin-converting enzyme,
3,ACTH,adrenocorticotropic hormone,
4,ADH,antidiuretic hormone,


In [70]:
# Find the medical abbreviations DataFrame in the list of DataFrames as assign it to `df`
# Assign the columns `['abb', 'full_name', 'other']`
### BEGIN SOLUTION
df = tables[2]
df.columns = ['abb', 'full_name', 'other']
df.head()
### END SOLUTION

Unnamed: 0,abb,full_name,other
0,EG abb,EG full name,"Other(ver change, need to know...etc.)"
1,ABG,arterial blood gas,
2,ACE,angiotensin-converting enzyme,
3,ACTH,adrenocorticotropic hormone,
4,ADH,antidiuretic hormone,


Cleanup of extra row

In [None]:
# drop the `other` column
### BEGIN SOLUTION
del df['other']
### END SOLUTION

In [66]:
# Drop the first row and set the index to the `abb` column
### BEGIN SOLUTION
df = df.iloc[1:]
df.set_index('abb', inplace=True)
df.head()
### END SOLUTION

KeyError: "None of ['abb'] are in the columns"

In [67]:
# Loop through the list of medical abbreviations and print the abbreviation
# along with the full description.
# Use the DataFrame to perform the lookup.
### BEGIN SOLUTION
for abb in med_abbreviations:
    print(abb, df.loc[abb].full_name)
### END SOLUTION

BMR basal metabolic rate
BP blood pressure
ECG electrocardiogram
MRI magnetic resonance imaging
qid 4 times a day
WBC white blood cell


# ==========================================

### Rating Class Objectives

* rate your understanding using 1-5 method in each objective

In [None]:
objectives = [
    "Use Beautiful Soup to scrape your own data from the web",
    "Save the results of web scraping into MongoDB",
]
rating = []
total = 0
for i in range(len(objectives)):
    rate = input(objectives[i]+"? ")
    total += int(rate)
    rating.append(objectives[i] + ". (" + rate + "/5)")
print("="*96)
print("My rating today is:")
print("-"*24)
for i in rating:
    print(i)
print("-"*64)
print("Average: " + str(total/len(objectives)))