## 12-Web-Scraping-and-Document-Databases - Day 2 - Web Scraping

### Class Objectives

* Use Beautiful Soup to scrape your own data from the web.
* Save the results of web scraping into MongoDB.

### Presentation:
* [Web Scraping](https://ucb.bootcampcontent.com/UCB-Coding-Bootcamp/ucb-bel-data-pt-10-2020-u-c/blob/master/01-Lesson-Plans/12-Web-Scraping-and-Document-Databases/Slideshows/Data-12.2-Web_Scraping.pdf)


### Resorces:
* [Selenium](https://splinter.readthedocs.io/en/latest/drivers/chrome.html)
* [Python Requests](http://docs.python-requests.org/en/master/)
* [Webscraping with BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [Beautiful soup intro](https://www.pythonforbeginners.com/beautifulsoup/scraping-websites-with-beautifulsoup)
* [Python Splinter](https://splinter.readthedocs.io/en/latest/)
* [Splinter Docs](https://splinter.readthedocs.io/en/latest/drivers/chrome.html)


### Install
* Beautiful Soup `[sudo] pip install bs4`
* Selenium `[sudo] pip install selenium`
* html5 `[sudo] pip install html5 lib`
* Splinter `[sudo] pip install splinter`
* Webdriver Manager `[sudo] pip install webdriver-manager`

# ==========================================

### 2.01 Instructor Do: Introduction to Beautiful Soup (0:10)

In [1]:
# Dependencies
from bs4 import BeautifulSoup as bs

In [2]:
html_string = """
<html>
<head>
  <title>
     A Simple HTML Document
  </title>
</head>
  <body>
    <p>This is a very simple HTML document</p>
    <p>It only has two paragraphs</p>
  </body>
</html>
"""

In [3]:
# Create a Beautiful Soup object
soup = bs(html_string, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [4]:
# Print formatted version of the soup
print(soup.prettify())

<html>
 <head>
  <title>
   A Simple HTML Document
  </title>
 </head>
 <body>
  <p>
   This is a very simple HTML document
  </p>
  <p>
   It only has two paragraphs
  </p>
 </body>
</html>



In [5]:
# Extract the title of the HTML document
soup.title

<title>
     A Simple HTML Document
  </title>

In [6]:
# Extract the text of the title
soup.title.text

'\n     A Simple HTML Document\n  '

In [7]:
# Clean up the text
soup.title.text.strip()

'A Simple HTML Document'

In [8]:
# Extract the contents of the HTML body
soup.body

<body>
<p>This is a very simple HTML document</p>
<p>It only has two paragraphs</p>
</body>

In [9]:
# Extract the text of the body
soup.body.text

'\nThis is a very simple HTML document\nIt only has two paragraphs\n'

In [10]:
# Text of the first paragraph
soup.body.p.text

'This is a very simple HTML document'

In [11]:
# Extract all paragraph elements
soup.body.find_all('p')

[<p>This is a very simple HTML document</p>, <p>It only has two paragraphs</p>]

In [12]:
texts = [x.text.strip() for x in soup.body.find_all('p')]

texts

['This is a very simple HTML document', 'It only has two paragraphs']

In [13]:
# Extract paragraph by index
soup.body.find_all('p')[0].text.strip()

'This is a very simple HTML document'

In [14]:
soup.body.find_all('p')[1]

<p>It only has two paragraphs</p>

In [15]:
# The text of the first paragraph
soup.body.find('p').text

'This is a very simple HTML document'

# ==========================================

### 2.02 Students Do: CNN Soup (0:15)

# A Soup Starter

## Instructions

* Believe it or not, CNN's website for **1996: Year in Review** is still alive on the web: <http://edition.cnn.com/EVENTS/1996/year.in.review/>

* We have, however, stored the HTML document as a string in your starter file.

* Your task, should you accept it (and you should), is to use Beautiful Soup to scrape and print the following pieces of information:

1. The **title**

2. All **paragraph** texts

3. The top 10 headlines (warning: this one is a bit tricky, and may not come out perfectly!)

## Hints

* For the third task, you will need a means of filtering the data, perhaps over multiple iterations.

## Bonus

* If you finish early, head over to the [Beautful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to read up on accessing `attributes` and navigating the DOM.


In [16]:
# Dependencies
import os
from bs4 import BeautifulSoup as bs

In [17]:
# Read HTML from file
filepath = os.path.join("2", "Activities", "02-Stu_CNNSoup", "Resources", "template.html")
with open(filepath) as file:
    html = file.read()

In [18]:
# Create a Beautiful Soup object
soup = bs(html, 'lxml')

In [19]:
# Extract title text
title = soup.title.text
print(title)

Top Ten Stories From 1996


In [20]:
# Print all paragraph texts
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print("-"*64)
    print(paragraph.text.strip())

----------------------------------------------------------------

----------------------------------------------------------------
What were the biggest stories of the year?

It's a question journalists like to ask themselves at the end of every
            year. Now you can join in the process. Here are our selections for the top ten news
            stories of 1996.

            Disagree with our choices? Then tell us what stories you think were most compelling
            in the poll below.
----------------------------------------------------------------

----------------------------------------------------------------
What makes a big
            story BIG?
----------------------------------------------------------------
It depends on your criteria, of course, and your perspective. That's why we offered
            a poll to find out what you think.
----------------------------------------------------------------
For our list, we polled producers throughout the CNN/Pathfinder famil

In [21]:
# Print all ten headlines
tds = soup.find_all('td')
# A blank list to hold the headlines
headlines = []
# Loop over td elements
for td in tds:
    # If td element has an anchor...
    if (td.a):
        # And the anchor has non-blank text...
        if (td.a.text):
            # Append the td to the list
            headlines.append(td)

In [22]:
headlines

[<td><a href="topten/israel/israel.index.html" target="_top"><b>Israel</b> elects <b>Netanyahu</b></a></td>,
 <td><a href="topten/twa/twa.index.html" target="_top">Crash of TWA Flight 800</a></td>,
 <td><a href="topten/yeltsin/yeltsin.index.html" target="_top"><b>Russia</b> elects <b>Yeltsin</b></a></td>,
 <td><a href="topten/clinton/clinton.index.html" target="_top"><b>U.S</b>. elects <b>Clinton</b></a></td>,
 <td><a href="topten/hutu/hutu.index.html" target="_top"><b>Hutu-Tutsi</b> conflict in central Africa</a></td>,
 <td><a href="topten/bosnia/bosnia.index.html" target="_top">Peace, elections in <b>Bosnia</b></a></td>,
 <td><a href="topten/saudi/saudi.index.html" target="_top"><b>U.S</b>. base bombed in <b>Saudi Arabia</b></a></td>,
 <td><a href="topten/olympics/olympics.index.html" target="_top">Centennial <b>Olympic</b> Games</a></td>,
 <td><a href="topten/aids/aids.index.html" target="_top">Advances against <b>AIDS</b></a></td>,
 <td><a href="topten/unabomb/unabomb.index.html" t

In [23]:
# Print only the headlines
for x in range(10):
    print(headlines[x].text)        


Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S. elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S. base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested


# ==========================================

### 2.03 Instructor Do: Craig's Wishlist (0:05)

In [24]:
# Dependencies
from bs4 import BeautifulSoup
import requests

In [30]:
# URL of page to be scraped
url = 'https://newjersey.craigslist.org/search/sss?sort=rel&query=guitar'
url = 'https://sfbay.craigslist.org/search/sss?sort=rel&query=guitar'

In [31]:
# Retrieve page with the requests module
response = requests.get(url)

In [32]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = BeautifulSoup(response.text, 'html.parser')

In [33]:
# Examine the results, then determine element that contains sought info
print(soup.prettify())

﻿
<!DOCTYPE html>
<html class="no-js">
 <head>
  <title>
   SF bay area for sale "guitar"  - craigslist
  </title>
  <script id="ld_breadcrumb_data" type="application/ld+json">
   {"@context":"https://schema.org","itemListElement":[{"item":{"name":"sfbay.craigslist.org","@id":"https://sfbay.craigslist.org"},"position":1,"@type":"ListItem"},{"item":{"name":"for sale","@id":"https://sfbay.craigslist.org/d/for-sale/search/sss"},"position":2,"@type":"ListItem"}],"@type":"BreadcrumbList"}
  </script>
  <meta content="" name="description"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible">
   <link href="https://sfbay.craigslist.org/d/for-sale/search/sss?query=guitar&amp;sort=rel" rel="canonical"/>
   <link href="https://sfbay.craigslist.org/d/for-sale/search/sss?s=120&amp;query=guitar&amp;sort=rel" rel="next"/>
   <meta content="width=device-width,initial-scale=1" name="viewport"/>
   <link href="//www.craigslist.org/styles/cl.css?v=5ea548767c8f312eb8ee55e79d68d2c4" media="all" rel="s

In [34]:
# results are returned as an iterable list
results = soup.find_all('li', class_="result-row")
results[0]

<li class="result-row" data-pid="7257951203" data-repost-of="7085896188">
<a class="result-image gallery" data-ids="3:00J0J_2hBDgVQwF5gz_0t20CI,3:00h0h_iCF9XFoRYgDz_0CI0t2,1:00202_eR3IfcyZlkD" href="https://sfbay.craigslist.org/eby/msg/d/hayward-guitar-bass-straps/7257951203.html">
<span class="result-price">$0</span>
</a>
<div class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2021-01-09 09:27" title="Sat 09 Jan 09:27:42 AM">Jan  9</time>
<h3 class="result-heading">
<a class="result-title hdrlnk" data-id="7257951203" href="https://sfbay.craigslist.org/eby/msg/d/hayward-guitar-bass-straps/7257951203.html" id="postid_7257951203">Guitar / Bass Straps</a>
</h3>
<span class="result-meta">
<span class="result-price">$0</span>
<span class="result-hood"> (hayward / castro valley)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon 

In [35]:
# Loop through returned results
for result in results:
    # Error handling
    try:
        # Identify and return title of listing
        title = result.find('a', class_="result-title").text
        # Identify and return price of listing
        price = result.find('span', class_="result-price").text
#         price = result.a.span.text
        # Identify and return link to listing
        link = result.a['href']
#         image = result.a.img
        image = result.find_all('img')

        # Print results only if title, price, and link are available
        if (title and price and link):
            print('-------------')
            print(title)
            print(price)
#             image = result.a.img
#             print(image)
#             print(result)
            print(link)
    except AttributeError as e:
        print(e)

-------------
Guitar / Bass Straps
$0
https://sfbay.craigslist.org/eby/msg/d/hayward-guitar-bass-straps/7257951203.html
-------------
Very Nice 1970 Yamaha FG-230 12-String Acoustic Guitar
$535
https://sfbay.craigslist.org/sby/msg/d/morgan-hill-very-nice-1970-yamaha-fg/7255194460.html
-------------
Autographed Epiphone SG Guitar for Trade
$0
https://sfbay.craigslist.org/eby/msg/d/tracy-autographed-epiphone-sg-guitar/7259407091.html
-------------
Ampeg R12R Reverberocket Guitar Amp
$300
https://sfbay.craigslist.org/nby/msg/d/san-anselmo-ampeg-r12r-reverberocket/7259404572.html
-------------
2007 Stromberg Newport Archtop Jazz Guitar
$800
https://sfbay.craigslist.org/eby/msg/d/union-city-2007-stromberg-newport/7254065614.html
-------------
Child Size Guitar
$35
https://sfbay.craigslist.org/eby/msg/d/danville-child-size-guitar/7256381738.html
-------------
NEW Xtonebox Carbon Booster Pedal for Guitar & Bass (Retail:$155)
$60
https://sfbay.craigslist.org/eby/msg/d/union-city-new-xtonebox-c

# ==========================================

### 2.04 Students Do: Reddit Scraper (0:10)

## Instructions

* In this activity, you will scrape the Programmer-Humor.html file provided

* Use Beautiful Soup to scrape only threads that have twenty or more comments, then print the thread's title, number of comments, and the URL to the thread.

## Bonus

* If you finish early, try to display each thread's top comment in your output!

* As an added bonus try re-scraping using the URL instead. What happens when you try to do this? Why might this be happening?

In [104]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import os

In [105]:
filepath = os.path.join("2/Activities/04-Stu_RedditScrape/Unsolved/Programmer-Humor.html")
with open(filepath, encoding='utf-8') as file:
    html = file.read()

In [106]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = BeautifulSoup(html, 'html.parser')

In [107]:
print(soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0041)https://www.reddit.com/r/ProgrammerHumor/ -->
<html class="js cssanimations csstransforms" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Programmer Humor
  </title>
  <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
  <meta content="reddit: the front page of the internet" name="description"/>
  <meta content="always" name="referrer"/>
  <link href="https://www.reddit.com/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/>
  <link href="https://www.reddit.com/r/ProgrammerHumor/" rel="canonical"/>
  <meta content="width=1024" name="viewport"/>
  <link href="https://out.reddit.com/" rel="dns-prefetch"/>
  <link href="https://out.reddit.com/" rel="preconnect"/>
  <meta content="https://www.redditstatic.com/icon.png" property="og:image"/>
  <meta content="reddit" property="og:site_nam

In [109]:
# Find the number of subscribers
number_subscribers = soup.find("span", class_='subscribers').find('span', class_='number').text
print(f"The number of subscribers: {number_subscribers}")

The number of subscribers: 422,381


In [111]:
# Examine the results, then determine element that contains sought info
# results are returned as an iterable list
# Examine the results and look for a div withe the class 'top-matter'
results = soup.find_all('div', class_='top-matter')
len(results)

27

In [114]:
# Loop through returned results
for result in results:
    
    # Retrieve the thread title
    title = result.find('p', class_='title')
    
    
    # Access the thread's text content
    title_text = title.a.text
#     print(title_text)

    try:
        # Access the thread with CSS selectors
        thread = result.find('li', class_='first')    

        # The number of comments made in the thread
        comments = thread.text.lstrip()

        # Parse string, e.g. '47 comments' for possible numeric manipulation
        comments_num = int(comments.split()[0])

        # Access the href attribute with bracket notation
        link = thread.a['href']

        # Run if the thread has 20 or more comments
        if (comments_num >=20 ):
            print('\n-----------------\n')
            print(title_text)
            print('Comments:', comments_num)
            print(link)
    except AttributeError as e:
        print(e)

'NoneType' object has no attribute 'text'

-----------------

[Meta] Clarification on rules
Comments: 79
https://www.reddit.com/r/ProgrammerHumor/comments/6y2b47/meta_clarification_on_rules/

-----------------

Doing conditionals
Comments: 258
https://www.reddit.com/r/ProgrammerHumor/comments/7pw5qk/doing_conditionals/

-----------------

Perfect date
Comments: 58
https://www.reddit.com/r/ProgrammerHumor/comments/7pyyl2/perfect_date/

-----------------

The truth about java.
Comments: 61
https://www.reddit.com/r/ProgrammerHumor/comments/7pxod4/the_truth_about_java/

-----------------

It all makes sense now.
Comments: 341
https://www.reddit.com/r/ProgrammerHumor/comments/7pp66f/it_all_makes_sense_now/

-----------------

This is where US's bandwidth going.
Comments: 20
https://www.reddit.com/r/ProgrammerHumor/comments/7pv1ta/this_is_where_uss_bandwidth_going/


In [115]:
# BONUS
# Try to scrape the site using the URL
url = 'https://www.reddit.com/r/ProgrammingHumor/'

# Retrieve page with the requests module
html = requests.get(url)

In [116]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = BeautifulSoup(html.text, 'html.parser')

In [117]:
# Display how different this HTML looks
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <script>
   var __SUPPORTS_TIMING_API = typeof performance === 'object' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;
    function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };
    var __firstPostLoaded = false;
    function __markFirstPostVisible() {
      if (__firstPostLoaded) { return; }
      __firstPostLoaded = true;
      __perfMark("first_post_title_image_loaded");
    }
    var __firstCommentLoaded = false;
    function __markFirstCommentVisible() {
      if (__firstCommentLoaded) { return; }
      __firstCommentLoaded = true;
      __perfMark("first_comment_loaded");
    }
  </script>
  <script>
   __perfMark('head_tag_start');
  </script>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="origin-when-cross-origin" name="referrer"/>
  <style>
   /* http://meyerweb.com/eric/tools/css/reset/
    v2.0 | 201101

# ==========================================

### 2.05 Instructor Do: Mongo Craig (0:10)

In [134]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import pymongo

In [135]:
# Initialize PyMongo to work with MongoDBs
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [136]:
# Define database and collection
db = client.craigslist_db
collection = db.items

In [137]:
# URL of page to be scraped
url = 'https://newjersey.craigslist.org/search/sss?sort=rel&query=guitar'
url = 'https://sfbay.craigslist.org/search/sss?sort=rel&query=guitar'

# Retrieve page with the requests module
response = requests.get(url)
# Create BeautifulSoup object; parse with 'lxml'
soup = BeautifulSoup(response.text, 'lxml')

In [138]:
# Examine the results, then determine element that contains sought info
# results are returned as an iterable list
results = soup.find_all('li', class_='result-row')

# Loop through returned results
for result in results:
    # Error handling
    try:
        # Identify and return title of listing
        title = result.find('a', class_='result-title').text
        # Identify and return price of listing
#         price = result.a.span.text
        price = result.find('span', class_="result-price").text
        price = float(price.replace("$", ""))
        # Identify and return link to listing
        link = result.a['href']

        # Run only if title, price, and link are available
        if (title and price and link):
            # Print results
            print('-------------')
            print(title)
            print(price)
            print(link)

            # Dictionary to be inserted as a MongoDB document
            post = {
                'title': title,
                'price': price,
                'url': link
            }

            collection.insert_one(post)

    except Exception as e:
        print(e)

-------------
Laguna Electric Guitar LE122
80.0
https://sfbay.craigslist.org/sby/msg/d/san-jose-laguna-electric-guitar-le122/7244968364.html
-------------
Fender Acoustic Guitar
70.0
https://sfbay.craigslist.org/scz/msg/d/santa-cruz-fender-acoustic-guitar/7255868125.html
-------------
Agile AL-2000 electric guitar with Seymour Duncan pickups
300.0
https://sfbay.craigslist.org/pen/msg/d/san-bruno-agile-al-2000-electric-guitar/7255869018.html
-------------
Fender R.A.D amplifier (guitar)
100.0
https://sfbay.craigslist.org/sfc/msg/d/san-francisco-fender-rad-amplifier/7252415537.html
-------------
Montoya classical guitar
90.0
https://sfbay.craigslist.org/eby/msg/d/berkeley-montoya-classical-guitar/7255856557.html
-------------
Montoya guitar
100.0
https://sfbay.craigslist.org/eby/msg/d/berkeley-montoya-guitar/7254112496.html
-------------
Guitar Setup
100.0
https://sfbay.craigslist.org/sfc/msg/d/san-francisco-guitar-setup/7242886929.html
-------------
Ortola Classical Guitar Case Backpack

In [139]:
# Display items in MongoDB collection
listings = db.items.find()

for listing in listings:
    print(listing)

{'_id': ObjectId('5ff0cee91f7a3580057808f2'), 'title': 'Laguna Electric Guitar LE122', 'price': 80.0, 'url': 'https://sfbay.craigslist.org/sby/msg/d/san-jose-laguna-electric-guitar-le122/7244968364.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f3'), 'title': 'Fender Acoustic Guitar', 'price': 70.0, 'url': 'https://sfbay.craigslist.org/scz/msg/d/santa-cruz-fender-acoustic-guitar/7255868125.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f4'), 'title': 'Agile AL-2000 electric guitar with Seymour Duncan pickups', 'price': 300.0, 'url': 'https://sfbay.craigslist.org/pen/msg/d/san-bruno-agile-al-2000-electric-guitar/7255869018.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f5'), 'title': 'Fender R.A.D amplifier (guitar)', 'price': 100.0, 'url': 'https://sfbay.craigslist.org/sfc/msg/d/san-francisco-fender-rad-amplifier/7252415537.html'}
{'_id': ObjectId('5ff0cee91f7a3580057808f6'), 'title': 'Montoya classical guitar', 'price': 90.0, 'url': 'https://sfbay.craigslist.org/eby/msg/d/berkeley-mon

# ==========================================

### 2.06 Students Do: Hockey Headers (0:10)

Teamwork! Speed! Mental and physical toughness! Passion! Excitement! Unpredictable matchups down to the wire! What could be better? While these terms could easily be applied to a data science hackathon, we're talking about the magnificent sport of hockey.

Your assignment is to scrape the articles on the news page of the NHL website - which is frequently updated - and then post the results of your scraping to MongoDB.

## Instructions

* Use Beautiful Soup and requests to scrape the header and subheader of each article on the news page.

* Post the above information as a MongoDB document and then print all of the documents on the database to the console.

* In addition to the above, post the date of the article publication as well.


In [141]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import pymongo

In [142]:
# Initialize PyMongo to work with MongoDBs
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [143]:
# Define database and collection
db = client.nhl_db
collection = db.articles

In [144]:
# URL of page to be scraped
url = 'https://www.nhl.com/news'

# Retrieve page with the requests module
response = requests.get(url)
# Create BeautifulSoup object; parse with 'lxml'
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en_US">
 <head>
  <title>
   NHL Hockey News | NHL.com
  </title>
  <!-- meta meta tag -->
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="no-cache" http-equiv="Cache-Control"/>
  <meta content="no-cache" http-equiv="Pragma"/>
  <meta content="-1" http-equiv="Expires"/>
  <meta content="en" http-equiv="content-language"/>
  <meta content="nhl, nhl.com, www.nhl.com, playoffs, scores, video, photos, standings, news, features, players, shop, auctions, tickets, mobile, game center live, stanley cup, winter classic, draft, free agency" name="keywords"/>
  <meta content="US" name="countryCode"/>
  <meta content="NHL Hockey News" property="og:title"/>
  <meta content="NHL Hockey News NHL.com" itemprop="name"/>
  <meta content="NHL.com" property="og:site_name"/>
  <meta content="website" property="og:type"/>
  <meta content="https://cms.nhl.bamgrid.com/images/photos/3199

In [145]:
# Retrieve the parent divs for all articles
results = soup.find_all('div', class_='article-item__top')

# loop over results to get article data
for result in results:
    # scrape the article header 
    header = result.find('h1', class_='article-item__headline').text
    
    # scrape the article subheader
    subheader = result.find('h2', class_='article-item__subheader').text
    
    # scrape the datetime
    datetime = result.find('span', class_='article-item__date')['data-date'] 
    
    # get only the date from the datetime
    date = datetime.split('T')[0]
    
    # print article data
    print('-----------------')
    print(header)
    print(subheader)
    print(datetime)
    print(date)

    # Dictionary to be inserted into MongoDB
    post = {
        'header': header,
        'subheader': subheader,
        'date': date,
    }

    # Insert dictionary into MongoDB as a document
    collection.insert_one(post)

-----------------
NHL teams that missed playoffs start training camp
Red Wings, Senators, Devils among those fired up for first practice
2021-01-01T16:54:21-0500
2021-01-01
-----------------
Lightning season preview: Return of Stamkos bodes well for Cup defense
Kucherov's absence negated by captain's health, arrival of key prospects
2021-01-02T00:00:00-0500
2021-01-02
-----------------
Hall takes ice for first time with Sabres
Forward says familiarity with coach Krueger will help him get acclimated quicker in training camp
2021-01-01T17:07:33-0500
2021-01-01
-----------------
3 'Star' keys for United States against Slovakia in WJC quarterfinals
NHL Network analyst Starman says goalie decision, play of forwards will be important
2021-01-02T09:52:51-0500
2021-01-02
-----------------
Stars season preview: Better start vital without Seguin, Bishop
Western Conference champs not expected to have injured forward, goalie for first two months
2021-01-02T00:01:59-0500
2021-01-02
----------------

In [146]:
# Display the MongoDB records created above
articles = db.articles.find()
for article in articles:
    print(article)

{'_id': ObjectId('5ff0d10a1f7a358005780959'), 'header': 'NHL teams that missed playoffs start training camp', 'subheader': 'Red Wings, Senators, Devils among those fired up for first practice', 'date': '2021-01-01'}
{'_id': ObjectId('5ff0d10a1f7a35800578095a'), 'header': 'Lightning season preview: Return of Stamkos bodes well for Cup defense', 'subheader': "Kucherov's absence negated by captain's health, arrival of key prospects", 'date': '2021-01-02'}
{'_id': ObjectId('5ff0d10a1f7a35800578095b'), 'header': 'Hall takes ice for first time with Sabres', 'subheader': 'Forward says familiarity with coach Krueger will help him get acclimated quicker in training camp', 'date': '2021-01-01'}
{'_id': ObjectId('5ff0d10a1f7a35800578095c'), 'header': "3 'Star' keys for United States against Slovakia in WJC quarterfinals", 'subheader': 'NHL Network analyst Starman says goalie decision, play of forwards will be important', 'date': '2021-01-02'}
{'_id': ObjectId('5ff0d10a1f7a35800578095d'), 'header'

# ==========================================

# BREAK (0:30)

# ==========================================

### 2.07 Instructor Do: Introduction to Splinter (0:15)

In [3]:
from splinter import Browser
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager

In [9]:
# Setup splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280


 


[WDM] - Driver [C:\Users\k\.wdm\drivers\chromedriver\win32\87.0.4280.88\chromedriver.exe] found in cache


# Mac Users

In [10]:
# https://splinter.readthedocs.io/en/latest/drivers/chrome.html
# !which chromedriver
# executable_path = {'executable_path': '/usr/local/bin/chromedriver'}


# Windows Users

In [11]:
# executable_path = {'executable_path': '2/Activities/07-Ins_Splinter/Solved/chromedriver.exe'}

In [12]:
# import os
# if os.name=="nt":
#     executable_path = {'executable_path': './chromedriver.exe'}
# else:
#     executable_path = {"executable_path": "/usr/local/bin/chromedriver"}

In [13]:
# browser = Browser('chrome', **executable_path, headless=False)

In [14]:
url = 'http://quotes.toscrape.com/'
browser.visit(url)

In [15]:
for x in range(1, 6):

    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')

    quotes = soup.find_all('span', class_='text')

    for quote in quotes:
        print('page:', x, '-------------')
        print(quote.text)

    browser.click_link_by_partial_text('Next')

page: 1 -------------
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
page: 1 -------------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
page: 1 -------------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
page: 1 -------------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
page: 1 -------------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
page: 1 -------------
“Try not to become a man of success. Rather become a man of value.”
page: 1 -------------
“It is better to be hated for what you are than to be loved for what you are not.”
page: 1 -------------
“I have not failed. I've just found 10,000 ways that won't work.”
page: 1 -------------
“A woman is like a tea bag; you ne

In [59]:
browser.quit()

# ==========================================

### 2.08 Students Do: Bookscraper (0:15)

## Instructions

* Go to <http://books.toscrape.com/>

* Scrape the titles and the URLs to all books on this fictional online bookstore. Display the results in console.

* That's it!

* If you're craving extra challenge, try scraping all books by **category**. Good luck!


In [18]:
from splinter import Browser
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager

In [19]:
# # Mac Users
# # https://splinter.readthedocs.io/en/latest/drivers/chrome.html
# # !which chromedriver
# # executable_path = {'executable_path': '/usr/local/bin/chromedriver'}

# # Windows Users
# import os
# if os.name=="nt":
#     executable_path = {'executable_path': './chromedriver.exe'}
# else:
#     executable_path = {"executable_path": "/usr/local/bin/chromedriver"}

In [20]:
# browser = Browser('chrome', **executable_path, headless=False)

In [21]:
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280
[WDM] - Driver [C:\Users\k\.wdm\drivers\chromedriver\win32\87.0.4280.88\chromedriver.exe] found in cache


 


In [22]:
url = 'http://books.toscrape.com/'
browser.visit(url)

In [23]:
# Iterate through all pages
for x in range(50):
    # HTML object
    html = browser.html
    # Parse HTML with Beautiful Soup
    soup = BeautifulSoup(html, 'html.parser')
    # Retrieve all elements that contain book information
    articles = soup.find_all('article', class_='product_pod')

    # Iterate through each book
    for article in articles:
        # Use Beautiful Soup's find() method to navigate and retrieve attributes
        h3 = article.find('h3')
        link = h3.find('a')
        href = link['href']
        title = link['title']
        print('-----------')
        print(title)
        print(url + href)

    # Click the 'Next' button on each page
    try:
        browser.links.find_by_partial_text('next').click()
          
    except:
        print("Scraping Complete")


-----------
A Light in the Attic
http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
-----------
Tipping the Velvet
http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
-----------
Soumission
http://books.toscrape.com/catalogue/soumission_998/index.html
-----------
Sharp Objects
http://books.toscrape.com/catalogue/sharp-objects_997/index.html
-----------
Sapiens: A Brief History of Humankind
http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
-----------
The Requiem Red
http://books.toscrape.com/catalogue/the-requiem-red_995/index.html
-----------
The Dirty Little Secrets of Getting Your Dream Job
http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
-----------
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhu

-----------
The Nameless City (The Nameless City #1)
http://books.toscrape.com/the-nameless-city-the-nameless-city-1_940/index.html
-----------
The Murder That Never Was (Forensic Instincts #5)
http://books.toscrape.com/the-murder-that-never-was-forensic-instincts-5_939/index.html
-----------
The Most Perfect Thing: Inside (and Outside) a Bird's Egg
http://books.toscrape.com/the-most-perfect-thing-inside-and-outside-a-birds-egg_938/index.html
-----------
The Mindfulness and Acceptance Workbook for Anxiety: A Guide to Breaking Free from Anxiety, Phobias, and Worry Using Acceptance and Commitment Therapy
http://books.toscrape.com/the-mindfulness-and-acceptance-workbook-for-anxiety-a-guide-to-breaking-free-from-anxiety-phobias-and-worry-using-acceptance-and-commitment-therapy_937/index.html
-----------
The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing
http://books.toscrape.com/the-life-changing-magic-of-tidying-up-the-japanese-art-of-decluttering-and-o

-----------
Algorithms to Live By: The Computer Science of Human Decisions
http://books.toscrape.com/algorithms-to-live-by-the-computer-science-of-human-decisions_880/index.html
-----------
A World of Flavor: Your Gluten Free Passport
http://books.toscrape.com/a-world-of-flavor-your-gluten-free-passport_879/index.html
-----------
A Piece of Sky, a Grain of Rice: A Memoir in Four Meditations
http://books.toscrape.com/a-piece-of-sky-a-grain-of-rice-a-memoir-in-four-meditations_878/index.html
-----------
A Murder in Time
http://books.toscrape.com/a-murder-in-time_877/index.html
-----------
A Flight of Arrows (The Pathfinders #2)
http://books.toscrape.com/a-flight-of-arrows-the-pathfinders-2_876/index.html
-----------
A Fierce and Subtle Poison
http://books.toscrape.com/a-fierce-and-subtle-poison_875/index.html
-----------
A Court of Thorns and Roses (A Court of Thorns and Roses #1)
http://books.toscrape.com/a-court-of-thorns-and-roses-a-court-of-thorns-and-roses-1_874/index.html
---------

-----------
Dark Notes
http://books.toscrape.com/dark-notes_800/index.html
-----------
Daring Greatly: How the Courage to Be Vulnerable Transforms the Way We Live, Love, Parent, and Lead
http://books.toscrape.com/daring-greatly-how-the-courage-to-be-vulnerable-transforms-the-way-we-live-love-parent-and-lead_799/index.html
-----------
Close to You
http://books.toscrape.com/close-to-you_798/index.html
-----------
Chasing Heaven: What Dying Taught Me About Living
http://books.toscrape.com/chasing-heaven-what-dying-taught-me-about-living_797/index.html
-----------
Big Magic: Creative Living Beyond Fear
http://books.toscrape.com/big-magic-creative-living-beyond-fear_796/index.html
-----------
Becoming Wise: An Inquiry into the Mystery and Art of Living
http://books.toscrape.com/becoming-wise-an-inquiry-into-the-mystery-and-art-of-living_795/index.html
-----------
Beauty Restored (Riley Family Legacy Novellas #3)
http://books.toscrape.com/beauty-restored-riley-family-legacy-novellas-3_794/in

-----------
My Name Is Lucy Barton
http://books.toscrape.com/my-name-is-lucy-barton_720/index.html
-----------
My Mrs. Brown
http://books.toscrape.com/my-mrs-brown_719/index.html
-----------
My Kind of Crazy
http://books.toscrape.com/my-kind-of-crazy_718/index.html
-----------
Mr. Mercedes (Bill Hodges Trilogy #1)
http://books.toscrape.com/mr-mercedes-bill-hodges-trilogy-1_717/index.html
-----------
More Than Music (Chasing the Dream #1)
http://books.toscrape.com/more-than-music-chasing-the-dream-1_716/index.html
-----------
Made to Stick: Why Some Ideas Survive and Others Die
http://books.toscrape.com/made-to-stick-why-some-ideas-survive-and-others-die_715/index.html
-----------
Luis Paints the World
http://books.toscrape.com/luis-paints-the-world_714/index.html
-----------
Luckiest Girl Alive
http://books.toscrape.com/luckiest-girl-alive_713/index.html
-----------
Lowriders to the Center of the Earth (Lowriders in Space #2)
http://books.toscrape.com/lowriders-to-the-center-of-the-ear

-----------
The Midnight Watch: A Novel of the Titanic and the Californian
http://books.toscrape.com/the-midnight-watch-a-novel-of-the-titanic-and-the-californian_640/index.html
-----------
The Lonely City: Adventures in the Art of Being Alone
http://books.toscrape.com/the-lonely-city-adventures-in-the-art-of-being-alone_639/index.html
-----------
The Gray Rhino: How to Recognize and Act on the Obvious Dangers We Ignore
http://books.toscrape.com/the-gray-rhino-how-to-recognize-and-act-on-the-obvious-dangers-we-ignore_638/index.html
-----------
The Golden Condom: And Other Essays on Love Lost and Found
http://books.toscrape.com/the-golden-condom-and-other-essays-on-love-lost-and-found_637/index.html
-----------
The Epidemic (The Program 0.6)
http://books.toscrape.com/the-epidemic-the-program-06_636/index.html
-----------
The Dinner Party
http://books.toscrape.com/the-dinner-party_635/index.html
-----------
The Diary of a Young Girl
http://books.toscrape.com/the-diary-of-a-young-girl_634

-----------
Chernobyl 01:23:40: The Incredible True Story of the World's Worst Nuclear Disaster
http://books.toscrape.com/chernobyl-012340-the-incredible-true-story-of-the-worlds-worst-nuclear-disaster_560/index.html
-----------
Art and Fear: Observations on the Perils (and Rewards) of Artmaking
http://books.toscrape.com/art-and-fear-observations-on-the-perils-and-rewards-of-artmaking_559/index.html
-----------
A Shard of Ice (The Black Symphony Saga #1)
http://books.toscrape.com/a-shard-of-ice-the-black-symphony-saga-1_558/index.html
-----------
A Hero's Curse (The Unseen Chronicles #1)
http://books.toscrape.com/a-heros-curse-the-unseen-chronicles-1_557/index.html
-----------
23 Degrees South: A Tropical Tale of Changing Whether...
http://books.toscrape.com/23-degrees-south-a-tropical-tale-of-changing-whether_556/index.html
-----------
Zero to One: Notes on Startups, or How to Build the Future
http://books.toscrape.com/zero-to-one-notes-on-startups-or-how-to-build-the-future_555/index

-----------
The Story of Art
http://books.toscrape.com/the-story-of-art_500/index.html
-----------
The Origin of Species
http://books.toscrape.com/the-origin-of-species_499/index.html
-----------
The Great Gatsby
http://books.toscrape.com/the-great-gatsby_498/index.html
-----------
The Good Girl
http://books.toscrape.com/the-good-girl_497/index.html
-----------
The Glass Castle
http://books.toscrape.com/the-glass-castle_496/index.html
-----------
The Faith of Christopher Hitchens: The Restless Soul of the World's Most Notorious Atheist
http://books.toscrape.com/the-faith-of-christopher-hitchens-the-restless-soul-of-the-worlds-most-notorious-atheist_495/index.html
-----------
The Drowning Girls
http://books.toscrape.com/the-drowning-girls_494/index.html
-----------
The Constant Princess (The Tudor Court #1)
http://books.toscrape.com/the-constant-princess-the-tudor-court-1_493/index.html
-----------
The Bourne Identity (Jason Bourne #1)
http://books.toscrape.com/the-bourne-identity-jason

-----------
World Without End (The Pillars of the Earth #2)
http://books.toscrape.com/world-without-end-the-pillars-of-the-earth-2_420/index.html
-----------
Will Grayson, Will Grayson (Will Grayson, Will Grayson)
http://books.toscrape.com/will-grayson-will-grayson-will-grayson-will-grayson_419/index.html
-----------
Why Save the Bankers?: And Other Essays on Our Economic and Political Crisis
http://books.toscrape.com/why-save-the-bankers-and-other-essays-on-our-economic-and-political-crisis_418/index.html
-----------
Where She Went (If I Stay #2)
http://books.toscrape.com/where-she-went-if-i-stay-2_417/index.html
-----------
What If?: Serious Scientific Answers to Absurd Hypothetical Questions
http://books.toscrape.com/what-if-serious-scientific-answers-to-absurd-hypothetical-questions_416/index.html
-----------
Two Summers
http://books.toscrape.com/two-summers_415/index.html
-----------
This Is Your Brain on Music: The Science of a Human Obsession
http://books.toscrape.com/this-is-yo

-----------
Shopaholic Ties the Knot (Shopaholic #3)
http://books.toscrape.com/shopaholic-ties-the-knot-shopaholic-3_340/index.html
-----------
Paper and Fire (The Great Library #2)
http://books.toscrape.com/paper-and-fire-the-great-library-2_339/index.html
-----------
Outlander (Outlander #1)
http://books.toscrape.com/outlander-outlander-1_338/index.html
-----------
Orchestra of Exiles: The Story of Bronislaw Huberman, the Israel Philharmonic, and the One Thousand Jews He Saved from Nazi Horrors
http://books.toscrape.com/orchestra-of-exiles-the-story-of-bronislaw-huberman-the-israel-philharmonic-and-the-one-thousand-jews-he-saved-from-nazi-horrors_337/index.html
-----------
No One Here Gets Out Alive
http://books.toscrape.com/no-one-here-gets-out-alive_336/index.html
-----------
Night Shift (Night Shift #1-20)
http://books.toscrape.com/night-shift-night-shift-1-20_335/index.html
-----------
Needful Things
http://books.toscrape.com/needful-things_334/index.html
-----------
Mockingjay (

-----------
The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses
http://books.toscrape.com/the-lean-startup-how-todays-entrepreneurs-use-continuous-innovation-to-create-radically-successful-businesses_260/index.html
-----------
The Last Painting of Sara de Vos
http://books.toscrape.com/the-last-painting-of-sara-de-vos_259/index.html
-----------
The Land of 10,000 Madonnas
http://books.toscrape.com/the-land-of-10000-madonnas_258/index.html
-----------
The Infinities
http://books.toscrape.com/the-infinities_257/index.html
-----------
The Husband's Secret
http://books.toscrape.com/the-husbands-secret_256/index.html
-----------
The Hitchhiker's Guide to the Galaxy (Hitchhiker's Guide to the Galaxy #1)
http://books.toscrape.com/the-hitchhikers-guide-to-the-galaxy-hitchhikers-guide-to-the-galaxy-1_255/index.html
-----------
The Guns of August
http://books.toscrape.com/the-guns-of-august_254/index.html
-----------
The Guernsey Literar

-----------
Jurassic Park (Jurassic Park #1)
http://books.toscrape.com/jurassic-park-jurassic-park-1_180/index.html
-----------
It's Never Too Late to Begin Again: Discovering Creativity and Meaning at Midlife and Beyond
http://books.toscrape.com/its-never-too-late-to-begin-again-discovering-creativity-and-meaning-at-midlife-and-beyond_179/index.html
-----------
Is Everyone Hanging Out Without Me? (And Other Concerns)
http://books.toscrape.com/is-everyone-hanging-out-without-me-and-other-concerns_178/index.html
-----------
Into the Wild
http://books.toscrape.com/into-the-wild_177/index.html
-----------
Inferno (Robert Langdon #4)
http://books.toscrape.com/inferno-robert-langdon-4_176/index.html
-----------
In the Garden of Beasts: Love, Terror, and an American Family in Hitler's Berlin
http://books.toscrape.com/in-the-garden-of-beasts-love-terror-and-an-american-family-in-hitlers-berlin_175/index.html
-----------
If I Run (If I Run #1)
http://books.toscrape.com/if-i-run-if-i-run-1_174/

-----------
Fruits Basket, Vol. 2 (Fruits Basket #2)
http://books.toscrape.com/fruits-basket-vol-2-fruits-basket-2_100/index.html
-----------
Diary of a Minecraft Zombie Book 1: A Scare of a Dare (An Unofficial Minecraft Book)
http://books.toscrape.com/diary-of-a-minecraft-zombie-book-1-a-scare-of-a-dare-an-unofficial-minecraft-book_99/index.html
-----------
Y: The Last Man, Vol. 1: Unmanned (Y: The Last Man #1)
http://books.toscrape.com/y-the-last-man-vol-1-unmanned-y-the-last-man-1_98/index.html
-----------
While You Were Mine
http://books.toscrape.com/while-you-were-mine_97/index.html
-----------
Where Lightning Strikes (Bleeding Stars #3)
http://books.toscrape.com/where-lightning-strikes-bleeding-stars-3_96/index.html
-----------
When I'm Gone
http://books.toscrape.com/when-im-gone_95/index.html
-----------
Ways of Seeing
http://books.toscrape.com/ways-of-seeing_94/index.html
-----------
Vampire Knight, Vol. 1 (Vampire Knight #1)
http://books.toscrape.com/vampire-knight-vol-1-vampi

-----------
Frankenstein
http://books.toscrape.com/frankenstein_20/index.html
-----------
Forever Rockers (The Rocker #12)
http://books.toscrape.com/forever-rockers-the-rocker-12_19/index.html
-----------
Fighting Fate (Fighting #6)
http://books.toscrape.com/fighting-fate-fighting-6_18/index.html
-----------
Emma
http://books.toscrape.com/emma_17/index.html
-----------
Eat, Pray, Love
http://books.toscrape.com/eat-pray-love_16/index.html
-----------
Deep Under (Walker Security #1)
http://books.toscrape.com/deep-under-walker-security-1_15/index.html
-----------
Choosing Our Religion: The Spiritual Lives of America's Nones
http://books.toscrape.com/choosing-our-religion-the-spiritual-lives-of-americas-nones_14/index.html
-----------
Charlie and the Chocolate Factory (Charlie Bucket #1)
http://books.toscrape.com/charlie-and-the-chocolate-factory-charlie-bucket-1_13/index.html
-----------
Charity's Cross (Charles Towne Belles #4)
http://books.toscrape.com/charitys-cross-charles-towne-belle

In [24]:
browser.quit()

# ==========================================

### 2.09 Instructor Do: Pandas Scraping (0:10)

Scraping with Pandas

In [25]:
import pandas as pd

We can use the `read_html` function in Pandas to automatically scrape any tabular data from a page.

In [40]:
url = 'https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'

In [41]:
tables = pd.read_html(url)
len(tables)

8

#### What we get in return is a list of dataframes for any tabular data that Pandas found.

In [42]:
type(tables)

list

#### We can slice off any of those dataframes that we want using normal indexing.

In [43]:
df = tables[0]
df.head()

Unnamed: 0_level_0,City,Building,Start Date,End Date,Duration,Ref
Unnamed: 0_level_1,Albany Congress,Albany Congress,Albany Congress,Albany Congress,Albany Congress,Albany Congress
0,"Albany, New York",Stadt Huys,"June 19, 1754","July 11, 1754",22 days,[8]
1,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress,Stamp Act Congress
2,"New York, New York",City Hall,"October 7, 1765","October 25, 1765",23 days,[9]
3,First Continental Congress,First Continental Congress,First Continental Congress,First Continental Congress,First Continental Congress,First Continental Congress
4,"Philadelphia, Pennsylvania",Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,[10]


#### Drop all single header rows

In [44]:
df.columns = df.columns.get_level_values(0)
df.columns

Index(['City', 'Building', 'Start Date', 'End Date', 'Duration', 'Ref'], dtype='object')

In [45]:
df.columns = df.columns.get_level_values(0)
df = df.loc[df.Ref.str.startswith("[")]
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,Ref
0,"Albany, New York",Stadt Huys,"June 19, 1754","July 11, 1754",22 days,[8]
2,"New York, New York",City Hall,"October 7, 1765","October 25, 1765",23 days,[9]
4,"Philadelphia, Pennsylvania",Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,[10]
6,"Philadelphia, Pennsylvania",Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",[11]
7,"Baltimore, Maryland",Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,[12]


#### Slipt column values into two separate columns

In [46]:
columnsplit = df['City'].str.split(", ", expand=True)
df = df.assign(City=columnsplit[0],State=columnsplit[1])
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,Ref,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,[8],New York
2,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,[9],New York
4,Philadelphia,Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,[10],Pennsylvania
6,Philadelphia,Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",[11],Pennsylvania
7,Baltimore,Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,[12],Maryland


#### Drop a column

In [47]:
df = df.drop(['Ref'], axis=1)
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,New York
2,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,New York
4,Philadelphia,Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,Pennsylvania
6,Philadelphia,Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",Pennsylvania
7,Baltimore,Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,Maryland


#### Reset an index

In [48]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,City,Building,Start Date,End Date,Duration,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,New York
1,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,New York
2,Philadelphia,Carpenters' Hall,"September 5, 1774","October 26, 1774",1 month and 21 days,Pennsylvania
3,Philadelphia,Independence Hall,"May 10, 1775","December 12, 1776","1 year, 7 months and 2 days",Pennsylvania
4,Baltimore,Henry Fite House,"December 20, 1776","February 27, 1777",2 months and 7 days,Maryland


In [50]:
df.loc[df.State=="New York"]

Unnamed: 0,City,Building,Start Date,End Date,Duration,State
0,Albany,Stadt Huys,"June 19, 1754","July 11, 1754",22 days,New York
1,New York,City Hall,"October 7, 1765","October 25, 1765",23 days,New York
13,New York,City Hall,"January 11, 1785","October 6, 1788","3 years, 11 months and 5 days",New York
14,New York,Federal Hall,"March 4, 1789","December 5, 1790","1 year, 9 months and 1 day",New York


## DataFrames as HTML

#### Pandas also had a `to_html` method that we can use to generate HTML tables from DataFrames.

In [51]:
html_table = df.to_html()
html_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>City</th>\n      <th>Building</th>\n      <th>Start Date</th>\n      <th>End Date</th>\n      <th>Duration</th>\n      <th>State</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Albany</td>\n      <td>Stadt Huys</td>\n      <td>June 19, 1754</td>\n      <td>July 11, 1754</td>\n      <td>22\xa0days</td>\n      <td>New York</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>New York</td>\n      <td>City Hall</td>\n      <td>October 7, 1765</td>\n      <td>October 25, 1765</td>\n      <td>23\xa0days</td>\n      <td>New York</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Philadelphia</td>\n      <td>Carpenters\' Hall</td>\n      <td>September 5, 1774</td>\n      <td>October 26, 1774</td>\n      <td>1\xa0month and 21\xa0days</td>\n      <td>Pennsylvania</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Philadelphia</td>\n      <td>In

#### You may have to strip unwanted newlines to clean up the table.

In [53]:
html_table.replace('\n', '')

'<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>City</th>      <th>Building</th>      <th>Start Date</th>      <th>End Date</th>      <th>Duration</th>      <th>State</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Albany</td>      <td>Stadt Huys</td>      <td>June 19, 1754</td>      <td>July 11, 1754</td>      <td>22\xa0days</td>      <td>New York</td>    </tr>    <tr>      <th>1</th>      <td>New York</td>      <td>City Hall</td>      <td>October 7, 1765</td>      <td>October 25, 1765</td>      <td>23\xa0days</td>      <td>New York</td>    </tr>    <tr>      <th>2</th>      <td>Philadelphia</td>      <td>Carpenters\' Hall</td>      <td>September 5, 1774</td>      <td>October 26, 1774</td>      <td>1\xa0month and 21\xa0days</td>      <td>Pennsylvania</td>    </tr>    <tr>      <th>3</th>      <td>Philadelphia</td>      <td>Independence Hall</td>      <td>May 10, 1775</td>      <td>December 12, 1776</td>      <

You can also save the table directly to a file.

In [54]:
df.to_html('2/Activities/09-Ins_Pandas_Scraping/Solved/table.html')

In [80]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open 2/Activities/09-Ins_Pandas_Scraping/Solved/table.html

'open' is not recognized as an internal or external command,
operable program or batch file.


# ==========================================

### 2.10 Students Do: Doctor Decoder (0:10)

In this activity, you will use `read_html` from Pandas to scrape a Wikipedia article. You will then use the resulting DataFrame to convert a list of medical abbreviations to their full description.

## Instructions

* Use Panda's `read_html` to parse the URL.

* Find the medical abbreviations DataFrame in the list of DataFrames as assign it to `df`.

  * Assign the columns `['abb', 'full_name', 'other']`

* Drop the `other` column from the DataFrame.

* Drop the header row (the first row) and set the index to the `abb` column.

* Loop through the list of medical abbreviations and print the abbreviation along with the full description.

  * Use the DataFrame to perform the lookup.

# Doctor Decoder

Use Pandas scraping to help decode the medical abbreviations that a doctor might use.

In [69]:
import pandas as pd

Use Pandas to scrape the following site and decode the medical abbreviations in the list

In [70]:
url = 'https://en.wikipedia.org/wiki/List_of_medical_abbreviations'
med_abbreviations = ['BMR', 'BP', 'ECG', 'MRI', 'qid', 'WBC']

In [71]:
# Use Panda's `read_html` to parse the url
### BEGIN SOLUTION
tables = pd.read_html(url)
len(tables)
### END SOLUTION

6

In [72]:
tables[2].head()

Unnamed: 0,0,1,2
0,EG abb,EG full name,"Other(ver change, need to know...etc.)"
1,ABG,arterial blood gas,
2,ACE,angiotensin-converting enzyme,
3,ACTH,adrenocorticotropic hormone,
4,ADH,antidiuretic hormone,


In [73]:
# Find the medical abbreviations DataFrame in the list of DataFrames as assign it to `df`
# Assign the columns `['abb', 'full_name', 'other']`
### BEGIN SOLUTION
df = tables[2]
df.columns = ['abb', 'full_name', 'other']
df.head()
### END SOLUTION

Unnamed: 0,abb,full_name,other
0,EG abb,EG full name,"Other(ver change, need to know...etc.)"
1,ABG,arterial blood gas,
2,ACE,angiotensin-converting enzyme,
3,ACTH,adrenocorticotropic hormone,
4,ADH,antidiuretic hormone,


Cleanup of extra row

In [None]:
# drop the `other` column
### BEGIN SOLUTION
del df['other']
### END SOLUTION

In [66]:
# Drop the first row and set the index to the `abb` column
### BEGIN SOLUTION
df = df.iloc[1:]
df.set_index('abb', inplace=True)
df.head()
### END SOLUTION

KeyError: "None of ['abb'] are in the columns"

In [67]:
# Loop through the list of medical abbreviations and print the abbreviation
# along with the full description.
# Use the DataFrame to perform the lookup.
### BEGIN SOLUTION
for abb in med_abbreviations:
    print(abb, df.loc[abb].full_name)
### END SOLUTION

BMR basal metabolic rate
BP blood pressure
ECG electrocardiogram
MRI magnetic resonance imaging
qid 4 times a day
WBC white blood cell


# ==========================================

### Rating Class Objectives

* rate your understanding using 1-5 method in each objective

In [None]:
objectives = [
    "Use Beautiful Soup to scrape your own data from the web",
    "Save the results of web scraping into MongoDB",
]
rating = []
total = 0
for i in range(len(objectives)):
    rate = input(objectives[i]+"? ")
    total += int(rate)
    rating.append(objectives[i] + ". (" + rate + "/5)")
print("="*96)
print("My rating today is:")
print("-"*24)
for i in rating:
    print(i)
print("-"*64)
print("Average: " + str(total/len(objectives)))