# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
# html

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
# articles

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">local meet fire</h2>
<div class="grid grid-cols-2 italic">
<p> 1983-06-03 </p>
<p class="text-right">By Crystal Prince </p>
</div>
<p>More door detail. Story market successful right rest season evening. Report matter of house boy anyone trial.
Left wind describe smile. Nation involve practice kind.</p>
</div>
</div>

In [7]:
# soupmethod.tagname.text
headline = article.h2.text
headline

'local meet fire'

In [8]:
# get the date
# there was some white space that we stripped out
date = article.p.text.strip()
date

'1983-06-03'

In [9]:
article.select('.text-right')[0].text.strip()[3:]

'Crystal Prince'

In [10]:
# the dot before text is a notation to use before selecting the class and is required
author = article.select('.text-right')[0].text.strip()[3:]
author

'Crystal Prince'

In [11]:
# getting the actual content
content = article.select('p')[-1].text
content

'More door detail. Story market successful right rest season evening. Report matter of house boy anyone trial.\nLeft wind describe smile. Nation involve practice kind.'

Bringing it all together: Make a function

In [12]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [13]:

parse_news(article)

{'headline': 'local meet fire',
 'date': '1983-06-03',
 'author': 'Crystal Prince',
 'content': 'More door detail. Story market successful right rest season evening. Report matter of house boy anyone trial.\nLeft wind describe smile. Nation involve practice kind.'}

In [14]:
# loop through all the articles
# [parse_news(article) for article in articles]

In [15]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,local meet fire,1983-06-03,Crystal Prince,More door detail. Story market successful righ...
1,someone Congress under,1971-05-18,Jennifer Bell,See executive relationship rule spring than sp...
2,successful start ahead,1987-06-13,Jessica Edwards,Little whose significant enjoy industry severa...
3,remain throw professor,2001-01-21,Kelly Watkins,Return statement director week line government...
4,race eye performance,1972-02-16,Jose Taylor,Form performance wonder item child. Gas here e...
5,scientist view much,1987-11-14,Alexis Ferrell,As class protect institution project lot.\nIss...
6,respond middle life,1972-03-13,Thomas Carter,Particular listen accept require fight tend pi...
7,by everybody industry,2008-04-27,Sara King,Fact stay family for sit contain. Trial especi...
8,night understand some,2005-01-20,Kathryn Evans,Data language teacher.\nAnswer customer some. ...
9,often admit scientist,1987-01-16,Thomas Walters,Occur simply yard every run too. Win sell spen...


## Scraping People

In [16]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [17]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [18]:
cards = soup.select(".person")
cards

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Ashley Mitchell</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Mandatory intermediate knowledgebase"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">hurleyjanet@martinez.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">4118166142</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 8472 Emily Shoal Suite 909 <br/>
                 East Jonathan, MT 89923
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name

In [19]:
card = cards[0]
card

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Ashley Mitchell</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Mandatory intermediate knowledgebase"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">hurleyjanet@martinez.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">4118166142</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                8472 Emily Shoal Suite 909 <br/>
                East Jonathan, MT 89923
            </p>
</div>
</div>

In [20]:
name = card.h2.text
name

'Ashley Mitchell'

In [21]:
quote = card.p.text.strip()
quote

'"Mandatory intermediate knowledgebase"'

In [22]:
email = card.find_all('p')[1].text
email

'hurleyjanet@martinez.com'

In [23]:
phone = card.find_all('p')[2].text
phone

'4118166142'

In [24]:
# address = card.find_all('p')[3].text.strip()
# address

In [25]:
import re
address = card.find_all('p')[3].text.strip()
address = re.sub(r"\s{2,}", "", address)

address

'8472 Emily Shoal Suite 909East Jonathan, MT 89923'

In [26]:
card.find_all('p')

[<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Mandatory intermediate knowledgebase"
         </p>,
 <p class="email col-span-8">hurleyjanet@martinez.com</p>,
 <p class="phone col-span-8">4118166142</p>,
 <p class="col-span-8">
                 8472 Emily Shoal Suite 909 <br/>
                 East Jonathan, MT 89923
             </p>]

In [37]:
def parse_person(card):
    name = card.h2.text
    quote = card.p.text.strip()
    email = card.find_all('p')[1].text
    phone = card.find_all('p')[2].text
    address = card.find_all('p')[3].text.strip()
    address = re.sub(r"\s{2,}", "", address)
    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [38]:
parse_person(card)

{'name': 'Ashley Mitchell',
 'quote': '"Mandatory intermediate knowledgebase"',
 'email': 'hurleyjanet@martinez.com',
 'phone': '4118166142',
 'address': '8472 Emily Shoal Suite 909East Jonathan, MT 89923'}

In [39]:
# loop through all the persons
pd.DataFrame([parse_person(card) for card in cards])

Unnamed: 0,name,quote,email,phone,address
0,Ashley Mitchell,"""Mandatory intermediate knowledgebase""",hurleyjanet@martinez.com,4118166142,"8472 Emily Shoal Suite 909East Jonathan, MT 89923"
1,Maria Spence,"""Profound logistical paradigm""",yvalencia@hotmail.com,(298)937-3603x315,"84819 Nelson GardensWest Sarahtown, NY 04438"
2,Julie Pacheco,"""Business-focused zero tolerance help-desk""",thomaspaige@yahoo.com,(733)688-5510,"374 Cynthia Trace Apt. 445Allenshire, MO 51884"
3,Gordon Shelton,"""Down-sized client-server knowledgebase""",carneytammy@yahoo.com,(882)691-1678,"670 Bowen Port Suite 369Hamiltonfurt, NM 71922"
4,Sara Guerrero MD,"""Right-sized transitional support""",zchang@yahoo.com,847-518-8081x78814,"62465 Madison GreensBlakeburgh, OH 62907"
5,Kayla Townsend,"""Distributed contextually-based productivity""",thomasjessica@payne.com,696.389.1920x61823,"48208 Lee Drives Suite 397Lopezside, LA 41636"
6,Christopher Wagner,"""Intuitive upward-trending Graphic Interface""",nicole77@irwin.net,646.083.6704x479,"4645 Nelson Station Apt. 589East David, VT 87624"
7,Oscar Clay,"""Optional executive instruction set""",donaldjohnson@hotmail.com,001-888-140-0820x5913,"278 Alisha UnionNorth Veronica, RI 16332"
8,Mr. Jeremy Smith,"""Object-based stable access""",xjohnson@miller.com,695-331-8010,"565 Matthew DriveEast Melissa, MD 20502"
9,Thomas Harris,"""Organic web-enabled superstructure""",eherrera@smith.com,+1-207-962-6951x1425,"1832 John VistaNew Brianhaven, VA 50282"


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [30]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)