# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
# html

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [21]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
# articles

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">help apply southern</h2>
<div class="grid grid-cols-2 italic">
<p> 2015-11-15 </p>
<p class="text-right">By Jeffrey Hughes </p>
</div>
<p>Contain technology walk expect. Just machine thought suggest some possible day research. Religious chair fund also who. Need coach just although whether my skin heavy.
Us line travel life by.</p>
</div>
</div>

In [7]:
# soupmethod.tagname.text
headline = article.h2.text
headline

'help apply southern'

In [8]:
# get the date
# there was some white space that we stripped out
date = article.p.text.strip()
date

'2015-11-15'

In [56]:
article.select('.text-right')[0].text.strip()[3:]

'Jeffrey Hughes'

In [9]:
# the dot before text is a notation to use before selecting the class and is required
author = article.select('.text-right')[0].text.strip()[3:]
author

'Jeffrey Hughes'

In [10]:
# getting the actual content
content = article.select('p')[-1].text
content

'Contain technology walk expect. Just machine thought suggest some possible day research. Religious chair fund also who. Need coach just although whether my skin heavy.\nUs line travel life by.'

Bringing it all together: Make a function

In [11]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [12]:

parse_news(article)

{'headline': 'help apply southern',
 'date': '2015-11-15',
 'author': 'Jeffrey Hughes',
 'content': 'Contain technology walk expect. Just machine thought suggest some possible day research. Religious chair fund also who. Need coach just although whether my skin heavy.\nUs line travel life by.'}

In [13]:
# loop through all the articles
# [parse_news(article) for article in articles]

In [14]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,help apply southern,2015-11-15,Jeffrey Hughes,Contain technology walk expect. Just machine t...
1,cold environment right,1973-04-22,Erica Mcclure,Final difficult draw bring throw from. Owner r...
2,PM data man,1983-07-12,Frank Hayes,Article management finally tree. Check fast th...
3,and defense although,2014-08-26,Jennifer Guzman,Likely event draw have key. About measure figu...
4,authority recognize sure,1972-09-03,Shane Douglas,Bed service audience dark suggest leader. Ten ...
5,group either whole,2003-01-15,Casey Reyes,Possible new growth tell final each. Writer pr...
6,for respond blood,2016-08-08,Amy Perez,Main director interview project plan. Improve ...
7,production even still,2007-05-08,Kenneth Nguyen,Happy police the set then herself not. Recent ...
8,federal our environment,1982-09-05,Aaron Holder,Police resource anyone water matter religious ...
9,southern lawyer house,2014-05-21,Jeffrey Mcintyre,Though action better method part. Professional...


## Scraping People

In [27]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [28]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [39]:
cards = soup.select(".person")
cards

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Lisa Freeman</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Diverse executive attitude"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">jacquelinescott@hotmail.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">+1-643-488-5409</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 696 Shawn Loop <br/>
                 Port Jessica, IL 02847
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full bor

In [40]:
card = cards[0]
card

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Lisa Freeman</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Diverse executive attitude"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">jacquelinescott@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">+1-643-488-5409</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                696 Shawn Loop <br/>
                Port Jessica, IL 02847
            </p>
</div>
</div>

In [45]:
name = card.h2.text
name

'Lisa Freeman'

In [58]:
quote = card.p.text.strip()
quote

'"Diverse executive attitude"'

In [77]:
email = card.find_all('p')[1].text
email

'jacquelinescott@hotmail.com'

In [81]:
phone = card.find_all('p')[2].text
phone

'+1-643-488-5409'

In [88]:
address = card.find_all('p')[3].text.strip()
address

'696 Shawn Loop \n                Port Jessica, IL 02847'

In [89]:
def parse_person(person):
    name = card.h2.text
    quote = card.p.text.strip()
    email = card.find_all('p')[1].text
    phone = card.find_all('p')[2].text
    address = card.find_all('p')[3].text.strip()

    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [90]:
parse_person(card)

{'name': 'Lisa Freeman',
 'quote': '"Diverse executive attitude"',
 'email': 'jacquelinescott@hotmail.com',
 'phone': '+1-643-488-5409',
 'address': '696 Shawn Loop \n                Port Jessica, IL 02847'}

In [91]:
# loop through all the persons
pd.DataFrame([parse_person(card) for card in cards])

Unnamed: 0,name,quote,email,phone,address
0,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
1,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
2,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
3,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
4,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
5,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
6,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
7,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
8,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."
9,Lisa Freeman,"""Diverse executive attitude""",jacquelinescott@hotmail.com,+1-643-488-5409,"696 Shawn Loop \n Port Jessica,..."


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [20]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)