# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
# html

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">executive laugh rather</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1979-09-04 </p>
 <p class="text-right">By Eileen Rivera </p>
 </div>
 <p>Main although hair air age third shoulder song. Ago scene skill do force PM audience could. Son week home certainly hotel.
 During though have strong. Example land military go foreign street some.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">send far foreign</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1973-07-29 </p>
 <p class="text-right">By Paige Sanchez </p>
 </div>
 <p>Nearly pass onto. Her call southern doctor

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">executive laugh rather</h2>
<div class="grid grid-cols-2 italic">
<p> 1979-09-04 </p>
<p class="text-right">By Eileen Rivera </p>
</div>
<p>Main although hair air age third shoulder song. Ago scene skill do force PM audience could. Son week home certainly hotel.
During though have strong. Example land military go foreign street some.</p>
</div>
</div>

In [7]:
# soupmethod.tagname.text
headline = article.h2.text
headline

'executive laugh rather'

In [8]:
# get the date
# there was some white space that we stripped out
date = article.p.text.strip()
date

'1979-09-04'

In [9]:
# the dot before text is a notation to use before selecting the class and is required
author = article.select('.text-right')[0].text.strip()[3:]
author

'Eileen Rivera'

In [10]:
# getting the actual content
content = article.select('p')[-1].text
content

'Main although hair air age third shoulder song. Ago scene skill do force PM audience could. Son week home certainly hotel.\nDuring though have strong. Example land military go foreign street some.'

Bringing it all together: Make a function

In [11]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [12]:

parse_news(article)

{'headline': 'executive laugh rather',
 'date': '1979-09-04',
 'author': 'Eileen Rivera',
 'content': 'Main although hair air age third shoulder song. Ago scene skill do force PM audience could. Son week home certainly hotel.\nDuring though have strong. Example land military go foreign street some.'}

In [14]:
# loop through all the articles
[parse_news(article) for article in articles]

[{'headline': 'executive laugh rather',
  'date': '1979-09-04',
  'author': 'Eileen Rivera',
  'content': 'Main although hair air age third shoulder song. Ago scene skill do force PM audience could. Son week home certainly hotel.\nDuring though have strong. Example land military go foreign street some.'},
 {'headline': 'send far foreign',
  'date': '1973-07-29',
  'author': 'Paige Sanchez',
  'content': 'Nearly pass onto. Her call southern doctor these. Office resource student direction attack capital.\nProfessional state author personal policy friend either. No development travel year talk forget budget.'},
 {'headline': 'white action often',
  'date': '1976-03-11',
  'author': 'Melissa Cooper',
  'content': 'Around education current month staff. Standard notice exist dinner our life they. Itself room here street event base.\nMother improve thank. The along success sell vote. Dark hundred although that few next approach.'},
 {'headline': 'cup evening against',
  'date': '1993-08-27',


In [17]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,executive laugh rather,1979-09-04,Eileen Rivera,Main although hair air age third shoulder song...
1,send far foreign,1973-07-29,Paige Sanchez,Nearly pass onto. Her call southern doctor the...
2,white action often,1976-03-11,Melissa Cooper,Around education current month staff. Standard...
3,cup evening against,1993-08-27,James Wagner,Serious western perform situation the however....
4,low natural skill,2012-02-16,Amanda Lee DVM,Hard add first on manager industry hotel. Get ...
5,enter letter claim,2019-08-24,Lee Keller,Stand goal information full. Success listen li...
6,reason nation evening,2017-09-25,Joseph Burns,Suddenly from be every recent collection resul...
7,development successful stay,1991-08-13,Justin Robinson,Prove simple stage lead.\nAnything act happy d...
8,bar life attack,1982-12-23,Diana Howell,Establish read subject. Give rock first toward...
9,space result mouth,1986-04-23,Emma Edwards,Officer process property almost live despite w...


## Scraping People

In [None]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [None]:
# def parse_person(person):
#     name = 
#     quote = 
#     email = 
#     phone = 
#     address = 

    
#     return {
#         'name': name, 'quote': quote, 'email': email,
#         'phone': phone,
#         'address': address
#     }

In [None]:
# loop through all the persons


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [None]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)