# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [None]:
# Use beautifulsoup methods to extract necessary content from an article

In [6]:
articles = soup.select('.grid-cols-4')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">in question try</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2003-04-26 </p>
 <p class="text-right">By Timothy Moreno DDS </p>
 </div>
 <p>Option policy store activity fly relationship. Keep guy fine anyone contain.
 Everything turn bill behind. Attention simple choice per.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">series I level</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1988-02-13 </p>
 <p class="text-right">By Matthew Bryant </p>
 </div>
 <p>Development what part water issue major. Daughter win leg buy such say industry.
 Three quickly skill make

In [11]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">in question try</h2>
<div class="grid grid-cols-2 italic">
<p> 2003-04-26 </p>
<p class="text-right">By Timothy Moreno DDS </p>
</div>
<p>Option policy store activity fly relationship. Keep guy fine anyone contain.
Everything turn bill behind. Attention simple choice per.</p>
</div>
</div>

In [13]:
headline = article.h2.text
headline

'in question try'

In [16]:
date = article.p.text.strip()
date

'2003-04-26'

In [22]:
author = article.select('.text-right')[0].text.strip()[3:]
author

'Timothy Moreno DDS'

In [28]:
content = article.select('p')[-1].text
content

'Option policy store activity fly relationship. Keep guy fine anyone contain.\nEverything turn bill behind. Attention simple choice per.'

Bringing it all together: Make a function

In [29]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
   
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [31]:
parse_news(article)

{'headline': 'in question try',
 'date': '2003-04-26',
 'author': 'Timothy Moreno DDS',
 'content': 'Option policy store activity fly relationship. Keep guy fine anyone contain.\nEverything turn bill behind. Attention simple choice per.'}

In [33]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,in question try,2003-04-26,Timothy Moreno DDS,Option policy store activity fly relationship....
1,series I level,1988-02-13,Matthew Bryant,Development what part water issue major. Daugh...
2,past member generation,1981-02-16,Ashley Taylor,Mother chair their teach simple. Senior test a...
3,simple drop among,2003-05-22,Danielle Haynes,Happy only pass travel modern either. Year add...
4,end difference compare,1984-07-21,Danny Boyd,Time along lead each for prevent. Deal bit sim...
5,my step should,2004-12-15,Joseph Baker,Republican they listen hold TV. Choice mean sc...
6,pretty television various,1975-11-14,Mr. Mark Hill,Consider defense really answer. Standard parti...
7,degree improve easy,1999-05-10,Richard Fry,Central enough police point. Exist president a...
8,federal sure quality,2017-09-27,Eric Rowe,Forget green quickly agreement. Tell guess thr...
9,organization trade business,2012-03-05,Jason White,Movement money watch quality true anyone educa...


## Scraping People

In [None]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [None]:
# def parse_person(person):
#     name = 
#     quote = 
#     email = 
#     phone = 
#     address = 

    
#     return {
#         'name': name, 'quote': quote, 'email': email,
#         'phone': phone,
#         'address': address
#     }

In [None]:
# loop through all the persons


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [None]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)