# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
# html

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
# articles

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">process expert throughout</h2>
<div class="grid grid-cols-2 italic">
<p> 2011-05-12 </p>
<p class="text-right">By Matthew Liu </p>
</div>
<p>Listen during discover knowledge color. Generation course pressure sort agent direction.
Tough product such push. True speak air wall bed keep hot.</p>
</div>
</div>

In [7]:
# soupmethod.tagname.text
headline = article.h2.text
headline

'process expert throughout'

In [8]:
# get the date
# there was some white space that we stripped out
date = article.p.text.strip()
date

'2011-05-12'

In [9]:
article.select('.text-right')[0].text.strip()[3:]

'Matthew Liu'

In [10]:
# the dot before text is a notation to use before selecting the class and is required
author = article.select('.text-right')[0].text.strip()[3:]
author

'Matthew Liu'

In [11]:
# getting the actual content
content = article.select('p')[-1].text
content

'Listen during discover knowledge color. Generation course pressure sort agent direction.\nTough product such push. True speak air wall bed keep hot.'

Bringing it all together: Make a function

In [12]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [13]:

parse_news(article)

{'headline': 'process expert throughout',
 'date': '2011-05-12',
 'author': 'Matthew Liu',
 'content': 'Listen during discover knowledge color. Generation course pressure sort agent direction.\nTough product such push. True speak air wall bed keep hot.'}

In [14]:
# loop through all the articles
# [parse_news(article) for article in articles]

In [15]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,process expert throughout,2011-05-12,Matthew Liu,Listen during discover knowledge color. Genera...
1,edge accept six,2019-07-03,Patrick Thomas,Beyond simple add I television. Wrong we troub...
2,force often indicate,2015-12-19,Lisa Lara,Key get alone meet. Situation pull few opportu...
3,me point dog,2006-09-03,Jacqueline Stout,Marriage civil director goal certainly. By att...
4,use rise step,2007-07-11,Kristen Hernandez,Design investment trouble we forget. Enter mai...
5,admit reach work,1984-06-04,Stephen White,Dream real available cold. Shoulder reveal guy...
6,may resource financial,1989-08-11,Julie Torres,Again any soldier reality local type group rol...
7,according security clearly,1980-10-09,James Torres Jr.,College then gas language list. Central produc...
8,thought case population,1993-11-18,Julie Diaz,Own scientist themselves forward. Professional...
9,young leader tree,2011-01-23,Jasmine Morgan,Same base great card situation economic wind c...


## Scraping People

In [16]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [17]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [18]:
cards = soup.select(".person")
cards

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Alicia Boyd</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Self-enabling neutral model"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">huangkevin@hotmail.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">090.767.3956</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 58172 Arias Extension <br/>
                 South Randy, KY 34607
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full borde

In [19]:
card = cards[0]
card

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Alicia Boyd</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Self-enabling neutral model"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">huangkevin@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">090.767.3956</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                58172 Arias Extension <br/>
                South Randy, KY 34607
            </p>
</div>
</div>

In [20]:
name = card.h2.text
name

'Alicia Boyd'

In [21]:
quote = card.p.text.strip()
quote

'"Self-enabling neutral model"'

In [22]:
email = card.find_all('p')[1].text
email

'huangkevin@hotmail.com'

In [23]:
phone = card.find_all('p')[2].text
phone

'090.767.3956'

In [24]:
# address = card.find_all('p')[3].text.strip()
# address

In [25]:
import re
address = card.find_all('p')[3].text.strip()
address = re.sub(r"\s{2,}", "", address)

address

'58172 Arias ExtensionSouth Randy, KY 34607'

In [26]:
card.find_all('p')

[<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Self-enabling neutral model"
         </p>,
 <p class="email col-span-8">huangkevin@hotmail.com</p>,
 <p class="phone col-span-8">090.767.3956</p>,
 <p class="col-span-8">
                 58172 Arias Extension <br/>
                 South Randy, KY 34607
             </p>]

In [27]:
def parse_person(card):
    name = card.h2.text
    quote = card.p.text.strip()
    email = card.find_all('p')[1].text
    phone = card.find_all('p')[2].text
    address = card.find_all('p')[3].text.strip()
    address = re.sub(r"\s{2,}", "", address)
    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [28]:
parse_person(card)

{'name': 'Alicia Boyd',
 'quote': '"Self-enabling neutral model"',
 'email': 'huangkevin@hotmail.com',
 'phone': '090.767.3956',
 'address': '58172 Arias ExtensionSouth Randy, KY 34607'}

In [29]:
# loop through all the persons
pd.DataFrame([parse_person(card) for card in cards])

Unnamed: 0,name,quote,email,phone,address
0,Alicia Boyd,"""Self-enabling neutral model""",huangkevin@hotmail.com,090.767.3956,"58172 Arias ExtensionSouth Randy, KY 34607"
1,Russell Mitchell,"""Ergonomic grid-enabled capacity""",caitlinfields@francis.com,+1-615-412-4880,"373 Carolyn PortsSandrashire, NH 74201"
2,Juan Adams,"""Multi-tiered leadingedge protocol""",belldana@phelps-kirk.com,(093)127-1374x7525,"047 Rhonda Walk Apt. 117Popemouth, IA 96348"
3,Ronald Blevins,"""Extended non-volatile ability""",michael35@hotmail.com,+1-816-785-2365x69413,"2314 Nicole RoadNorth Julieberg, GA 19816"
4,Annette Taylor,"""Grass-roots explicit strategy""",jeremy60@middleton.com,070-245-2765x0640,"021 Deborah PlaceLake Jacobberg, OH 95483"
5,Michael Atkinson,"""Front-line bandwidth-monitored middleware""",vwilkinson@gmail.com,8954594745,"887 Hawkins Fork Apt. 056New Davidton, OK 52972"
6,Jason Miller,"""Networked bi-directional archive""",xtaylor@christensen-lyons.com,186.172.2353x43570,"605 Wallace Bypass Suite 993South Matthew, CA ..."
7,Christopher Mills,"""Persistent executive frame""",xwillis@gmail.com,464-378-8235x83897,"9394 Gill Inlet Suite 716South Dwaynebury, IL ..."
8,Joseph Garza,"""Cross-group encompassing capacity""",coxvincent@martin.biz,(411)674-7689x81669,"9120 David RueSouth Daniel, WA 75480"
9,Barbara Harrell,"""Multi-lateral zero tolerance pricing structure""",akennedy@gonzalez.com,233-206-7976,"086 Le RidgesSanchezfort, PA 97034"


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [30]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)

In [31]:
# print(codeup.prettify)

In [None]:
<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [110]:
articles = soup.find_all('h2', class_ = 'entry-title')
# articles

In [70]:
articles[0]

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [71]:
article = articles[0]
article

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [72]:
title = article.text
title

'Codeup Start Dates for March 2022'

In [73]:
article

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [109]:
link = article.a.attrs['href']
link

'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/'

In [150]:
article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
article_soup = BeautifulSoup(article_response.text)
# article_soup

In [145]:
article_content = [p.text for p in article_soup.find_all('p')]

In [149]:
article_content

['Jan 26, 2022 | Codeup News',
 'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.',
 'Full Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!',
 'As one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:',
 '\xa0',
 'Our first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.',
 'Why consider pivoting careers to Data Science?',
 'The supply of data scientists remains painfully low compared to the outrageous demand. YOU can help close the gap while launching a fulfilling, secure, and high-paying career – one of the very best in the country!',
 'Employers are scrambling to find talent due to a lack of qualified applicants. YOU 

## the below is returning the same "content" for every title

In [78]:
# # for content:
# summaries = soup.find_all('div',class_="post-content")
# summaries[0]

<div class="post-content"><div class="post-content-inner"><p>As we approach the end of January we wanted to look forward to our next start dates for all of our current programs....</p>
</div></div>

In [84]:
# summary = summaries[0].text.strip()
# summary

'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs....'

In [151]:
def get_blog_articles(article):
    response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
    soup = BeautifulSoup(response.text)
    articles = soup.find_all('h2', class_ = 'entry-title')
    title = article.text
    link = article.a.attrs['href']
    article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
    article_soup = BeautifulSoup(article_response.text)
    article_content = [p.text for p in article_soup.find_all('p')]

    
    return {
        'title': title, 'article_content': article_content
    }

In [153]:
codeup_blog_posts = pd.DataFrame([get_blog_articles(article) for article in articles])

In [154]:
codeup_blog_posts

Unnamed: 0,title,article_content
0,Codeup Start Dates for March 2022,"[Jan 26, 2022 | Codeup News, As we approach th..."
1,VET TEC Funding Now Available For Dallas Veterans,"[Jan 7, 2022 | Codeup News, Dallas Newsletter,..."
2,Dallas Campus Re-opens With New Grant Partner,"[Dec 30, 2021 | Codeup News, Featured, We are ..."
3,Codeup Dallas Open House,"[Nov 30, 2021 | Dallas Newsletter, Events, Com..."
4,Codeup’s Placement Team Continues Setting Records,"[Nov 19, 2021 | Codeup News, Employers, Who ex..."
5,"IT Certifications 101: Why They Matter, and Wh...","[Nov 18, 2021 | IT Training, Tips for Prospect..."
6,A rise in cyber attacks means opportunities fo...,"[Nov 17, 2021 | Cybersecurity, In the last few..."
7,Use your GI Bill® benefits to Land a Job in Tech,"[Nov 4, 2021 | Codeup News, Tips for Prospecti..."
8,Which program is right for me: Cyber Security ...,"[Oct 28, 2021 | IT Training, Tips for Prospect..."
9,What the Heck is System Engineering?,"[Oct 21, 2021 | IT Training, Tips for Prospect..."


In [155]:
codeup_blog_posts.article_content[0]

['Jan 26, 2022 | Codeup News',
 'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.',
 'Full Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!',
 'As one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:',
 '\xa0',
 'Our first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.',
 'Why consider pivoting careers to Data Science?',
 'The supply of data scientists remains painfully low compared to the outrageous demand. YOU can help close the gap while launching a fulfilling, secure, and high-paying career – one of the very best in the country!',
 'Employers are scrambling to find talent due to a lack of qualified applicants. YOU 

### Cracked it : )

# News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


`
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
`

In [189]:
base_url = 'https://inshorts.com'
section_links = ["/en/read/business","/en/read/sports","/en/read/technology","/en/read/entertainment"]
response = requests.get(base_url + '/en/read', headers={'user-agent': 'ds_student'})
soup = BeautifulSoup(response.text)


In [190]:
response

<Response [200]>

In [191]:
soup.find_all(class_ = 'active-category')

[<li class="active-category selected">All News</li>,
 <li class="active-category">India</li>,
 <li class="active-category">Business</li>,
 <li class="active-category">Sports</li>,
 <li class="active-category">World</li>,
 <li class="active-category">Politics</li>,
 <li class="active-category">Technology</li>,
 <li class="active-category">Startup</li>,
 <li class="active-category">Entertainment</li>,
 <li class="active-category">Miscellaneous</li>,
 <li class="active-category">Hatke</li>,
 <li class="active-category">Science</li>,
 <li class="active-category">Automobile</li>]

In [192]:
temp = soup.find_all('ul')

In [214]:
soup.ul.attrs

{'class': ['category-list']}

In [219]:
temp2 = soup.ul.find_all('a')

In [220]:
temp2

[<a href="/en/read" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToAllNews', 'action': 'clicked', 'label': 'RedirectedToAllNews' });"> <li class="active-category selected">All News</li> </a>,
 <a href="/en/read/national" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToIndiaNews', 'action': 'clicked', 'label': 'RedirectedToIndiaNews' });"> <li class="active-category">India</li> </a>,
 <a href="/en/read/business" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToBusinessNews', 'action': 'clicked', 'label':  'RedirectedToBusinessNews' });"> <li class="active-category">Business</li> </a>,
 <a href="/en/read/sports" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkTosportsNews', 'action': 'clicked', 'label': 'RedirectedToSportsNews' });"> <li class="active-category">Sports</li> </a>,
 <a href="/en/read/world" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToworldNews', 'action': 'clicked', 'label': 'Re

In [221]:
temp2[2]

<a href="/en/read/business" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToBusinessNews', 'action': 'clicked', 'label':  'RedirectedToBusinessNews' });"> <li class="active-category">Business</li> </a>

In [228]:
temp3 = soup.ul.find_all('a')

In [232]:
temp3

[<a href="/en/read" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToAllNews', 'action': 'clicked', 'label': 'RedirectedToAllNews' });"> <li class="active-category selected">All News</li> </a>,
 <a href="/en/read/national" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToIndiaNews', 'action': 'clicked', 'label': 'RedirectedToIndiaNews' });"> <li class="active-category">India</li> </a>,
 <a href="/en/read/business" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToBusinessNews', 'action': 'clicked', 'label':  'RedirectedToBusinessNews' });"> <li class="active-category">Business</li> </a>,
 <a href="/en/read/sports" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkTosportsNews', 'action': 'clicked', 'label': 'RedirectedToSportsNews' });"> <li class="active-category">Sports</li> </a>,
 <a href="/en/read/world" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToworldNews', 'action': 'clicked', 'label': 'Re

### might need to come back to this and try an easier approach--for example, just break down the section page instead of trying to access through the main page