# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
# html

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
# articles

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">security fine southern</h2>
<div class="grid grid-cols-2 italic">
<p> 2013-04-19 </p>
<p class="text-right">By Michael Lindsey </p>
</div>
<p>On spend watch station south. Care need note politics possible both class. Health word ready thought discuss.
Write major lot imagine. Stop debate hair key about between happy. Nature group third song section process organization strong. Minute major there well serious girl.</p>
</div>
</div>

In [7]:
# soupmethod.tagname.text
headline = article.h2.text
headline

'security fine southern'

In [8]:
# get the date
# there was some white space that we stripped out
date = article.p.text.strip()
date

'2013-04-19'

In [9]:
article.select('.text-right')[0].text.strip()[3:]

'Michael Lindsey'

In [10]:
# the dot before text is a notation to use before selecting the class and is required
author = article.select('.text-right')[0].text.strip()[3:]
author

'Michael Lindsey'

In [11]:
# getting the actual content
content = article.select('p')[-1].text
content

'On spend watch station south. Care need note politics possible both class. Health word ready thought discuss.\nWrite major lot imagine. Stop debate hair key about between happy. Nature group third song section process organization strong. Minute major there well serious girl.'

Bringing it all together: Make a function

In [12]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [13]:

parse_news(article)

{'headline': 'security fine southern',
 'date': '2013-04-19',
 'author': 'Michael Lindsey',
 'content': 'On spend watch station south. Care need note politics possible both class. Health word ready thought discuss.\nWrite major lot imagine. Stop debate hair key about between happy. Nature group third song section process organization strong. Minute major there well serious girl.'}

In [14]:
# loop through all the articles
# [parse_news(article) for article in articles]

In [15]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,security fine southern,2013-04-19,Michael Lindsey,On spend watch station south. Care need note p...
1,challenge pressure move,1987-06-03,Spencer Waller,Decide could finally throughout thing debate l...
2,option happen himself,1984-11-06,Brandon Ellis,Mission reduce camera. Social religious respon...
3,little democratic popular,2003-11-08,Angela Ewing,Dog ok space yeah. Among certainly including.\...
4,score help before,2022-01-27,Jay Johnson,Within any enter cut play sit clearly as. Whet...
5,under certainly style,2016-06-14,Michael Mitchell,Commercial interest company. Fast national gre...
6,reduce up black,1971-09-28,Mark Blevins,Answer argue social voice.\nTrip worker bill f...
7,stand team suddenly,1971-02-19,Julie Walker,Top industry easy whole name report. Floor tri...
8,worry our side,1979-07-22,Stacey Ortiz,Baby gun relate morning threat as. Material tr...
9,candidate customer until,1983-03-11,Cory White,Everybody some analysis do arm stand about. Re...


## Scraping People

In [16]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [17]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [18]:
cards = soup.select(".person")
cards

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Andrew Moore</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Down-sized object-oriented moderator"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">zroberts@wiley.net</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">001-098-027-7795x94100</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 38041 Dawson Shoals <br/>
                 Riveraton, NH 41221
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-spa

In [19]:
card = cards[0]
card

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Andrew Moore</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Down-sized object-oriented moderator"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">zroberts@wiley.net</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">001-098-027-7795x94100</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                38041 Dawson Shoals <br/>
                Riveraton, NH 41221
            </p>
</div>
</div>

In [20]:
name = card.h2.text
name

'Andrew Moore'

In [21]:
quote = card.p.text.strip()
quote

'"Down-sized object-oriented moderator"'

In [22]:
email = card.find_all('p')[1].text
email

'zroberts@wiley.net'

In [23]:
phone = card.find_all('p')[2].text
phone

'001-098-027-7795x94100'

In [24]:
# address = card.find_all('p')[3].text.strip()
# address

In [25]:
import re
address = card.find_all('p')[3].text.strip()
address = re.sub(r"\s{2,}", "", address)

address

'38041 Dawson ShoalsRiveraton, NH 41221'

In [26]:
card.find_all('p')

[<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Down-sized object-oriented moderator"
         </p>,
 <p class="email col-span-8">zroberts@wiley.net</p>,
 <p class="phone col-span-8">001-098-027-7795x94100</p>,
 <p class="col-span-8">
                 38041 Dawson Shoals <br/>
                 Riveraton, NH 41221
             </p>]

In [27]:
def parse_person(card):
    name = card.h2.text
    quote = card.p.text.strip()
    email = card.find_all('p')[1].text
    phone = card.find_all('p')[2].text
    address = card.find_all('p')[3].text.strip()
    address = re.sub(r"\s{2,}", "", address)
    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [28]:
parse_person(card)

{'name': 'Andrew Moore',
 'quote': '"Down-sized object-oriented moderator"',
 'email': 'zroberts@wiley.net',
 'phone': '001-098-027-7795x94100',
 'address': '38041 Dawson ShoalsRiveraton, NH 41221'}

In [29]:
# loop through all the persons
pd.DataFrame([parse_person(card) for card in cards])

Unnamed: 0,name,quote,email,phone,address
0,Andrew Moore,"""Down-sized object-oriented moderator""",zroberts@wiley.net,001-098-027-7795x94100,"38041 Dawson ShoalsRiveraton, NH 41221"
1,Ashley Robinson,"""De-engineered object-oriented process improve...",alexandra50@hotmail.com,250.734.4739x5861,"5194 Timothy Path Apt. 254Mccallhaven, MO 71865"
2,Casey Lawrence,"""Re-contextualized disintermediate application""",kwalker@parsons.com,438.773.8967,"73062 Ross Lakes Suite 322Lindseyhaven, SC 89183"
3,Tammy Burnett,"""Synergistic client-driven installation""",wrightmelanie@hotmail.com,5718147101,"558 Bauer ViaSouth Dianeshire, CA 44238"
4,Charles Farmer,"""Enterprise-wide eco-centric migration""",zwells@sanders.com,+1-325-027-4192,"34829 Jamie TraceSouth Justin, SC 70782"
5,Heather Lynch,"""Re-contextualized cohesive forecast""",rgarcia@yahoo.com,9627196987,"72877 Frost SpringShawnmouth, DE 18056"
6,Mary Peterson,"""Customizable 4thgeneration throughput""",jennifercuevas@collier-hogan.com,(657)582-0996x02985,"7786 Brandon LandingEast Jennifer, IN 19540"
7,Linda Murray,"""De-engineered bandwidth-monitored installation""",romanmarcus@gmail.com,857.989.0351x4733,"18799 Schroeder LocksPort Paula, OK 55336"
8,Erica Rogers,"""Stand-alone maximized algorithm""",nicolelittle@hotmail.com,(648)719-4877,"34992 Matthew Center Suite 080Kelleyfurt, OR 8..."
9,Jeremy Lee,"""Enhanced 24hour Local Area Network""",mclark@west.org,556-220-0114x86437,"802 Benjamin PortStephanieberg, KY 59602"


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [30]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)

In [31]:
# print(codeup.prettify)

In [32]:
# <h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [33]:
articles = soup.find_all('h2', class_ = 'entry-title')
articles

[<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/vet-tec-funding-dallas/">VET TEC Funding Now Available For Dallas Veterans</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/">Dallas Campus Re-opens With New Grant Partner</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/dallas-newsletter/codeup-dallas-open-house/">Codeup Dallas Open House</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/">Codeup’s Placement Team Continues Setting Records</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/it-training/it-certifications-101/">IT Certifications 101: Why They Matter, and Why They Don’t</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/cybersecurity/a-

In [34]:
articles[0]

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [35]:
article = articles[0]
article

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [36]:
title = article.text
title

'Codeup Start Dates for March 2022'

In [37]:
article

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [38]:
link = article.a.attrs['href']
link

'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/'

In [39]:
def get_links():
    link_list = []
    response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
    soup = BeautifulSoup(response.text)
    articles = soup.find_all('h2', class_ = 'entry-title')
    for article in articles:
        link = article.a.attrs['href']
        link_list.append(link)
    return link_list

In [40]:
def get_link(article):
    link = article.a.attrs['href']
    return link

In [41]:
get_link(article)

'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/'

In [42]:
temp_list = get_links()
temp_list

['https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/',
 'https://codeup.com/codeup-news/vet-tec-funding-dallas/',
 'https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/',
 'https://codeup.com/dallas-newsletter/codeup-dallas-open-house/',
 'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
 'https://codeup.com/it-training/it-certifications-101/',
 'https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/',
 'https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/',
 'https://codeup.com/tips-for-prospective-students/which-program-is-right-for-me-cyber-security-or-systems-engineering/',
 'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
 'https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/',
 'https://codeup.com/behind-the-billboards/boris-behind-the-billboards/',
 'https://codeup.com/codeup-new

In [97]:
article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
article_soup = BeautifulSoup(article_response.text)
# article_soup

In [98]:
article_content = [p.text for p in article_soup.find_all('p')]

In [99]:
article_content

['Jan 26, 2022 | Codeup News',
 'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.',
 'Full Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!',
 'As one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:',
 '\xa0',
 'Our first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.',
 'Why consider pivoting careers to Data Science?',
 'The supply of data scientists remains painfully low compared to the outrageous demand. YOU can help close the gap while launching a fulfilling, secure, and high-paying career – one of the very best in the country!',
 'Employers are scrambling to find talent due to a lack of qualified applicants. YOU 

In [105]:
article_soup.find('h1',class_='entry-title').text

'Codeup Start Dates for March 2022'

In [112]:
title = article_soup.find('h1',class_='entry-title').text
title

'Codeup Start Dates for March 2022'

In [106]:
article_soup.find('span',class_='published').text

'Jan 26, 2022'

In [113]:
date_published = article_soup.find('span',class_='published').text
date_published

'Jan 26, 2022'

In [111]:
article_soup.find('div',class_='entry-content').text.strip()

'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.\nFull Stack Web Development – 3/7/22\nFull Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!\nAs one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:\n\n1.5 million developer jobs*\n250,000 of them remain open\na high growth rate of 13%*\n\n\xa0\nData Science – 3/22/22\nOur first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.\nWhy consider pivoting careers to Data Science?\n\n#1 job in America from 2016-2020 (Glassdoor*)\n650% increase in data science positions since 2012\nNearly 12 million new jobs between 2019 and 2029\n31% ten-year growth rate\n\nThe supply of data sc

In [114]:
content = article_soup.find('div',class_='entry-content').text.strip()
content

'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.\nFull Stack Web Development – 3/7/22\nFull Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!\nAs one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:\n\n1.5 million developer jobs*\n250,000 of them remain open\na high growth rate of 13%*\n\n\xa0\nData Science – 3/22/22\nOur first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.\nWhy consider pivoting careers to Data Science?\n\n#1 job in America from 2016-2020 (Glassdoor*)\n650% increase in data science positions since 2012\nNearly 12 million new jobs between 2019 and 2029\n31% ten-year growth rate\n\nThe supply of data sc

In [46]:
# # for content:
# summaries = soup.find_all('div',class_="post-content")
# summaries[0]

In [47]:
# summary = summaries[0].text.strip()
# summary

In [48]:
# def get_blog_articles(article):
#     response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
#     soup = BeautifulSoup(response.text)
#     articles = soup.find_all('h2', class_ = 'entry-title')
#     title = article.text
#     link = article.a.attrs['href']
#     article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
#     article_soup = BeautifulSoup(article_response.text)
#     article_content = [p.text for p in article_soup.find_all('p')]

#     return {
#         'title': title, 'article_content': article_content
#     }


# codeup_blog_posts = pd.DataFrame([get_blog_articles(article) for article in articles])



### this was the original, working function i created before making it more 
### complicated with the link function and writing to json


In [118]:
# tinker in this cell but not the next

def get_blog_articles():
    filename = 'codeup_blog_articles.json'
    if os.path.isfile(filename):
        return pd.read_json(filename)
    
    else:
        article_list=[]
        response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
        soup = BeautifulSoup(response.text)
        articles = soup.find_all('h2', class_ = 'entry-title')
        for article in articles:
            link = get_link(article)
            article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
            article_soup = BeautifulSoup(article_response.text)
            title = article_soup.find('h1',class_='entry-title').text
            date_published = article_soup.find('span',class_='published').text
            article_content = article_soup.find('div',class_='entry-content').text.strip()

            article = {
                'title': title, 
                'date_published': date_published,
                'article_content': article_content
            }
            article_list.append(article)
        df = pd.DataFrame(article_list)
        df.to_json('codeup_blog_articles.json')
    return df


# this is the CURRENT, FUNCTIONING iteration of the function


In [121]:
# def get_blog_articles():
#     filename = 'codeup_blog_articles.json'
#     if os.path.isfile(filename):
#         return pd.read_json(filename)
    
#     else:
#         article_list=[]
#         response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
#         soup = BeautifulSoup(response.text)
#         articles = soup.find_all('h2', class_ = 'entry-title')
#         for article in articles:
#             title = article.text
#             link = get_link(article)
#             article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
#             article_soup = BeautifulSoup(article_response.text)
#             article_content = [p.text for p in article_soup.find_all('p')]

#             article = {
#                 'title': title, 'article_content': article_content
#             }
#             article_list.append(article)
#         df = pd.DataFrame(article_list)
#         df.to_json('codeup_blog_articles.json')
#     return df


# # this was the PREVIOUS iteration

In [120]:
codeup_blog_posts = get_blog_articles()
codeup_blog_posts

Unnamed: 0,title,date_published,article_content
0,Codeup Start Dates for March 2022,"Jan 26, 2022",As we approach the end of January we wanted to...
1,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022",We are so happy to announce that VET TEC benef...
2,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021",We are happy to announce that our Dallas campu...
3,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
4,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
5,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
6,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
7,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."
8,Which program is right for me: Cyber Security ...,"Oct 28, 2021",What IT Career should I choose?\nIf you’re thi...
9,What the Heck is System Engineering?,"Oct 21, 2021",Codeup offers a 13-week training program: Syst...


In [52]:
# codeup_blog_posts = pd.DataFrame([get_blog_articles(article) for article in articles])


In [53]:
# codeup_blog_posts = 
# codeup_blog_posts

In [54]:
# codeup_blog_posts.article_content[0]

### Cracked it : )

# Inshort News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


`
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
`

In [55]:
# beginning this scraping with just the business section
base_url = 'https://inshorts.com/en/read/'
response = requests.get(base_url + 'business', headers={'user-agent':'ds_student'})
soup = BeautifulSoup(response.text)

In [56]:
cards = soup.select('.news-card')

In [57]:
card = cards[0]

In [58]:
title = card.find('span',itemprop='headline').text

In [59]:
title

'RBI cancels licence of Maha-based Independence Co-operative Bank'

In [60]:
card.find('span', class_="author").text

'Shalini Ojha'

In [61]:
author = card.find('span', class_="author").text
author

'Shalini Ojha'

In [62]:
card.find('div', itemprop="articleBody").text

"RBI has cancelled licence of Maharashtra-based Independence Co-operative Bank, citing inadequate capital. It will cease to carry on banking operations from the close of business on February 3. In the present situation, the bank won't be able to pay its depositors in full, RBI said. It added that the bank didn't comply with multiple sections of Banking Regulation Act, 1949. "

In [63]:
content = card.find('div', itemprop="articleBody").text
content

"RBI has cancelled licence of Maharashtra-based Independence Co-operative Bank, citing inadequate capital. It will cease to carry on banking operations from the close of business on February 3. In the present situation, the bank won't be able to pay its depositors in full, RBI said. It added that the bank didn't comply with multiple sections of Banking Regulation Act, 1949. "

In [90]:
## was trying to arrive at a variable that would be either/or depending on the 
## (mis)spelling of the 'clas=date' tag
# variable = 'clas="date"'
# card.find('span',variable.strip()).text

In [80]:
date_published = card.find('span', clas="date").text
date_published

'03 Feb 2022,Thursday'

#### OK this is looking great, time to build a function for it:

In [95]:
def get_news_articles():
    filename = 'inshort_news_articles.json'
    if os.path.isfile(filename):
        return pd.read_json(filename)
    else:
        base_url = 'https://inshorts.com/en/read/'
        sections = ["business","sports","technology","entertainment"]
        articles = []
        for section in sections:
            response = requests.get(base_url + section, headers={'user-agent': 'ds_student'})
            soup = BeautifulSoup(response.text)
            cards = soup.select('.news-card')
            for card in cards:
                title = card.find('span',itemprop='headline').text
                author = card.find('span', class_="author").text
                content = card.find('div', itemprop="articleBody").text
                date_published = card.find('span', clas="date").text
                article = ({'section': section, 
                            'title': title, 
                            'author': author, 
                            'content': content,
                            'date_published': date_published})
                articles.append(article)
        df = pd.DataFrame(articles)
        df.to_json('inshort_news_articles.json')
    return df

### Consider passing a 'caching' argument so that you can bypass the existing json if you need to (without having to delete it in the directory)

In [96]:
inshort_articles = get_news_articles()
inshort_articles

Unnamed: 0,section,title,author,content,date_published
0,business,RBI cancels licence of Maha-based Independence...,Shalini Ojha,RBI has cancelled licence of Maharashtra-based...,"03 Feb 2022,Thursday"
1,business,Boost to EVs a big step: Windmill Capital,Roshan Gupta,"Increased use of EVs in public transport, spec...","03 Feb 2022,Thursday"
2,business,"Tesla co-worker used N-word, threw a hot tool ...",Kiran Khatri,A former Tesla worker has filed a lawsuit agai...,"03 Feb 2022,Thursday"
3,business,Mark Zuckerberg risks losing $24 billion wealt...,Kiran Khatri,Mark Zuckerberg risks losing $24 billion from ...,"03 Feb 2022,Thursday"
4,business,Why did Facebook's parent company Meta lose $2...,Pragya Swastik,Facebook parent Meta's shares fell over 20% ea...,"03 Feb 2022,Thursday"
...,...,...,...,...,...
94,entertainment,"Light candle, pray for me; let's heal together...",Ria Kapoor,Singer Tom Parker who was reportedly diagnosed...,"03 Feb 2022,Thursday"
95,entertainment,Tejasswi winning Bigg Boss because of Naagin i...,Kriti Kambiri,Actor Karan Kundrra has said that claims of hi...,"03 Feb 2022,Thursday"
96,entertainment,Greatest blessing: Riteish wishes Genelia on 1...,Mahima Kharbanda,Actor Riteish Deshmukh on Thursday marked his ...,"03 Feb 2022,Thursday"
97,entertainment,Priyanka to star opposite Anthony Mackie in ac...,Kriti Kambiri,Actress Priyanka Chopra will be starring oppos...,"03 Feb 2022,Thursday"
