In [1]:
# Import libraries

import numpy as np
import pandas as pd

from requests import get
import re
from bs4 import BeautifulSoup

import os

### Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

### 1. Codeup Blog Articles

Scrape the article text from the following pages:

- https://codeup.com/codeups-data-science-career-accelerator-is-here/
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [2]:
# Create a list of urls

urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/', 
        'https://codeup.com/data-science-myths/', 
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/', 
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/', 
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

# Print the length
len(urls)

5

In [4]:
# Create an empty list
blog_articles = []

# For Loop the urls to subtract the title and content

for url in urls:
    headers = {'User-Agent': 'Codeup Data Science'}
    
    # Use response.content to make the soup object
    response = get(url, headers=headers)
    
    # Create the soup object by passing the HTML string and choice of parser.
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # The h1 element holds the title
    title = soup.find('h1', class_='jupiterx-post-title')
    
    # Grab the text from page
    content = soup.find('div', class_='jupiterx-post-content')
    
    # Store the title and text in a dictionary
    d = {'title': title.text, 'content': content.text}
    
    # Append the dictionary to the list
    blog_articles.append(d)
    
    # Convert the list of dicts to a dataframe
    df = pd.DataFrame(blog_articles)
    
    # Write the df to a json file for faster access
    df.to_json('codeup_blogs.json')

blog_articles

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspac

#### Build the Helper Functions

In [5]:
# Create a helper functioin that requests and parse HTML returning a soup object

def make_soup(url):
    '''
    This helper function takes in a url and requests and parses HTML
    returning a soup object
    '''
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    return soup

In [6]:
def acquire_codeup_blogs(urls, cached=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter with default cashed == False.
    It scrapes the title and text for each url, creates a list of dictionaries with title and tex for each blog,
    creates a list of dictionaries, converts list to df, and returns df
    If cached == True, the function returns a dataframe from a json file.     
    '''
    if cached == True:
        df = pd.read_json('codeup_blogs.json') # cached == False completes a fresh scrape for df. 
    else:
        
        blog_articles = []
        
        for url in urls:
            soup = make_soup(url)
            title = soup.find('h1', class_='jupiterx-post-title')
            content = soup.find('div', class_='jupiterx-post-content')
            d = {'title': title.text, 'content': content.text}
            blog_articles.append(d)
        
        df = pd.DataFrame(blog_articles)
        df.to_json('codeup_blogs.json')
    
    return df

In [7]:
# Test the functions

codeup_blogs = acquire_codeup_blogs(urls)
codeup_blogs

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


### Bonus URL Scrape

In [9]:
# Hit codeup's main blog page to scrape the urls. 

url = 'https://codeup.com/resources/#blog'
soup = make_soup(url)
type(soup)

bs4.BeautifulSoup

In [11]:
# Filter my soup to return a list of all anchor elements from my HTML
urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')

# Print the type of the element
print(type(urls_list[0]))

# Take a peek at the urls_list
urls_list[0]

<class 'bs4.element.Tag'>


<a class="jet-listing-dynamic-link__link" href="https://codeup.com/introducing-salary-refund-guarantee/"><span class="jet-listing-dynamic-link__label">Introducing Our Salary Refund Guarantee</span></a>

In [14]:
# Filter the href attribute value for each anchor element in my list
# 40 urls are scraped. 
# Duplicates exit

urls = [url.get('href') for url in urls_list]
len(urls)

40

In [15]:
# Use the set comprehension to return only unique urls

urls = {url.get('href') for url in urls_list}
len(urls)

20

In [16]:
# Convert the set to a list

urls = list(urls)
print(f'There are {len(urls)} unique urls in the list)')
urls

There are 20 unique urls in the list)


['https://codeup.com/journey-into-web-development/',
 'https://codeup.com/codeup-wins-civtech-datathon/',
 'https://codeup.com/new-scholarship/',
 'https://codeup.com/codeup-alumni-make-water/',
 'https://codeup.com/introducing-salary-refund-guarantee/',
 'https://codeup.com/codeup-inc-5000/',
 'https://codeup.com/succeed-in-a-coding-bootcamp/',
 'https://codeup.com/how-were-celebrating-world-mental-health-day-from-home/',
 'https://codeup.com/what-data-science-career-is-for-you/',
 'https://codeup.com/transition-into-data-science/',
 'https://codeup.com/codeup-in-houston/',
 'https://codeup.com/from-slacker-to-data-scientist/',
 'https://codeup.com/covid-19-data-challenge/',
 'https://codeup.com/what-is-python/',
 'https://codeup.com/codeups-application-process/',
 'https://codeup.com/what-is-machine-learning/',
 'https://codeup.com/math-in-data-science/',
 'https://codeup.com/what-to-expect-at-codeup/',
 'https://codeup.com/build-your-career-in-tech/',
 'https://codeup.com/education-

#### Build the Helper Function

In [17]:
def get_blog_urls():
    '''
    This function scrapes all of the Codeup blog urls from the main Codeup blog page
    Returning a list of urls
    '''
    base_url = 'https://codeup.com/resources/#blog' 
    soup = make_soup(base_url)
    urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
    urls = {url.get('href') for url in urls_list}
    urls = list(urls)
    
    return urls

In [18]:
# Now test the function
# cached == False does a fresh scrape.

all_blogs = acquire_codeup_blogs(urls=get_blog_urls())

# Print the shape
all_blogs.shape

(20, 2)

In [19]:
# Take a peek at the df
all_blogs.head()

Unnamed: 0,title,content
0,Alumni Share their Journey into Web Development,Everyone starts somewhere. Many developers out...
1,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...
2,Announcing: The Annie Easley Scholarship to Su...,We have an exciting announcement! We’re launch...
3,How Codeup Alumni are Helping to Make Water,Imagine having a kit mailed to you with all th...
4,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu..."


In [22]:
# cached == True reads in a dataframe from 'codeup_blogs.json'

all_blogs = acquire_codeup_blogs(urls=get_blog_urls(), cached=True)
all_blogs.head()

Unnamed: 0,title,content
0,Alumni Share their Journey into Web Development,Everyone starts somewhere. Many developers out...
1,Codeup Grads Win CivTech Datathon,Many Codeup alumni enjoy competing in hackatho...
2,Announcing: The Annie Easley Scholarship to Su...,We have an exciting announcement! We’re launch...
3,How Codeup Alumni are Helping to Make Water,Imagine having a kit mailed to you with all th...
4,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu..."


### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

### a. Start by inspecting the website in your browser. Figure out which elements will be useful.
- title: span, itemprop="headline"
- content: div, itemprop="articleBody"
- category: 

In [13]:
url = 'http://inshorts.com/en/news/china-suspends-fish-imports-from-indian-firm-after-coronavirus-detected-1605240633688'
headers = {'User-Agent': 'Codeup Data Science Darden'}
response = get(url, headers=headers)

In [14]:
print(response.text[: 400])

<!doctype html>
<html lang="en">

<head>
  <meta charset="utf-8" />
  <style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if ne


In [15]:
soup = BeautifulSoup(response.content, 'html.parser')

In [16]:
title = soup.find('span', itemprop='headline')
title.text

'China suspends fish imports from Indian firm after coronavirus detected'

In [17]:
content = soup.find('div', itemprop='articleBody')
content.text

"China has suspended imports from India's Basu International for one week after detecting the novel coronavirus on three samples taken from the outer packaging of frozen cuttlefish. Imports will resume automatically after one week, Chinese customs said. Companies from Brazil, Russia, Ecuador and Indonesia have also faced similar one-week suspensions during the last month."

In [20]:
category = soup.find('li', class_='active-category')
category.text

'All News'