By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [2]:
import pandas as pd
import numpy as np

from requests import get
import re
from bs4 import BeautifulSoup
import os

## Exercise 1
Codeup Blog Articles

Scrape the article text from the following pages:

- https://codeup.com/codeups-data-science-career-accelerator-is-here/
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/


Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
Plus any additional properties you think might be helpful.

In [3]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} 
    
response = get(url, headers=headers)
response.ok

True

In [4]:
# Here's our long string of HTML; we'll use response.text to make our soup object.
#response.text gives all of the text provided in the article
print(type(response.text))

<class 'str'>


In [5]:
# Create our Soup object by passing our HTML string and choice of parser.

soup = BeautifulSoup(response.text, 'html.parser')

# Now we have our BeautifulSoup object and can use its built-in methods and attributes.

print(type(soup))

<class 'bs4.BeautifulSoup'>


In [6]:
title = soup.find('h1').text
title

'Codeup’s Data Science Career Accelerator is Here!'

In [8]:
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
content = soup.find('div', class_="jupiterx-post-content").text
print(content)

The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students will work with real

In [9]:
print(type(content))

<class 'str'>


In [10]:
# Create a helper function that requests and parses HTML returning a soup object.

def make_soup(url):
    '''
    This helper function takes in a url and requests and parses HTML
    returning a soup object.
    '''
    headers = {'User-Agent': 'Codeup Data Science'} 
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

In [11]:
def get_blog_articles(urls, cached=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter
    with default cached == False which scrapes the title and text for each url, 
    creates a list of dictionaries with the title and text for each blog, 
    converts list to df, and returns df.
    If cached == True, the function returns a df from a json file.
    '''
    if cached == True:
        df = pd.read_json('big_blogs.json')
        
    # cached == False completes a fresh scrape for df     
    else:

        # Create an empty list to hold dictionaries
        articles = []

        # Loop through each url in our list of urls
        for url in urls:

            # Make request and soup object using helper
            soup = make_soup(url)

            # Save the title of each blog in variable title
            title = soup.find('h1').text

            # Save the text in each blog to variable text
            content = soup.find('div', class_="jupiterx-post-content").text

            # Create a dictionary holding the title and content for each blog
            article = {'title': title, 'content': content}

            # Add each dictionary to the articles list of dictionaries
            articles.append(article)
            
        # convert our list of dictionaries to a df
        df = pd.DataFrame(articles)

        # Write df to a json file for faster access
        df.to_json('big_blogs.json')
    
    return df

In [12]:
# Here cached == False, so the function will do a fresh scrape of the urls and write data to a json file.

urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

blogs = get_blog_articles(urls=urls, cached=False)
blogs

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is Here!,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


## Bonus:

Scrape the text of all the articles linked on codeup's blog page.

In [13]:
# I'm going to hit Codeup's main blog page to scrape the urls and use my new function.

url = 'https://codeup.com/resources/#blog'
soup = make_soup(url)

In [14]:
# I'm filtering my soup to return a list of all anchor elements from my HTML. (view first 2)

urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
urls_list[:2]

[<a class="jet-listing-dynamic-link__link" href="https://codeup.com/introducing-salary-refund-guarantee/"><span class="jet-listing-dynamic-link__label">Introducing Our Salary Refund Guarantee</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/introducing-salary-refund-guarantee/"><span class="jet-listing-dynamic-link__label">Read More</span></a>]

In [15]:
len(urls_list)

40

In [16]:
# Filter the href attribute value for each anchor element in my list; we scraped 40 urls.

# I'm using a set comprehension to return only unique urls because there are two links for each article.
urls = {link.get('href') for link in urls_list}

# I'm converting my set to a list of urls.
urls = list(urls)

print(f'There are {len(urls)} unique links in our urls list.')
print()
urls

There are 20 unique links in our urls list.



['https://codeup.com/transition-into-data-science/',
 'https://codeup.com/codeup-inc-5000/',
 'https://codeup.com/math-in-data-science/',
 'https://codeup.com/journey-into-web-development/',
 'https://codeup.com/new-scholarship/',
 'https://codeup.com/introducing-salary-refund-guarantee/',
 'https://codeup.com/codeups-application-process/',
 'https://codeup.com/what-is-machine-learning/',
 'https://codeup.com/what-data-science-career-is-for-you/',
 'https://codeup.com/build-your-career-in-tech/',
 'https://codeup.com/from-slacker-to-data-scientist/',
 'https://codeup.com/covid-19-data-challenge/',
 'https://codeup.com/education-is-an-investment/',
 'https://codeup.com/succeed-in-a-coding-bootcamp/',
 'https://codeup.com/how-were-celebrating-world-mental-health-day-from-home/',
 'https://codeup.com/codeup-in-houston/',
 'https://codeup.com/codeup-alumni-make-water/',
 'https://codeup.com/what-is-python/',
 'https://codeup.com/codeup-wins-civtech-datathon/',
 'https://codeup.com/what-to-

In [17]:
def get_all_urls():
    '''
    This function scrapes all of the Codeup blog urls from
    the main Codeup blog page and returns a list of urls.
    '''
    # The base url for the main Codeup blog page
    url = 'https://codeup.com/resources/#blog' 
    
    # Make request and soup object using helper
    soup = make_soup(url)
    
    # Create a list of the anchor elements that hold the urls.
    urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
    
    # I'm using a set comprehension to return only unique urls because list contains duplicate urls.
    urls = {link.get('href') for link in urls_list}

    # I'm converting my set to a list of urls.
    urls = list(urls)
        
    return urls

In [18]:
# Now I can use my same function with my new function.
# cached == False does a fresh scrape.

big_blogs = get_blog_articles(urls=get_all_urls(), cached=False)

In [19]:
big_blogs

Unnamed: 0,title,content
0,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
1,Codeup on Inc. 5000 Fastest Growing Private Co...,We’re excited to announce a huge Codeup achiev...
2,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
3,Alumni Share their Journey into Web Development,Everyone starts somewhere. Many developers out...
4,Announcing: The Annie Easley Scholarship to Su...,We have an exciting announcement! We’re launch...
5,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu..."
6,What is Codeup’s Application Process?,Curious about Codeup’s application process? Wo...
7,What is Machine Learning?,"There’s a lot we can learn about machines, and..."
8,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
9,Build Your Career in Tech: Advice from Alumni!,"Bryan Walsh, Codeup Web Development alum, and ..."


In [20]:
big_blogs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    20 non-null     object
 1   content  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes


In [21]:
# cached == True reads in a df from `big_blogs.json`.

big_blogs = get_blog_articles(urls=get_all_urls(), cached=True)
big_blogs

Unnamed: 0,title,content
0,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
1,Codeup on Inc. 5000 Fastest Growing Private Co...,We’re excited to announce a huge Codeup achiev...
2,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
3,Alumni Share their Journey into Web Development,Everyone starts somewhere. Many developers out...
4,Announcing: The Annie Easley Scholarship to Su...,We have an exciting announcement! We’re launch...
5,Introducing Our Salary Refund Guarantee,"Here at Codeup, we believe it’s time to revolu..."
6,What is Codeup’s Application Process?,Curious about Codeup’s application process? Wo...
7,What is Machine Learning?,"There’s a lot we can learn about machines, and..."
8,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
9,Build Your Career in Tech: Advice from Alumni!,"Bryan Walsh, Codeup Web Development alum, and ..."


## Exercise 2
News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

- Hints:

    - Start by inspecting the website in your browser. Figure out which elements will be useful.
    - Start by creating a function that handles a single article and produces a dictionary like the one above.
    - Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
    - Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [22]:
# Make the soup object using my function.

url = 'https://inshorts.com/en/read/entertainment'
soup = make_soup(url)

In [23]:
# Scrape a ResultSet of all the news cards on the page and inspect the elements on the first card.

cards = soup.find_all('div', class_='news-card')

print(f'There are {len(cards)} news cards on this page.')
print()
cards[0]

There are 24 news cards on this page.



<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/tamil-actors-video-amid-cancer-treatment-surfaces-fans-say-he-looks-skinny-1605583414590" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Ankush Verma" itemprop="name"></span>
</span>
<span content="Tamil actor's video amid cancer treatment surfaces, fans say 'he looks skinny'" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2020/11_nov/17_tue/img_1605581692950_418.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span cont

In [24]:
# Create a list of titles using the span element and itemprop attribute with text method.

titles = [card.find('span', itemprop='headline').text for card in cards]
titles[:5]

["Tamil actor's video amid cancer treatment surfaces, fans say 'he looks skinny'",
 "Sushmita shares video of 'Money Heist' actress singing 'Chunari Chunari'",
 'Tamil TV series actor hacked to death, CCTV footage shows argument with gang',
 "Twinkle posts pic of 'Mela' villain's poster, says movie 'left mark or scar on me'",
 'Milind Soman picks up garbage on trek to a temple; shares pic']

In [25]:
# Create a list of authors using the span element and class attribute with text method.

authors = [card.find('span', class_='author').text for card in cards]
authors[:5]

['Ankush Verma',
 'Ankush Verma',
 'Daisy Mowke',
 'Ankush Verma',
 'Pragya Swastik']

In [26]:
# Create a list of content strings using the div element and itemprop attribute with text method.

content = [card.find('div', itemprop='articleBody').text for card in cards]
content[:5]

['A video showing Tamil actor Thavasi, known for his supporting roles, appealing for help has surfaced online. The actor is suffering from cancer and is admitted to a hospital. "I never thought I\'d be affected by such a disease. I request...my fellow actors in the industry and the people of the state to help me recover from this," he said.',
 'Sushmita Sen has shared a video of \'Money Heist\' actress Itziar Ituño singing the song \'Chunari Chunari\' from the 1999 movie \'Biwi No. 1\'. Itziar sang the lines while speaking with News18\'s Showsha on her love for Bollywood. "Still having hangover of chunari chunari...Everything in the song was superb," a Twitter user commented.',
 'Selvarathinam, an actor who played a villain in a Tamil TV series was hacked to death. "[On Sunday]...he received a call after which he left...Later, his roommate received the information [about his death]," police said. CCTV footage shows 4 suspicious men moving about near the murder spot. The actor could be 

In [27]:
# Create an empty list, articles, to hold the dictionaries for each article.
articles = []

# Loop through each news card on the page and get what we want
for card in cards:
    title = card.find('span', itemprop='headline' ).text
    author = card.find('span', class_='author').text
    content = card.find('div', itemprop='articleBody').text
    
    # Create a dictionary, article, for each news card
    article = {'title': title, 'author': author, 'content': content}
    
    # Add the dictionary, article, to our list of dictionaries, articles.
    articles.append(article)

In [28]:
# Here we see our list contains 24-25 dictionaries for news cards

print(len(articles))
articles[0]

24


{'title': "Tamil actor's video amid cancer treatment surfaces, fans say 'he looks skinny'",
 'author': 'Ankush Verma',
 'content': 'A video showing Tamil actor Thavasi, known for his supporting roles, appealing for help has surfaced online. The actor is suffering from cancer and is admitted to a hospital. "I never thought I\'d be affected by such a disease. I request...my fellow actors in the industry and the people of the state to help me recover from this," he said.'}

In [29]:
def get_news_articles(cached=False):
    '''
    This function with default cached == False does a fresh scrape of inshort pages with topics 
    business, sports, technology, and entertainment and writes the returned df to a json file.
    cached == True returns a df read in from a json file.
    '''
    # option to read in a json file instead of scrape for df
    if cached == True:
        df = pd.read_json('articles.json')
        
    # cached == False completes a fresh scrape for df    
    else:
    
        # Set base_url that will be used in get request
        base_url = 'https://inshorts.com/en/read/'
        
        # List of topics to scrape
        topics = ['business', 'sports', 'technology', 'entertainment']
        
        # Create an empty list, articles, to hold our dictionaries
        articles = []

        for topic in topics:
            
            # Create url with topic endpoint
            topic_url = base_url + topic
            
            # Make request and soup object using helper
            soup = make_soup(topic_url)

            # Scrape a ResultSet of all the news cards on the page
            cards = soup.find_all('div', class_='news-card')

            # Loop through each news card on the page and get what we want
            for card in cards:
                title = card.find('span', itemprop='headline' ).text
                author = card.find('span', class_='author').text
                content = card.find('div', itemprop='articleBody').text

                # Create a dictionary, article, for each news card
                article = ({'topic': topic, 
                            'title': title, 
                            'author': author, 
                            'content': content})

                # Add the dictionary, article, to our list of dictionaries, articles.
                articles.append(article)
            
        # Create a DataFrame from list of dictionaries
        df = pd.DataFrame(articles)
        
        # Write df to json file for future use
        df.to_json('articles.json')
    
    return df

In [30]:
# Test our function with cached == False to do a freash scrape and create `articles.json` file.

df = get_news_articles(cached=False)
df.head()

Unnamed: 0,topic,title,author,content
0,business,"Lakshmi Vilas Bank withdrawals capped at ₹25,0...",Pragya Swastik,The Centre has imposed a 30-day moratorium on ...
1,business,Shutting Delhi markets may prove counterproduc...,Sakshita Khosla,Traders' body CAIT on Tuesday said a proposal ...
2,business,Pfizer shares drop 4.5% as Moderna says its va...,Krishna Veera Vanamali,Pfizer’s shares fell as much as 4.5% on Monday...
3,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
4,business,"Musk gets $15bn richer in 2 hours, becomes wor...",Krishna Veera Vanamali,Billionaire Elon Musk added $15 billion to his...


In [31]:
df.topic.value_counts()

sports           25
business         25
technology       25
entertainment    24
Name: topic, dtype: int64

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    99 non-null     object
 1   title    99 non-null     object
 2   author   99 non-null     object
 3   content  99 non-null     object
dtypes: object(4)
memory usage: 3.2+ KB


In [33]:
# Test our function to read in the df from `articles.csv`

df = get_news_articles(cached=True)
df.head()

Unnamed: 0,topic,title,author,content
0,business,"Lakshmi Vilas Bank withdrawals capped at ₹25,0...",Pragya Swastik,The Centre has imposed a 30-day moratorium on ...
1,business,Shutting Delhi markets may prove counterproduc...,Sakshita Khosla,Traders' body CAIT on Tuesday said a proposal ...
2,business,Pfizer shares drop 4.5% as Moderna says its va...,Krishna Veera Vanamali,Pfizer’s shares fell as much as 4.5% on Monday...
3,business,How does Moderna's COVID-19 vaccine candidate ...,Pragya Swastik,Moderna's initial results of late-stage trial ...
4,business,"Musk gets $15bn richer in 2 hours, becomes wor...",Krishna Veera Vanamali,Billionaire Elon Musk added $15 billion to his...


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 0 to 98
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    99 non-null     object
 1   title    99 non-null     object
 2   author   99 non-null     object
 3   content  99 non-null     object
dtypes: object(4)
memory usage: 3.9+ KB
