# Acquire Data through Web Scraping Exercises


By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)



### imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### 1. Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [2]:
# {
#     'title': 'the title of the article',
#     'content': 'the full text content of the article'
# # 

Plus any additional properties you think might be helpful.

Bonus: Scrape the text of all the articles linked on codeup's blog page.

In [3]:
# https://codeup.com/blog/

Articles to scrape: 

https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/
https://codeup.com/codeup-news/vet-tec-funding-dallas/
https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/
https://codeup.com/dallas-newsletter/codeup-dallas-open-house/
https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/


In [4]:
# response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
# soup = BeautifulSoup(response.text)

In [17]:
response = requests.get('https://codeup.com/dallas-newsletter/codeup-dallas-open-house/', headers={'user-agent': 'Codeup DS Hopper'})
html = response.text
# html

In [18]:
soup = BeautifulSoup(response.text)

In [19]:
# print(soup.prettify())

In [20]:
title = soup.select_one(".entry-title").text

In [21]:
title

'Codeup Dallas Open House'

In [37]:
content = soup.select_one(".entry-content").text.strip()

In [38]:
content

'Come join us for the re-opening of our Dallas Campus with some drinks and snacks at Codeup! Curious about what our campus looks like? Click here to register for free\nAbout this event\nCome join us for the re-opening of our Dallas Campus with some drinks and snacks at Codeup!\nCurious about what our campus looks like? Interested in our Web Development Career Accelerator? Keen to chat with an instructor or financial aid rep?\nAt our Open House, we are here to answer all your questions!\nMeet a Codeup instructor, who can help explain what’s taught in our classes and answer questions.\nUnderstand how to join one of our upcoming cohort ( Dec. 6th).\nDon’t miss this opportunity to learn more about how you can start the new year transitioning into a new, exciting career in tech. We’re here to answer any questions you may have about Codeup and your future.\nTake the first step of your new career today and create your tomorrow!'

In [52]:
# function to get all links from the main blog page

def get_blog_links():
    
    '''
    This function returns a list of all the urls linked on Codeup's main blog page.
    '''
    
    # imports
    import requests
    from bs4 import BeautifulSoup
    
    response = requests.get("https://codeup.com/blog/", headers={"user-agent": "Codeup DS Hopper"})
    soup = BeautifulSoup(response.text)
    links = [link.attrs["href"] for link in soup.select(".more-link")]
    return links


In [53]:
# use function to get list of links
get_blog_links()

['https://codeup.com/dallas-newsletter/codeup-dallas-open-house/',
 'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
 'https://codeup.com/it-training/it-certifications-101/',
 'https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/',
 'https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/',
 'https://codeup.com/tips-for-prospective-students/which-program-is-right-for-me-cyber-security-or-systems-engineering/',
 'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
 'https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/',
 'https://codeup.com/behind-the-billboards/boris-behind-the-billboards/',
 'https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/',
 'https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/',
 'https://codeup.com/tips-for-prospective-students/why-should-i-become-a-s

In [49]:
def parse_codeup_blog_article(url):
    
    '''
    This function: 
    takes url of blog post,
    extracts info
    returns the specified info as a dictionary
    '''
    
    # imports
    import requests
    from bs4 import BeautifulSoup
    
    response = requests.get(url, headers={"user-agent": "Codeup DS Hopper"})
    soup = BeautifulSoup(response.text)
    return {
        "title": soup.select_one(".entry-title").text,
        "published": soup.select_one(".published").text,
        "content": soup.select_one(".entry-content").text.strip(),
    }



In [50]:
url = 'https://codeup.com/dallas-newsletter/codeup-dallas-open-house/'

In [51]:
parse_codeup_blog_article(url)

{'title': 'Codeup Dallas Open House',
 'published': 'Nov 30, 2021',
 'content': 'Come join us for the re-opening of our Dallas Campus with some drinks and snacks at Codeup! Curious about what our campus looks like? Click here to register for free\nAbout this event\nCome join us for the re-opening of our Dallas Campus with some drinks and snacks at Codeup!\nCurious about what our campus looks like? Interested in our Web Development Career Accelerator? Keen to chat with an instructor or financial aid rep?\nAt our Open House, we are here to answer all your questions!\nMeet a Codeup instructor, who can help explain what’s taught in our classes and answer questions.\nUnderstand how to join one of our upcoming cohort ( Dec. 6th).\nDon’t miss this opportunity to learn more about how you can start the new year transitioning into a new, exciting career in tech. We’re here to answer any questions you may have about Codeup and your future.\nTake the first step of your new career today and create 

In [60]:
# return blog article info as a df
def get_blog_articles_info():
    
    '''
    This function: 
    uses get_blog_links to get blog post links from the main Codeup blog page,
    extracts info using parse_codeup_blog_article,
    returns the specified info as a df where each row contains a seperate blog title
    '''
    
    # imports
    import requests
    from bs4 import BeautifulSoup
    
    links = get_blog_links()
    df = pd.DataFrame([parse_codeup_blog_article(link) for link in links])
    return df

In [59]:
get_blog_articles_info()

Unnamed: 0,title,published,content
0,Codeup Dallas Open House,"Nov 30, 2021",Come join us for the re-opening of our Dallas ...
1,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
2,"IT Certifications 101: Why They Matter, and Wh...","Nov 18, 2021","AWS, Google, Azure, Red Hat, CompTIA…these are..."
3,A rise in cyber attacks means opportunities fo...,"Nov 17, 2021","In the last few months, the US has experienced..."
4,Use your GI Bill® benefits to Land a Job in Tech,"Nov 4, 2021","As the end of military service gets closer, ma..."
5,Which program is right for me: Cyber Security ...,"Oct 28, 2021",What IT Career should I choose?\nIf you’re thi...
6,What the Heck is System Engineering?,"Oct 21, 2021",Codeup offers a 13-week training program: Syst...
7,From Speech Pathology to Business Intelligence,"Oct 18, 2021","By: Alicia Gonzalez\nBefore Codeup, I was a ho..."
8,Boris – Behind the Billboards,"Oct 3, 2021",
9,Is Codeup the Best Bootcamp in San Antonio…or ...,"Sep 16, 2021",Looking for the best data science bootcamp in ...


#### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

In [3]:
# {
#     'title': 'The article title',
#     'content': 'The article content',
#     'category': 'business' # for example
# }

Hints:

a. Start by inspecting the website in your browser. Figure out which elements will be useful.

b. Start by creating a function that handles a single article and produces a dictionary like the one above.

c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.

d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

#### 3. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).

