In [1]:
import requests

import pandas as pd
from bs4 import BeautifulSoup

from acquire import *

# Data Acquisition Exercises

## 1

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}

Plus any additional properties you think might be helpful.

Bonus: Scrape the text of all the articles linked on codeup's blog page.

### Get Blog Articles

Let's figure out how to get the article info for a Codeup blog post.

In [2]:
# We'll scrape these URLs

urls = [
    'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
    'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
    'https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/',
    'https://codeup.com/codeup-news/codeup-candidate-for-accreditation/',
    'https://codeup.com/tips-for-prospective-students/why-you-need-the-best-coding-bootcamp-instructors/'
]

In [3]:
# We'll start with just the first URL and get the response with requests
# We'll need to set the user agent header so we can have access to scrape the site

headers = {'user-agent' : 'Innis Data Science Cohort'}
response = requests.get(urls[0], headers = headers)

In [4]:
# Get the HTML using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [5]:
# Get the title of the blog post
soup.find('h1', class_ = 'entry-title').text

'Codeup’s Placement Team Continues Setting Records'

In [6]:
# Let's also get the date
soup.find('p', class_ = 'post-meta').find('span').text

'Nov 19, 2021'

In [7]:
# Get the body of the blog post
soup.find('div', class_ = 'entry-content').text

'\n\n\n\n\n\nOur Placement Team is simply defined as a group that manages relationships with our employer partners and our graduating students to help get our graduating students hired. Last quarter the Placement Team helped 48 students get hired to life-changing careers in tech. Last month our Placement Team has already placed 40 students with top tech companies. For that, we want to send a huge thank you to both our Placement Team and our Employer Partners who have done a tremendous job of helping Codeup empower life change for these students.\nWho exactly got hired and where? Check out the list below!\n\n\nKirsten Collier – hired at CGI as a Java Developer\nMichael Baker – hired to CGI as a Java Developer\nMichael Troia – hired to CDW as an Associate Consulting Engineer\xa0\nCarlos Padilla – hired at CGI as a Java Developer\xa0\nVictor G. Hernandez – hired at Anderson Marketing Group as a Java Developer\nNicholas Martinez – hired at Seggazza as a Software Engineer\xa0\nJordan Felan 

In [8]:
# Let's see if this works on other blog posts.

headers = {'user-agent' : 'Innis Data Science Cohort'}
response = requests.get(urls[1], headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [9]:
# Check the title
soup.find('h1', class_ = 'entry-title').text

'What the Heck is System Engineering?'

In [10]:
# Check the date
soup.find('p', class_ = 'post-meta').find('span').text

'Oct 21, 2021'

In [11]:
# Check the blog contents
soup.find('div', class_ = 'entry-content').text

'\n\n\n\n\n\nCodeup offers a 13-week training program: Systems Engineering.\xa0Designed to help you launch your career in tech, this program takes you from 0 experience to IT hero with certifications and hands-on experience in just weeks. But if you’re new to tech, you might be wondering…what is Systems Engineering?\nWhat is Systems Engineering?\nIn IT terms, a Systems Engineer (Sysadmin or Sysad for short) is responsible for the configuration, upkeep, and operation of computer systems. They manage things like security, storage, automation, troubleshooting, and a whole lot more. For example, some of the day to day tasks include things like:\n\nReviewing system logs for anomalies and issues\nUpdating operating systems (OS)\nInstalling new hardware and software\nManaging user accounts\nDocumenting information about the system setup\nManaging file systems\n\nYou might be thinking, “Woah, that looks like a lot!” and that’s because it is! A Systems Engineer is a critical piece of IT infrast

In [12]:
# Let's try one more

headers = {'user-agent' : 'Innis Data Science Cohort'}
response = requests.get(urls[2], headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [13]:
# Check the title
soup.find('h1', class_ = 'entry-title').text

'Is Codeup the Best Bootcamp in San Antonio…or the World?'

In [14]:
# Check the date
soup.find('p', class_ = 'post-meta').find('span').text

'Sep 16, 2021'

In [15]:
# Check the blog contents
soup.find('div', class_ = 'entry-content').text

'\n\n\n\n\n\nLooking for the best data science bootcamp in the world? Or how about the best coding bootcamp in San Antonio? If you’re reading this, you’ve found both! We are thrilled to announce that Codeup has been chosen as a Best Data Science Bootcamp of 2021 by Course Report, and a Best San Antonio Coding Bootcamp 2021 by Career Karma.\nBest Data Science Bootcamp in the World\nCourse Report is a leading authority in ranking and reviewing bootcamps. Every year, they compile a list of the Best Data Science Bootcamps in the world. Codeup made the 2021 list, along with 21 other bootcamps, for our Data Science program. Why? Course Report considers alumni and student reviews, financing options, outcome transparency, and duration of the program.\xa0More on how Codeup performs in each of those categories below!\nBest Coding Bootcamp in San Antonio\nThe folks at Career Karma are experts on what the best bootcamps are for various tech careers. They’ve recently identified the 10 Best Coding B

Everything looks good. Let's put this all into a function.

In [16]:
def get_blog_articles(urls):
    articles = []
    headers = {'user-agent' : 'Innis Data Science Cohort'}
    
    for url in urls:
        try:
            response = requests.get(url, headers = headers)
        except requests.exceptions.RequestException:
            continue
            
        soup = BeautifulSoup(response.text, 'html.parser')
        
        if url == response.url:
            articles.append({
                'title' : soup.find('h1', class_ = 'entry-title').text,
                'date' : soup.find('p', class_ = 'post-meta').find('span').text,
                'content' : soup.find('div', class_ = 'entry-content').text.strip()
            })
        
    return articles

In [17]:
# Let's test it
pd.DataFrame(get_blog_articles(urls))

Unnamed: 0,title,date,content
0,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021",Our Placement Team is simply defined as a grou...
1,What the Heck is System Engineering?,"Oct 21, 2021",Codeup offers a 13-week training program: Syst...
2,Is Codeup the Best Bootcamp in San Antonio…or ...,"Sep 16, 2021",Looking for the best data science bootcamp in ...
3,Announcing our Candidacy for Accreditation!,"Jun 30, 2021",Did you know that even though we’re an indepen...
4,Why You Need the Best Coding Bootcamp Instructors,"May 21, 2021",One of the many reasons students love Codeup i...


### Bonus

In [18]:
# Now let's try grabbing the URLs for all blog posts on the Codeup blog page

url = 'https://codeup.com/blog/'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [19]:
# We'll search for all article elements and grab the href from the a element associated with the article
for article in soup.find_all('article'):
    print(article.find('a')['href'])

https://codeup.com/workshops/from-bootcamp-to-bootcamp-a-military-appreciation-panel/
https://codeup.com/featured/our-acquisition-of-the-rackspace-cloud-academy-one-year-later/
https://codeup.com/workshops/virtual/learn-to-code-html-css-on-4-30/
https://codeup.com/workshops/virtual/learn-to-code-python-workshop-on-4-16/
https://codeup.com/codeup-news/coming-soon-cloud-administration/
https://codeup.com/featured/5-books-every-woman-in-tech-should-read/
https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/
https://codeup.com/codeup-news/vet-tec-funding-dallas/
https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/
https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/
https://codeup.com/it-training/it-certifications-101/
https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/
https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/
https://codeup.com/tips-

In [20]:
# Let's try grabbing the Older Entries link
link = soup.find('a', string = '« Older Entries')['href']
link

'https://codeup.com/blog/page/2/?et_blog'

In [21]:
# Now let's see what we get when we try scraping the Older Entries link
response = requests.get(link, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [22]:
# Let's get all the links on this page
for article in soup.find_all('article'):
    print(article.find('a')['href'])

https://codeup.com/workshops/virtual/learn-to-code-python-workshop-on-4-16/
https://codeup.com/codeup-news/coming-soon-cloud-administration/
https://codeup.com/featured/5-books-every-woman-in-tech-should-read/
https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/
https://codeup.com/tips-for-prospective-students/why-should-i-become-a-system-administrator/
https://codeup.com/codeup-news/codeup-candidate-for-accreditation/
https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/
https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/
https://codeup.com/tips-for-prospective-students/why-you-need-the-best-coding-bootcamp-instructors/
https://codeup.com/codeup-news/meet-the-new-codeup-coo-stephen-noteboom/
https://codeup.com/alumni-stories/how-i-went-from-codeup-to-business-owner/
https://codeup.com/tips-for-prospective-students/coding-is-for-women/
https://codeup.com/codeup-news/codeup-acquires-rackspace-cloud-academy/

There are a few duplicates in the list, but otherwise it worked as expected. We'll create a loop that keeps grabbing the Older Entries link until it no longer is present, for each link we'll grab all the article links and put everything into a list. Once we have the list of links we'll convert the list into a set to remove all the duplicates and pass this set into the get_blog_articles function.

In [23]:
def get_all_blog_articles():
    url = 'https://codeup.com/blog/'
    links = get_all_article_links(url)
    return get_blog_articles(set(links))

def get_all_article_links(url):
    links = []
    headers = {'user-agent' : 'Innis Data Science Cohort'}
    response = requests.get(url, headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    for article in soup.find_all('article'):
        links.append(article.find('a')['href'])
        
    if (link := soup.find('a', string = '« Older Entries')):
        links += get_all_article_links(link['href'])
    
    return links

In [24]:
# Let's test it
pd.DataFrame(get_all_blog_articles())

Unnamed: 0,title,date,content
0,"Meet the new Codeup COO, Stephen Noteboom!","May 3, 2021","A big welcome to Stephen Noteboom, who will be..."
1,Codeup Launches a Houston Bootcamp!,"Oct 26, 2020","Houston, we have a problem: there aren’t enoug..."
2,Codeup Named a Top 30 Coding School,"Aug 14, 2018",Codeup Named a Top 30 Coding School\n \nWhile ...
3,Codeup Success Story: Ryan Orsinger,"Aug 14, 2018",Codeup Success Story: Ryan Orsinger\n \nWatch ...
4,Codeup Success Story: Cole Reveal,"Aug 14, 2018",Codeup Success Story: Cole Reveal\nWatch the v...
...,...,...,...
215,Which program is right for me: Cyber Security ...,"Oct 28, 2021",What IT Career should I choose?\nIf you’re thi...
216,How Codeup Paid Off for Both Employee and Empl...,"Feb 10, 2021","After graduating from Codeup in 2016, Stan H. ..."
217,Codeup Acquires Rackspace Cloud Academy!,"Apr 16, 2021",We are thrilled to officially announce Codeup’...
218,Screening candidates just got easier – and mor...,"Feb 22, 2021","In the magical time of “before COVID,” Codeup ..."


### Put it in a Module

Now let's put all this acquisition code into a module and test it.

In [26]:
# Let's test the function to make sure it works
get_codeup_blog().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    220 non-null    object
 1   date     220 non-null    object
 2   content  220 non-null    object
dtypes: object(3)
memory usage: 5.3+ KB


## 2

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

### Get Info From One Article

Let's start by getting the information we need from just one article.

In [27]:
# Let's grab the HTML for a single article

url = 'https://inshorts.com/en/news/lt-infotech-announces-merger-with-mindtree-1651834439520'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In [28]:
# Let's try getting the title

soup.find('div', class_ = 'news-card-title').find('span').text

'L&T Infotech announces merger with Mindtree'

In [29]:
# Next let's try getting the content

soup.find('div', class_ = 'news-card-content').find('div').text

"Larsen & Toubro Infotech (L&T Infotech) on Friday announced that its board has approved a scheme of amalgamation and arrangement with Mindtree. Under the scheme, L&T Infotech will issue 73 shares for every 100 shares of Mindtree. After the completion of the merger, Larsen & Toubro Limited will hold a 68.73% stake in 'LTIMindtree', the combined entity."

The category can be filled in with a function parameter. We'll pass it in as we go through each category page. Let's try this on a few more articles to see if it works.

In [30]:
# Let's try this one

url = 'https://inshorts.com/en/news/indias-covid19-death-numbers-quite-a-bit-lower-than-rich-countries-bill-gates-1651858819621'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In [31]:
# Let's try getting the title

soup.find('div', class_ = 'news-card-title').find('span').text

"India's COVID-19 death numbers quite a bit lower than rich countries: Bill Gates"

In [32]:
# Let's try getting the content

soup.find('div', class_ = 'news-card-content').find('div').text

'India\'s COVID-19 death numbers are "quite a bit lower than rich countries", Microsoft Co-founder Bill Gates said in an interview with Times Now. He, however, admitted that there is "some debate over what those numbers are". On being asked how he would rate India\'s COVID-19 response on a scale of 1-10, Gates said, "I\'ll give India seven or eight."'

In [33]:
# Let's try one more.

url = 'https://inshorts.com/en/news/mi-hand-gt-their-2nd-loss-in-a-row-as-daniel-sams-defends-9-runs-in-last-over-1651860324419'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In [34]:
# Let's try getting the title

soup.find('div', class_ = 'news-card-title').find('span').text

'MI hand GT their 2nd loss in a row as Daniel Sams defends 9 runs in last over'

In [35]:
# Let's try getting the content

soup.find('div', class_ = 'news-card-content').find('div').text

"Mumbai Indians (MI) defeated Gujarat Titans (GT) by five runs to register their second win in a row in IPL 2022. MI pacer Daniel Sams defended nine runs in the last over as GT suffered their second straight defeat in the tournament. Ishan Kishan top-scored for MI with 45(29), while MI's Murugan Ashwin took two wickets in the match."

It looks like it works. Let's put this in a function.

In [36]:
def get_article_info(url, category):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    return {
        'title' : soup.find('div', class_ = 'news-card-title').find('span').text,
        'content' : soup.find('div', class_ = 'news-card-content').find('div').text,
        'category' : category
    }

In [37]:
# Let's test it

url = 'https://inshorts.com/en/news/mi-hand-gt-their-2nd-loss-in-a-row-as-daniel-sams-defends-9-runs-in-last-over-1651860324419'
get_article_info(url, 'sports')

{'title': 'MI hand GT their 2nd loss in a row as Daniel Sams defends 9 runs in last over',
 'content': "Mumbai Indians (MI) defeated Gujarat Titans (GT) by five runs to register their second win in a row in IPL 2022. MI pacer Daniel Sams defended nine runs in the last over as GT suffered their second straight defeat in the tournament. Ishan Kishan top-scored for MI with 45(29), while MI's Murugan Ashwin took two wickets in the match.",
 'category': 'sports'}

### Find All Articles on a Single Page

Now let's try getting all the article links for a single category page.

In [38]:
# Let's get the HTML for one category page.

url = 'https://inshorts.com/en/read/sports'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In [39]:
# Let's try getting the links for each article

# We can use find_all to get all news-card-title elements

for card in soup.find_all('div', class_ = 'news-card-title'):
    print(card.find('a')['href'])

/en/news/virat-kohli-bows-down-to-dinesh-karthik-after-his-375strike-rate-knock-video-goes-viral-1652012606197
/en/news/dc-to-bowl-first-against-csk-injured-ravindra-jadeja-replaced-by-shivam-dube-1652017597345
/en/news/kohlis-frustrated-reaction-after-golden-duck-goes-viral-bangar-seen-consoling-him-1652008813822
/en/news/prithvi-shaw-hospitalised-after-fever-shares-pic-from-hospital-bed-1652010924429
/en/news/wanindu-hasaranga-records-best-bowling-figures-of-ipl-2022-1652019879487
/en/news/csk-record-biggest-win-of-ipl-2022-defeat-dc-1652031957410
/en/news/australia-cricketer-head-fiancées-flight-skids-into-a-field-after-emergency-landing-1652027563246
/en/news/rcb-hand-srh-their-4th-straight-defeat-in-ipl-2022-get-to-14-points-1652018209677
/en/news/hes-equally-good-against-pace-spin-watson-on-dc-allrounder-powell-1652021357253
/en/news/hes-impressed-me-the-most-sehwag-on-pbks-wicketkeeperbatter-jitesh-1652010343725
/en/news/theekshana-has-been-the-steal-for-csk-one-of-the-best-buys

These links on their own won't be very useful. We'll have to insert the base domain as well.

In [40]:
# Insert the base domain before the link we get from the href attribute

for card in soup.find_all('div', class_ = 'news-card-title'):
    print('https://inshorts.com' + card.find('a')['href'])

https://inshorts.com/en/news/virat-kohli-bows-down-to-dinesh-karthik-after-his-375strike-rate-knock-video-goes-viral-1652012606197
https://inshorts.com/en/news/dc-to-bowl-first-against-csk-injured-ravindra-jadeja-replaced-by-shivam-dube-1652017597345
https://inshorts.com/en/news/kohlis-frustrated-reaction-after-golden-duck-goes-viral-bangar-seen-consoling-him-1652008813822
https://inshorts.com/en/news/prithvi-shaw-hospitalised-after-fever-shares-pic-from-hospital-bed-1652010924429
https://inshorts.com/en/news/wanindu-hasaranga-records-best-bowling-figures-of-ipl-2022-1652019879487
https://inshorts.com/en/news/csk-record-biggest-win-of-ipl-2022-defeat-dc-1652031957410
https://inshorts.com/en/news/australia-cricketer-head-fiancées-flight-skids-into-a-field-after-emergency-landing-1652027563246
https://inshorts.com/en/news/rcb-hand-srh-their-4th-straight-defeat-in-ipl-2022-get-to-14-points-1652018209677
https://inshorts.com/en/news/hes-equally-good-against-pace-spin-watson-on-dc-allrounde

Now we have good URLs. Let's put this in a function.

In [41]:
def get_category_links(category):
    links = []
    base_domain = 'https://inshorts.com'
    url = f'{base_domain}/en/read/{category}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    for card in soup.find_all('div', class_ = 'news-card-title'):
        links.append(base_domain + card.find('a')['href'])
        
    return links

In [42]:
# Let's test it
get_category_links('sports')

['https://inshorts.com/en/news/virat-kohli-bows-down-to-dinesh-karthik-after-his-375strike-rate-knock-video-goes-viral-1652012606197',
 'https://inshorts.com/en/news/dc-to-bowl-first-against-csk-injured-ravindra-jadeja-replaced-by-shivam-dube-1652017597345',
 'https://inshorts.com/en/news/kohlis-frustrated-reaction-after-golden-duck-goes-viral-bangar-seen-consoling-him-1652008813822',
 'https://inshorts.com/en/news/prithvi-shaw-hospitalised-after-fever-shares-pic-from-hospital-bed-1652010924429',
 'https://inshorts.com/en/news/wanindu-hasaranga-records-best-bowling-figures-of-ipl-2022-1652019879487',
 'https://inshorts.com/en/news/csk-record-biggest-win-of-ipl-2022-defeat-dc-1652031957410',
 'https://inshorts.com/en/news/australia-cricketer-head-fiancées-flight-skids-into-a-field-after-emergency-landing-1652027563246',
 'https://inshorts.com/en/news/rcb-hand-srh-their-4th-straight-defeat-in-ipl-2022-get-to-14-points-1652018209677',
 'https://inshorts.com/en/news/hes-equally-good-agains

### Scrape All Articles From Various Categories

Finally, let's use the last two functions and create a function that scrape all articles from the categories we request.

In [43]:
# Get all links from the categories we choose, and get article info for each link

categories = [
    'business',
    'sports',
    'technology',
    'entertainment'
]
articles = []

for category in categories:
    for link in get_category_links(category):
        articles.append(get_article_info(link, category))
    
pd.DataFrame(articles)

Unnamed: 0,title,content,category
0,Best investment you'll make: Poonawalla sugges...,The Serum Institute of India's Adar Poonawalla...,business
1,Amazon driver leaves 'kind' message for girl f...,Amazon Founder Jeff Bezos took to Instagram St...,business
2,"Who is Jared Birchall, who manages the world's...",Jared Birchall is the Managing Director of Exc...,business
3,Work ethic expectations from Twitter employees...,"The world's richest man Elon Musk tweeted ""wor...",business
4,Baseless and untrue: ED as Xiaomi alleges it t...,After Xiaomi alleged in a court filing that it...,business
...,...,...,...
94,Hope hero says he's male Nushrratt: Nushrratt ...,Actress Nushrratt Bharuccha reacted to being c...,entertainment
95,"Preity shares pic of twins, says 'Beginning to...",Preity Zinta shared a picture of her twins and...,entertainment
96,'Arjun Reddy' actor Rahul Ramakrishna announce...,"Actor Rahul Ramakrishna, known for portraying ...",entertainment
97,"Irrfan in 'Piku' car scene was ""brilliance bey...",Actor Amitabh Bachchan has stated that late ac...,entertainment


In [44]:
# Let's put it in a function

def get_news_articles(categories):
    articles = []

    for category in categories:
        for link in get_category_links(category):
            articles.append(get_article_info(link, category))

    return pd.DataFrame(articles)

In [45]:
get_news_articles(categories)

Unnamed: 0,title,content,category
0,"Who is Jared Birchall, who manages the world's...",Jared Birchall is the Managing Director of Exc...,business
1,Best investment you'll make: Poonawalla sugges...,The Serum Institute of India's Adar Poonawalla...,business
2,Amazon driver leaves 'kind' message for girl f...,Amazon Founder Jeff Bezos took to Instagram St...,business
3,Work ethic expectations from Twitter employees...,"The world's richest man Elon Musk tweeted ""wor...",business
4,Baseless and untrue: ED as Xiaomi alleges it t...,After Xiaomi alleged in a court filing that it...,business
...,...,...,...
94,"Johnny Depp gets gifts, cards from fans outsid...",A video going viral on social media shows acto...,entertainment
95,Gangster from Middle East offered ₹1.25cr owed...,Late veteran actor Amjad Khan's son Shadaab ha...,entertainment
96,Faced hurdles getting into OTT when I was bare...,"Rithvik Dhanjani said he, initially, faced ""so...",entertainment
97,'Spider-Man: No Way Home' one of greatest movi...,"Filmmaker Sam Raimi, who directed the Tobey Ma...",entertainment


### Put it in a Module

Now let's put all this acquisition code into a module and test it.

In [47]:
%run acquire.py

In [49]:
get_inshorts_articles().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     99 non-null     object
 1   content   99 non-null     object
 2   category  99 non-null     object
dtypes: object(3)
memory usage: 2.4+ KB


## Bonus: cache the data

All code for cacheing the data is located in _acquire.py