By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

#### 1. Codeup Blog Articles
###### Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:


>{<br>
    'title': 'the title of the article',
    'content': 'the full text content of the article'<br>
}

Plus any additional properties you think might be helpful.
###### Bonus: Scrape the text of all the articles linked on codeup's blog page.
---
#### 2. News Articles
###### We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

    - Business
    - Sports
    - Technology
    - Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
###### Hints:
- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.
---
#### Bonus: cache the data
Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).

In [1]:
import requests
from bs4 import BeautifulSoup

#### We'll use the beautiful soup library to work with HTML data in python.
>Make a soup variable holding the response content<br>
soup = BeautifulSoup(response.content, 'html.parser')

#### Beautiful Soup Methods and Properties
- soup.title.string gets the page's title (the same text in the browser tab for a page, this is the \<title> element
- soup.prettify() is useful to print in case you want to see the HTML
- soup.find_all("a") find all the anchor tags, or whatever argument is specified.
- soup.find("h1") finds the first matching element
- soup.get_text() gets the text from within a matching piece of soup/HTML
- The soup.select() method takes in a CSS selector as a string and returns all matching elements. super useful

> see also `soup.find_all`<br>
beautiful soup uses `class_` as the keyword argument for searching<br>
for a class because `class` is a reserved word in python<br>
we'll use the class name that we identified from looking in the inspector in chrome<br>

article = soup.find('div', id='main-content')<br>
article.text


---

#### 1.
###### Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:


>{<br>
    'title': 'the title of the article',
    'content': 'the full text content of the article'<br>
}

Plus any additional properties you think might be helpful.
###### Bonus: Scrape the text of all the articles linked on codeup's blog page.

In [2]:
url1 = 'https://codeup.com/cloud-administration/cloud-computing-and-aws/'
url2 = 'https://codeup.com/codeup-news/c-suite-award-stephen-noteboom/'
url3 = 'https://codeup.com/data-science/recession-proof-career/'
url4 = 'https://codeup.com/codeup-news/codeup-x-comic-con/'
url5 = 'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/'

In [3]:
# Grab content from the url
requests.get(url1).content

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

In [4]:
# Taking content from curriculum to fix error
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = requests.get(url1, headers=headers)
response

<Response [200]>

In [5]:
# response.text, similar output
response.content

b'<!DOCTYPE html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<link rel="pingback" href="https://codeup.com/xmlrpc.php" />\n\n\t<script type="text/javascript">\n\t\tdocument.documentElement.className = \'js\';\n\t</script>\n\t\n\t<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClas

In [6]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [7]:
BeautifulSoup(response.content, 'html.parser')

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-overlay-visible","ove

In [8]:
BeautifulSoup(response.content, 'html.parser').find('div', id='main-content')

<div id="main-content">
<div class="container">
<div class="clearfix" id="content-area">
<div id="left-area">
<article class="et_pb_post post-19148 post type-post status-publish format-standard has-post-thumbnail hentry category-cloud-administration category-tips-for-prospective-students" id="post-19148">
<div class="et_post_meta_wrapper">
<h1 class="entry-title">What is Cloud Computing and AWS?</h1>
<p class="post-meta"><span class="published">Sep 13, 2022</span> | <a href="https://codeup.com/category/cloud-administration/" rel="category tag">Cloud Administration</a>, <a href="https://codeup.com/category/tips-for-prospective-students/" rel="category tag">Tips for Prospective Students</a></p><img alt="cloud computing" class="" height="675" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1080px, 100vw" src="https://tribucodeup.wpenginepowered.com/wp-content/uploads/2022/09/Copy-of-Blog-BannerFeatured-Image-1-1080x

In [9]:
BeautifulSoup(response.content, 'html.parser').find('div', id='main-content').text

'\n\n\n\n\n\nWhat is Cloud Computing and AWS?\nSep 13, 2022 | Cloud Administration, Tips for Prospective Students\n\n\nWith many companies switching to cloud services and implementing cloud infrastructure in their practices, many of us may be wondering…what is cloud computing, and what is AWS?\nHopefully, this beginner-friendly guide will answer these questions and give you a foundational understanding of AWS and how cloud infrastructure differs from traditional IT.\nWhat is cloud computing?\nCloud computing is data hosted via the internet by an independent party. This independent party extends its infrastructure to customers with a pay-as-you-go pricing model.\nCloud Service Providers\nRecently, more organizations have given up physical data centers and are transitioning to the Cloud for their needs. The following are the top Cloud Infrastructure Services providers in 2022:\n\nAmazon Web Services (AWS)\nMicrosoft Azure\nGoogle Cloud Platform (GCP)\n\nAWS| Amazon Web Services\nAt Codeu

In [10]:
article = BeautifulSoup(response.content, 'html.parser').find('div', id='main-content')

with open('article.txt', 'w') as f:
    f.write(article.text)

In [11]:
def get_article_text():
    # if we already have the data, read it locally
    #if path.exists('article.txt'):
        #with open('article.txt') as f:
            #return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/data-science/math-in-data-science/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', id='main-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text.strip()

In [12]:
get_article_text()

'What are the Math and Stats Principles You Need for Data Science?\nOct 21, 2020 | Data Science\n\n\nComing into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?\nWhat are the main math principles you need to know to get into Codeup’s Data Science program?\n\n\nAlgebra\nDo you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:\xa0\n\nVariables (x, y, n, etc.)\nFormulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).\nOrder of evaluation: PEMDAS: parentheses, exponents, then multiplication, divis

In [13]:
article.text.strip()

'What is Cloud Computing and AWS?\nSep 13, 2022 | Cloud Administration, Tips for Prospective Students\n\n\nWith many companies switching to cloud services and implementing cloud infrastructure in their practices, many of us may be wondering…what is cloud computing, and what is AWS?\nHopefully, this beginner-friendly guide will answer these questions and give you a foundational understanding of AWS and how cloud infrastructure differs from traditional IT.\nWhat is cloud computing?\nCloud computing is data hosted via the internet by an independent party. This independent party extends its infrastructure to customers with a pay-as-you-go pricing model.\nCloud Service Providers\nRecently, more organizations have given up physical data centers and are transitioning to the Cloud for their needs. The following are the top Cloud Infrastructure Services providers in 2022:\n\nAmazon Web Services (AWS)\nMicrosoft Azure\nGoogle Cloud Platform (GCP)\n\nAWS| Amazon Web Services\nAt Codeup, our Cloud

In [14]:
article.final_all('p')

TypeError: 'NoneType' object is not callable

In [15]:
def get_blog_articles():
    url = 'https://codeup.com/blog/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    articles = []
    regexp = r'(http.+)"'

    for link in soup.find_all('h2')[0:6]:
        blog_url = re.findall(regexp, str(link))[0]
        response2 = get(blog_url, headers=headers)
        soup2 = BeautifulSoup(response2.content, 'html.parser')
        title = soup2.select('h1.entry-title')
        content = soup2.select('div.entry-content')
        articles.append({'title' : title[0].text.strip(), 'content' : content[0].text.strip()})
    return pd.DataFrame(articles)

#### 2.
###### We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

    - Business
    - Sports
    - Technology
    - Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

In [16]:
url6 = 'https://inshorts.com/en/read'

In [17]:
requests.get(url6).content



In [18]:
#headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
requests.get(url6)

<Response [200]>

In [19]:
BeautifulSoup(requests.get(url6).content, 'html.parser')#.find('div', id='main-content')

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* positi

In [20]:
BeautifulSoup(requests.get(url6).content, 'html.parser').find('div', id='main-content').text

AttributeError: 'NoneType' object has no attribute 'text'

In [21]:
def get_news_articles():
    articles = []

    home_url = 'https://inshorts.com/en/read/'
    cat_list = ['technology', 'sports', 'business', 'entertainment']
    for cat in cat_list:
        response = get(home_url+cat)
        soup = BeautifulSoup(response.content, 'html.parser')    
        titles = soup.select('div.news-card-title.news-right-box')
        contents = soup.select('div.news-card-content.news-right-box')

        regexp = r'(.+?)\n\n'
        for i in range(len(titles)):
            content = re.findall(regexp, contents[i].text.strip())
            title = re.findall(regexp, titles[i].text.strip())
            articles.append({'title' : title[0], 'content' : content[0], 'category' : cat})
    return pd.DataFrame(articles)

#### Bonus: cache the data
Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).