By the end of this exercise, you should have a file named `acquire.py` that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. `acquire_codeup_blog.py` and `acquire_news_articles.py`), but the end function should be present in `acquire.py` (that is, `acquire.py` should import `get_blog_articles` from the `acquire_codeup_blog` module.)

In [3]:
import pandas as pd

import requests
import bs4
import os

### 1. Codeup Blog Articles

Scrape the article text from the following pages:

- https://codeup.com/codeups-data-science-career-accelerator-is-here/
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/


Encapsulate your work in a function named `get_blog_articles` that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

> `{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}`


Plus any additional properties you think might be helpful.


**Bonus**:
- Scrape the text of all the articles linked on [codeup's blog page](https://codeup.com/blog/).

In [4]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'}

response = requests.get(url, headers=headers)
html = response.text
soup = bs4.BeautifulSoup(html)

In [5]:
article = soup.select('.jupiterx-main-content')[0]
title = article.find('h1').text
content = article.select('.jupiterx-post-content.clearfix')[0].text

{"title": title, "content": content}

{'title': 'Codeup’s Data Science Career Accelerator is Here!',
 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace,

In [6]:
#listing pages to scrape the article text in function
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
    'https://codeup.com/data-science-myths/',
    'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
    'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
    'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']
    
    
#function to get list of article dictionaries
def get_blog_articles(urls):
    
    '''
    This function takes in urls that contain articles from the Codeup website,
    scrapes the article title and article text from each,
    and returns a list of dictionaries, 
    with each dictionary representing one article. 
    '''
    
    #create empty list 
    article_list = []
    
    #for loop to iterate through list of URLs
    for url in urls:
        
        #empty dictionary
        dic = {}
        
        #specify headers
        headers = {'User-Agent': 'Codeup Data Science'}
        
        #make http request and turn response into a beautiful soup object
        response = requests.get(url, headers=headers)
        html = response.text
        soup = bs4.BeautifulSoup(html)
        
        #save article body text
        article = soup.select('.jupiterx-main-content')[0]
        
        #save article title and content text
        title = article.find('h1').text
        content = article.select('.jupiterx-post-content.clearfix')[0].text
        
        #store article body and title in dictionary
        dic['title'] = title
        dic['content'] = content
        
        #append dictionary to list
        article_list.append(dic)
        
    #return list when all article text added
    return article_list

In [7]:
#list of dictionaries w/ each dictionary representing one article
get_blog_articles(urls)

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspac

In [8]:
#can specify particular article 0-4
get_blog_articles(urls)[0]

{'title': 'Codeup’s Data Science Career Accelerator is Here!',
 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace,

In [18]:
#making it into a dataframe along w/ date and image info

# Make a function that works on a single url
# Make sure your function has everything it needs inside (try to avoid globals)

def get_codeup_blog(url):
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = requests.get(url, headers=headers)
    
    soup = bs4.BeautifulSoup(response.text)
    
    title = soup.find("h1").text
    published_date = soup.time.text
    
    if len(soup.select(".jupiterx-post-image")) > 0:
        blog_image = soup.select(".jupiterx-post-image")[0].picture.img["data-src"]
    else:
        blog_image = None
        
    content = soup.select(".jupiterx-post-content")[0].text
    
    output = {}
    output["title"] = title
    output["published_date"] = published_date
    output["blog_image"] = blog_image
    output["content"] = content
    
    return output

In [19]:
def get_all_blog_articles(urls):
    # List of dictionaries
    posts = [get_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [20]:
urls = [
    "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
    "https://codeup.com/data-science-myths/",
    "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
    "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
    "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"
]

In [21]:
df = get_all_blog_articles(urls)
df.head()

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


***
### 2. News Articles

We will now be scraping text data from [inshorts](https://inshorts.com/), a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment


The end product of this should be a function named `get_news_articles` that returns a list of dictionaries, where each dictionary has this shape:

> `{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}`



Hints:

- a. Start by inspecting the website in your browser. Figure out which elements will be useful.
- b. Start by creating a function that handles a single article and produces a dictionary like the one above.
- c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [54]:
url = 'https://inshorts.com/en/read'

response = requests.get(url)
html = response.text
soup = bs4.BeautifulSoup(html)

In [60]:
article = soup.select('.card-stack')[0]
title = article.find(itemprop='headline').text
content = article.find(itemprop='articleBody').text

{"title": title, "content": content}

{'title': "US woman who didn't know she was pregnant gives birth mid-flight; video viral",
 'content': 'An American woman who didn\'t know she was pregnant gave birth to her son mid-flight and a video of the incident, posted by another passenger, went viral online. The Utah-based woman Lavinia Mounga was allowed to fly from Salt Lake City to Hawaii in her third trimester, as according to her father the family had "no idea" about the pregnancy.'}

In [22]:
category = "business"
base = "https://inshorts.com/en/read/"
url = base + category
url

'https://inshorts.com/en/read/business'

In [23]:
#function that handles a single article and produces a dictionary
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [24]:
def get_articles(category):
    
    '''
    This function takes in a category as a string (category),
    that must be an available category in inshorts, 
    
    and returns a list of dictionaries where each dictionary represents a single inshort article.
    '''
    
    base = "https://inshorts.com/en/read/"
    
    #concatenate base_url with the category
    url = base + category
    
    #get http response object from the server
    response = requests.get(url)

    #make soup out of the raw html
    soup = bs4.BeautifulSoup(response.text)
    
    #ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    #iterate through every article tag/soup 
    for article in articles:
        
        #returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        #append the dictionary to the list
        output.append(article_data)
    
    #return the list of dictionaries
    return output

In [26]:
# Example of using the get_articles function sending in the category name that's part of the URL
get_articles("business")[0]

{'title': 'India underestimated the coronavirus: Raghuram Rajan on 2nd wave',
 'content': 'Speaking about India\'s second COVID-19 wave, former RBI Governor Raghuram Rajan said, "I think what went wrong was simply [that]...we underestimated the virus and its ability to adapt." After the first wave, "there was a sense that we had endured the worst...and we had come through, and it was time to open up, and that complacency hurt us", he added.',
 'category': 'business'}

In [27]:
def get_news_articles(categories):
    '''
    This function takes in a list of categories where the category is part of the URL pattern on inshorts,
    
    and returns a dataframe of every article from every category listed
    
    Each row in the dataframe is a single article.
    '''
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    
    return all_inshorts

In [29]:
categories = ["business", "sports", "technology", "entertainment"]

#specify article in brackets to check out different titles, content, and categories
get_news_articles(categories)[30]

{'title': 'Cricket reporter Ruchir Mishra dies due to COVID-19, Wasim Jaffer condoles demise',
 'content': 'Cricket reporter Ruchir Mishra passed away due to coronavirus at the age of 42. Ruchir worked for a national daily in Nagpur. Punjab Kings batting coach Wasim Jaffer condoled his demise, tweeting, "Don\'t want to believe what I have just heard. Ruchir Mishra is no more. This is just so sad. He was one of the nicest journalist from Nagpur."',
 'category': 'sports'}

In [35]:
def news_articles_df(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [37]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
df = news_articles_df(categories)
df

Unnamed: 0,title,content,category
0,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
1,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
2,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
3,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
4,"SAIL increases oxygen supply to 1,100 metric t...",Steel Authority of India Limited (SAIL) said t...,business
...,...,...,...
142,Egypt buys 30 more Rafale jets from France in ...,Egypt will buy 30 more Rafale fighter jets fro...,world
143,Germany's Oktoberfest cancelled again in 2021 ...,"Germany's Oktoberfest, the world's largest bee...",world
144,Nepal PM appeals for COVID-19 vaccines as case...,Nepal PM KP Sharma Oli has urged neighbouring ...,world
145,Iraqi Health Minister resigns over hospital fi...,Iraq's Health Minister Hassan al-Tamimi steppe...,world


***
### 3. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).