2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

    Business
    Sports
    Technology
    Entertainment
    
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

**Hints:**

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

**Bonus: cache the data**

3. Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).

In [4]:
import pandas as pd
from requests import get
import os
from bs4 import BeautifulSoup

In [9]:
def get_articles_from_topic(url):
    headers = {'User-Agent':'Codeup Data Science Student'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    output = []
    articles = soup.select(".news-card")
    
    for article in articles:
        title = article.select("[itemprop='headline']")[0].get_text()
        body = article.select("[itemprop='articleBody']")[0].get_text()
        author = article.select(".author")[0].get_text()
        publish_date = article.select(".time")[0].get_text()
        category = response.url.split("/")[-1]
        
        article_data = {'title': title,
                        'body': body,
                        'category': category,
                        'author': author,
                        'publish_date': publish_date
                       }
        output.append(article_data)
        
    return output

In [10]:
def make_new_request():
    urls = ["https://inshorts.com/en/read/business",
            "https://inshorts.com/en/read/sports",
            "https://inshorts.com/en/read/technology",
            "https://inshorts.com/en/read/entertainment"]
    output = []
    
    for url in urls:
        # use .extend to make flat output list
        output.extend(get_articles_from_topic(url))
    
    df = pd.DataFrame(output)
    df.to_csv('inshorts_news_articles.csv')
    
    return df

In [11]:
make_new_request()

Unnamed: 0,author,body,category,publish_date,title
0,Krishna Veera Vanamali,"British cave explorer Vernon Unsworth, who los...",business,09:01 pm,I'll take it on the chin: Cave explorer after ...
1,Krishna Veera Vanamali,After a US jury found that Elon Musk did not d...,business,10:04 pm,My faith in humanity is restored: Musk after w...
2,Dharna,A Lucknow-based customer has filed an FIR agai...,business,01:47 pm,FIR filed against Club Factory in Lucknow for ...
3,Krishna Veera Vanamali,Congress leader Shashi Tharoor said he wants a...,business,07:00 am,I want a 'New India' where Bajaj can speak fea...
4,Krishna Veera Vanamali,Finance Minister Nirmala Sitharaman on Saturda...,business,06:06 pm,Sitharaman hints at personal income tax rate c...
5,Pragya Swastik,The Uttar Pradesh Cabinet on Monday approved Z...,business,03:05 pm,UP Cabinet approves Zurich Airport Int'l as de...
6,Pragya Swastik,Former RBI Governor Raghuram Rajan in an artic...,business,04:10 pm,Ideas come from a small set of people around P...
7,Pragya Swastik,Price of onion has shot up to ₹200 per kg in B...,business,02:14 pm,Onion prices surge up to ₹200 per kg in Bengaluru
8,Krishna Veera Vanamali,A panel set up to increase GST collections is ...,business,08:00 am,GST panel considers raising 5% slab to 6%: Rep...
9,Pragya Swastik,Tata Sons Chairman Emeritus Ratan Tata in an i...,business,04:04 pm,I wanted to buy a proper piano until I saw the...


In [12]:
def get_news_articles():
    filename = 'inshorts_news_articles.csv'
    if os.path.exists(filename):
        return pd.read_csv(filename)
    else:
        return make_new_request()

In [13]:
df = get_news_articles()

In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,author,body,category,publish_date,title
0,0,Krishna Veera Vanamali,"British cave explorer Vernon Unsworth, who los...",business,09:01 pm,I'll take it on the chin: Cave explorer after ...
1,1,Krishna Veera Vanamali,After a US jury found that Elon Musk did not d...,business,10:04 pm,My faith in humanity is restored: Musk after w...
2,2,Dharna,A Lucknow-based customer has filed an FIR agai...,business,01:47 pm,FIR filed against Club Factory in Lucknow for ...
3,3,Krishna Veera Vanamali,Congress leader Shashi Tharoor said he wants a...,business,07:00 am,I want a 'New India' where Bajaj can speak fea...
4,4,Krishna Veera Vanamali,Finance Minister Nirmala Sitharaman on Saturda...,business,06:06 pm,Sitharaman hints at personal income tax rate c...


In [15]:
df.category.unique()

array(['business', 'sports', 'technology', 'entertainment'], dtype=object)