## Web-Scraping Exercises

By the end of this exercise, you should have a file named ```acquire.py``` that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. ```acquire_codeup_blog.py``` and ```acquire_news_articles.py```), but the end function should be present in ```acquire.py``` (that is, ```acquire.py``` should ```import get_blog_articles``` from the ```acquire_codeup_blog module```.)



**1.  Codeup Blog Articles.  Scrape the article text from the following pages:**
- https://codeup.com/codeups-data-science-career-accelerator-is-here/
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{'title': 'the title of the article','content': 'the full text content of the article'}

Plus any additional information you think might be helpful

In [1]:
#imports
import pandas as pd
import requests
import bs4
from requests import get
from bs4 import BeautifulSoup


In [2]:
# make the http request 
headers = {'User-Agent': 'Codeup Data Science'}
response = requests.get('http://codeup.com/codeups-data-science-career-accelerator-is-here', headers = headers)
html = response.text

In [3]:
# turn the response into a beautiful soup object
soup = bs4.BeautifulSoup(response.text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
  <style id="nab-alternative-loader-style" type="text/css">
  </style>
  <script id="nelio-ab-testing-kickoff" type="text/javascript">
   /*nelio-ab-testing-kickoff*//* <![CDATA[ */( function() { var ua = window.navigator.userAgent || ''; if ( -1 !== ua.indexOf( 'MSIE ' ) || -1 !== ua.indexOf( 'Trident/' ) ) { window.nabAddSingleAction = function() {}; window.nabDoSingleAction = function() {}; return; } function hideContent() { var element = document.getElementById( 'nab-alternative-loader-style' ); if ( ! element ) { return; } var style = 'html::before, body::before { background:#fff; content:""; width:100vw; height:100vh; position:fixed; z-index:999999999; }'; if ( element.styleSheet ) { element.styleSheet.cssText = style; } else { ele

In [4]:
# find the container for the information we want
title = soup.find('title').text
content = soup.find('div', class_='jupiterx-post-content')
#article.text

In [5]:
#create list of urls
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/', 
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'
       ]

# create an acquire function
def get_blog_articles(urls):

    '''
    This function takes in a list of Codeup blog post urls, performs webscraping on Codeup's
    blog site and returns a list of dictionaries containing the title and text of each article'''
    
    articles = []
    for url in urls: 
        # fetch the articles
        headers = {'User-Agent': 'Codeup Data Science'}
        response = requests.get(url, headers = headers)
        html = response.text
  
        # turn the response into a beautiful soup object
        soup = bs4.BeautifulSoup(response.text)
    
        # create a dictionary for each article and add to list of dictionaries
        articles.append({
        'title': soup.find('title').text,
        'content': soup.find('div',class_='jupiterx-post-content').text
            })
        
    return articles

In [6]:
# call the function
articles = get_blog_articles(urls)
#articles

In [7]:
#create a dataframe of the articles and text
df = pd.DataFrame(articles)

In [8]:
df

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


**2. News Articles - Write a function that scrapes the news articles for the following topics:**

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

**Hints:**

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [9]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [10]:
def get_articles(category):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    base = "https://inshorts.com/en/read/"
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output
    

In [11]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [12]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
df = get_all_news_articles(categories)
df

Unnamed: 0,title,content,category
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
2,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
3,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
4,India announced triumph over COVID-19 early: U...,Confederation of Indian Industry (CII) Preside...,business
...,...,...,...
142,Myanmar's military govt bans satellite TV citi...,Myanmar's military government has announced a ...,world
143,Germany's Oktoberfest cancelled again in 2021 ...,"Germany's Oktoberfest, the world's largest bee...",world
144,Senior Swiss diplomat in Iran found dead after...,The first secretary at Switzerland's embassy i...,world
145,Myanmar indicts Japanese journalist on 'fake n...,Japanese journalist Yuki Kitazumi detained by ...,world
