In [1]:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
from os import path

At a high level, we'll go about web scraping through this process:

* Manually explore the site in a web browser, and identify the relevant HTML elements.
* Use the requests module to obtain the HTML from the page.
* Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
* (Maybe) Script the process of requesting another page and parsing the data from it as well.
* Take this data further down the data science pipeline.

Steps

1) Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
2) Assign the address of the web page to a variable named url.
3) Request the server the content of the web page by using get(), and store the server’s response in the variable response.
4) Print the response text to ensure you have an html page.
5) Take a look at the actual web page contents and inspect the source to understand the structure a bit.
6) Use BeautifulSoup to parse the HTML into a variable ('soup').
7) Identify the key tags you need to extract the data you are looking for.
8) Create a dataframe of the data desired.
9) Run some summary stats and inspect the data to ensure you have what you wanted.
10) Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
11) Create a corpus of the column with the text you want to analyze.
12) Store that corpus for use in a future notebook.

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
# Set user agent
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
# Get http response object from the server
response = get(url, headers=headers)

In [3]:
print(response.text[:200])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-previe


In [5]:
raw_html = response.text

In [6]:
# Make a soup variable holding the response content
soup = BeautifulSoup(raw_html)

In [7]:
# get title
title = soup.find('h1').text
title

'Codeup’s Data Science Career Accelerator is Here!'

In [8]:
# Another way to get title using the jupiterx-post-title class
title = soup.select('.jupiterx-post-title')[0].text

In [11]:
# Get content
content = soup.select('.jupiterx-post-content')[0].text
content

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [12]:
# image
div_for_image = soup.select('.jupiterx-post-image')[0]
image_src = div_for_image.picture.img['data-src']
image_src

'https://codeup.com/wp-content/uploads/2018/10/Data-Science-7.png'

In [27]:
# Make a function that works on a single url
# Make sure your function has everything it needs inside (try to avoid globals)

def acquire_codeup_blog(url):
    '''scrapes website elements and creates corpus for future use'''
    # Set header
    headers = {'User-Agent': 'Codeup Data Science'}

    # Get the http response object from the server
    response = get(url, headers=headers)
    
    soup = BeautifulSoup(response.text)
    
    title = soup.find("h1").text
    published_date = soup.time.text
    
    if len(soup.select(".jupiterx-post-image")) > 0:
        blog_image = soup.select(".jupiterx-post-image")[0].picture.img["data-src"]
    else:
        blog_image = None
        
    content = soup.select(".jupiterx-post-content")[0].text
    
    output = {}
    output["title"] = title
    output["published_date"] = published_date
    output["blog_image"] = blog_image
    output["content"] = content
    
    return output


In [29]:
output = acquire_codeup_blog('https://codeup.com/codeups-data-science-career-accelerator-is-here/')
output                          

{'title': 'Codeup’s Data Science Career Accelerator is Here!',
 'published_date': 'September 30, 2018',
 'blog_image': 'https://codeup.com/wp-content/uploads/2018/10/Data-Science-7.png',
 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum dev

In [30]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here',
'https://codeup.com/data-science-myths',
'https://codeup.com/data-science-vs-data-analytics-whats-the-difference',
'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair',
'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger']

def get_blog_articles(urls):
    '''iterates through a list of websites and compiles corpus of elements'''
    # List of dictionaries
    posts = [acquire_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [31]:
get_blog_articles(urls)

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


2) News Articles

    We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

    Write a function that scrapes the news articles for the following topics:

    * Business
    * Sports
    * Technology
    * Entertainment
    * The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:
    
    {
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:

a) Start by inspecting the website in your browser. Figure out which elements will be useful.

b) Start by creating a function that handles a single article and produces a dictionary like the one above.

c) Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.

d) Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [32]:
# categories
categories = ["business", "sports", "technology", "entertainment", "science", "world"]

In [34]:
# url
url = 'https://inshorts.com/en/read/'

In [35]:
first_cat = categories[0]

In [37]:
first_page = url + first_cat

In [39]:
headers = {'User-Agent': 'Codeup Data Science'}

In [40]:
# get our content
response = get(first_page, headers=headers)

In [43]:
response.text[:200]

'<!doctype html>\n<html lang="en">\n\n<head>\n  <meta charset="utf-8" />\n  <style>\n    /* The Modal (background) */\n    .modal_contact {\n        display: none; /* Hidden by default */\n        position: fix'

In [44]:
# soup object
# make our soup
soup = BeautifulSoup(response.text)

In [48]:
# pull articles by news-card id
articles = soup.select('.news-card')

In [75]:
# pull HTML code from first article
articles[0]

<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/amazon-job-posting-fuels-speculations-about-plan-to-accept-payments-in-crypto-1627312165039" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Pragya Swastik" itemprop="name"></span>
</span>
<span content="Amazon job posting fuels speculations about plan to accept payments in crypto" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2021/07_jul/26_mon/img_1627309467319_923.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span 

In [73]:
# pull body text by id from first article
articles[0].select('[itemprop="articleBody"]')[0].text

"A new job posting by Amazon has fuelled speculations that the e-commerce major may begin accepting Bitcoin, Ether and other cryptocurrencies as a form of payment. According to the job posting, Amazon's Payments Acceptance & Experience team is hiring a 'Digital Currency and Blockchain Product Lead'. Following the speculations around Amazon's plan, Bitcoin surged near $40,000 on Monday."

In [126]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select('[itemprop="articleBody"]')[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [127]:
def get_articles(category, url_base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = url_base + category
    
    # Set the headers
    headers = {'User-Agent': 'Codeup Data Science'}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [128]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [125]:
get_all_news_articles(categories)

Unnamed: 0,title,content,category
0,Amazon job posting fuels speculations about pl...,A new job posting by Amazon has fuelled specul...,business
1,China's ex-teacher turned billionaire no more ...,"China's Larry Chen, a former teacher who becam...",business
2,"Musk takes a jibe at rival car companies, says...",Tesla CEO and the world's second-richest perso...,business
3,Mahua Moitra writes to FM to look into 'over-i...,Lok Sabha MP Mahua Moitra has shared a letter ...,business
4,Govt paid Infosys ₹164.5 crore for new Income ...,The government paid ₹164.5 crore to Infosys to...,business
...,...,...,...
142,Russian PM Mishustin visits Pacific islands cl...,During his tour of Russia's Far East and Siber...,world
143,46 Afghan soldiers flee to Pakistan in retreat...,The Pakistani Army on Monday said that 46 Afgh...,world
144,US offers further air support to Afghan troops...,The US will continue to carry out airstrikes a...,world
145,Afghan Army chief postpones India visit amid T...,Afghan Army chief General Wali Mohammad Ahmadz...,world
