# Lab1.4: Google news as a source of text

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we use the Google news API to search for news: 

https://news.google.com/?hl=en-US&gl=US&ceid=US:en

We follow a similar approach as for Techcrunch and use BeautifulSoup to get the content. However, now we have to deal with the structure of the Google news output. We do the usual imports for making a request and process the output through BeautifulSoup.

In [1]:
from bs4 import BeautifulSoup
import requests


## 1. Making a request to the Goggle News API

In [4]:
import requests
query='vaccines'

query = query.lower()
language='en'
region='us'
base_url = "http://news.google.com"
query_url = "{0}/?q={1}&hl={1}-{2}&gl={2}".format(base_url, query, language, region)
print(query_url)
google_news_content= requests.get(query_url).content

http://news.google.com/?q=vaccines&hl=vaccines-en&gl=en


The result is an HTML string. We can inspect the start of this string. Let's look at the first 1000 characters:

In [12]:
print(google_news_content[:1000])

b'<!doctype html><html lang="en" dir="ltr"><head><base href="https://news.google.com/"><meta name="referrer" content="origin"><link rel="canonical" href="https://news.google.com/search"><meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui"><meta name="apple-itunes-app" content="app-id=459182288"><meta name="google-site-verification" content="AcBy5YFny2HQgVUCR18tO5YUTf6MpVlcJqGTd-a9-SI"><meta name="mobile-web-app-capable" content="yes"><meta name="apple-mobile-web-app-capable" content="yes"><meta name="application-name" content="News"><meta name="apple-mobile-web-app-title" content="News"><meta name="apple-mobile-web-app-status-bar-style" content="black"><meta name="theme-color" content="white"><meta name="msapplication-tap-highlight" content="no"><link rel="shortcut icon" href="https://lh3.googleusercontent.com/-DR60l-K8vnyi99NZovm9HlXyZwQ85GMDxiwJWzoasZYCUrPuUM_P_4Rb7ei03j-0nRs0c4F=w16" sizes="16x16"><link rel="shortcut icon" href="https://lh3.googleusercontent

## 2. Using BeatifulSoup to process the result

We are going to use BeautifulSoup again to flesh out the results. So let's first turn the string returned by Google into a *soup* object.

In [13]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(google_news_content, 'html.parser')

We are not going into the specific structure of the Google news result structure but we define two specific functions that get the *author* of the news article and the actual *content*.

Analyse the two functions and try to understand the code. Note that if the format of the Google news output changes, also the code may have to be adapted. These functions use regular expressions: https://docs.python.org/3/library/re.html Regular expressions allow you to 'parse' strings using simple patterns. We need to import the re package to use regular expressions.

In [14]:
import re

#This function requires the Beautifule soup object  we get back from the request to the Google API

def parse_author(html):
    result = ""
    html = BeautifulSoup(html, 'html.parser')
    # Try Parsing Author from Meta Tags
    author = html.find('meta', attrs={'name': re.compile('author')}) or \
             html.find('meta', property=re.compile('author', re.IGNORECASE))

    if author: result = author['content']
    else:  # Otherwise, try parsing Author from Text
        author = html.find(attrs={'itemprop': 'author'}) or \
                 html.find(attrs={'class': 'byline'})
        if author: result = author.text

    return re.sub(r"\s+", " ", re.sub("by ", "", result, flags=re.IGNORECASE)).strip()


#This function requires the HTML content we get back from the request to the Google API

def parse_news_text(html):

    html = BeautifulSoup(html, 'html.parser')

    # Try to find Article Body by Semantic Tag
    article = html.find('article')

    # Otherwise, try to find Article Body by Class Name (with the largest number of paragraphs)
    if not article:
        articles = html.find_all(class_=re.compile('(body|article|main)', re.IGNORECASE))
        if articles:
            article = sorted(articles, key=lambda x: len(x.find_all('p')), reverse=True)[0]

    # Parse text from all Paragraphs
    text = []
    if article:
        for paragraph in [tag.text for tag in article.find_all('p')]:
            if re.findall("[.,!?]", paragraph):
                text.append(paragraph)
    text = re.sub(r"\s+", " ", " ".join(text))

    return text

The Google news output consists of articles. We therefore need code to iterate over all article data elements and collect the data. The next code shows how we aggregate the content and meta information for each article, where we apply the above functions to obtain the author and the text from the content obtained from the URL. We break the loop after the first iteration.

In [15]:
# Iterate over all Articles in Google News
articles = soup.find_all('article')
for i, article in enumerate(articles, 1):
    ## new_entry is going to contain the data for each article returned
    new_entry = []
    
    div, title, publisher = article.find_all('a')
    time = re.sub("[Z\-:]", "", article.find('time').get('datetime'))

    article_redirect = "{}{}".format(base_url, title.get('href')[1:])
    article_url = requests.get(article_redirect).url
    article_hash = int(abs(hash(article_url)))
    
    news_content= requests.get(article_url).content
    author = parse_author(news_content)
    news_text = parse_news_text(news_content)
    
    #new entry append in the order of the data frame
    new_entry += [article_url, time, language, author, publisher.text, title.text, news_text]
    print(new_entry)
    break


['https://komonews.com/news/local/65-of-pregnant-women-dont-get-flu-vaccines-put-newborns-at-risk-researcher-says', '20191020T180200', 'en', 'KOMO News Staff', 'KOMO News', "65% of pregnant women don't get flu vaccines - put newborns at risk, data finds", '']


## 3 Save the search results in a CSV file with the meta data

Now we want to store the results for the complete loop in a CSV file so that we can load it later on. To create the output as CSV data, we are going to use the `Pandas` package: https://pandas.pydata.org
Please follow the instructions to install pandas locally:

* `conda install pandas`
* `python -m pip install --upgrade pandas`

Consult the documentation to learn more about the functionalities. Here we are going to use it to convert our list of featurures for a tweet to a CSV format.

We need to import *os* for writing to a file and *pandas* (after the install) for dealing with the data structure. Take your time to study the next bit of code so that you understand the individual steps. 

We need to define the columns for the result table. The data need to be stored in the order of the columns.

In [16]:
COLS = ['url', 'created_at', 'lang', 'author',  'publisher', 'title', 'news_text']

Now we use the same article processing for-loop to obtain the data for each article and store the result in a pandas data frame that we call `all_news_dataframe`.

In [26]:
import os
import pandas as pd

# We first define a data frame that we name 'all_news_dataframe' 
# with pandas imported as 'pd' using the columns list that we defined before.
# Basically, we tell pandas what data will be stored.

all_news_dataframe = pd.DataFrame(columns=COLS)

# Iterate over all Articles in Google News
articles = soup.find_all('article')
for i, article in enumerate(articles, 1):
    ## new_entry is going to contain the data for each article returned
    new_entry = []
    
    div, title, publisher = article.find_all('a')
    time = re.sub("[Z\-:]", "", article.find('time').get('datetime'))

    article_redirect = "{}{}".format(base_url, title.get('href')[1:])
    ### Since the request call may generate an error from the website it tries to reach
    ### We have to catch the error message so that we can continue with the next URL to obtain a result
    ### To handle the errors, we create a try and except block. IF there is an error, we print it, otherwise we carry out commands
    try:
        article_url = requests.get(article_redirect).url
        article_hash = int(abs(hash(article_url)))
    
        news_content= requests.get(article_url).content
        author = parse_author(news_content)
        news_text = parse_news_text(news_content)
    
        #new entry append in the order of the data frame
        new_entry += [article_url, time, language, author, publisher.text, title.text, news_text]

        # We now completed appending all the possible values for this tweet.
        # We use the pandas framework imported as 'pd' to create a dataframe from the aggregated data in new_entry
        # We need to provide the columns COLS to tell pandas what value belongs to what.
        # Note that the data need to be aggregated in the same order as the names in COLS, otherwise values will get mixed up
        single_article_dataframe = pd.DataFrame([new_entry], columns=COLS)
        
        # single_tweet_dataframe now contains the data for a single tweet
        # next we add it to the data frame for all tweets 'all_tweets_dataframe'
        # check the pandas documentation if you want to know what ignore_index=True does to the data aggregation
        all_news_dataframe = all_news_dataframe.append(single_article_dataframe, ignore_index=True)
    except requests.exceptions as e:
        print("Error:" + str(e))

TypeError: catching classes that do not inherit from BaseException is not allowed

When running the code, you may see a whole batch of error messages. Most likely, these are due to websites that cannot be reached. Basically, we fail to collect the data from the site. Still, some of the sites could be reached.

Our data frame basically is a table with columns and rows. We use the `shape` function to ask for the number of rows and columns.

In [25]:
print(all_news_dataframe.shape)

(35, 7)


Through the *pandas* framework, we can now save it to a CSV file.

In [11]:
# We define a file path to store the results as CSV. Make sure the folder 'googlenews_search_results' exists 
# or that you specify another path to an existing location. The 'news_results_<query>.csv' file will be created in that location.
csvFilePath='googlenews_search_results/news_results_'+query+'.csv'

# we now open the csvFile for appending our result
csvFile = open(csvFilePath,"w+")       
all_news_dataframe.to_csv(csvFile, columns=COLS, index=False)

Unfortunately, the Google News API is no longer maintained. It is still running but it is not known for how long:

https://medium.com/rakuten-rapidapi/top-10-best-news-apis-google-news-bloomberg-bing-news-and-more-bbf3e6e46af6


In the next notebook Lab1.5, we show how you can access other news sources directly.

## End of this notebook