Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.3: Extracting Metadata

In this notebook, we use the Google News API to search for news: 

https://news.google.com/?hl=en-US&gl=US&ceid=US:en

We learn how to extract some metadata from the API. 

## 1. Queries in different languages

In the Google News API, we can specify the query (*q*) and the language (*gl*). The language needs to be abbreviated according to the two-letter ISO-639-1 code. 

**Play with different queries and languages.** 

Note: There are different ISO code classifications for languages. ISO-639-1 is the oldest one and uses two letters. More recent schemes use three letters to include more languages (living and extinct): 
* https://www.iso.org/iso-639-language-codes.html 
* https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

In [1]:
from util_html import *

topic ='veganism'
language='en'
base_url = "http://news.google.com/"
query = topic.lower()

# Make sure you understand how this string is composed. 
full_query = "?q={0}&gl={1}".format(query, language)
query_url = (base_url + full_query)
print("The query URL is:", query_url)

query_content = url_to_html(query_url)


The query URL is: http://news.google.com/?q=veganism&gl=en


## 2. Extracting Metadata

Google News lists many articles for each query. As with the NOS-articles, we first want to extract the links. If you click on the links, you will notice that Google News does not write own articles, but just lists articles from other sources. In the following function, we try to extract metadata from the html. 

This particular strategy for metadata extraction only works for this version of Google News.
If you use another engine or if their code changes, you will need to adapt the metadata extraction. 

**Make sure to add additional printouts to inspect the html content and understand how we find the metadata.** 

In [2]:
def extract_metadata_googlenews(article):
    # Extract the publication date
    time = article.find('time')
    if time:
        datetime = time.get('datetime')
        date, time = datetime.split("T")
    else:
        date = ""
        time = ""
    # Discover the structure in the data
    technical_data, title_html, publisher_html = article.find_all('a')
        
    # Extract meta data
    publisher = publisher_html.contents[0]
    title = title_html.contents[0]
    url = title_html.get('href')        
        
    # The URL is a redirect from the Google page. Let's re-create the original URL form this
    article_redirect = base_url + url
    article_url = requests.get(article_redirect).url
        
    return date, time, publisher, title, article_url

## 3. Extracting Content

In util_html.py, you find two additional functions: *parse_author* and *parse_news_text*. These functions try to extract the author and the text from each article. Note, that the functions are only approximations. They might fail because we do not know the html structure of every publisher. 

**If you are an advanced programmer, check the code of the functions and make sure you understand it.** 

In [3]:
articles = query_content.find_all('article')

max = 10
for i, article in enumerate(articles):
    if i < max: 
        
        date, time, publisher, title, article_url = extract_metadata_googlenews(article)
        
        article_content = url_to_html(article_url)   
        author = parse_author(article_content)
        content = parse_news_text(article_content)
        
        print(date, time, "publisher:", publisher)
        print(article_url)
        print("author:", author)  
        print("title:", title) 
        print(content[:100])
        print()
    else:
        break

2020-10-20 12:22:11Z publisher: Clevelandmagazine.com
https://clevelandmagazine.com/food-drink/articles/picked-proteins-offers-an-alternative-to-meat-for-vegans
author: 
title: Picked Proteins Offers An Alternative To Meat For Vegans
Scott Roger didn’t miss much once he went vegan. But his cravings for pepperoni led him on a quest t

2020-10-18 07:16:42Z publisher: New Bloom
https://newbloommag.net/2020/10/18/chang-veganism-commentary/
author: 
title: Popular YouTube Video by Chang Chih-chyi Misrepresents Veganism
FAMOUS TAIWANESE Youtube influencer, Chang Chih-chyi, made a video about Veganism that is, at best, 

2020-10-11 07:00:00Z publisher: Telegraph.co.uk
https://www.telegraph.co.uk/health-fitness/nutrition/diet/gave-veganism-health-improved-instantly/
author: Flic Everett
title: 'I gave up veganism and my health improved instantly'
 Although many advocates of veganism remain healthy, after two years of health issues, I’m admitting

  publisher: RADIO.COM
https://www.radio.com/al

## 4. Saving results as TSV

A standardized open format for storing the content of multiple variables are CSV files. CSV stands for comma-separated values. CSV files are text-based, but when they are imported to a spreadsheet program such as Excel, they are displayed as a table. Lines in the text file are interpreted as rows; commas in the text file are interpreted as separators for columns.  

Most programmers prefer to use TSV files. In these files, the values are separated by tabulators ("\t") instead of commas. Both variants can be easily processed, but you need to know which separator has been used. 

**If necessary, recap information on CSV and TSV files in [Chapter 16](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2016%20-%20Data%20formats%20I%20(CSV%20and%20TSV).ipynb) of the python course.** 

In [6]:
# Specify query
base_url = "http://news.google.com/"
query='veganism'
language='en'

# We can restrict the search result to the last 48 hours (48h), or 3 days (3d) or even to the last year (1y)
# Note, however that this is a relative date, which makes it difficult to reproduce the retrieval at a later time!
time="48h"

# Extract data

full_query = "?q={0}&hl={1}&when=".format(query.lower(), language, time)
query_url = (base_url + full_query)
query_content = url_to_html(query_url)
articles = query_content.find_all('article')

outfile = "../results/googlenews_results/" + query +"_overview.tsv"

# Extract metadata and write 
with open(outfile, "w") as f:
    f.write("Publication Date\tTime\tPublisher\tAuthor\tTitle\tURL\tText\n")
    
    for i, article in enumerate(articles):
        
        # Extract metadata
        date, time, publisher, title, article_url = extract_metadata_googlenews(article)
        
        # Extract content
        article_content = url_to_html(article_url)
        author = parse_author(article_content)
        content = parse_news_text(article_content)
        
        # We remove the newlines from the content, so that we can easily store it in a single line. 
        # Keep in mind, that newlines can also carry meaning.
        # For example, they separate paragraphs and this information is lost in the analysis, if we remove them. 
        content = content.replace("\n", "")
        
        # We want the fields to be separated by tabulators (\t)
        output = "\t".join([date, time, publisher, author, title, article_url, content])
        f.write(output +"\n")


# 5. Inspect the results.

You will notice that we do not always find a value for the author. There can be two reasons for that: 
- The author name is not provided by the publisher.
- Our code cannot find it.

Double-check on the website which explanation holds. When we are working with automatic methods, we will always be confronted with the issue of missing data. 

**Discuss how this can affect the methodology and interpretation of your experiments.** 

Note: Unfortunately, the Google News API is no longer maintained. It is still running but it is not known for how long.