Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.3: Extracting Metadata

In this notebook, we use the [https://mediastack.com/ ](MediaStack) to search for news: 

We learn how to extract some metadata from the API. 

Go to https://mediastack.com/signup and fill in the required information. <br>
Save the *API Access Key* and use it in the code. 

## 1. Queries in different languages

In the Media Stack API, we can vary different parameters, such as the keywords and the language.

The language needs to be abbreviated according to the two-letter ISO-639-1 code.

Try out different queries and languages.

**Language codes**: There are different ISO code classifications for languages. ISO-639-1 is the oldest one and uses two letters. More recent schemes use three letters to include more languages (living and extinct):

https://www.iso.org/iso-639-language-codes.html
https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

**Query limit**: Mediastack only allows you to access maximum 100 articles each query. You can get around this limit by changing the parameters, e.g. offset, source, category with every query. Check the [https://mediastack.com/documentation](documentation).  

In [2]:
import http.client, urllib.parse, json
from util_html import *

conn = http.client.HTTPConnection('api.mediastack.com')

params = urllib.parse.urlencode({
    'access_key': '0392ab082f336de31195dd7065260751', ## ADD YOUR ACCESS KEY
    'keywords': 'vegan',
    'sort': 'published_desc',
    'languages':'en', 
    'limit': 100
    })

conn.request('GET', '/v1/news?{}'.format(params))

res = conn.getresponse()
data = res.read()

query_content=(data.decode('utf-8'))

query = json.loads(query_content)
print(query)



# 2. Extracting Metadata

Media Stack lists many articles for each query. As with the NOS-articles, we first want to extract the links. If you click on the links, you will notice that Media Stack does not write own articles, but just lists articles from other sources. In the following function, we try to extract metadata from the html. 

This particular strategy for metadata extraction only works for this version of Media Stack.
If you use another engine or if their code changes, you will need to adapt the metadata extraction. 

**Make sure to add additional printouts to inspect the html content and understand how we find the metadata.** 

In [3]:
def extract_metadata(article):
    # Extract the publication date
    published_at = article['published_at']
    if published_at:
        date, time = published_at.split("T")        
    else:
        date = ""
        time = ""

    # Extract meta data
    url = article ['url']
    title= article['title'] 
    
    # category associated with the given news article
    category = article['category']
    
    # country code associated with given article 
    country = article ['country']
    
    return date, time, title, url, category, country
    

# 3 Extracting Content

In util_html.py, you find two additional functions: *parse_author* and *parse_news_text*. These functions try to extract the author and the text from each article. Note, that the functions are only approximations. They might fail because we do not know the html structure of every publisher. 


In [4]:
articles = query["data"]
max = 10
for i, article in enumerate(articles):
    if i < max: 
        
        date, time, title, article_url, category, country = extract_metadata(article)
    

        article_content = url_to_html(article_url)   
        author = parse_author(article_content)
        content = parse_news_text(article_content)
        
        print(date, time)
        print(article_url)
        print("author:", author)  
        print("title:", title) 
        print("category:",category)
        print("country:",country)
        print(content[:100])
        print()
    else:
        break

2023-11-03 12:01:52+00:00
https://boingboing.net/2023/11/03/new-steak-umm-campaign-turns-vegans-into-meat-lovers-to-provide-critical-education-about-ai-deepfakes.html
author: 
title: New Steak-umm campaign turns vegans into meat-lovers to provide critical education about AI deepfakes
category: general
country: us
Steak-umm—the brand that sells thin-sliced frozen meat steaks—is back with another project in their 

2023-11-03 00:21:35+00:00
https://www.pedestrian.tv/bites/krispy-kreme-vegan-doughnuts/
author: Soaliha Iqbal
title: Calling All Vegans: I Tried Krispy Kreme’s New Plant-Based Doughnuts & Here’s My Review
category: general
country: tv
I’ve been vegan for a couple of years now and in that time I’ve pretty much perfected (big claim, I 

2023-11-02 19:33:44+00:00
https://www.healthcanal.com/nutrition/healthy-eating/kefir-vs-kombucha
author: 
title: Kefir Vs Kombucha: Which Fermented Drink Is Better?
category: general
country: us


2023-11-02 16:44:01+00:00
https://www.ksro.com/20

# 4. Saving results as TSV

In [5]:
conn = http.client.HTTPConnection('api.mediastack.com')

keywords = 'veganism'

params = urllib.parse.urlencode({
    'access_key': '',## YOUR ACCESS KEY
    'keywords' : keywords,
    'sort': 'published_desc',
    'languages':'en', 
    'limit': 100
    })

conn.request('GET', '/v1/news?{}'.format(params))

res = conn.getresponse()
data = res.read()

query_content=(data.decode('utf-8'))
query = json.loads(query_content)

outfile = "../results/mediastack_results/" + keywords +"_overview.tsv"

with open(outfile, "w",encoding="utf-8") as f:
    date, time, title, article_url, category, country = extract_metadata(article)

    f.write("Publication Date\tTime\tAuthor\tTitle\tURL\tText\n")
    
    for i, article in enumerate(articles):
        
        # Extract metadata
        date, time, title, article_url, category, country = extract_metadata(article)
        
        # Extract content
        article_content = url_to_html(article_url)
        author = parse_author(article_content)
        content = parse_news_text(article_content)
        
        # We remove the newlines from the content, so that we can easily store it in a single line. 
        # Keep in mind, that newlines can also carry meaning.
        # For example, they separate paragraphs and this information is lost in the analysis, if we remove them. 
        content = content.replace("\n", "")
        
        # We want the fields to be separated by tabulators (\t)
        output = "\t".join([date, time, author, title, article_url, content])
        f.write(output +"\n")


# 5. Inspect the results.

You will notice that we do not always find a value for the author. There can be two reasons for that: 
- The author name is not provided by the publisher.
- Our code cannot find it.

Double-check on the website which explanation holds. When we are working with automatic methods, we will always be confronted with the issue of missing data. 

**Discuss how this can affect the methodology and interpretation of your experiments.** 