Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.2: Querying a REST API

News articles are a powerful resource to analyze current trends and events. In this Lab, we will extract Dutch news articles from nos.nl.

**Visit https://nos.nl/zoeken/ and have a look at the search utility.**

It provides access to a database. Instead of searching the database in the browser, we can also access it using code through a so-called REST API. REST stands for REpresentational State Transfer and makes it possible to query information from a server. 

## 1. Querying keywords

We want to create a dataset that contains articles related to a specific topic. In order to find these articles, we need to determine good keywords for the topic. In this example, we use the keyword "veganisme".
**Make sure to test several keywords for your topic and inspect the quality of the results.**  

Instead of manually copying all search results into files, we want to automate the task. We create a URL by appending the keyword to the search URL of NOS. Note that other websites can use a different syntax. It helps to first test the search interface manually to understand how searches are specified. 

In [1]:
from util_html import *

keyword="veganisme"
url = "https://nos.nl/zoeken/?q=" + keyword

print('The search request URL:', url)

parser_content= url_to_html(url)

# The class for the search results has a weird name
# You can find it out when you look at the HTML source in your web browser
search_results = parser_content.find_all("a", {"class":"sc-f75afcb6-4 isiLEZ"})

# For comparison, print the full output, scroll through it and make sure you find the search results in there. 
print(search_results)

The search request URL: https://nos.nl/zoeken/?q=veganisme
[<a class="sc-f75afcb6-4 isiLEZ" data-testid="listitem" href="/artikel/2314011-veganistische-moeder-niet-langer-geweerd-van-moedermelkbank"><div class="sc-f75afcb6-1 dwJDmr"><span class="sc-89aee953-0 flpnmJ"><picture><source media="" sizes="(min-width: 760px) 165px, 100px" srcset="https://cdn.nos.nl/image/2019/12/09/612898/96x72a.jpg 96w, https://cdn.nos.nl/image/2019/12/09/612898/192x144a.jpg 192w, https://cdn.nos.nl/image/2019/12/09/612898/288x216a.jpg 288w, https://cdn.nos.nl/image/2019/12/09/612898/384x288a.jpg 384w, https://cdn.nos.nl/image/2019/12/09/612898/480x360a.jpg 480w, https://cdn.nos.nl/image/2019/12/09/612898/576x432a.jpg 576w, https://cdn.nos.nl/image/2019/12/09/612898/768x576a.jpg 768w, https://cdn.nos.nl/image/2019/12/09/612898/960x720a.jpg 960w, https://cdn.nos.nl/image/2019/12/09/612898/1152x864a.jpg 1152w, https://cdn.nos.nl/image/2019/12/09/612898/1440x1080a.jpg 1440w, https://cdn.nos.nl/image/2019/12/09/

In [4]:
parser_content.find_all("div",{"class":"sc-d6d7be46-0 jjSfnY sc-f75afcb6-6 hGGBnM"})

[<div class="sc-d6d7be46-0 jjSfnY sc-f75afcb6-6 hGGBnM" data-testid="metadata-container"><span><time datetime="2019-12-09T15:29:45+0100">maandag 9 december 2019, 15:29</time></span></div>,
 <div class="sc-d6d7be46-0 jjSfnY sc-f75afcb6-6 hGGBnM" data-testid="metadata-container"><span><time datetime="2019-11-10T22:19:48+0100">zondag 10 november 2019, 22:19</time></span></div>,
 <div class="sc-d6d7be46-0 jjSfnY sc-f75afcb6-6 hGGBnM" data-testid="metadata-container"><span><time datetime="2018-08-25T15:56:43+0200">zaterdag 25 augustus 2018, 15:56</time></span></div>,
 <div class="sc-d6d7be46-0 jjSfnY sc-f75afcb6-6 hGGBnM" data-testid="metadata-container"><span><time datetime="2018-06-03T21:50:25+0200">zondag 3 juni 2018, 21:50</time></span><span>•</span><span>Nieuwsuur</span></div>,
 <div class="sc-d6d7be46-0 jjSfnY sc-f75afcb6-6 hGGBnM" data-testid="metadata-container"><span><time datetime="2018-02-21T11:36:49+0100">woensdag 21 februari 2018, 11:36</time></span></div>,
 <div class="sc-d6d7

In [20]:
titles = parser_content.find_all("h2",{"class":"sc-f75afcb6-3 lhteiV"})
s = str(titles[0])

In [21]:
s

'<h2 class="sc-f75afcb6-3 lhteiV">Veganistische moeder niet langer geweerd van Moedermelkbank</h2>'

In [22]:
import re
re.search(r'\<h2 class\=\"sc\-f75afcb6\-3 lhteiV\"\>(.*?)\<\/h2\>', s).group(1)

'Veganistische moeder niet langer geweerd van Moedermelkbank'

In [33]:
def get_nos_content_and_section(url):
    '''
    Get the text from url link
    '''
    soup = BeautifulSoup(requests.get(url).text,"html5lib")
    section_container = soup.select("p",{"class":"sc-f9df6382-7 cMuisv"}) # NOS section class
    section = re.search(r'cMuisv\"\>(.*?)\<\/p\>', str(section_container)).group(1)
    return section

In [34]:
get_nos_content_and_section('https://nos.nl/artikel/2455639-groenlandse-vrouwen-kregen-ook-na-1991-nog-ongewilde-anticonceptie')

'Buitenland'

## 2. Collecting results

If you inspect the search results, you can see that they have different properties such as *title*, *time*, and *category*. It would be possible, to filter our results based on these categories. For the moment, we only want to collect the links to the articles in a list.  

We iterate through all links, send a request and extract the text. Then we store the text in a file.

In [2]:
domain = "https://nos.nl/"
for i, link in enumerate(search_results):    
    found_url = domain + link["href"]
    print(i, found_url)
    
    # Extract text and add the url as first line
    text = found_url + '\n'+ url_to_string(found_url) 
    
    # Save in file
    dir = "../results/nos_search_results/"
    filename = keyword + "_" + str(i) + ".txt"
    with open(dir + filename, "w", encoding = "utf-8") as f:
        f.write(text)


0 https://nos.nl//artikel/2314011-veganistische-moeder-niet-langer-geweerd-van-moedermelkbank
1 https://nos.nl//artikel/2309909-ellen-degeneres-krijgt-golden-globe-oeuvreprijs
2 https://nos.nl//artikel/2247497-veganisten-protesteren-in-amsterdam-tegen-hamburgers-en-bontjassen
3 https://nos.nl//nieuwsuur/artikel/2234918-is-rundvlees-eten-vervuilender-dan-autorijden
4 https://nos.nl//artikel/2218583-nepleer-is-nu-ineens-veganistisch
5 https://nos.nl//artikel/2208423-mcdonald-s-komt-met-mcvegan-over-15-jaar-zijn-alle-snacks-vega
6 https://nos.nl//nieuwsuur/video/2204560-veganisme-is-populair-maar-is-het-ook-gezond
7 https://nos.nl//nieuwsuur/artikel/2204524-veganisme-is-populair-maar-is-het-ook-gezond
8 https://nos.nl//op3/artikel/2128277-veganisme-is-booming-business-in-israel
9 https://nos.nl//artikel/2099574-discussierende-demonstranten-en-politie-botsen-in-parijs
10 https://nos.nl//op3/artikel/2090859-gaat-het-nieuwe-vlees-het-oude-vlees-verslaan


## 3. Inspecting data

Once you collected your preliminary data, the most important pre-processing phase begins: **Data Analyis**. 

Inspect the collected files and examine the following aspects: 
- Are the articles related to the topic that you had in mind when choosing the keyword? Or do you need to re-adjust the query? Many search engines allow you to use combinations of keywords using regular expressions in your queries.
- Is the data structure of the results homogeneous?
- How useful is the extracted text? What kind of additional data pre-processing would be helpful? How could you do that? 
- Can you identify different thematic groups or perspectives in the results? Are the differences reflected in the language? 
 

**Try out other public APIs to collect texts from interesting sources. You can find some examples here: https://github.com/public-apis/public-apis. Experiment with data from different languages.** 

How do you need to adjust the code? 

Note: Many APIs require that you register and obtain an API Key. This key is an identifier that you add to the query. It controls the number of queries that you send to avoid abuse.  