# Web data for dummies (in-class)

*The internet offers abundant possibilities to collect data for use in empirical research projects. This tutorial is a gentle introduction on how web scraping and APIs to collect such data in Python. Get inspired now!*

## Learning Objectives

After completion of this tutorial, students will be able to:

* Explain the differences between retrieving data from websites vs. APIs
* Retrieve web data in Python using the `requests` library, and store retrieved data in HTML or JSON/TXT files for further inspection.
* Use browser control tools ("inspect") to develop strategies how to select and capture information from websites (e.g., text, numbers, pictures, etc.)
* Select elements from websites using BeautifulSoup (e.g., class names, attribute or tag names)
* Select elements from JSON dictionaries obtained through APIs (attribute-value pairs)
* Apply programming concepts (e.g., loops, functions) to the collection of web data, and convert dictionaries to JSON files.
* Understand the difference between Jupyter Notebooks and “raw” Python files, and run collection via the command line/terminal

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>

------

---

## 1. Web Scraping

### 1.1 What is web scraping?

Say that you want to capture and analyze data from a website. Of course, you could simply copy-paste the data from each page. But, of course, this manually executed job would have severe limitations. What if the data on the page gets updated (i.e., would you have time available to copy-paste the new data, too)? Or what if there are simply so many pages that you can't possibly do it all by hand (i.e., thousands of product pages)? 

Web scraping can help you overcome these issues __by programmatically extracting data from the web__. Before we can extract/grab/capture/scrape information from a website, we need a bit of background on how websites work technically, so let's focus on that first.


### 1.2. How websites work

#### Importance

It's vital to take some time to get familiar with HTML - the primary programming language used when building websites. Once we're familiar with HTML (and the structure of websites), we can rapidly navigate complex websites to extract the information we're interested in (e.g., prices, names of product categories, ...). In other words: to reach our end goal, we do have to give you some technical details first.

So, here we go: A web page consists of various text files, each one with its style, formatting, and syntax. These files each serve a specific purpose:

- `.html` (HyperText Markup Language) files give structure to a page (e.g., where's the menu?, which content to show (e.g., text, tables)?)
- `.css` (Cascading Style Sheet) files determine how the page looks (e.g., which color do headers have? what's the font used for text in a paragraph?)
- `.js` (JavaScript) files add interactivity (e.g., button animations)

#### Let's try it out
Check out this simple [example](https://codepen.io/rcyou/pen/QEObEk/). The site shows the source code of a site (`.html`, `.css`, and `.js`) in an online editor, along with a rendered ("viewable") version of the site. Once you make changes to the code, the site gets automatically updated.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/codepen.png" align="left" width=60%/>


#### Exercise 1.1 
Just to get a feeling for how things work, let's make the following changes in the [CodePen snippet](https://codepen.io/rcyou/pen/QEObEk/): 
1. Change the text between the `<h1>` tags to `I am a purple of size 3em`. 
2. Change the `h1` font-size to `3em` and the color to purple (add `color: purple;` below `margin-bottom`).  
3. Remove the JavaScript code. What happens now when you click the blue button?


#### Solutions
Clicking the button should no longer trigger the script to hide the paragraph text.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/purple_headline.png" align="left" width=60%/>



### 1.3 Advancing HTML skills

#### Importance

Most HTML elements are represented by a pair of tags - an opening tag and a closing tag. 

For example, a table starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table, so it displays nicely on the site." The closing tag (note the forward-slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`).


```html
<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header row -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>
```

This what the rendered HTML table looks like:

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>


HTML elements can have any number of

- __attributes__, such as IDs, which *uniquely* identify elements

```html
<table id="example-table">
```

- __classes__, which identify a *type* of an element (contrary to ids, a class can be used more than once)

```html
<table class="striped-table">
```

- and __styles__, which define how specific elements *appear* (e.g. the width of the table)

```html
<table style="width:95%;">
```

As you may already have noticed, we use spaces (or tabs) to separate the elements from one another (the geeks among us will call this "indentation") to provide structure and improve readability. Yes, that's right. *Improve readability*.

Code may look complex to read at first, but when you take a closer look at it, it boils down to simple English, following a particular structure (also known as syntax). For example, the `<table>` tag is placed farther to the right than the `<html>` tag indicates that the table is nested within the HTML block.

This may be a lot to take in if you're entirely new to HTML, but don't worry, as the goal of this section is not to teach you how to code from scratch but rather to teach you what HTML is and why it is relevant for web scraping.


__Let's try it out__

Double-click on the rendered table below to edit its HTML structure. Try to change some simple things, e.g., the text. Rerun the cell (Shift + Enter, or click the Run button in Juypyter Notebook). Watch your changes come alive!


#### Exercise 1.2

Please finish the exercises below. After each change, rerun the cell.

1. Add another row in the table above to become a 2 (columns) x 4 (rows) table. That is 3 regular rows and 1 table header row.
2. Fill the cells with the corresponding text labels (e.g., Row 3, Column A). 
3. Change the table width to `50%` so that the table becomes narrower.


<div class="alert alert-block alert-info"><b>Proceed in small steps!</b> 
Try not to make too many changes at once. Always proceed in small steps to see whether your code still works!
</div>


*Make your changes here:*

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>

**Solutions**
<html>
    <table id="example-table" class="striped-table" style="width: 50%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
        <tr> <!-- Row 3 --->
            <td>Row 3, Column A</td>
            <td>Row 3, Column B</td>
        </tr>
    </table>
</html>

--- 
### 1.4 Finding content in a website's source code

#### Importance

Alright, we've now covered tables (`<table>`). However, there are hundreds of different tag words in HTML, and it's impossible to memorize all of them. That's why developers use a pretty handy tool to *inspect the source* of a website directly in the browser. From now onwards, we recommend you to use Chrome (in Safari and Firefox, things look slightly different, and we can't cover those, unfortunately.)

Suppose you have identified an element you want to capture (e.g., a price or the name of a product). You can "ask" your browser for the specific HTML tag of that object (so it becomes easier to capture that element later). 

#### Let's try it out 


How does it work? Start by inspecting specific elements on the page by *right-clicking on the page* and selecting __"Inspect"__ from the context menu that pops up. Then, hover over elements in the "Elements" tab to highlight them on the page. This can be super helpful when you're trying to figure out how to (uniquely) identify the element you want to scrape.

Check out the HTML structure of this fictitious [online bookstore](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html). Each of the 1000 books has its page, which shows the title, stock level, star rating, product description, and a table with other product information. Note that the prices and ratings are randomly generated, and therefore the figures on your screen may deviate from the ones below.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/inspect.gif" align="left" width=60%/>

In the screenshot above, we've selected the book title ("A Light in the Attic"), right-clicked on it, and chose "Inspect." The same text is highlighted in blue in the HTML code below. 

__Try it out yourself!__

The `<h1>` and `</h1>` tags surrounding the title indicate that this text is a header on the web page. Move your pointer down to the line below (`<p class="price_color">£51.77</p>`), and you'll see that on the top screen, it now highlights the price (rather than the title) of the book. This way, you can quickly investigate any webpage. 

__Also try this out...!__

As we discussed earlier, tags can be nested within other tags. This also becomes clear from the screenshot below, in which the small gray triangles (▶) indicate that there is code hidden within these blocks. Click on them to expand the code, see what's inside, and click again to collapse them.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/html_structure.png" align="left" width=70%/>

#### Exercise 1.3
1. Use the inspect tool to find the HTML element that constitutes the table header "**Number of reviews**" at the bottom of [this page](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).
2. Look up how many elements on the page are associated with the class `sub-header` (within the Inspector screen, use `Ctrl+F` on a PC or `⌘+F` on Mac to search)
3. You can make local (only on your computer) changes to the web page by double-clicking in the inspector and swapping the code for something else (yes, you can overwrite what's already written there!). Change the price of the book to £39.95 and assign it a five star-rating. 
4. After making the changes in 3.), refresh the page (reload it). What happens (and why)?


*A "faked" price and star-rating*

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/exercise_inspector.png" width=40% align="left"  style="border: 1px solid black"/>


In [None]:
# your answer goes here!

#### Solutions
1. The `<th>` (table header) tags enclose the text "Number of reviews." 
2. Three elements are associated with the class `sub-header` (product description, product information, reviews)
3. The star rating can be changed from the class attribute to `star-rating Five`. Once you refresh the page, the original (unedited price and star rating) appears again.
---


### 1.5 Loading a website's source code into Python

#### Importance

Alright. Up to this moment, we've learned about HTML and fiddled around with a website's source code. But we finally want to understand how we can load a website's source code into Python.

Rather than (manually) using the Inspector, we now automate these tasks using Python's `requests` library. Libraries are "extensions" to Python, but most of them are not loaded by default. So let's import the library using `import requests`.

#### Let's try it out

Please (re)run the code cell below.

In [None]:
import requests

# make a get request to the "A Light in the Attic" webpage
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
user_agent = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
book_request = requests.get(url, headers = user_agent)

# return the source code from the request object
book_source_code = book_request.text

# print out the source code to verify you have loaded the correct page
print(book_source_code)



<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    A Light in the Attic | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    It&#39;s hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein&#39;s humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and lov

<div class="alert alert-block alert-info"><b>Why are we using "user agents" when making website requests?</b>
    <br>

- User agents tell a browser which website version to return. For example, your smartphone's browser will request mobile versions of websites, whereas your laptop will request versions suited to larger screens.
    
- Changing the user agent is also your "first wall of defense" when being blocked during data collections. Later, we will learn about other ways to prevent being blocked from automatically extracting information from the web. 
    
- Interested in the ethical aspects of retrieving publicly available web data? Check the relevant sections in ["Fields of Gold"](https://journals.sagepub.com/doi/abs/10.1177/00222429221100750?journalCode=jmxa).
 
</div>


#### Exercise 1.4
1. Using the code snippet above, write a function (starting with `def download_website(url):`) that downloads the raw website data for [this product](https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html), printing it to the screen. Remember to use the same number of spaces or tabs ("indents") when writing your function!
2. Adapt the function (copy-paste first, relabel as `def save_website(url, filename):`), storing the website's raw code __in a file__ (that you can specify in the second parameter, `filename`). Recall that you can use previously learnt concepts, in specific `f=open(<filename>, 'w', encoding='utf-8')`, `f.write()`, and `f.close()`. Rerun the function on the URL above. Does it work?
3. Write a loop to store the raw HTML source code for the first four books [from this page](https://books.toscrape.com/catalogue/category/books_1/index.html). Before starting, create an array/list of *dictionaries* (`books`) with URLs and filenames to store the websites. Use the previously written function `store_website(url, filename`) for this exercise.

```
books = [{'url': 'first_url',
         'filename': 'filename1.html'},
        {'url': 'second_url',
         'filename': 'filename2.html'}]
```
         
         


In [None]:
# start with your code here

#### Solutions 

In [None]:
# Q1
import requests

def download_website(url):
    user_agent = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers = user_agent)
    source_code = request.text
    print(source_code)

download_website('https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html')



<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    Shakespeare&#39;s Sonnets | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    This book is an important and complete collection of the Sonnets of William Shakespeare. Most readers are aware of the great plays and manuscripts written for the stage, but are unaware of the magnificent Sonnets which were written around the same period. This is an excellent, complete collection of the Sonnets and poetry of William Shakespeare and should not be missed by

In [None]:
# Q2
import requests

def save_website(url, filename):
    user_agent = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers = user_agent)
    source_code = request.text
    f=open(filename, 'w', encoding='utf-8')
    f.write(source_code)
    f.close()
    print(f'Done retrieving {url} and saving as {filename}')

save_website('https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html', 'website.html')

Done retrieving https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html and saving as website.html


In [None]:
# Q3
books = [{'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
         'filename': 'filename1.html'},
        {'url': 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
         'filename': 'filename2.html'},
        {'url': 'https://books.toscrape.com/catalogue/soumission_998/index.html',
         'filename': 'filename3.html'},
        {'url': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
         'filename': 'filename4.html'}
       ]

for book in books:
    save_website(book['url'], book['filename'])

Done retrieving https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html and saving as filename1.html
Done retrieving https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html and saving as filename2.html
Done retrieving https://books.toscrape.com/catalogue/soumission_998/index.html and saving as filename3.html
Done retrieving https://books.toscrape.com/catalogue/sharp-objects_997/index.html and saving as filename4.html


---
### 1.6 Extracting information from a website's source code using `BeautifulSoup` 

#### Importance

It's useful to store raw data from websites (we will make use of this a lot). But &#150; how can extract specific information from a website, such as a product's title or price? 

Fortunately, we can make use of the *structured* nature of HTML, by selecting information on the basis of:
 
- tags names (e.g., `<h1>`, `<table>`), or
- attributes such as IDs (e.g., `<table id="example-table">`), or class names (e.g., `<table class="striped-table">`)
    
For now, we'll show you how to apply these concepts using *BeautifulSoup*, a fantastic Python library that allows you to navigate and extract data from HTML files. BeautifulSoup does NOT gather information from the web itself (for this, we still use `requests`, as above). 

#### Let's try it out
First, we import the package `BeautifulSoup` and turn the `book_source_code` (the HTML code from the "A Light in the Attic" webpage we used earlier) into BeautifulSoup object. Once converted, we can easily navigate the code by *tag names*, *attribute names*, or *class names*. This process is called __parsing__, and is one of the central tasks in web scraping.

Since we know that the title is surrounded by `<h1>` tags (see Google Inspector screenshot above), we use `soup.find('h1')` to parse the title of the book.

Please run the following cells to see things in action!


In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(book_source_code)
print(soup.find('h1'))

<h1>A Light in the Attic</h1>


The `.find()` method will always print out the first matching element that it finds. For example, the web page has two `<h2>` elements which contain the "Product Description" and "Product Information" subheaders.  Only the first one will be returned by `.find()`:

In [None]:
print(soup.find('h2'))

<h2>Product Description</h2>


To capture __all__ matching `<h2>` elements you use the `find_all()` method like this:

In [None]:
print(soup.find_all('h2'))

[<h2>Product Description</h2>, <h2>Product Information</h2>]


Note that it now returns a list of elements (`[element1, element2]`), so to access individual elements you need to apply indexing (which starts with [0] for the first elements, [1] for the second and so on...).

In [None]:
# obtain first h2 element 
print(soup.find_all('h2')[0])

# obtain second h2 element
print(soup.find_all('h2')[1])

# we can also count the number of elements returned, using the len() function
print(len(soup.find_all('h2'))) # will return 2

<h2>Product Description</h2>
<h2>Product Information</h2>
2


Both subheaders are still surrounded by `<h2>` and `</h2>` tags. To get rid of them, append `.get_text()` to your code:

In [None]:
# sub header without h2 tags
print(soup.find_all('h2')[0].get_text())

Product Description


#### Exercise 1.5

1. Retrieve the website's source code, and parse the following information (and print them out):
    - product title,
    - price,
    - in-stock availability, and
    - the number of stars.
    

<div class="alert alert-block alert-info"><b>Tips</b>
    <br>
<ul>
    <li>To extract information using class names, use the <code>class_</code> argument in the <code>find()</code> function.<br>
        <b>Example:</b> <code>soup.find(class_ = 'class_name_to_find)</code>
    </li>
    <li>You can also extract information by <b>counting</b> the number of classes. For example, <code>len(soup.find('h2'))</code> returns the number of <code>h2</code> elements on a site.
    </li>
    <li>
        Too much whitespace surrounding your parsed information? Use Python's <code>strip()</code> function, e.g., <code>'   too much whitespace    '.strip()</code>.
    </li>
</ul>
 
</div>

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# title
print(soup.find('h1').get_text())
# price
print(soup.find(class_='price_color').get_text())
# in-stock availability
print(soup.find(class_='instock availability').get_text().strip())
# number of stars
print(len(soup.find_all(class_='icon-star')))

A Light in the Attic
Â£51.77
In stock (22 available)
5


### 1.7 Writing your complete first web scraper

__Exercise 1.6__

Now it's time to put in action everything we have learnt so far.

- Use the list of four URLs specified above to start your data collection, 
- Write a loop that stores the raw website data in separate HTML files (storing raw data for diagnostic purposes is very helpful!), 
- Store the extracted information (title, price, instock availability and number of stars) in a dictionary, that is stored in new-line separated JSON files.

<div class="alert alert-block alert-info"><b>Tips</b>
    <br>
    <ul>
        <li>
    Copy-paste and then modify the <code>store_website()</code> function from above to handle the two tasks of storing AND extracting information ("parsing"). </li>
            <li>Make use of the `return()` argument in a function to return the parsed data in a dictionary. </li>
        <li>
            You can import the json package (<code>import json`</code>`) and use the <code>json.dumps()</code> function to convert the dictionary to writable output data.
        </li>
 
</div>


In [None]:
# write your code here

__Solution__

In [None]:
import requests
from bs4 import BeautifulSoup
import json

books = [{'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
         'filename': 'filename1.html'},
        {'url': 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
         'filename': 'filename2.html'},
        {'url': 'https://books.toscrape.com/catalogue/soumission_998/index.html',
         'filename': 'filename3.html'},
        {'url': 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
         'filename': 'filename4.html'}
       ]

def parse_website(url, filename):
    user_agent = {'User-agent': 'Mozilla/5.0'} # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers = user_agent)
    source_code = request.text
    f=open(filename, 'w', encoding='utf-8')
    f.write(source_code)
    f.close()
    
    # parse information
    soup = BeautifulSoup(source_code)
    
    # title
    title = soup.find('h1').get_text()
    price = soup.find(class_='price_color').get_text()
    instock = soup.find(class_='instock availability').get_text().strip()
    stars = len(soup.find_all(class_='icon-star'))

    data = {'title': title,
            'price': price,
            'instock': instock,
            'stars': stars}
    
    print(f'Done retrieving {url} and saving as {filename}')
    
    return(data)


f=open('book_data.json', 'w', encoding='utf-8')
for book in books:
    data = parse_website(book['url'], book['filename'])
    f.write(json.dumps(data))
    f.write('\n') # new line to separate objects
f.close()
    

Done retrieving https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html and saving as filename1.html
Done retrieving https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html and saving as filename2.html
Done retrieving https://books.toscrape.com/catalogue/soumission_998/index.html and saving as filename3.html
Done retrieving https://books.toscrape.com/catalogue/sharp-objects_997/index.html and saving as filename4.html


### 1.8 Wrapping up

Congrats! You've just learned the first steps in collecting online data from websites! Along with boosting your "geek"-factor (wait till you show this to your friends!), you've gained an intuition on how websites are built up (HTML, CSS, JS), how source code translates into a rendered (visual) website (or, in other words, you know how to spoof websites - now, take screenshots and show *that* to your friends...), how websites can be loaded into Python, and how you can use `BeautifulSoup` to extract information using tag or attribute names and classes. Good job!

--- 

## 2. Application Programming Interface (API)


### 2.1 What is an API?

An equally important data collection method is called Application Programming Interface (API). That's a mouthful, but in essence, it is nothing more than a version of a _website intended for computers, rather than humans, to talk to one another_. 

APIs are everywhere, and most are used to provide...
- data (e.g., retrieve a user name and demographics), 
- functions (e.g., start playing music from Spotify, turn on your lamps in your "smart home"), or 
- algorithms (e.g., submit an image, retrieve a written text for what's *on* the image).

In what follows, we'll introduce you to the API of [Reddit](https://www.reddit.com), a popular American social news aggregation and discussion site. Reddit gives you an up to date view on what's happening around the world, all sorted through a voting system ("upvotes") by 1bn.+ users!


<div class="alert alert-block alert-info"><b>Tip:</b><br><br>
<ul>
    <li>Think of Reddit upvotes as a like on Instagram. Posts are arranged based on the number of votes, and those with many upvotes are featured on the homepage. The grey number next to each post represents the sum of votes (= upvotes - downvotes).
    </li>
    <li>
        Each API relies on slightly different ways to access data. The Reddit API is particularly easy, as it only requires you to append <code>.json</code> to a URL from their website. Most other APIs likely have different ways to map content from their websites to API endpoints. Check the API documentation of your data source for information on this!
    </li>
    </ul>
</div>
    
<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/reddit_homepage.png" width=60% align="left"  style="border: 1px solid black"/>



### 2.2 How APIs work

__Importance__

APIs work very similar to websites. At the core, instead of obtaining the source code of a (rendered) web site, you obtain code that computers can easily understand to process the content of a website. APIs provide you with simpler and more scalable ways to obtain data, so you really have to understand how they work.

__Let's try it out__

Consider the screen shot above (a view of the Reddit website). Here's an example of how the output of the Reddit API (click on it to view it in your browser):

https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89.json

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/api_example.png" width=60% align="left"  style="border: 1px solid black"/>

*A few things stand out right away:*

- the output only contains text, which is structured according to a data structure (e.g., array or list (`[]`) and dictionary (`{}`)), 
- there's no human interface with buttons, menus, and links, yet...
- you can access it like any other website by filling out the URL in your browser (`reddit.com/r/science/...` in this example).

In fact, the API output above corresponds to the (visual) Reddit thread, which you can open here:

https://www.reddit.com/r/science/comments/k0bjqt/study_finds_users_not_notifications_initiate_89

For example, look at the third and fourth line from above, which states the `title` of the post you can also see below on the rendered website.  

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/reddit.png" width=60% align="left"  style="border: 1px solid black"/>



<div class="alert alert-block alert-info"><b>Tips:</b><br>
<ul>
    <li>If you have taken a look at the API output, you may conclude that making sense of raw JavaScript Object Notation (JSON) is easier said than done. Fortunately, this <a href='https://chrome.google.com/webstore/detail/json-viewer/gbmdgpbipfallnflgajpaliibnhdgobh'>plugin</a> automatically formats and highlights the output such that it's easier to digest. If your browser does not automatically display JSON data in a "nice" way, we recommend installing the Chrome plugin. 
    </li>
    <li>
After installation, view the output again. That's much better, right?
</li>
    </ul>
</div>





#### Exercise 2.1

Navigate through the JSON tree structure of the post above and anwer the following questions:

1. At the parent level you find two dictionaries at line 5 and 197 (i.e., the blue arrows). Collapse the content and describe in your own words what each dictionary represents. How does it relate to the Reddit HTML page? 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/webdata-for-dummies/images/reddit_api.gif" width=70% align="left"  style="border: 1px solid black"/>

2. The first comment is from the post author (fotogneric) and has gathered the most points. How many downvotes did this comment get (you find the answer in the JSON output)? 

3. Suppose that you want to extract the date and time each comment was created. What path do you need to navigate? 

<div class="alert alert-block alert-info"><b>Tips:</b><br>
   <ul> <li>
Times are often registered in UTC format, a globally interchangeable time representation (also known as Epoch time). More specifically, it is the number of seconds elapsed since January 1, 1970. It can be used as a universal time scale around the world. 
    </li>
    <li>
    Copy-paste the UTC time to an <a href='https://epochconverter.com' target='_blank'>online epoch converter</a> and check whether it corresponds with the date and time on the webpage.
    </li>
    </ul>
</div>






**Your answer**  

...

#### Solutions
1. The dictionary that starts at line 5 contains data on the post (title, subreddit, upvote ratio, thumbnail/image, link to article). The other dictionary stores the comments of the post (author, body text, timestamp). 
2. At the moment of writing this solution (December 2020) the post has 0 downvotes (`'downs': 0`).
3. The `created` key stores a large number (e.g., 1606274053) that can be translated into a date and time (for this example: 25 November 2020 03:14 GMT). The corresponding path for the timestamp of the first comment is: `request[1]['data']['children'][0]['data']['created']` (a written description that follows these directions also suffices: first, you take the 2nd element `[1]` in the list, then you choose the `data` key, etc.).


### 2.2 Inspect data before collection

__Importance__

Before we proceed to *downloading* data from an API, it is useful to first inspect the corresponding website (if it exists) to get an understanding for what data is available.

Here, we zoom in on "subreddits" (which are easier to understand than the comments considered earlier). Subreddits are niche communities centered around a particular topic. Users can nearly post anything in these subreddits, and you'd be surprised to find out what people are talking about. For example, see below for a screenshot of the [subreddit on Science](https://www.reddit.com/r/Science).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/apis101/images/reddit_science.png" width=60% align="left"  style="border: 1px solid black"/>


__Let's try it out__

Subreddits all start with `reddit.com/r/...`. Here are a few examples: [askreddit](https://www.reddit.com/r/AskReddit), [aww](https://www.reddit.com/r/aww/), [gifs](https://www.reddit.com/r/gifs/), [showerthoughts](https://www.reddit.com/r/Showerthoughts), [lifehacks](https://www.reddit.com/r/lifehacks), [getmotivated](https://www.reddit.com/r/GetMotivated), [moviedetails](https://www.reddit.com/r/MovieDetails), [todayilearned](https://www.reddit.com/r/todayilearned/), [foodporn](https://www.reddit.com/r/FoodPorn/).

Take your time to browse through some of the subreddits, and get familiar with the structure of the pages.

#### Exercise 2.2
Consult the [`marketing`](https://www.reddit.com/r/marketing/hot/) subreddit and answer the following questions: 
1. For your thesis, you need to collect survey responses. Are you allowed to share a link to your survey in this subreddit? Please explain how you came to this conclusion. 
2. You post a link (and wonder how many users will potentially be able to see your post). How many users are subscribed to the subreddit? How many users are currently online?
3. Like other social media platforms, you can navigate towards Reddit's user profiles and learn more about these persons. Inspect the profile of a user that has posted in the Subreddit. Describe what types of information you can gather from this user. How is the feed organized?

In [None]:
# your answer goes here!

#### Solutions
1. No, the subreddit rules prescribe users not to post surveys and homework assignments (right sidebar).
2. `r/marketing` is moderated has about 370k members, and (at the time of writing this tutorial), about 160 of them were online.
3. On a user page, you find the bio, trophies, communities the user moderates, connected accounts, and most importantly: all user's posts and comments.



### 2.3 Retrieving data from the Reddit.com API

**Importance**  

Many APIs are provided via paid subscriptions, but parts of the Reddit.com API are free to use. To request data from the Reddit API, we need to include `headers` in our HTTP request. Like in web scraping (remember the user agent?), headers contain *meta-data* that are required for the API call to work (e.g., type of browser, language, expected data format, etc.). 

**Let's try it out**  

Below we request the about page of the [`marketing`]() subreddit that includes such a header. We make our first request to the Reddit API and parse the output in the upcoming exercise!


In [None]:
import requests
url = 'https://www.reddit.com/r/marketing/about/.json'
headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=10', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}
response = requests.get(url, headers=headers)
json_response = response.json()


#### Exercise 2.3
1. First, take a look at the `json_response` object. Then, leave out the `headers` parameter in your request (so it becomes `requests.get(url)` instead), rerun the cell, and inspect the `json_response` another time. Are there any differences? 
2. Write a while-loop that prints the count of the number of currently active users of the `marketing` subreddit. Have your code pause every 5 seconds before refreshing. Stop the loop after 3 iterations. For pausing, use the function `time.sleep(5)`. Import the time package using `import time`.

```
import time
i = 0
while i<=3:
    print('Starting to collect data, iteration', i+1)
    #### YOUR API COLLECTION CODE HERE
    print('   waiting 5 seconds...')
    time.sleep(5)
    i = i + 1

```

3. Convert your code from the previous exercise into a function `get_usercount()` that takes a `subreddit` as input and returns the total number of users, and the number of currently active users as a dictionary. Test your function for the `science`, `skateboarding`, and `marketing` subreddits. How many total and currently active users do these communities have?


In [None]:
# your answer goes here!


#### Solutions
1. Without the `headers` parameter, the API returns an error code (429). Headers are frequently used to track who is using the API. The user of the "anonymous header" has pushed the boundaries too much!


In [None]:
# Question 2 
import time

i = 1
while i <= 3:
    url = 'https://www.reddit.com/r/marketing/about/.json'
    headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=10', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}
    response = requests.get(url, headers=headers)
    json_response = response.json()
    
    print(json_response['data']['active_user_count'])
    i += 1
    time.sleep(5)

In [None]:
# Question 3
def get_usercount(subreddit):
    response = response = requests.get(f'https://www.reddit.com/r/{subreddit}/about/.json', headers=headers)
    json_response = response.json()
    out = {}
    out['subreddit'] = subreddit
    out['total_users'] = json_response['data']['subscribers']
    out['active_users'] = json_response['data']['active_user_count']
    return out
    
get_usercount('science')

{'subreddit': 'science', 'total_users': 27867629, 'active_users': 6752}

In [None]:
get_usercount('skateboarding')

{'subreddit': 'skateboarding', 'total_users': 451981, 'active_users': 174}

In [None]:
get_usercount('marketing')


{'subreddit': 'marketing', 'total_users': 420332, 'active_users': 200}

### 2.4 Retrieving data for multiple subreddits

__Importance__

Remember the website scraper above? At the end of the scraping exercises, we iterated through a list of books to retrieve data. Here, we do the same for the API `about` endpoint. Iterating through an API endpoint for multiple "seeds" (or sampling units) is at the core of each data extraction task. 

__Try it out__

Run the following cell, to see how looping through a set of subreddits works like. Do you see similarities to the scraping example introduced earlier? Exactly: the concept is entirely the same.


In [None]:
subreddits = ['science', 'skateboarding', 'marketing']

for sub in subreddits:
    print(get_usercount(sub))

{'subreddit': 'science', 'total_users': 27867638, 'active_users': 6035}
{'subreddit': 'skateboarding', 'total_users': 451982, 'active_users': 161}
{'subreddit': 'marketing', 'total_users': 420334, 'active_users': 222}


__Exercise 2.4__

1. Extend the list of subreddits to at least 10 (by browsing the site and taking some subreddits you're interested in)
2. Write a function to return the current timestamp in UNIX/Epoch time and readable time. Append the timestamp information to each retrieved JSON object. Storing the timestamp of retrieval, along with the actual data, will help you to later match the data to other datasets across time.
```
# retrieving current timestamp (from https://www.geeksforgeeks.org/get-current-timestamp-using-python/)
import datetime
# ct stores current time
ct = datetime.datetime.now()
print("current time:", ct)
# ts store timestamp of current time
ts = ct.timestamp()
print("timestamp:", int(ts))
```
3. Write a `for` loop through your subreddits, and store all of the retrieved data in a file called `subreddits.json`. When rerunning the code the data may not be overwritten! (so, try running the code a couple of times).

In [None]:
# your solutions here

__Solution__

In [None]:
# Q1 (added just a few more here...)
subreddits = ['science', 'skateboarding', 'marketing', 'music', 'beach']

# Q2
import datetime

def get_time():
    # ct stores current time
    ct = datetime.datetime.now()
    # ts store timestamp of current time
    ts = ct.timestamp()
    return({'current_time': ct,
           'timestamp': ts})

get_time()
get_time()['timestamp'] # e.g., for only the UNIX timestamp

# Q3

import json

f = open('subreddits.json', 'a', encoding = 'utf-8')

for sub in subreddits:
    data = get_usercount(sub)
    data['retrieval_timestamp'] = int(get_time()['timestamp'])
    f.write(json.dumps(data))
    f.write('\n')

f.close()

### 2.5 Wrap up: APIs versus web scrapers

Now that you understand what APIs are, you may rightfully wonder: why should I learn APIs when I could scrape the elements from the website instead (like the book webshop)?

- One of the major advantages of APIs is that you can directly access the data you need *without all the hassle of selecting the right HTML tags*. 

- Another advantage is that you can often customize your API request (e.g., the first 100 comments or only posts about science), which may not always be possible in the web interface. 

- Using APIs is a legit way to get access to website data (mostly, you will have to pay a license fee to use APIs!). So it's a more stable and legit way to retrieve web data compared to web scraping. That's also why we recommend using an API whenever possible. 

- In practice, though, APIs really can't give you all the data you possibly want, and web scraping allows you to access complementary data (e.g., viewable on a website or somewhere hidden in the source code).

More commonalities and differences are also shown in Web Appendix of ["Fields of Gold"](https://doi.org/10.1177%2F00222429221100750), Table W1.

Happy scraping!


## After-class exercises


### Exercise 1

Up to this moment, we have only parsed some attributes from the book pages. In this exercise, please try to parse all relevant data from the book pages, e.g.

- product description
- UPC
- number of reviews

Use the code written for exercise 1.6 in web data for dummies as a starter.

Then, append the time stamp of data collection (used in the API part of the webdata for dummies tutorial), and add it to the generated dataset.


### Exercise 2

So far, we have used Jupyter Notebook to execute our code. But, what if you wanted to *schedule and automatically run* your data extraction (e.g., even when you are asleep)?

1. Copy your code written in (3) of exercise 2.4 in web data for dummies to a `.py` file, and execute it 
from the terminal (`python myscript.py`). 

2. Work through the scheduling tutorial on [Tilburg Science Hub](https://tilburgsciencehub.com/schedule/task/?utm_campaign=referral-short).

3. Combine 1 & 2 to automatically schedule the extraction of the API data, every 10 minutes, for a duration of 2 hours

4. Open the downloaded `.json` data using the `pandas` package and provide some summary statistics:
    
    - number of unique subreddits
    - number of times each subreddit was scraped
    - start and end timestamp of the scraper
    - average active users per subreddit

```
# snippet to load the data into Python
import pandas as pd
pd.read_json('subreddits.json', lines = True)
```