# Web Scraping and APIs

In this notebook, we learn how to scrape data from the Web and get an idea of what Applicaiton Programming Interfaces are (APIs).

## Web Scraping

**Web Scraping** is a technique for the extraction of information from websites by transforming unstructured data (HTML pages) into structured data (databases or spreadsheets). 

Even if scraping can be manually performed by a user, it is usually implemented using a **web crawler** (i.e., it is usually implemented as an automatic process). For larger scale scraping see, e.g., [Scrapy](https://scrapy.org).

The process is an alternative to using already available **API**s (Application Programming Interface), such as those provided by all the major platforms, like *Facebook*, *Google* and *Twitter*. **More below.**

### Basics of HTML

The **HyperText Markup Language (HTML)** is the standard **descriptive markup** language for web pages.


- **Markup** language: a human-readable, explicit system for annotating the content of a document. Markdown is another markup language.


- **Descriptive** markup languages (e.g. HTML, XML) are used to annotate the structure or the contents of a document, as opposed to **procedural** markup languages (e.g. TEX, Postscript), whose main goal is to describe how a document should be processed.

HTML provides a means to annotate the <strong>structural</strong> elements of documents like (different kinds of) headings, paragraphs, lists, links, images, quotes, tables and so forth. Similarly, even if with fewer options, does Markdown (which we are <em>using</em> *here*, check the code!).

HTML tags **do not mark the logical structure** of a document, but only its format (e.g. *this is a table*, *this is a h3-type heading*...). It is up to the browser to then use HTML (plus other information, such as *Cascading Style Sheets*), to render a webpage appropriately.

HTML markup relies on a **fixed inventory of tags**, written by using angle brackets. Some tags, e.g. `<p>...</p>`, surround the marked text, and may include subelements. Other tags, e.g. `<br>` or `<img>` introduce content directly.

The following is an example of a web page:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>The Adventures of Pinocchio</title>
  </head>
  <body>
    <h2>Carlo Collodi</h2>
    <h1>The Adventures of Pinocchio</h1>
    <hr>
    <h4>CHAPTER 1</h4>
    <br>
    <p><i>How it happened that Mastro Cherry, carpenter, found a piece of wood that wept and laughed like a child</i></p>
    <br>
    <p>Centuries ago there lived--</p>
    <p>"A king!" my little readers will say immediately.</p>
  </body>
</html>
```

### Scraping Web Pages

>The following notes are roughly based on the **Chapters 1-3** of: Mitchell, R. (2015). [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do), O'Reilly

#### Modules and Packages Required for Web Scraping

**BeautifulSoup**: this library defines [classes and functions](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to pull data (e.g. table, lists, paragraphs) out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying the parse tree.


**lxml**: to function, BeautifulSoup relies on external HTML-XML parsers. Many options are available, among which the html5lib's and the Python's built-in parsers. We'll rely on the [lxml](http://lxml.de/)'s parser, due to its high performance, reliability and flexibility.


**Urllib**: BeautifulSoup does not fetch the web page for us. To do this, we'll rely on the [Urllib](https://docs.python.org/3.7/library/urllib.html#module-urllib) module available in the Python Standard Library, that implements classes and functions which help in opening URLs (authentication, redirections, cookies and so on). We will see another option, **requests**, below.

In [1]:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup

#### Retrieve and Parse an HTML page

`urllib.request.urlopen()` allows us to retrieve our target HTML page:

In [2]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")

What if the page doesn't exist?

In [3]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page.html")
except Exception as e:
    print(e)

HTTP Error 404: Not Found


Well, let's handle this properly...

In [4]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page.html")
except urllib.request.URLError as e:
    pass # code your plan B here
except urllib.request.URLError as e:
    raise # raise any other exception

We use `BeautifulSoup()` in conjunction with `lxml` to parse out `html` page and store it in the Beautiful Soup format

In [5]:
# you might need to to the following:
#!pip install lxml

In [6]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
soup_page1 = BeautifulSoup(html, "lxml")

In [7]:
#Let's scrape another couple of pages we'll need in our examples
soup_page3 = BeautifulSoup(urlopen("http://www.pythonscraping.com/pages/page3.html"), "lxml")
soup_wap = BeautifulSoup(urlopen("http://www.pythonscraping.com/pages/warandpeace.html"), "lxml")

#### Let's look at the nested structure of the page

The `prettify()` method allows us to have a look at the structure of the HTML page

In [8]:
print(soup_page1)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [9]:
print(soup_page1.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



#### Let's play with a HTML tag

The notation `soup.<tag>` allows us to retrieve the content marked by a tag (opening and closing tags included)

In [10]:
# note that the first "<div>" tag is nested two layers deep (html → body → div).
soup_page1.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

If the text is the only thing you're interested into, well, the `soup.<tag>.string` method comes in handy:

In [11]:
soup_page1.div.string

'\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n'

The HTML markup generated by Beautiful Soup can be modified:

In [12]:
# let's change the content of our div
soup_page1.div.string = "this content has been changed"
# let's change the name of the tag
soup_page1.div.name = "new_div"

In [13]:
print(soup_page1.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <new_div>
   this content has been changed
  </new_div>
 </body>
</html>



In its simplest use, the `find()` method is an alternative to the `soup.<tag>` notation...

In [14]:
soup_page1.find("new_div")

<new_div>this content has been changed</new_div>

In [15]:
soup_page1.new_div

<new_div>this content has been changed</new_div>

...but this function allows for the searching of nodes by exploiting cues in the markup, such as a given **class attribute** value:

In [16]:
print(soup_wap.prettify())

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

In [17]:
soup_wap.find("span", attrs = {"class":"green"})

<span class="green">Anna
Pavlovna Scherer</span>

The values of an attribute for a given tag instance can be retrieved by using the `get("ATTRIBUTE")` method. For instance, if we want to retrieve the URL of an image we can extract the `src` value from the corresponding `<img>` tag:

In [18]:
soup_page3.img.get("src")

'../img/gifts/logo.jpg'

If we want to know all the attibutes associated with a given tag, the `attrs` method is convenient:

In [19]:
soup_page3.img.attrs

{'src': '../img/gifts/logo.jpg', 'style': 'float:left;'}

In [20]:
# by returning a dictionary, it is easy to see how "attrs" can be used as an alternative to "get()"
soup_page3.img.attrs["src"]

'../img/gifts/logo.jpg'

In [21]:
# if you fancy another way to do the same thing...
soup_page3.img["src"]

'../img/gifts/logo.jpg'

#### Dealing with multiple HTML tags at once

When the same tag is used multiple time in the same page, however, both the `soup.<tag>` notation and the `find()` method allow you to access **only one instance** (i.e. the first):

In [22]:
print(soup_wap.prettify())[180:1190]

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

TypeError: 'NoneType' object is not subscriptable

In [23]:
soup_wap.span

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In order to extract the **sequence of all the instances of a tag** in a file, we can use the `find_all()` method (previously known as `findAll()` and `findChildren()` in BS 3 and BS 2, respectively)

In [24]:
soup_wap.find_all("span")

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span clas

The `find_all()` method as well allows for  the extraction of  all tags by exploiting cues in the markup, such as a given **class attribute** value:

In [25]:
soup_wap.find_all("span",  attrs = {"class":"green"})

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

### Web Crawling

Web Crawlers are softwares designed to collect pages from the Web. In essence, they recursively implement the following steps: 

- they start by retrieving the page content for an URL 


- they then parse it to retrieve other URLs of interest


- they then focus on these new URLs, for each of which they repeat the whole process, ad infinitum

For instance, if you want to crawl and **entire site**:

- start with a top-level page


- parse the page (retrieve the data your application need) and extract all the internal links, by ignoring already visited URLs


- for each new link, move to the corresponding page and repeat the previous step

#### A Random walk through Wikipedia

Let's set our starting page URL, fetch it and parse its HTML:

In [26]:
starting_page = urlopen("https://en.wikipedia.org/wiki/Chris_Cornell")
soup = BeautifulSoup(starting_page, "lxml")

At this point, it should be easy to extract all the links in the page:

In [27]:
# links are defined by <a> tag
for link in soup.find_all("a")[:10]:
    print(link)

<a id="top"></a>
<a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected until June 9, 2020 at 11:57 UTC."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/></a>
<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
<a class="mw-jump-link" href="#p-search">Jump to search</a>
<a class="image" href="/wiki/File:Chris_Cornell.jpg"><img alt="Chris Cornell.jpg" data-file-height="900" data-file-width="600" decoding="async" height="330" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Chris_Cornell.jpg/220px-Chris_Cornell.jpg" 

Let's ignore all the "a" tags without an "href" attribute:

In [28]:
for link in [tag for tag in soup.find_all("a") if 'href' in tag.attrs][:10]:
    print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/File:Chris_Cornell.jpg
/wiki/Seattle,_Washington
/wiki/Detroit,_Michigan
/wiki/Suicide_by_hanging
/wiki/Hollywood_Forever_Cemetery
/wiki/Susan_Silver
/wiki/List_of_awards_and_nominations_received_by_Chris_Cornell


Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles:

```
/wiki/Template_talk:Chris_Cornell
```

```
#cite_note-147
```

Moreover, we don't want to visit pages outside of Wikipedia:

```
http://www.chriscornell.com/
```

Relevant links have three thing in common:

- they reside within the `div` with the `id` set to `bodyContent`


- the URLs do not contain semicolons


- the URLs begin with `/wiki/`

In [29]:
import re

for link in soup.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Seattle,_Washington
/wiki/Detroit,_Michigan
/wiki/Suicide_by_hanging
/wiki/Hollywood_Forever_Cemetery
/wiki/Susan_Silver
/wiki/List_of_awards_and_nominations_received_by_Chris_Cornell
/wiki/Alternative_metal
/wiki/Heavy_metal_music
/wiki/Grunge
/wiki/Alternative_rock
/wiki/Hard_rock
/wiki/SST_Records
/wiki/Sub_Pop
/wiki/A%26M_Records
/wiki/Epic_Records
/wiki/Interscope_Records
/wiki/Mosley_Music_Group
/wiki/Soundgarden
/wiki/Audioslave
/wiki/Pearl_Jam
/wiki/Temple_of_the_Dog
/wiki/Center_for_Disease_Control_Boys
/wiki/Alice_Mudgarden
/wiki/Heart_(band)
/wiki/Mad_Season_(band)
/wiki/N%C3%A9
/wiki/Rock_music
/wiki/Soundgarden
/wiki/Audioslave
/wiki/Temple_of_the_Dog
/wiki/Andrew_Wood_(singer)
/wiki/Grunge
/wiki/Octave
/wiki/Belting_(music)
/wiki/Euphoria_Morning
/wiki/Carry_On_(Chris_Cornell_album)
/wiki/Scream_(Chris_Cornell_album)
/wiki/Higher_Truth
/wiki/Songbook_(Chris_Cornell_album)
/wiki/The_Roads_We_Choose_%E2%80%93_A_Retrospective
/wiki/Chris_Cornell_(album)
/wiki/Golden_Gl

This code returns the list of all the Wikipedia articles linked to our starting page. 

This is not enough, we want to be recursively repeat this process for all these links. That is, we need a function that takes as input a Wikipedia article URL of the form `/wiki/<Article_Name>` and returns a list of all linked articles

In [30]:
def getLinks(articleUrl):
    page = urlopen("http://en.wikipedia.org" + articleUrl)
    soup = BeautifulSoup(page, "lxml")
    links = soup.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
    return links

Let's test our function by calling it in a script that randomly select, for each iteration, a random link and that stops after 10 URLs have been retrieved (or when it bumps into a page without link):

In [31]:
import random

links = getLinks("/wiki/Chris_Cornell")

for _ in range(10):
    if len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
        print(newArticle)
        links = getLinks(newArticle)
    else:
        print("no links in this page")
        break

/wiki/Miami_Vice_(film)
/wiki/The_Washington_Post
/wiki/Joan_Vennochi
/wiki/David_Barstow
/wiki/The_Providence_Journal
/wiki/The_Daily_Republican
/wiki/The_Ithaca_Journal
/wiki/The_American_News
/wiki/Kirksville_Daily_Express
/wiki/Daily_newspaper


---

### Exercise 1.

Write code to retrieve the official address of the Internationally Ranked Universities in the Netherlands by starting from the following Wikipedia article:

https://en.wikipedia.org/wiki/List_of_universities_in_the_Netherlands

In [32]:
# your code here

---

## Working with APIs

An **Application Programming Interface** is a set of protocols that defines how software programs communicate among eachother. Without APIs, we have to scrape the Web or get the data directly. With APIs, we often can get structured data: it is a much more convenient way to work.

APIs are a great option in that they implement extensively tested routines (**high reliability**). However, you should spend time in learning how they work and, in some cases, they don't allow you to access the piece of information you may need (**low flexibility**).

In [33]:
import requests

In [34]:
# Example of a Google search

In [35]:
query = "Tesla"
r = requests.get('https://www.google.com/search', params={'q': query})

In [36]:
r.status_code

200

In [37]:
print(r.headers['content-type'])
print(r.encoding)
print(r.url)

text/html; charset=ISO-8859-1
ISO-8859-1
https://www.google.com/search?q=Tesla


In [38]:
r.text[:1000]

'<!doctype html><html lang="nl"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Tesla - Google zoeken</title><script nonce="v0ZtdUW0V1jl5ahhFaacow==">(function(){\ndocument.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"==c||"q"==c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addEventListener("click",function(b){var a;a:{for(a=b.target;a&&a!=document.documentElement;a=a.parentElement)if("A"==a.tagName){a="1"==a.getAttribute("data-nohref");break a}a=!1}a&&b.preventDefault()},!0);}).call(this);(function(){\nvar a=window.performance;window.start=Date.now();a:{var b=window;if(a){var c=a.timing;if(c){var d=c.navigationStart,f=c.responseStart;if(f>d&&f<=window.start){window.start=f;b.wsrt=f-d;break a}}a.now&&(b.wsrt=Math.floor(a.now()))}}window.google=window.google||{};var h

---

### Exercise 2.

1. Inspect the Google search results page and understand how results are displayed.


2. Use BeautifulSoup to get the link of the first 10 results of this search out.

---

What about using `requests` to query APIs? Easy using the param dictionary. Responses then follow the starndard format of the API (or you can request the one you like if available).

In [39]:
r = requests.get('https://api.github.com')

# raw
r.content

b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sea

In [45]:
# json
r.json()

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

### Twitter API (OPT)

Two main APIs:

* **Streaming API**: a sample of public tweets and events as they published on Twitter, provides only real-time data without limits.


* **REST API**: allows to search, follow trends, read author profile and follower data, post / modify. It provides historical data up to a week (for the free account, more by paying), rwquires a one-time request and has rate limit (varies for different requests and subscriptions).


REST APIs (it is a style for developing Web services which is widely used): https://en.wikipedia.org/wiki/Representational_state_transfer

Some more basic info: https://developer.twitter.com/en/docs/basics/things-every-developer-should-know

Tutorials: https://developer.twitter.com/en/docs/tutorials

#### Using the API: authentication

**For this part, you will need credentials from the Twitter dev website.**

A good way to store your keys is using `.conf` files and `configparser`.

In [41]:
import configparser
config = configparser.ConfigParser()
config.read("stuff/conf.conf")

['stuff/conf.conf']

In [42]:
config['twitter']['api_key']

'hBn3fPoa7TXlL4fEEbZ5l1cbd'

This is how my `conf.conf` file looks like (also in `stuff/conf_public.conf`):

```
[twitter]
api_key = YOURS
api_secret_key = YOURS
access_token = YOURS
access_secret_token = YOURS
```

#### A useful package: Tweepy

https://tweepy.readthedocs.io/en/latest/index.html

In [43]:
import tweepy

In [44]:
# Tweepy Hello World

# authentication (OAuth)
auth = tweepy.OAuthHandler(config['twitter']['api_key'], config['twitter']['api_secret_key'])
auth.set_access_token(config['twitter']['access_token'], config['twitter']['access_secret_token'])

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets[:5]:
    print(tweet.text)

Great work by our friends @scite https://t.co/TtVZcQoWSL
RT @mjczies: Good morning and happy weekend! @Eckyo8 inspired me this morning! If we want to stand tall, we must grow and thrive even throu…
RT @SoManyBooks24: “The novel’s polyphony builds into a panoramic picture of late 13th-century England, divided by wealth and status but su…
In dit artikel vatten we het belangrijkste coronanieuws van de afgelopen uren voor je samen.
https://t.co/5M6d8qlLVp
RT @Nisa00: Still a few days to submit nominations to EASST Amsterdamska, Freeman and Ziman prizes! @STSeasst Full information at https://t…


#### Interlude: JSON

The Twitter API returns data structured in the JSON format. [JSON](https://www.json.org) (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. **It is basically a list of nested Python dictionaries.**


Minimal example:

```json
{
  "firstName": "John",
  "lastName": "Doe",
  "age": 21
}
```

Extended example:

```json
{
  "$id": "https://example.com/person.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string",
      "description": "The person's first name."
    },
    "lastName": {
      "type": "string",
      "description": "The person's last name."
    },
    "age": {
      "description": "Age in years which must be equal to or greater than zero.",
      "type": "integer",
      "minimum": 0
    }
  }
}
```


Online viewer: http://jsonviewer.stack.hu

#### Using the API: search

All the most recent Tweets from a given hashtag.

In [49]:
# queries

tweets = tweepy.Cursor(api.search, q="#nlproc")

for item in tweets.items(2):
    print(item._json)

{'created_at': 'Sat Jun 06 08:49:10 +0000 2020', 'id': 1269189578423820288, 'id_str': '1269189578423820288', 'text': 'RT @Takuma_Kato51: #acl2020nlp の併設ワークショップ(SRW)に採択された論文を公開しました！\nhttps://t.co/aEVnBqij7N\nNERデータセットのラベルを複数の要素に分解(例：B-person -&gt; B, Person)し，各要素…', 'truncated': False, 'entities': {'hashtags': [{'text': 'acl2020nlp', 'indices': [19, 30]}], 'symbols': [], 'user_mentions': [{'screen_name': 'Takuma_Kato51', 'name': 'Takuma Kato', 'id': 1267657369200128000, 'id_str': '1267657369200128000', 'indices': [3, 17]}], 'urls': [{'url': 'https://t.co/aEVnBqij7N', 'expanded_url': 'https://arxiv.org/abs/2006.01372v2', 'display_url': 'arxiv.org/abs/2006.01372…', 'indices': [63, 86]}]}, 'metadata': {'iso_language_code': 'ja', 'result_type': 'recent'}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name':

In [50]:
# queries with Boolean operators
import json

tweets = tweepy.Cursor(api.search, q="#nlproc")

for item in tweets.items(2):
    print(json.dumps(item._json, indent=4, sort_keys=False))

{
    "created_at": "Sat Jun 06 08:49:10 +0000 2020",
    "id": 1269189578423820288,
    "id_str": "1269189578423820288",
    "text": "RT @Takuma_Kato51: #acl2020nlp \u306e\u4f75\u8a2d\u30ef\u30fc\u30af\u30b7\u30e7\u30c3\u30d7(SRW)\u306b\u63a1\u629e\u3055\u308c\u305f\u8ad6\u6587\u3092\u516c\u958b\u3057\u307e\u3057\u305f\uff01\nhttps://t.co/aEVnBqij7N\nNER\u30c7\u30fc\u30bf\u30bb\u30c3\u30c8\u306e\u30e9\u30d9\u30eb\u3092\u8907\u6570\u306e\u8981\u7d20\u306b\u5206\u89e3(\u4f8b\uff1aB-person -&gt; B, Person)\u3057\uff0c\u5404\u8981\u7d20\u2026",
    "truncated": false,
    "entities": {
        "hashtags": [
            {
                "text": "acl2020nlp",
                "indices": [
                    19,
                    30
                ]
            }
        ],
        "symbols": [],
        "user_mentions": [
            {
                "screen_name": "Takuma_Kato51",
                "name": "Takuma Kato",
                "id": 1267657369200128000,
                "id_str

#### Using the API: users

Get some info on a given user, and explore their friends/followers.

In [51]:
user = api.get_user("elonmusk")

print("User:",user.screen_name)
print("------")
print("Friends:",user.friends_count)
print("Followers:",user.followers_count)
print("------")
for friend in user.friends(count=10):
    print(friend.screen_name)
print("------")
for friend in user.followers(count=10):
    print(friend.screen_name)

User: elonmusk
------
Friends: 92
Followers: 35673217
------
jack
TalulahRiley
justinemusk
MKBHD
jk_rowling
joerogan
Blklivesmatter
Erdayastronaut
engineeringvids
TheWeirdHistory
------
x_funeral_x
StandWithPower1
jamesglackin10
wernerfinch5
_katkouture
VASOLAIYAPPAN
lemonthewoozi
karlwreed1
MichaelChao0522
bruce78398060


#### Using the API: tweets from user

In [52]:
user = api.get_user("elonmusk", tweet_mode="extended") # extended tweetmode gets also the longer 280/char tweets
elon_tweets = user.timeline()

for tweet in elon_tweets[:5]:
    print(tweet.text)

@akidesir Thursday
@CodingMark I guess more people need to get more involved in the party primaries (which is a chore)
The gerontocracy is out of touch with the people
Selling weed literally went from major felony to essential business (open during pandemic) in much of America &amp; yet… https://t.co/32J0z4qDTI
This will probably get me into trouble, but I feel I have to say it


---

For those who don't have a Twitter account and app, here are some tweets on and by Boris Johnson!

In [55]:
user = "@BorisJohnson"
tweets_on_user = tweepy.Cursor(api.search, q=user, tweet_mode="extended")

on_boris = list()
for tweet in tweets_on_user.items(100):
    on_boris.append(tweet.full_text)
    
#print("\n------\n")
# from user

user = api.get_user("BorisJohnson", tweet_mode="extended") # extended tweetmode gets also the longer 280/char tweets
tweets_from_user = user.timeline(count=100)

from_boris = list()
for tweet in tweets_from_user:
    from_boris.append(tweet.text)

In [56]:
# save to file
f_on_boris = "stuff/tweets_on_boris.csv"
f_from_boris = "stuff/tweets_from_boris.csv"

# note we are using the "" as text delimiter
with open(f_on_boris, "w") as f:
    for t in on_boris:
        f.write('"'+t+'"\n')
        
with open(f_from_boris, "w") as f:
    for t in from_boris:
        f.write('"'+t+'"\n')

### Exercise 3.

1. Download the last 100 (or another number) tweets mentioning a user you are interested into and the last 100 from the user itself. Alternatively, use the tweets in the on_boris and from_boris files.


2. Create a minimal pipeline to normalize the tweets into lists of tokens.


3. Count and compare from the two datasets, the most frequent (top 10):
    - tokens
    - hashtags
    - other user mentions

In [None]:
# your code here