# Crawling Twitter in Python

by 

[__Michael Granitzer__ (michael.granitzer@uni-passau.de)]( http://www.mendeley.com/profiles/michael-granitzer/)

<p>
<p>
<br>
__Licences__
<br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/" align="left"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>


# RESTful API's

RESTful API's (REST = Representational state transfer) are the most dominant API's in the web. We will briefly review the underlying technology, namely: 
- HTTP as protocol
- JSON as return format
- OAuth as authentication method



## HTTP in a nutshell

<div class="alert alert-warning">
The **HyperText Transfer Protocol** is an generic, stateless, request-response (pull) application protocol for distributed and collaborative hypermedia information systems (Fielding et. al. 99)
</div>

Current Version is HTTP/1.1


### Format

**Request-Format**

- Request line (e.g. `GET /index.html HTTP/1.1`)
- Request header fields as `Key: Value1, Value2 ...` line (e.g. `Accept-Language:en`)
- empty line
- Optional message body


**Response-Format**

- Status line with status code and reason message (e.g. `HTTP/1.1 200 OK`)
- Response header fields `Key: Value1, Value2 ...` line (e.g. `Content-Type:text/html`)
- empty line
- Optional message body

*Header Examples*

|Key     | Description|
|------|-|
|Accept  | Content type accepted as response|
|Authorization | Credentials for authentication|
|Content-Type| The MIME Type of the Body|
|User-Agent| a string identfiying the user agent|



**Status Codes**

|Code-Group| Description|
|----------|------------|
| 1xx      | Informational|
| 2xx      | Success|
| 3xx      | Redirection|
| 4xx      | Client Error|
| 5xx      | Server Error |


Details@[Iana](http://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml)


### Request Methods

HTTP addresses a resource via an URL (host+file part) and can apply several [methods](http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html), also known as verbs, to this resource.

** Safe Methods:** Methods do not change the resource. They only retrieve information.

- `GET` - retrieve the resource or parts of the resource
- `HEAD` - identical response to a GET resource but without body
- `OPTIONS` - returns the HTTP methods supported by the server
- `TRACE` - echos back the received request to see probable changes or additions made by intermediate servers

Only `GET` and `HEAD` are available in version HTTP/1.0

**Idempotent Methods:** multiple identical requests  have the same effect on a single resource. The system is in the same state after multiple identical requests have been made.

- `PUT` - Stores a resource under the supplied URI
- `DELETE` - Deletes the specified resource

**Non-Idempotent Methods:** Sending two identical requests are allowed produce different results.

- `POST` - Post the entity encludes in the request to the resource


### MIME Types

Multipurpose Internet Mail Extension Types (MIME Types) are used describing the type of media returned. 

A media type (e.g. `text/html;charset=UTF-8`) is composed of
- a top-level type (e.g. `text`)
- a sub-type (e.g. `html`) 
- zero or more optional parameters (e.g. `charset=UTF-8`)

**top-level type names** are `application, audio, example, image, message, model, multipart, text, video`

**sub-types** are composed as `{|vnd.|prs.|x.}subtype-name[+suffix]`

- `{|vnd.|prs.|x.}` refers to different sub-trees, namely the standard tree, the vendor tree `vnd.`, the personal (experimental) tree `prs.`, the `x.` (private) tree.
- `[+suffix]` referes to an optional structure specification of the media type, namely `+xml, +json, +ber, +der, +fastinfoset, +wbxml, +zip, +cbor` 

**Example**: `'application/vnd.mozilla.xul+xml'`

All types are listed at [IANA](http://www.iana.org/assignments/media-types/media-types.xhtml)

### HTTP Example

    telnet www.uni-passau.de 80
    Trying 132.231.51.59...
    Connected to www.uni-passau.de.
    Escape character is '^]'.
    GET / HTTP/1.1

    HTTP/1.1 302 Found
    Server: Apache/2.2.22 (Ubuntu)
    X-Powered-By: PHP/5.3.10-1ubuntu3.14
    Location: http://www.uni-passau.de/404fehlerseite/
    Vary: Accept-Encoding
    Cache-Control: max-age=3600
    Expires: Sun, 12 Oct 2014 15:37:00 GMT
    Content-Type: text/html
    Transfer-Encoding: chunked
    Date: Sun, 12 Oct 2014 14:37:00 GMT
    X-Varnish: 1997371000
    Age: 0
    Via: 1.1 varnish
    Connection: keep-alive

## REST Architecture in a nutshell

<div class="alert alert-warning">
**Representational state transfer (REST)** describes an architectural style of a distributed hypermedia information system like the WWW. Its basic properties are:
<br>
<ul>
 <li> Scalability
 <li> Performance
 <li> Simplicity
 <li> Reliability
 <li> Modifiability
 <li> Portability
</ul>
</div>

It has been developed by Roy Fielding in his PhD thesis at UC Irvine in 2000.

### Architectural constraints

- **Client-Server Model:** The client is separated from the server through a uniform interface (separation of concerns)
- **Uniformed Interface:** REST defines four interface constraints:
  - Identification of resources (in the Web through URIs) in requests, decoupled from the return results (e.g. XML, JSON, HTML)
  - Manipulation of resources through returned representation
  - Self-descriptive messages
  - Hypermedia as the engine of application state (HATEOAS). Given a fixed entry point, the server returns potential state transitions to the client. The client must not make any assumptions besides that
- **Stateless:** Client context must not be stored on the server. Every client request contains all information necessary
- **Cacheable:** Clients can cache responses, if a response is defined so.
- **Layered System:** Clients can connect to intermediate/proxy servers.
- **Code on demand (optional):** Servers may transfer executable code to the client. 


### REST over HTTP

REST over HTTP consists of:
- an **URI** representing the resource
- an **Internet Media Type (MIME Type)** indicating the format of the data/resource.
- **HTTP Methods:** Put, Get, Post Delete
- **Hyperlinks** indicating server **state transitions** (HATEOAS)
- **Hyperlinks** to reference **related resources**

|Resource|Get | Put | Delete |Post|
|--------|----|-----|--------|----|
|Collection URI|List URIs in Collection| Replace entire collection|Delete collection| Create new entry|
|Element URI|Get representation of URI|Replace or Create element| delete element|Not generally used. Treat element as collection and create new entry|


## JSON in a nutshell

<div class ="alert alert-warning" align="centred"> "JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language" [1]
</div>
[1]:http://json.org/

JSON builds on two structures that can be mixed with each other:

- A collection of key/value pairs (e.g. a dictionary in Python)
  - Format: `{Key1: Value, ...., KeyN: Value}`
- An ordered list of values (e.g. a list in Python)
  - Format: `[Value1, ...., ValueM]`

It improves readability and is easier to parse in JavaScript than XML.


**Example:**

     { "FirstName"  : "Michael",
       "LastName"   : "Granitzer",
       "Courses"    : [{"Name" : "Web Mining", "ID" :1234}, 
                       {"Name" : "Data Mining", "ID" :1235}]
     }
    
  

## OAuth in a nutshell

<div class="alert alert-warning">
"An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications." [1]
</div>

[1]: http://oauth.net/

Latest version is OAuth 2.0, which enables a third-party application to obtain limited access to an HTTTP Service. 

For the user there is no need to share credentials with a consumer application (e.g. e-Learning System) that consumes user-authenticated web services (e.g. Twitter for posting tweets through the e-Learning System) 

**Example:** "Login via Google"


### A Brief Overview on the Protocol

We will take a very brief look on the OAuth protocol, Version 2.0. This is by no means exhaustive, but should be sufficient for our purpose of web mining. 

For details on the protocol please see the corresponding [standard](https://tools.ietf.org/html/draft-ietf-oauth-v2-31). 

For a fast introduction see [this tutorial](http://aaronparecki.com/articles/2012/07/29/1/oauth2-simplified) (which has been used as basis for the description here)

#### Roles
- **Client:** The third party application attempting to get access to a user's account on the resource server
- **Resource-Server: The API ** that the Client wants to access on the behalf of the user
- **Resource Owner: The User** who is the person who is giving access to some propotion of the services the user is allowed to use on the resource server

#### Preliminaries: Application-based access

- **Registering an Application:** The OAuth process requires the **Client** to register an **application** at the **resource-server**. 

- **Redirect URI:** It is an URI that will be called when a user registers at the **client** and must be registered with the **application** on the **resource-server**.
- **Client ID and Secret:** They form the credentials for your application. The client ID represents public information and is used to build the login URLs. The client secret must be kept confidential, since it represents the authorization token for an app.



#### Authorization for Web Server based Clients

For Web Server based Clients, the following OAuth dance is performed:

<img src="images/OAuth.svg"/>


**Example:**
    
1. Redirect User to Resource Server for Login
    https://resource-server.com/auth?response_type=code&client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=photos
    
2. resource server answers with user prompt to get access for the registered app

3. On acceptance, the user is redirected to `REDIRECT_URI` with an auth code:
   https://client.com/cb?code=AUTH_CODE_HERE
   
4. The resource server exchanges the Auth Code for an access token:
   <pre><code>
     POST https://api.oauth2server.com/token
     grant_type=authorization_code&
     code=AUTH_CODE_HERE&
     redirect_uri=REDIRECT_URI&
     client_id=CLIENT_ID&
     client_secret=CLIENT_SECRET
  </pre></code>
  
  Server-replies the access token which has to be used in further calls: 
   <pre><code> 
    {
    "access_token":"RsT5OjbzRn430zqMLgV3Ia"
    }
    </pre></code>
    
5. Making authenticated requests via `Authorization` HTTP Header field (use HTTPS to keep the token secure!)
    <pre><code>     
    Authorization: Bearer RsT5OjbzRn430zqMLgV3Ia
    </pre></code>

**Other Request Types**

- **Password:** Allows to exchange the access token with the real user/password. Used for Clients from the same vendor like the resource server (e.g. Twitter Native App)
- **Application Access:** Change the applications registry information without any user context

#### Differences to OAuth V 1.0
- **Simplicity:** reduced requirements on signing requests/cryptographic signatures
- **User Experience:** Better user experience for native applications
- **Performance:** Separation of client ID/authorization code and access token enhances scalability for resources servers.

## HTTP Requests in Python: The `requests` Package

Although the Python Standard Library comes with the http-client library called [`urllib2`](https://docs.python.org/2/library/urllib2.html), the API is hard to use. 

The `requests` module provides an easy and handy API to send http requests over the network.

This section briefly summarizes the most important calls with `requests`. For details see the [documentation](http://docs.python-requests.org/en/latest/).


### Installation 

In the following you find an install procedure working within the notebook (respectively after restart). Usefull links for installing modules and managing packages can be found here:
- [Python Setup Tools](https://pypi.python.org/pypi/setuptools):
Softwaremanagers are `easy_install` or `pip`
- [Installing Python Modules](https://docs.python.org/2/install/index.html): 
Standard command after downloading the source `python setup.py install`

In [1]:
def installed():
    try:
      import requests
      return True
    except ImportError as ie:
      return False

In [3]:
print (installed())

True


In [4]:
#note we execute a shell command called pip (using !), which comes with python setuptools. 
#we install it to the users site with --user in order to avoid any rights issues
if not installed():
    !easy_install --user requests 
if not installed(): #we might not have pip installed. so install it from source locally
    import tempfile
    td = tempfile.tempdir
    !cd $td
    !git clone git://github.com/kennethreitz/requests.git
    !cd requests.git
    !python setup.py install --user

### Usage

A brief example (based on the [requests documentation](http://docs.python-requests.org/en/latest/user/quickstart/)):

In [7]:
import requests
r = requests.get('http://en.wikipedia.org/w/api.php')
print ("Status of request (200=success): %d"%r.status_code)
print ("HTTP Headers %s"%str(r.headers['content-type']))
print ("Encoding %s"% r.encoding)
print ("Returned text %s"%r.text.encode(r.encoding)[0:100])

Status of request (200=success): 200
HTTP Headers text/html; charset=utf-8
Encoding utf-8
Returned text b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'


#### The request object

The request is the main source of invoking http requst. For every request method (the verbs), a separate function exist.


In [8]:
r = requests.post("http://httpbin.org/post")
r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

#### Passing Parameters

Parameters are passed as dictionary object.

In [9]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print(r.url)

http://httpbin.org/get?key1=value1&key2=value2


#### Response Content

Response automatically decode content from the server by guessing/estimating the encoding. The content can be accessed under the `'text'` field. After changing the encoding through setting `'encoding'` subsquent access to `'text'` provides the new encoding.

In [10]:
r = requests.get('https://api.github.com/events')
r.text[1:1000]

'{"id":"5910216938","type":"PushEvent","actor":{"id":17426470,"login":"yangbin194","display_login":"yangbin194","gravatar_id":"","url":"https://api.github.com/users/yangbin194","avatar_url":"https://avatars.githubusercontent.com/u/17426470?"},"repo":{"id":91169208,"name":"yangbin194/yangbin194.github.io","url":"https://api.github.com/repos/yangbin194/yangbin194.github.io"},"payload":{"push_id":1751296109,"size":1,"distinct_size":1,"ref":"refs/heads/master","head":"4011808816e45506d17f3973a9ad223ee16ef857","before":"90c29d0ff8cf59c1ff4891c1309e8bc66028d492","commits":[{"sha":"4011808816e45506d17f3973a9ad223ee16ef857","author":{"email":"yangbinyhbn@gmail.com","name":"YangBin"},"message":"Site updated: 2017-05-20 16:42:51","distinct":true,"url":"https://api.github.com/repos/yangbin194/yangbin194.github.io/commits/4011808816e45506d17f3973a9ad223ee16ef857"}]},"public":true,"created_at":"2017-05-20T08:42:58Z"},{"id":"5910216930","type":"PushEvent","actor":{"id":22369369,"login":"Jie1102","di

In [11]:
print ("The encoding of the response is", r.encoding)

The encoding of the response is utf-8


#### Binary Response Content

The binary content can be accessed under the field `'content'`. `gzip` and `deflate` transfer-encodings are automatically decoded.

In [12]:
c = r.content
print (type(c), ": ", c[:1000])


<class 'bytes'> :  b'[{"id":"5910216938","type":"PushEvent","actor":{"id":17426470,"login":"yangbin194","display_login":"yangbin194","gravatar_id":"","url":"https://api.github.com/users/yangbin194","avatar_url":"https://avatars.githubusercontent.com/u/17426470?"},"repo":{"id":91169208,"name":"yangbin194/yangbin194.github.io","url":"https://api.github.com/repos/yangbin194/yangbin194.github.io"},"payload":{"push_id":1751296109,"size":1,"distinct_size":1,"ref":"refs/heads/master","head":"4011808816e45506d17f3973a9ad223ee16ef857","before":"90c29d0ff8cf59c1ff4891c1309e8bc66028d492","commits":[{"sha":"4011808816e45506d17f3973a9ad223ee16ef857","author":{"email":"yangbinyhbn@gmail.com","name":"YangBin"},"message":"Site updated: 2017-05-20 16:42:51","distinct":true,"url":"https://api.github.com/repos/yangbin194/yangbin194.github.io/commits/4011808816e45506d17f3973a9ad223ee16ef857"}]},"public":true,"created_at":"2017-05-20T08:42:58Z"},{"id":"5910216930","type":"PushEvent","actor":{"id":22369369,

#### JSON Response Content

JSON is returned via `requests.json` 

In [15]:
import json
print (type(r.json()))
#pretty print the json using json.dumps
print (json.dumps(r.json(), indent =  4))

<class 'list'>
[
    {
        "id": "5910216938",
        "type": "PushEvent",
        "actor": {
            "id": 17426470,
            "login": "yangbin194",
            "display_login": "yangbin194",
            "gravatar_id": "",
            "url": "https://api.github.com/users/yangbin194",
            "avatar_url": "https://avatars.githubusercontent.com/u/17426470?"
        },
        "repo": {
            "id": 91169208,
            "name": "yangbin194/yangbin194.github.io",
            "url": "https://api.github.com/repos/yangbin194/yangbin194.github.io"
        },
        "payload": {
            "push_id": 1751296109,
            "size": 1,
            "distinct_size": 1,
            "ref": "refs/heads/master",
            "head": "4011808816e45506d17f3973a9ad223ee16ef857",
            "before": "90c29d0ff8cf59c1ff4891c1309e8bc66028d492",
            "commits": [
                {
                    "sha": "4011808816e45506d17f3973a9ad223ee16ef857",
                    "aut

### Custom Headers

You can set custom headers simply via `dict` objects containing the header parameter

In [16]:
import json
url = 'https://api.github.com/events'
#add some additional payload, i.e. post parameters
payload = {'some': 'data'} 
#set custom headers here
headers = {'content-type': 'application/json'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
#pretty print the json this time
print (json.dumps(r.json(), sort_keys=True, indent=4))

{
    "documentation_url": "https://developer.github.com/v3",
    "message": "Not Found"
}


### Response Headers

We can view the server's response header:

In [17]:
r.headers

{'Server': 'GitHub.com', 'Date': 'Sat, 20 May 2017 08:44:23 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Status': '404 Not Found', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '58', 'X-RateLimit-Reset': '1495273378', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Access-Control-Expose-Headers': 'ETag, Link, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval', 'Access-Control-Allow-Origin': '*', 'Content-Security-Policy': "default-src 'none'", 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'deny', 'X-XSS-Protection': '1; mode=block', 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': '8C10:6EB1:1D95C57:230D5AF:592001E7'}

## Exercise I: OAuth Authentication with Twitter 

Details see "Exercises Folder", "Exercise I" in "Exercise Twitter with Python" ([local](exercises/Exercises%20Crawling%20Twitter%20with%20Python.ipynb)|[online](http://nbviewer.ipython.org/urls/raw.github.com/mgrani/LODA-lecture-notes-on-data-analysis/master/V.Web-Mining-Applications/exercises/Exercises%20Crawling%20Twitter%20with%20Python.ipynb))

# Twitter 

<div class="alert alert-warning">
Twitter is an online social networking services where users can send and read 140-character messages called tweets. It can be compared to a mobile phone Short Message Service (SMS) for the internet.
</div>

## Conceptual Model

- A **twitter user** creates Tweets, 140 character long messages
- **Tweets** are public on default, but can be protected.
- **Private Messages** are allowed between users.
- **Retweeting** refers to the process of forwarding a tweet through another Twitter user.
- **Following** refers to the process of one twitter user following the tweets of another twitter users. It can be seen as a filtering tweets on a per user basis or subscribing to another users tweets. Mutual following is often considered as indicator for friendship, but there is no explicit way of stating friendship.
- **Tweet Metadata** refers to metadata associated with a tweet, like for example the authors location. 
- **Tweet Special Characters:** Due to the limited amount of characters, certain "special" characters have been introduced by the community. Technically they are treated as characters, but add meaning for users:
  - `@` followed by a username is used for mentioning or replying to the named user
  - `#` followed by a word indicates a topic description

<img src="images/twitter-example.png"/>

## API

All Twitter functionality and data relating to the conceptual models are provided via the API. We will take a brief look at the API and its possibilities. This also clarifies the details of the conceptual Twitter model.


There are different APIs/Tools on Twitter:

- **Fabric** - a mobile app development kit to include Twitter functionality.
- **Twitter for Websites** - a set of embeddable widgets, buttons etc. to integrate Twitter in a website
- **Cards** - allow to display additional content alongside a Tweet (e.g. Photos, Videos, summary HTML pages)
- **OAuth** - endpoints for authentication on behalf of a user or on behalf of an application
- **REST API** - read and write access to Twitter data. 
- **Streaming API** - deliver a stream of update to REST API queries over a long lived HTTP connection
- **Ads API** - Allows to integrate Twitter advertising management.

We will focus on the REST API and here only on *reading* Twitter data.

### REST API V1.1

- Standard Rest Calls
- Authentication either with user context or without (application-only)
- Details under https://dev.twitter.com/rest/public



**Searching Twitter**

- URL: https://api.twitter.com/1.1/search/tweets.json
- Method: Get
- Parameters:
  - `q=<Query>` where `Query` is an url encoded query similar to using the standard Twitter search. Operators are:
     - `word1 word2` - standard is and search
     - `"word1 word2"` - phrases search with "
     - `word1 OR word2` - OR search
     - `word1 -word2` - not containing a word through `-`
     - `#hashtag` - searches for a hash tag
     - `from:User1` - tweet posted by User1
     - `to:User1` - tweet send to User1
     - more like positive/negative attitude, date ranges, asking questsions, type of feeds etc. 
  - *Result Type:* choose between recent or popular tweets
  - *Geolocation:* restrict query by location
  - *Language:* language of a tweet
  - *Iterating in a result set:* through count, until, since, max
- Rate limit: 
  - 180 requests per 15 minutes for a user context 
  - 450 queries per 15 minutes
- Details for rate limits: https://dev.twitter.com/rest/public/rate-limits
     
     


**Get Requests which are interesting for Mining Twitter**

`GET` requests return resources of interest, while `POST` requests update/manipulate the resources. Hence, we only look into `GET` requests for now.

*Some Details*

- prefix: https://api.twitter.com/1.1/
- Documentation: https://dev.twitter.com/rest/public 

*Requests*

|Request|Description|
|-------|-----------|
|statuses/user_timeline.json | returns most recent tweets posted by a user|
|statuses/home_timeline.json | returns the most recent Tweets and Retweets posted by a user and the his/her followed users|
|statuses/retweets\*|a collection of functions for accessing retweets|
|statuses/show/:id|Returns the details of a single Tweet including information on the Tweet Author|
|friends/ids| Returns the users the specified user is following|
|followers/ids| Returns the followers of a specified user|
|friendships/\*| Returns incoming and outgoing pending follow requests|
|users/search|search public user accounts on Twitter|
|users/show|returns the details of a user|
|lists/\*|a collection of list realted functions| 
|geo/\*|a collection of geo related functions| 

**Working with Timelines**

Due to the real-time characteristic of Twitter, working with timelines is not so easy (see https://dev.twitter.com/rest/public/timelines)


Paging does not work due to the arrival of new Tweets during processing:

<img src="images/twitter-paging-problem.png"\>
(Image-Source: Twitter)

**Solution using `max_id`** 

Work not relative to the static ordered list of tweets (page 0, count 5), but to the tweets processed.

Procedure:

1. Call Twitter with the `count` parameter for retrieving `count` number of tweets.
2. Subsequent calls use the `max_id` parameter and set it to the lowest retrieved tweet id (Note: `max_id` is inclusive). 
3. Twitter returns all Tweets with ids lower than `max_id`

<img src="images/twitter-paging-max_id.png"/>
(Image Source Twitter)

**Solution using `since_id`**

For timelines efficient processing of newly arrived tweets can be done using the `since_id` parameter. 

`since_id` returns all tweets that arrived since the Tweet specified with `since_id`. Note that `since_id` is exclusive.

`max_id` would be too inefficent:
<img src="images/twitter-maxid-problem.png"/>
(Image Source Twitter)

So we combine `since_id` and `max_id`.
<img src="images/twitter-paging-since_id.png"\>
(Image source Twitter)

### Tweet Metadata

A nice summary of what metadata is available to a tweet has been given by Raffi Krikorian. 

http://online.wsj.com/public/resources/documents/TweetMetadata.pdf

Roughly tweets include

- Tweet information like text, entities, polarity, mentioned geo locations, hash tags, creation date
- Author information like creation date, profile, location of the user

## The Python Twitter Module for accessing the API

There is a [`twitter`](https://pypi.python.org/pypi/twitter) module (Version 1.15) that warps the Twitter API and provides convenience function for accessing twitter.  Be careful which version you use (API might change).

### Installing the Python Twitter Module

In [23]:
def installed_twitter():
    try:
      import twitter
      return True
    except ImportError as ie:
      return False

#note we execute a shell command called pip, which comes with python setuptools. 
#we install it to the users site with --user in order to avoid any rights issues
if not installed_twitter():
    print ("Twitter module is not installed. Continuing with pip and installing it to current user.")
    !pip install twitter --user
if not installed_twitter(): #we might not have pip installed. so install it from source locally
    print ("Pip installation failed. Trying to build from source") 
    import tempfile
    td = tempfile.tempdir
    !cd $td
    !git clone git://github.com/bear/python-twitter.git
    !cd python-twitter
    !python setup.py install --user
    
if installed_twitter():
    print ("Twitter module is now installed. restart the kernal and use import twitter to access it.")
else:
    print ("Buidling from source failed. Do manually")
    

Twitter module is not installed. Continuing with pip and installing it to current user.
Pip installation failed. Trying to build from source
/bin/sh: 1: cd: can't cd to None
fatal: destination path 'python-twitter' already exists and is not an empty directory.
python: can't open file 'setup.py': [Errno 2] No such file or directory
Buidling from source failed. Do manually


In [30]:
%%bash
git clone git://github.com/bear/python-twitter.git
cd python-twitter
pip install -Ur requirements.testing.txt
pip install -Ur requirements.txt

Requirement already up-to-date: future in /home/nadim/anaconda3/lib/python3.6/site-packages (from -r requirements.testing.txt (line 1))
Requirement already up-to-date: requests in /home/nadim/anaconda3/lib/python3.6/site-packages (from -r requirements.testing.txt (line 2))
Requirement already up-to-date: requests_oauthlib in /home/nadim/anaconda3/lib/python3.6/site-packages (from -r requirements.testing.txt (line 3))
Collecting responses (from -r requirements.testing.txt (line 4))
  Downloading responses-0.5.1-py2.py3-none-any.whl
Collecting pytest (from -r requirements.testing.txt (line 6))
  Downloading pytest-3.0.7-py2.py3-none-any.whl (172kB)
Collecting pytest-cov (from -r requirements.testing.txt (line 7))
  Downloading pytest_cov-2.5.1-py2.py3-none-any.whl
Collecting pytest-runner (from -r requirements.testing.txt (line 8))
  Downloading pytest_runner-2.11.1-py2.py3-none-any.whl
Collecting mccabe (from -r requirements.testing.txt (line 9))
  Downloading mccabe-0.6.1-py2.py3-none-

fatal: destination path 'python-twitter' already exists and is not an empty directory.


### Some examples with the Twitter module

In [1]:
import twitter
help(twitter)

Help on package twitter:

NAME
    twitter - The minimalist yet fully featured Twitter API and Python toolset.

DESCRIPTION
    The Twitter and TwitterStream classes are the key to building your own
    Twitter-enabled applications.
    
    
    The Twitter class
    -----------------
    
    The minimalist yet fully featured Twitter API class.
    
    Get RESTful data by accessing members of this class. The result
    is decoded python objects (lists and dicts).
    
    The Twitter API is documented at:
    
      http://dev.twitter.com/doc
    
    
    Examples::
    
        from twitter import *
    
        t = Twitter(
            auth=OAuth(token, token_key, con_secret, con_secret_key))
    
        # Get your "home" timeline
        t.statuses.home_timeline()
    
        # Get a particular friend's timeline
        t.statuses.user_timeline(screen_name="billybob")
    
        # to pass in GET/POST parameters, such as `count`
        t.statuses.home_timeline(count=5)
    


In principle, class members of the Twitter API class are getting translated to Twitter REST requests.

### Authentication First

The twitter module can do the OAuth dance using application keys/secret and user token/secret

In [2]:
import twitter
    
consumer_key = "IhrU9UwHys1IBnVXh2tLt5dDW"
consumer_secret ="H4kL6FHOJlZ0wiCHUhqD2u3FmRm2q8vKJRqoqCwJzEN5t9Fsfw" #the secret has been scrambled. Use your own.
token = "2843747566-333S6Hv9clIrvfb0Ne8FRpY5OuYFqLowrUJksmv"
token_secret = "DKZAFwKfc63SSf7XnyVVw8IyT0YtWQjOmh8YklfYL6OKu"
auth = twitter.oauth.OAuth(token, token_secret,
                           consumer_key, consumer_secret)
twitter_api = twitter.Twitter(auth = auth)
print (twitter_api)


<twitter.api.Twitter object at 0x7f2e5846f550>


### Getting timelines

In [3]:
# Get your "home" timeline
[t["text"] for t in twitter_api.statuses.home_timeline()]


['RT @egghat: Warte mal …\n\nDie Türkei ruft den US-Botschafter ein, weil Erdowahns Bodyguardsin den USA Demonstranten verprügelt haben?!?\n\nWhu…',
 'Reformen in Frankreich: Merkel verspricht Hilfe für Macron https://t.co/mAKLYwGwmk',
 'GE, Lockheed Martin, Exxon Mobil: Diese Deals hat Trump in Saudi-Arabien eingefädelt https://t.co/ZFl93E7Qrn',
 'Umfrage: Nur 16 Prozent der Deutschen verschlüsseln ihre E-Mails https://t.co/efmGQ4US97 #PGP #EMail',
 'Insolvenz-Prozess: Zeuge belastet Anton Schlecker schwer... https://t.co/llMVSKkLfK',
 'Apple beerdigt Gratis-Testphase für "Music": Nutzer in der Schweiz, Spanien sowie Australien müssen künftig zahlen - https://t.co/MRklkW58tm',
 'Schweizer Bank muss Drogerie-Gründer Müller 45 Millionen zahlen https://t.co/twVLqwbeBG https://t.co/i6Gv4ZEndP',
 'RT @FelixHoltermann: Der Hammer: Jede 5. #Bank könnte dichtmachten. https://t.co/m8XLcraEYA',
 'Horrorfilme mit Aliens: Das sind die besten https://t.co/HYofEt85gL https://t.co/dsC9AyiP6t',
 'Es 

In [4]:
# Get a particular friend's timeline
[t["text"] for t in twitter_api.statuses.user_timeline(screen_name="mgrani")]

['RT @Kasparov63: Is he insane? AI job replacement is top topic everywhere I speak for 2 years, more than security or strategy! https://t.co/…',
 'RT @itsgreateu: 🇦🇹 Our Foreign Minister is about the age of your son Barron. #BFF Our Chancellor is Humphrey Bogart. Come on, Mr. President…',
 'RT @ktochtermann: Our #openscience conference https://t.co/fCbKHcSzGl almost fully booked. Already 167 registrations from 33 different coun…',
 'RT @Qofficiel: Nous souhaitions simplement savoir si le garde du corps de Marine Le Pen avait eu un emploi fictif au Parlement européen ou…',
 'RT @AndrewYNg: Othello/Checkers/Chess/Go theoretically solvable w/tree search+sheer computation. Poker/bluffing needs more. Congrats CMU Li…',
 'RT @kdnuggets: #DeepLearning and Neuromorphic Chips #KDN https://t.co/eYEnNbkRn2',
 'RT @ktochtermann: Great explainity video why the #EOSC and access to heterogeneous and FAIR data sources is the future https://t.co/gSaKOFS…',
 'RT @kdnuggets: Lessons from 2 Million #Machi

In [5]:
# to pass in GET/POST parameters, such as `count`
[t["text"] for t in twitter_api.statuses.home_timeline(count=5)]


['RT @egghat: Warte mal …\n\nDie Türkei ruft den US-Botschafter ein, weil Erdowahns Bodyguardsin den USA Demonstranten verprügelt haben?!?\n\nWhu…',
 'Reformen in Frankreich: Merkel verspricht Hilfe für Macron https://t.co/mAKLYwGwmk',
 'GE, Lockheed Martin, Exxon Mobil: Diese Deals hat Trump in Saudi-Arabien eingefädelt https://t.co/ZFl93E7Qrn',
 'Umfrage: Nur 16 Prozent der Deutschen verschlüsseln ihre E-Mails https://t.co/efmGQ4US97 #PGP #EMail',
 'Insolvenz-Prozess: Zeuge belastet Anton Schlecker schwer... https://t.co/llMVSKkLfK']

### User centric functions

In [6]:
#calling the followers which is a substitute for https://dev.twitter.com/rest/reference/get/followers/ids
twitter_api.followers.ids(screen_name="mgrani")

{'ids': [37162329,
  1057480268,
  850720269878652928,
  3218983397,
  268243115,
  537464987,
  839125519945850880,
  293527251,
  3329492973,
  2864070897,
  742968466198650881,
  219476668,
  4704572095,
  87548625,
  779794135,
  710038354788745216,
  4924187376,
  142280807,
  4629350069,
  779225865997676544,
  18115444,
  715153720082882560,
  204297410,
  15801598,
  522096738,
  393893442,
  2650004130,
  742667964583972864,
  423417020,
  190261638,
  4361474962,
  174294231,
  747096993432899584,
  724006529238732800,
  742834629116301316,
  3391225835,
  284553,
  76336279,
  2969452439,
  267670313,
  722729696203509760,
  1840276716,
  2603486526,
  704225667727101952,
  25481671,
  2926998515,
  4793796869,
  14305608,
  951950401,
  2693600486,
  272008339,
  3118188136,
  328926928,
  18676745,
  337018821,
  3916694415,
  75246856,
  70470604,
  2325373176,
  3320945609,
  1933696789,
  534302269,
  78867746,
  2555501227,
  350164545,
  2345198082,
  2732416056,
  31

In [7]:
#examine the occuring exception to see how the call construction works. 
#This is a perfect example for python magic overwriting the __call__ function of a library
twitter_api.url.path.none.exists.followers.ids(screen_name="mgrani")

TwitterHTTPError: Twitter sent status 404 for URL: 1.1/url/path/none/exists/followers/ids.json using parameters: (oauth_consumer_key=IhrU9UwHys1IBnVXh2tLt5dDW&oauth_nonce=12480754203401261202&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1495463737&oauth_token=2843747566-333S6Hv9clIrvfb0Ne8FRpY5OuYFqLowrUJksmv&oauth_version=1.0&screen_name=mgrani&oauth_signature=gg1I7PvreOWZDN7pkGfj3jnBRtE%3D)
details: {'errors': [{'message': 'Sorry, that page does not exist', 'code': 34}]}

### Accessing trends

In [8]:
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# We define some ids based on the Yahoo Where On Earth ID
# Use http://woeid.rosselliot.co.nz/ for lookups
WORLD_WOE_ID = 1
GER_WOE_ID = 23424829
twitter_api.trends.place(_id=WORLD_WOE_ID)

[{'as_of': '2017-05-22T14:35:51Z',
  'created_at': '2017-05-22T14:32:37Z',
  'locations': [{'name': 'Worldwide', 'woeid': 1}],
  'trends': [{'name': '#FelizLunes',
    'promoted_content': None,
    'query': '%23FelizLunes',
    'tweet_volume': 42393,
    'url': 'http://twitter.com/search?q=%23FelizLunes'},
   {'name': '#mondaymotivation',
    'promoted_content': None,
    'query': '%23mondaymotivation',
    'tweet_volume': 96637,
    'url': 'http://twitter.com/search?q=%23mondaymotivation'},
   {'name': '#DiaDoAbraço',
    'promoted_content': None,
    'query': '%23DiaDoAbra%C3%A7o',
    'tweet_volume': None,
    'url': 'http://twitter.com/search?q=%23DiaDoAbra%C3%A7o'},
   {'name': '#DTBYRomance',
    'promoted_content': None,
    'query': '%23DTBYRomance',
    'tweet_volume': 423722,
    'url': 'http://twitter.com/search?q=%23DTBYRomance'},
   {'name': '#ToDeSacoCheioDe',
    'promoted_content': None,
    'query': '%23ToDeSacoCheioDe',
    'tweet_volume': None,
    'url': 'http://twi

In [9]:
twitter_api.trends.place(_id=GER_WOE_ID)

[{'as_of': '2017-05-22T14:35:55Z',
  'created_at': '2017-05-22T14:32:37Z',
  'locations': [{'name': 'Germany', 'woeid': 23424829}],
  'trends': [{'name': '#Schmidt',
    'promoted_content': None,
    'query': '%23Schmidt',
    'tweet_volume': None,
    'url': 'http://twitter.com/search?q=%23Schmidt'},
   {'name': '#Artenvielfalt',
    'promoted_content': None,
    'query': '%23Artenvielfalt',
    'tweet_volume': None,
    'url': 'http://twitter.com/search?q=%23Artenvielfalt'},
   {'name': '#TrumpInIsrael',
    'promoted_content': None,
    'query': '%23TrumpInIsrael',
    'tweet_volume': 14888,
    'url': 'http://twitter.com/search?q=%23TrumpInIsrael'},
   {'name': '#BeFo17',
    'promoted_content': None,
    'query': '%23BeFo17',
    'tweet_volume': None,
    'url': 'http://twitter.com/search?q=%23BeFo17'},
   {'name': 'Untersuchungsausschuss',
    'promoted_content': None,
    'query': 'Untersuchungsausschuss',
    'tweet_volume': None,
    'url': 'http://twitter.com/search?q=Untersu

## Time to do Exercise II in Exercises - Crawling Twitter in Python

Go to the Exercise folder and conduct Exercise II in Exercises - Crawling Twitter in Python

# Literature

- **Matthew A. Russel, ["Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More"](http://miningthesocialweb.com/), O'Reilly Media, 2013 **
- Fielding, Roy T.; Gettys, James; Mogul, Jeffrey C.; Nielsen, Henrik Frystyk; Masinter, Larry; Leach, Paul J.; Berners-Lee (June 1999). [Hypertext Transfer Protocol -- HTTP/1.1.](https://tools.ietf.org/html/rfc2616) IETF. RFC 2616.
- Fielding, Roy Thomas (2000). ["Architectural Styles and the Design of Network-based Software Architectures"](https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm). Dissertation. University of California, Irvine.
- D. Hardt, Ed. ["The OAuth 2.0 Authorization Framework, draft-ietf-oauth-v2-31"](https://tools.ietf.org/html/draft-ietf-oauth-v2-31) OAuth Working Group, Internet Draft, 2012