In [1]:
from IPython.display import display
import pandas as pd
pd.options.display.max_columns = None # Display all columns of a dataframe
pd.options.display.max_rows = 700
from pprint import pprint
import re

# Week 2: Working with Big Datasets

## Motivation

**Publication of crawling papers by year**

![Publication of crawling papers by year](images/publication_crawling_papers_by_year.png)*Source*: Claussen, Jörg and Peukert, Christian, **Obtaining Data from the Internet: A Guide to Data Crawling in Management Research** (June 2019). Available at SSRN: https://ssrn.com/abstract=3403799 or http://dx.doi.org/10.2139/ssrn.3403799
    

**General objective of the notebook**: construct a dataset with the **tweets** of the current U.S. members of Congress (Senate + House) with information on their party **affiliation**

**Three sources of data**:
 1. **List of U.S. representatives**: **webscraped** from [ballotpedia](https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress)
 2. **Twitter accounts** of the U.S. representative. From a [hand-labeld dataset](https://github.com/vegetable68/Midterm-2018-candidates) compiled by Yiqing Hua for all candidates.
 3. **Tweets** published on the twitter accounts
 
**2 merge operations**:
- 1+2: select only the elected representative among the candidates present in 2
- 3+2+1: tweets associated with their author + party affiliation

## Screen scraping

What is webscraping ?

<img src="images/screenscraping.png">

Source: [SICSS](https://compsocialscience.github.io) 

Points to keep in mind:
- It may or may not be legal
- Webscraping is tedious and frustrating

Main challenges:
- Variety of websites and webpages
- Durability of code as website constantly changes

## Typical Steps of Webscraping

### Exploring the Website

We will scrape the list of current members of the U.S. Congress because it will be useful later in the class!
<img src="images/ballotpedia.png">

Source: [ballotpedia website](https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress) 

### Understanding URLs
- Base URL: https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress
- More complex URL with query parameter https://ballotpedia.org/wiki/index.php?search=jerry&searchToken=elnan6bftyqukadgu8xb2rtbg
    - query parameter=`p?search=jerry`
    - can be used to crawl websites if you have a list of queries that you want to loop over (e.g. dates, localities...)
    - query structure:
        - *Start*: `?`
        - *Information*: pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value). 
        - *Separator*: `&` -> if multiple query parameters 
        
Other example of URL: https://opendata.swiss/en/dataset?political_level=commune&q=health. Try to change the search and selection parameters and observe how that affects your URL. 

### Inspect the site Using Developer Tools
We use the `inspect` function (right click) to access the underlying HTML interactively. 
<img src="images/ballotpedia_inspect.png">

**R users** 

The logic shown hereafter has its direct equivalent in `R`. See [this post](https://towardsdatascience.com/web-scraping-tutorial-in-r-5e71fd107f32) for examples of the most useful functions. 

### HTML parsing
In this example, we scrape **static HTML content**: the server that hosts the site sends back HTML documents that already contain all the data you’ll get to see as a user.

In [2]:
import urllib # Python's module for accessing web pages
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'

In [3]:
page = urllib.request.urlopen(url) # open the web page
html = page.read() # read web page contents as a string
print("-- first 400 characters --", html[:400]) 
print("-- last 400 characters --", html[-400:])
print("-- length of string --", len(html))

-- first 400 characters -- b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of current members of the U.S. Congress - Ballotpedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace'
-- last 400 characters -- b'h(url, {method: \'POST\', mode: \'no-cors\',body: JSON.stringify(p_json_final),credentials: \'include\',headers: {\'Content-Type\': \'text/plain\',}});\ndocument.write(\'<scri\'+\'pt src="https://fkrkkmxsqeb5bj9r.s3.amazonaws.com/384.js" type="text/javascript"></scri\'+\'pt>\');\n}\n</script><script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgBackendResponseTime":347});});</script>\t</body>\n</html>\n'
-- length of string -- 279358


In [4]:
# Parse raw HTML
from bs4 import BeautifulSoup # package for parsing HTML
soup = BeautifulSoup(html, 'html.parser') # parse html of web page
print("-- title item:", soup.title) 

-- title item: <title>List of current members of the U.S. Congress - Ballotpedia</title>


In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of current members of the U.S. Congress - Ballotpedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_current_members_of_the_U.S._Congress","wgTitle":"List of current members of the U.S. Congress","wgCurRevisionId":8033620,"wgRevisionId":8033620,"wgArticleId":180048,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Unique congress pages"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMont

In [6]:
# extract text
text = soup.get_text() # get text (remove HTML markup)
lines = text.splitlines() # split string into separate lines
print("-- Number of lines:", len(lines))

-- Number of lines: 10369


In [7]:
lines = [line for line in lines if line != ''] # drop empty lines
print("-- Number of lines (after dropping empty lines):", len(lines))
print("-- The first 20 lines:", lines[:20])

-- Number of lines (after dropping empty lines): 2564
-- The first 20 lines: ['List of current members of the U.S. Congress - Ballotpedia', 'document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );', '(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_current_members_of_the_U.S._Congress","wgTitle":"List of current members of the U.S. Congress","wgCurRevisionId":8033620,"wgRevisionId":8033620,"wgArticleId":180048,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Unique congress pages"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","Augu

### Scraping a table

#### Find Elements by ID
using the `find` function 

In [8]:
print(soup.find(id="mw-content-text"))

<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><p>The <b>United States Congress</b> is the <a href="/Bicameralism" title="Bicameralism">bicameral</a> legislature of the United States of America's federal government. It consists of two houses, the <a href="/United_States_Senate" title="United States Senate">Senate</a> and the <a href="/United_States_House_of_Representatives" title="United States House of Representatives">House of Representatives</a>, with members chosen through direct <a href="/Elections" title="Elections">election</a>.
</p><p><a href="/United_States_Congress" title="United States Congress">Congress</a> has 535 voting members. The Senate has 100 voting officials, and the House has 435 voting officials, along with five delegates and one resident commissioner. 
<!-- /1011927/BP_BTFWindow -->
<div id="div-gpt-ad-1548351617089-0">
<script>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1548351617089-0'); });
</script>
</div>
</p>
<div cl

In [9]:
results=soup.find(id="mw-content-text")
print(results.prettify())

<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en">
 <p>
  The
  <b>
   United States Congress
  </b>
  is the
  <a href="/Bicameralism" title="Bicameralism">
   bicameral
  </a>
  legislature of the United States of America's federal government. It consists of two houses, the
  <a href="/United_States_Senate" title="United States Senate">
   Senate
  </a>
  and the
  <a href="/United_States_House_of_Representatives" title="United States House of Representatives">
   House of Representatives
  </a>
  , with members chosen through direct
  <a href="/Elections" title="Elections">
   election
  </a>
  .
 </p>
 <p>
  <a href="/United_States_Congress" title="United States Congress">
   Congress
  </a>
  has 535 voting members. The Senate has 100 voting officials, and the House has 435 voting officials, along with five delegates and one resident commissioner.
  <!-- /1011927/BP_BTFWindow -->
  <div id="div-gpt-ad-1548351617089-0">
   <script>
    googletag.cmd.push(function

#### Find Elements by HTML Class Name

In [10]:
soup.find('table', class_='wikitable sortable jquery-tablesorter')

<table border="1" class="wikitable sortable jquery-tablesorter" style="font-size:90%" width="70%">
<tr>
<th>Officeholder name</th>
<th>Office title</th>
<th>Date assumed office</th>
<th>Party affiliation</th>
</tr>
<tr>
<td>
<p><a href="https://ballotpedia.org/Jerry_Moran">Jerry Moran</a>
</p>
</td>
<td>
<p> <a href="https://ballotpedia.org/Jerry_Moran">U.S. Senate Kansas</a>
</p>
</td>
<td>
<p>					2011-01-05
</p>
</td>
<td>
<p>					Republican Party
</p>
</td>
</tr>
<tr>
<td>
<p><a href="https://ballotpedia.org/Pat_Roberts">Pat Roberts</a>
</p>
</td>
<td>
<p> <a href="https://ballotpedia.org/Pat_Roberts">U.S. Senate Kansas</a>
</p>
</td>
<td>
<p>					1997-01-07
</p>
</td>
<td>
<p>					Republican Party
</p>
</td>
</tr>
<tr>
<td>
<p><a href="https://ballotpedia.org/Gary_Peters">Gary Peters</a>
</p>
</td>
<td>
<p> <a href="https://ballotpedia.org/Gary_Peters">U.S. Senate Michigan</a>
</p>
</td>
<td>
<p>					2015-01-06
</p>
</td>
<td>
<p>					Democratic Party
</p>
</td>
</tr>
<tr>
<td>
<p

`find_all` is often more useful than `find`.

In [11]:
tb = soup.find_all('table', class_='wikitable sortable jquery-tablesorter')
len(tb)

2

In [12]:
senate=tb[0] # first element 
print(senate.find_all('tr')[1]) # a row of the table

<tr>
<td>
<p><a href="https://ballotpedia.org/Jerry_Moran">Jerry Moran</a>
</p>
</td>
<td>
<p> <a href="https://ballotpedia.org/Jerry_Moran">U.S. Senate Kansas</a>
</p>
</td>
<td>
<p>					2011-01-05
</p>
</td>
<td>
<p>					Republican Party
</p>
</td>
</tr>


#### Extract Text From HTML Elements

In [13]:
import re # for regular expressions

Cleaning the row: using the `get_text` function

In [14]:
row=senate.find_all('tr')[1]

print("-- Number of rows: {}".format(len(senate.find_all('tr'))))

# Using the a of the first 2 cells
for cell in row.find_all('a'):
    print(cell.get_text())

# For the 2 last cells:
for cell in row.find_all('p')[2:4]:
    print(re.sub('\n', '', cell.get_text().lstrip())) # little text trick: wait for the class on text-as-data!

-- Number of rows: 101
Jerry Moran
U.S. Senate Kansas
2011-01-05
Republican Party


Loop over all rows:

In [15]:
df_senate=pd.DataFrame() # empty dataframe in which the cleaned rows will be stored

for row in senate.find_all('tr'):
    row_dict=dict() # empty dictionary in which the cleaned cells are stored
    i=0
    # Using the a of the first 2 cells
    for cell in row.find_all('a'):
        row_dict[i]=[cell.get_text()]
        i=i+1
    # For the 2 last cells:
    for cell in row.find_all('p')[2:4]:
        row_dict[i]=[re.sub('\n', '', cell.get_text().lstrip())]
        i=i+1
    df_row=pd.DataFrame.from_dict(row_dict, orient='columns') # row_dict -> dataframe
    df_senate=pd.concat([df_senate, df_row]) # append the df_row 

In [16]:
df_senate=df_senate.rename(columns={0:'Officeholder name', 1: 'Office title', 2: 'Date assumed office', 3: 'Party affiliation'})
df_senate

Unnamed: 0,Officeholder name,Office title,Date assumed office,Party affiliation
0,Jerry Moran,U.S. Senate Kansas,2011-01-05,Republican Party
0,Pat Roberts,U.S. Senate Kansas,1997-01-07,Republican Party
0,Gary Peters,U.S. Senate Michigan,2015-01-06,Democratic Party
0,Debbie Stabenow,U.S. Senate Michigan,2001-01-03,Democratic Party
0,Tim Kaine,U.S. Senate Virginia,2013-01-03,Democratic Party
0,Mark Warner,U.S. Senate Virginia,2009-01-06,Democratic Party
0,Chris Van Hollen,U.S. Senate Maryland,2017-01-03,Democratic Party
0,Ben Cardin,U.S. Senate Maryland,2007-01-04,Democratic Party
0,Dianne Feinstein,U.S. Senate California,1993,Democratic Party
0,Kamala D. Harris,U.S. Senate California,2017-01-03,Democratic Party


In [17]:
df_senate['Party affiliation'].value_counts()

Republican Party    53
Democratic Party    45
Independent          2
Name: Party affiliation, dtype: int64

#### Exercise: construct a dataframe containing the table on the House

The house composition is the second table of the page

In [18]:
house=tb[1] # first element 
print(house.find_all('tr')[1]) # a row of the table

<tr>
<td>
<p><a href="https://ballotpedia.org/Jennifer_Wexton">Jennifer Wexton</a>
</p>
</td>
<td>
<p> <a href="https://ballotpedia.org/Jennifer_Wexton">U.S. House Virginia District 10</a>
</p>
</td>
<td>
<p>					2019-01-03
</p>
</td>
<td>
<p>					Democratic Party
</p>
</td>
</tr>


In [19]:
row=house.find_all('tr')[1]

print("-- Number of rows: {}".format(len(house.find_all('tr'))))

# Using the a of the first 2 cells
for cell in row.find_all('a'):
    print(cell.get_text())

# For the 2 last cells:
for cell in row.find_all('p')[2:4]:
    print(re.sub('\n', '', cell.get_text().lstrip())) # little text trick: wait for the class on text-as-data!

-- Number of rows: 436
Jennifer Wexton
U.S. House Virginia District 10
2019-01-03
Democratic Party


Loop over all rows:

In [20]:
df_house=pd.DataFrame() # empty dataframe in which the cleaned rows will be stored

for row in house.find_all('tr'):
    row_dict=dict() # empty dictionary in which the cleaned cells are stored
    i=0
    # Using the a of the first 2 cells
    for cell in row.find_all('a'):
        row_dict[i]=[cell.get_text()]
        i=i+1
    # For the 2 last cells:
    for cell in row.find_all('p')[2:4]:
        row_dict[i]=[re.sub('\n', '', cell.get_text().lstrip())]
        i=i+1
    df_row=pd.DataFrame.from_dict(row_dict, orient='columns') # row_dict -> dataframe
    df_house=pd.concat([df_house, df_row]) # append the df_row 

In [21]:
df_house=df_house.rename(columns={0:'Officeholder name', 1: 'Office title', 2: 'Date assumed office', 3: 'Party affiliation'})
df_house

Unnamed: 0,Officeholder name,Office title,Date assumed office,Party affiliation
0,Jennifer Wexton,U.S. House Virginia District 10,2019-01-03,Democratic Party
0,Michael F.Q. San Nicolas,U.S. House Guam At-large District,2019-01-03,Democratic Party
0,Nanette Barragán,U.S. House California District 44,2017-01-03,Democratic Party
0,Andy Biggs,U.S. House Arizona District 5,2017-01-03,Republican Party
0,Elise Stefanik,U.S. House New York District 21,2015-01-06,Republican Party
0,Brett Guthrie,U.S. House Kentucky District 2,2009-01-06,Republican Party
0,Jim Cooper,U.S. House Tennessee District 5,2003-01-07,Democratic Party
0,Hank Johnson,U.S. House Georgia District 4,2007-01-04,Democratic Party
0,John Curtis,U.S. House Utah District 3,2017-11-13,Republican Party
0,Jackie Walorski,U.S. House Indiana District 2,2013-01-03,Republican Party


In [22]:
df_house['Party affiliation'].value_counts()

Democratic Party         235
Republican Party         197
Independent                1
Libertarian Party          1
New Progressive Party      1
Name: Party affiliation, dtype: int64

### Going further

There are also **dynamic websites**: the server does not always send back HTML, but your browser also receive and interpret JavaScript code that you cannot retreive from the HTML. You receive JavaScript code that you cannot parse using `beautiful soup` but that you would need to execute like a browser does. 

Solutions: 
- Use `requests-html` 
- Simulate a browser using [selenium](https://selenium-python.readthedocs.io/) 

## Data on politician with info on party and twitter accounts

We need to find (or build from scratch) a data with information on the politician. Most importantly, we need a link to their twitter account and their party affiliation. 

Such a dataset has been constructed by Yiqing Hua (Cornell Tech) for  US. midterm election 2018 candidates with their twitter handles data from https://github.com/vegetable68/Midterm-2018-candidates

Data = full list of candidates running for House and Senate, as well as gubernatorial candidates from Ballotpedia

In [23]:
# read file with pandas (stored on github)
df = pd.read_csv('https://raw.githubusercontent.com/vegetable68/Midterm-2018-candidates/master/candidates.csv')
df.head()

Unnamed: 0,candidate_name,created_at,description,district,followers_count,friends_count,location,party,position,state,url,gender,twitter id,twitter handle,twitter name
0,Jackie Speier,2009-03-17 17:02:38,Represents California's 14th Congressional Dis...,14,123863,22020,,democratic,house,california,https://t.co/kqVDVfirna,female,24913070.0,RepSpeier,Jackie Speier
1,Jackie Walorski,2008-04-27 23:31:05,Honored to represent Indiana's Second District...,2,3404,1089,"Elkhart, IN",republican,house,indiana,https://t.co/r6FCJCAzyx,female,14562870.0,jackiewalorski,Jackie Walorski
2,Jackie Walorski,2013-01-06 15:41:40,Representing Indiana's Second Congressional Di...,2,15894,697,,republican,house,indiana,http://t.co/1c3JNm0kJ5,female,1065995000.0,RepWalorski,Jackie Walorski
3,Frankie Robbins,2012-06-19 20:26:31,SUPPORT FRANKIE ROBBINS FOR OKLAHOMA'S THIRD C...,3,14,4,,democratic,house,oklahoma,,male,612838200.0,robbinsforok,Frankie Robbins
4,Sri Preston Kulkarni,2011-05-30 9:58:58,"Democratic nominee for US Congress-TX 22, Hous...",22,3732,226,"Sugar Land-Pearland-Katy, TX",democratic,house,texas,https://t.co/CIOOdDaRoW,male,307808400.0,SriPKulkarni,Sri Preston Kulkarni


In [24]:
df['party'].value_counts()

democratic     687
republican     614
third party      2
Name: party, dtype: int64

In [25]:
df['gender'].value_counts()

male      945
female    358
Name: gender, dtype: int64

In [26]:
df_candidates=df[['candidate_name', 'party', 'twitter handle']]

Merge with House and Senate data : only keeps the elected candidates

In [27]:
df_congress= pd.concat([df_house, df_senate])
df_congress.shape

(535, 4)

In [28]:
print(df_candidates.shape)
print("Number of unique candidate names", len(df_candidates['candidate_name'].unique()))

(1303, 3)
Number of unique candidate names 924


In [29]:
df_merged_all=pd.merge(df_congress, df_candidates, right_on='candidate_name', left_on='Officeholder name')

In [30]:
print("--Result of the merge:")
print("Number of twitter accounts from candidates:", df_candidates.shape[0])
print("Number of twitter accounts from US representative:", df_merged_all.shape[0])
print("Correspond to {} politicians (often having 2 accounts)".format(len(df_merged_all['Officeholder name'].unique())))
print("Number US representative:", df_congress.shape[0])
print("Share of US representative with a twitter account:", len(df_merged_all['Officeholder name'].unique())/df_congress.shape[0])

--Result of the merge:
Number of twitter accounts from candidates: 1303
Number of twitter accounts from US representative: 613
Correspond to 373 politicians (often having 2 accounts)
Number US representative: 535
Share of US representative with a twitter account: 0.697196261682243


List of tweeter accounts, useful for the following task

In [31]:
account_list = df_merged_all['twitter handle'].tolist()
print('First 3 elements:', account_list[:3])
print('Number of twitter account studied:', len(account_list))

First 3 elements: ['JenniferWexton', 'Nanette4CA', 'RepBarragan']
Number of twitter account studied: 613


## Application Programming Interfaces (API)
### What Is an API?

**APIs are tools for building apps or other forms of software that help people access certain parts of large databases**

The website [Programmable Web](https://www.programmableweb.com/apis/directory) lists more than 225,353 API from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others.

<img src="images/growth_in_web_api.png">

Source: [Programmable Web](https://www.programmableweb.com/news/apis-show-faster-growth-rate-2019-previous-years/research/2019/07/17) 


### How Does an API Work?

Better than webscraping if possible because: 
- More stable than webpages
- No HTML but already structured data (e.g. in `json`)

### API Credentials
In order to prevent software developer to collect huge amount of individual data, many APIs require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access. 

### Rate Limiting
The credentials not only define what type of information we are allowed to access, but also how often we are allowed to make requests for such data. 

## Why Using Twitter's API?

- Increasingly used in Political Sciences and Economics 
    - Allyson L. Benton & Andrew Q. Philips, 2020. **"Does the @realDonaldTrump Really Matter to Financial Markets?,"** *American Journal of Political Science*, John Wiley & Sons, vol. 64(1), pages 169-190, January. [Website](https://onlinelibrary.wiley.com/doi/10.1111/ajps.12491)
    - Petrova Maria Sen Ananya and Yildirim Pinar, **Social Media and Political Donations: New Technology and Incumbency Advantage in the United States** (September 8, 2016). [SSRN](https://ssrn.com/abstract=2836323)
    - **Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings** by Dorottya Demszky, Nikhil Garg, Matthew Gentzkow,  Rob Voigt, James Zou, Jesse M. Shapiro, and Dan Jurafsky, 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). June 2019. [arxiv](https://arxiv.org/abs/1904.01596)
- As an example for using an API


## An Example with Twitter’s API

### How to apply for a developer account
Developers need first to have a twitter account: this tutorial assumes that it is already the case. 

How to obtain credentials from Twitter that will allow you to make API calls? 

1. create an account (https://apps.twitter.com) in order to receive credentials  
2. create a developer account by clicking ''Apply for a developer account''. 
3. confirm your email address or add a mobile phone number (two-factor authentication helps Twitter prevent people from obtaining a large number of different credentials using multiple accounts that could be use to collect large amounts of data without being rate limited—or, for other nefarious purposes such as creating armies of bots that produce spam or attempt to influence elections.)
4. answer series of questions about how you want to use Twitter’s API & accept terms of services
5. Once you accept the terms, your app developer request will go under review by Twitter. Then it takes time (1-2 days to a week)

### Create an Application & get your authentification details
1. Once the developer account is approved, go to your profile tab and select Apps. Create an app and fill in the details.
2. Click on `details`
3. Click on `Keys and tokens`. This is where you get the relevant keys (you will have to regenerate and copy the tokens):
    - API key
    - API secret key
    - Access token
    - Access token secret


After registering to the Twitter API, you get:

In [32]:
 #this you get when you make create an application on twitter as a dev
consumer_key="7X8q1LteL1qOLRg4DSoiI0lyk"
consumer_secret="Bb08vV5XoxDEP4SGLJfkvEpuwvxEDvzVSRGTptzQCjC9XVmquP"
access_token="1230145588659998721-EQb8DoMYnorBtqAuEPMZtZcXumzuAe"
access_token_secret="mzurpOu1LWhj0NU9pctgYZ6OBzhQvyqw8hvRNpe3yI4Qc"

## Accessing the Twitter API using `tweepy`

We use the `tweepy` package (documentation: https://tweepy.readthedocs.io/en/latest/). Tweepy is an *An easy-to-use Python library for accessing the Twitter API.*

R users can use [rtweet](https://rtweet.info/), a similar package. 

Twitter requires all requests to use `OAuth` for authentication

In [33]:
import tweepy
from tweepy import OAuthHandler

Authenticate to Twitter

In [34]:
auth = OAuthHandler(consumer_key, consumer_secret) #creating an OAuthHandler instance
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

We specify `wait_on_rate_limit_notify==True` & `wait_on_rate_limit=True`. The API method will wait once you’ve reached your rate limit and prints out a message

In [35]:
# test authentication
try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

Authentication OK


For an extensive list of the methods available, see the [API Reference page](https://tweepy.readthedocs.io/en/latest/api.html#api-reference). There are several types of methods. The following methods enable you to access twitter content:
- Timeline methods return a list of  `status` objects
- Status methods return a `status` object
- User methods return `user` object or a list of `user` objects. 
- Favorite methods: return a list of  `status` objects

For some methods, you can interact with twitter:
- Friendship Methods return a `user` object, by example:
    - `create_friendship`: creates a new friendship with the specified user ()
    
Let's review some useful methods:

### Methods returning a `status` object (or a list of objects)
#### Search method

If you seeking Twitter data to get conversations on a particular topic. This method returns a collection of relevant Tweets matching a specified query for all public tweets.

In [36]:
# most recent tweets about ETH 
tweets = api.search(q="ETH Zürich", lang="en")
for tweet in tweets:
    print(tweet.text) # printing the first tweet

RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a 

Postdoctoral…
RT @RobertFin

The `status` object:

In [37]:
pprint(tweets[0]) # for the first tweet

Status(_api=<tweepy.api.API object at 0x00000207C8D2DF48>, _json={'created_at': 'Fri May 01 20:23:45 +0000 2020', 'id': 1256318414731313158, 'id_str': '1256318414731313158', 'text': 'RT @RobertFinger1: In case you missed it: the Agricultural Economics and Policy Group at ETH Zürich @ETH_en is looking for a \n\nPostdoctoral…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'RobertFinger1', 'name': 'Robert Finger', 'id': 534117142, 'id_str': '534117142', 'indices': [3, 17]}, {'screen_name': 'ETH_en', 'name': 'ETH Zurich', 'id': 204279080, 'id_str': '204279080', 'indices': [100, 107]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 63787059, 'id_str'

In [38]:
pprint(tweets[0]._api) # for the first tweet

<tweepy.api.API object at 0x00000207C8D2DF48>


In [39]:
pprint(tweets[0]._json) # for the first tweet

{'contributors': None,
 'coordinates': None,
 'created_at': 'Fri May 01 20:23:45 +0000 2020',
 'entities': {'hashtags': [],
              'symbols': [],
              'urls': [],
              'user_mentions': [{'id': 534117142,
                                 'id_str': '534117142',
                                 'indices': [3, 17],
                                 'name': 'Robert Finger',
                                 'screen_name': 'RobertFinger1'},
                                {'id': 204279080,
                                 'id_str': '204279080',
                                 'indices': [100, 107],
                                 'name': 'ETH Zurich',
                                 'screen_name': 'ETH_en'}]},
 'favorite_count': 0,
 'favorited': False,
 'geo': None,
 'id': 1256318414731313158,
 'id_str': '1256318414731313158',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_repl

#### Small introduction to `json` format and dictionaries

`JSON` (`JavaScript Object Notation`) is a popular data format used for representing structured data. See the chapter in the [Hitchhiker’s Guide to Python](https://docs.python-guide.org/scenarios/json/)

The object is a **dictionary**. Dictionaries are Python objects associating keys to values. Keys and Values can be any Python object: scalar, string, list, dictionaries... If a value is a dictionary, then the overall dictionary embed a hierachical structure.

See chapter 3.1 of [Python for Data Analysis](https://learning.oreilly.com/library/view/python-for-data/9781491957653/) for more on built_in data structures, including dictionaries. 

In [40]:
empty_dict={} # dict are defined by curly braces
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4], 'c' : {'c1': 10, 'c2':20}}
d1['d']='more' # add a key-value pair in d1
print(d1)

{'a': 'some value', 'b': [1, 2, 3, 4], 'c': {'c1': 10, 'c2': 20}, 'd': 'more'}


In [41]:
#navigating in the dictionary using the keys:
print(d1['a'])

some value


In [42]:
print(d1['c']['c2']) # works several time: a handy way to get to an element

20


In [43]:
print(tweets[0]._json.keys()) # Keys of dictionary (for the first tweet)

dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])


**Exercice**  Access the screen name of the first tweet

#### home_timeline
Returns the 20 most recent statuses, including retweets, posted by the uthenticating user and that user’s friends. This is the equivalent of /timeline/home on the Web.

In [44]:
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

#PlantBasedParty 🥳 https://t.co/UBL0h0DXwp
RT @Earth_Burger: It’s a #PlantBasedParty! 😍🎉 Buy any @BeyondMeat Burger from @Earth_Burger and help feed those who need it the most! 🧡🚀 Th…
RT @mollyelwood: Aah, a good excuse to get takeout this weekend. 🎉

1. Purchase a vegan meal for takeout/delivery
2. Post a photo of it to…
Ain't no party like a #plantbasedparty 🎉 https://t.co/xLNLBxfhXt
Help us FEED A MILLION+
May 1-3, buy a plant-based takeout/delivery meal &amp; post a pic, tagging #PlantBasedParty.… https://t.co/Y8WpWC1cyc
We are grateful for you and all of the other essential workers/ heroes out there. Thank YOU! https://t.co/u4jgpEDdPp
Though our everyday feels a bit different right now, our unwavering passion and drive to be the brand that lets peo… https://t.co/tNPhgM1jyM
RT @JavSloko: Beyond meat burgers are really good 10/10 would recommend @BeyondMeat
RT @ashanti: I’m so proud to be apart of this!! 😆@beyondmeat has made a pledge to donate over 1 million Beyond Burgers to those

#### user_timeline
The overall rate limit to this method is 100,000 calls during any single 24-hour period. That will translate to 100,000 users and their timeline posts (up to 200 most recent posts).

In [45]:
timeline = api.user_timeline(user_id=46182536, count=2)
print(len(timeline))

2


### Methods returning a `user` object (or a list of objects)
- `me` returns the authenticated user's information

In [46]:
api.me()

User(_api=<tweepy.api.API object at 0x00000207C8D2DF48>, _json={'id': 1230145588659998721, 'id_str': '1230145588659998721', 'name': 'Andreas Eckmann', 'screen_name': 'AndreasEckmann', 'location': '', 'profile_location': None, 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Feb 19 15:02:28 +0000 2020', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 0, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png', 'profile_image_url_https': 'https://abs.twimg.com/sticky/default_profile_images/default_

- The `get_user` method returns information about the specified user.

In [47]:
target=account_list[0] #'JenniferWexton'
user = api.get_user(target) # argument = id, user_id, screen_name
pprint(user._json)

{'contributors_enabled': False,
 'created_at': 'Sun Nov 24 18:38:14 +0000 2013',
 'default_profile': True,
 'default_profile_image': False,
 'description': 'Mom to two boys & two rescue labs. Former state Senator, '
                'prosecutor, and advocate for abused children. Congresswoman '
                'for #VA10.',
 'entities': {'description': {'urls': []},
              'url': {'urls': [{'display_url': 'jenniferwexton.com',
                                'expanded_url': 'http://jenniferwexton.com',
                                'indices': [0, 23],
                                'url': 'https://t.co/S9LraqZK1g'}]}},
 'favourites_count': 2698,
 'follow_request_sent': False,
 'followers_count': 24884,
 'following': False,
 'friends_count': 479,
 'geo_enabled': True,
 'has_extended_profile': True,
 'id': 2212905492,
 'id_str': '2212905492',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': None,
 'listed_count': 432,
 'location': 'Leesburg, VA',
 'name': 'Jen

Some attributes of the `user` object:

In [48]:
print("Name:", user.name)
print("Screen name:", user.screen_name)
print("Number of followers:" ,  user.followers_count)
pprint("description: " + user.description)
pprint("Number of tweets published: " + str(user.statuses_count))
pprint("friends_count: " + str(user.friends_count))

Name: Jennifer Wexton
Screen name: JenniferWexton
Number of followers: 24884
('description: Mom to two boys & two rescue labs. Former state Senator, '
 'prosecutor, and advocate for abused children. Congresswoman for #VA10.')
'Number of tweets published: 3651'
'friends_count: 479'


- `followers` returns the user's followers
- `search_users` searches for users

### A Friendship method: follower_ids
This method allows you to get most recent following of a particular user (use screen_name as parameter). 
It is useful if you want to get all the tweets on the timeline of a particular user. 

In [49]:
followers=api.followers_ids(screen_name=target)
print(followers[0:10])

[707746840918102016, 1254202320671440896, 2743430429, 1248060370058534913, 967303105, 1181387942578798592, 41480222, 1022595171672895490, 503908482, 161032197]


Fetch the first 10 tweets published by this account:

In [50]:
tweets = api.user_timeline(screen_name = target, count = 10, include_rts = True)

### Looping over `account_list` 

Handling the rate limit imposed by the API

In [51]:
import time
time.sleep(3) # wait for three seconds

In [52]:
nb_tweets_by_target=2
print("We aim at fetching {} tweets".format(len(account_list)* nb_tweets_by_target))

We aim at fetching 1226 tweets


In [53]:
%%time 
# to get an idea of how long it takes

df_tweets=pd.DataFrame() # empty dataframe where the tweet will be saved

if len(account_list) > 0:
    
    # Restricting the search for the first 10 accounts
    for target in account_list[:10]:
        
        # try the following:
        try:
            # Fetch nb_tweets_by_target for target
            tweets = api.user_timeline(screen_name = target, count = nb_tweets_by_target, include_rts = False)
            
            # Put the tweets into a dataframe object
            tweet_count=0
            for tweet in tweets:
                # 1. Transform the json into a dataframe
                df_tweet=pd.DataFrame.from_dict(tweet._json, orient='index', columns=[tweet_count]) # , sleep_on_rate_limit=True
                # 2. adds screen name as a row
                df_tweet=df_tweet.append(pd.DataFrame({tweet_count:[target]}, 
                                                      index=['twitter handle']))
                # 3. Add the tweet dataframe to the df_tweets dataframe
                df_tweets=pd.concat([df_tweet, df_tweets], axis=1)
                
                # counting the number of target fetched
                tweet_count += 1 
                
            time.sleep(0.5)
            
        # except if TweepError arises
        except tweepy.TweepError: #the error arises when the user has protected tweets
            print("Failed to run the command on user {}, Skipping...".format(target))
            
        # except if RateLimitError arises
        except tweepy.RateLimitError:
            print("ressource usage limit: {} skipped".format(target))
            time.sleep(0.3)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Wall time: 8.14 s


In [54]:
df_tweets=df_tweets.transpose() # Transpose the dataset
print(df_tweets.columns)
print(df_tweets.shape)

Index(['contributors', 'coordinates', 'created_at', 'entities',
       'extended_entities', 'favorite_count', 'favorited', 'geo', 'id',
       'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quoted_status', 'quoted_status_id',
       'quoted_status_id_str', 'retweet_count', 'retweeted', 'source', 'text',
       'truncated', 'twitter handle', 'user'],
      dtype='object')
(16, 29)


## Merge tweet and party affiliation on `twitter handle` 

In [55]:
df_tweets_small=df_tweets[['text', 'created_at', 'retweet_count', 'favorite_count', 'twitter handle']] # 'user'
df_tweets_small.head()

Unnamed: 0,text,created_at,retweet_count,favorite_count,twitter handle
1,Local resources during tough times: https://t....,Fri May 01 13:19:02 +0000 2020,1,1,ReElectHank
0,@CDCgov has added several new symptoms to its ...,Fri May 01 13:48:57 +0000 2020,0,0,ReElectHank
1,I appreciated the opportunity to talk to KY-02...,Fri May 01 19:19:45 +0000 2020,1,2,RepGuthrie
0,Nearly 1.7 billion loans have been approved fo...,Fri May 01 20:00:03 +0000 2020,3,3,RepGuthrie
0,ICYMI: My interview with WLKY News about my ba...,Tue Nov 06 15:25:16 +0000 2018,2,4,brettguthrie


In [56]:
df_merged=pd.merge(df_tweets_small, df_merged_all,on='twitter handle')
df_merged.shape

(16, 11)

<div class="alert alert-block alert-warning">
<i class="fa fa-warning"></i>&nbsp;<code>os</code> package
    <ul>
        <li> <code>os.getcwd()</code>: fetchs the current path
        </li>
        <li> <code>os.path.dirname()</code>: go back to the parent directory
        </li>
        <li> <code>os.path.join()</code>: concatenates several paths
        </li>
    </ul>
</div>

In [57]:
import os
parent_path=os.path.dirname(os.getcwd()) # os.getcwd() fetchs the current path, 
data_path=os.path.join(parent_path, 'data')
print(data_path)

C:\Users\Andi Eckmann\Desktop\ETH Studium\MASTERSTUDIUM\2. Semester\Big Data for Public Policy\big_data_policy_2020\data


<div class="alert alert-block alert-warning">
<i class="fa fa-warning"></i>&nbsp;<code>pickle</code> format
    <ul>
        <li> Useful to store <code>python</code> objects 
        </li>
        <li> Well integrated in  <code>pandas</code> (using <code>to_pickle</code> and <code>read_pickle</code>)
        </li>
        <li> When the object is not a pandas Dataframe, use the <code>pickle</code> package
        </li>
    </ul>
</div>


In [58]:
df_merged.to_pickle(data_path+'/tweet_labeled.pkl')
df_merged.to_csv(data_path+'/tweet_labeled.csv')

## Other example using API

Forecasts from the **Carbon Intensity API**: https://carbonintensity.org.uk/ (include CO2 emissions related to eletricity generation only).
See the API [documentation](https://carbon-intensity.github.io/api-definitions/#carbon-intensity-api-v2-0-0)

In [59]:
import requests
headers = {
  'Accept': 'application/json'
}

In [60]:
# Get Carbon Intensity data for current half hour
r = requests.get('https://api.carbonintensity.org.uk/intensity', params={}, headers = headers) 
print(r.json())

{'data': [{'from': '2020-05-02T10:30Z', 'to': '2020-05-02T11:00Z', 'intensity': {'forecast': 131, 'actual': 124, 'index': 'low'}}]}


In [61]:
# Get Carbon Intensity data for today
r = requests.get('https://api.carbonintensity.org.uk/intensity/date', params={}, headers = headers)
pprint(r.json())

{'data': [{'from': '2020-05-01T23:00Z',
           'intensity': {'actual': 125, 'forecast': 148, 'index': 'low'},
           'to': '2020-05-01T23:30Z'},
          {'from': '2020-05-01T23:30Z',
           'intensity': {'actual': 119, 'forecast': 138, 'index': 'low'},
           'to': '2020-05-02T00:00Z'},
          {'from': '2020-05-02T00:00Z',
           'intensity': {'actual': 115, 'forecast': 120, 'index': 'low'},
           'to': '2020-05-02T00:30Z'},
          {'from': '2020-05-02T00:30Z',
           'intensity': {'actual': 113, 'forecast': 120, 'index': 'low'},
           'to': '2020-05-02T01:00Z'},
          {'from': '2020-05-02T01:00Z',
           'intensity': {'actual': 112, 'forecast': 109, 'index': 'low'},
           'to': '2020-05-02T01:30Z'},
          {'from': '2020-05-02T01:30Z',
           'intensity': {'actual': 112, 'forecast': 111, 'index': 'low'},
           'to': '2020-05-02T02:00Z'},
          {'from': '2020-05-02T02:00Z',
           'intensity': {'actual': 112, 'f

In [62]:
# Get Carbon Intensity factors for each fuel type
r = requests.get('https://api.carbonintensity.org.uk/intensity/factors', params={}, headers = headers)
pprint(r.json())

{'data': [{'Biomass': 120,
           'Coal': 937,
           'Dutch Imports': 474,
           'French Imports': 53,
           'Gas (Combined Cycle)': 394,
           'Gas (Open Cycle)': 651,
           'Hydro': 0,
           'Irish Imports': 458,
           'Nuclear': 0,
           'Oil': 935,
           'Other': 300,
           'Pumped Storage': 0,
           'Solar': 0,
           'Wind': 0}]}


In [63]:
# Get Carbon Intensity data for current half hour for GB regions
r = requests.get('https://api.carbonintensity.org.uk/regional', params={}, headers = headers)
pprint(r.json())

{'data': [{'from': '2020-05-02T10:30Z',
           'regions': [{'dnoregion': 'Scottish Hydro Electric Power '
                                     'Distribution',
                        'generationmix': [{'fuel': 'biomass', 'perc': 0},
                                          {'fuel': 'coal', 'perc': 0},
                                          {'fuel': 'imports', 'perc': 0},
                                          {'fuel': 'gas', 'perc': 0},
                                          {'fuel': 'nuclear', 'perc': 0},
                                          {'fuel': 'other', 'perc': 0},
                                          {'fuel': 'hydro', 'perc': 24.2},
                                          {'fuel': 'solar', 'perc': 0},
                                          {'fuel': 'wind', 'perc': 75.8}],
                        'intensity': {'forecast': 0, 'index': 'very low'},
                        'regionid': 1,
                        'shortname': 'North Scotland'},
          

## Class survey
Please fill in this [short survey](https://framaforms.org/keep-start-stop-survey-1583156515) about the class. 

## What is not covered in the notebook

- If you struggle something and you need for your project, tell us and we can spend some time on it. For example:
    - Scraping dynamically-generated content