<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session B1: API harvesting and Twitter API

**Data collection** is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing accurate insights for research using various techniques. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

The ease of access to the technology has made various social media platforms more popular as communication tools, therefore as a source of data. With this rise of social media use as a data source, data collection using APIs has become a demanding skill. Here, in this session, we aim to teach how to collect data from various social media platforms such as Twitter.

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to collect digital behavioral data via API harvesting throughout this session. In subsession **B1.1**, we start with how to collect data programmatically using webbased application programming interfaces (APIs). In subsession **B1.2**, we introduce how to collect data from Twitter such as rehydrating tweet identifiers, getting information on users, and keyword search. In subsession **B1.3**, More APIs are introduced. Finally, Subsession **B1.4** is dedicated to how data collection through API can be challenging.
</div>

## B1.1. Collecting data from API

<img src="./images/database.png"  width="150" height = "150" align="right"/>


>APIs provide a number of benefits when it comes to collecting data from the web. With rare exceptions, rules about what data we can collect and how we can collect it using APIs are explicitly stated, which removes some of the legal uncertainties that can
complicate other data collection methods. Widely used APIs are generally well documented and maintained, and the data returned is usually well-structured and easy to work with.(<a href='#mclevey_doing_2022'>McLevey, 2022</a>, ch.4)

In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system.

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites.

However, even though we have access to these APIs, as researchers, we should not forget to respect API access rules and always read the documents before collecting data.


## B1.2. Getting practical with Twitter API 

Twitter is one of the most used social media platforms in the academic research. This microblogging and social networking service hosts users who can post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read those that are publicly available. As of January 2023, Twitter has 556 million active users worldwide (<a href='#statista'>Statista, 2023</a>). 

<img src="./images/twitter.png"  width="200" height = "200" align="left"/>


Different access options for different purposes:

- Twitter Developer: https://developer.twitter.com/
- APIs: https://developer.twitter.com/en/docs
- GNIP: http://support.gnip.com/apis/
- Twitter Enterprise: https://developer.twitter.com/en/enterprise

IMPORTANT to note that free APIs cover 7 days Tweets; Premium APIs exist for 30-day search and beyond. If you have an Academic Research access level, you can access even more data with full-archive search endpoint. There are changes to APIs policies over time, such as functionalities and user agreements. Also, limitations on volume and functions should be considered. 

Before we start with our first project on Twitter, first you need to sign up for Twitter and then, create a Developer account: 

- Sign up from [here.](https://help.twitter.com/en/using-twitter/create-twitter-account)
- Create a Developer Account from [here.](https://developer.twitter.com)

### B1.2.1. Setting up Twitter API keys 

In [1]:
apikey = 'YOURapikey' #25 alphanumeric characters
apisecretkey = 'YOURapisecretkey'
accesstoken = 'YOURaccesstoken'
accesstokensecret = 'YOURaccesstokensecret'
bearertoken = 'YOURbearertoken'


<img src="./images/developer_portal.png"  width="500" height = "500" align="center"/>

### B1.2.2. Rehydrating tweets

Researchers share large tweet data sets with **tweet identifiers** as Twitter Terms of Service does not allow researchers to share the full tweets data. In order to get the tweets used in a research work, we need to retrieve/reconstruct the tweets data using those tweet identifiers (tweet ids). This process is called hydrating/rehydrating tweets.

Since some of the tweets used in a research work migh have been deleted in time, we may not be able to access every single tweet used at the time when that research work has been done. We will see about that in more details later in this subsession.

In order to rehydrate tweets, we will be using Twarc library, which is a python wrapper for twitter API. You can install it with `pip`.

In [2]:
from twarc import Twarc2, expansions

We will rehydrate tweets from The Twitter Parliamentarian Database (<a href='#van_Vliet'>van Vliet et al., 2020</a>) for our teaching purposes here. Download the `2021.csv` data set from [this](https://figshare.com/articles/dataset/The_Twitter_Parliamentarian_Database/10120685) link and read it like the following:

In [3]:
import pandas as pd

data = pd.read_csv('./data/2021.csv', header = None, low_memory=False)
data.columns = ['country', 'party', 'author', 'author_id', 'district','date','tweet_id']

In [4]:
data.head()

Unnamed: 0,country,party,author,author_id,district,date,tweet_id
0,United States,Republican,thom tillis,2964174789,,2021-01-01 05:01:00,1344871222073458690
1,United States,Republican,thom tillis,2964174789,,2021-01-03 22:19:37,1345857376155545601
2,United States,Republican,thom tillis,2964174789,,2021-01-04 19:31:13,1346177383225815052
3,United States,Republican,thom tillis,2964174789,,2021-01-05 19:46:42,1346543667570495488
4,United States,Republican,thom tillis,2964174789,,2021-01-06 17:29:55,1346871631101259776


We will take the tweets for Turkey and keep their IDs to rehydrate. We'll try rehydrating a random sample of 1000 of them:

In [5]:
turkey = data[data['country'] == 'Turkey']

tweet_ids = list(turkey.sample(1000, random_state = 2023)['tweet_id'])

We will use the following `rehydrate` [function](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) to rehydrate the tweets and keep their data in the `tweets` list:

In [6]:
import json

# Use your bearer token here
client = Twarc2(bearer_token=bearertoken)

tweets = []

def rehydrate(ids: list):
    # List of Tweet IDs you want to lookup
    tweet_ids = ids
    # The tweet_lookup function from twarc 
    lookup = client.tweet_lookup(tweet_ids=tweet_ids)
    for page in lookup:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            tweets.append(tweet)

Running the function for the 1000 tweet IDs takes around 30 seconds, since twarc sends a GET request for 100 tweet IDs every 3 seconds. More on twitter rate limits [here](https://developer.twitter.com/en/docs/twitter-api/rate-limits).

In [7]:
rehydrate(tweet_ids)

To see the information returned for each tweet ID, we can check the first item in `tweets` list:

In [8]:
tweets[0]

{'reply_settings': 'everyone',
 'id': '1367480557412769793',
 'referenced_tweets': [{'type': 'replied_to',
   'id': '1367480542002876420',
   'reply_settings': 'everyone',
   'entities': {'urls': [{'start': 252,
      'end': 275,
      'url': 'https://t.co/bUXK4Wygi6',
      'expanded_url': 'https://twitter.com/lutfikasikci/status/1367480542002876420/photo/1',
      'display_url': 'pic.twitter.com/bUXK4Wygi6',
      'media_key': '3_1367480531148042248'}]},
   'possibly_sensitive': False,
   'lang': 'tr',
   'conversation_id': '1367480542002876420',
   'edit_history_tweet_ids': ['1367480542002876420'],
   'author_id': '215618996',
   'public_metrics': {'retweet_count': 155,
    'reply_count': 32,
    'like_count': 609,
    'quote_count': 21,
    'impression_count': 0},
   'text': '1- TBMM ‘ne dün itibari ile gelen fezlekelerden birinin de bana ait olduğunu öğrenmiş bulunmaktayım.\nHiç vakit kaybedilmeden ve bekletilmeden dokunulmazlığımın kaldırılması ile ilgili dilekçemi bugün itibari 

We can also check to see how many tweets could have been rehydrated from the IDs:

In [9]:
len(tweets)

918

As you can see, only 92 percent of tweets could have been rehydrated; others are not available anymore.

In [10]:
tweets[0]

{'reply_settings': 'everyone',
 'id': '1367480557412769793',
 'referenced_tweets': [{'type': 'replied_to',
   'id': '1367480542002876420',
   'reply_settings': 'everyone',
   'entities': {'urls': [{'start': 252,
      'end': 275,
      'url': 'https://t.co/bUXK4Wygi6',
      'expanded_url': 'https://twitter.com/lutfikasikci/status/1367480542002876420/photo/1',
      'display_url': 'pic.twitter.com/bUXK4Wygi6',
      'media_key': '3_1367480531148042248'}]},
   'possibly_sensitive': False,
   'lang': 'tr',
   'conversation_id': '1367480542002876420',
   'edit_history_tweet_ids': ['1367480542002876420'],
   'author_id': '215618996',
   'public_metrics': {'retweet_count': 155,
    'reply_count': 32,
    'like_count': 609,
    'quote_count': 21,
    'impression_count': 0},
   'text': '1- TBMM ‘ne dün itibari ile gelen fezlekelerden birinin de bana ait olduğunu öğrenmiş bulunmaktayım.\nHiç vakit kaybedilmeden ve bekletilmeden dokunulmazlığımın kaldırılması ile ilgili dilekçemi bugün itibari 

We can show some of the useful information of tweets in a dataframe:

In [11]:
author_id = []
created_at = []
text = []
reply_count = []
like_count = []

for i in tweets:
    
    author_id.append(i['author_id'])
    created_at.append(i['created_at'])
    text.append(i['text'])
    reply_count.append(i['public_metrics']['reply_count'])
    like_count.append(i['public_metrics']['like_count'])
    
tweets_df = pd.DataFrame(data=[tweet_ids, author_id, created_at, text, reply_count, like_count]).transpose()

tweets_df.columns = ["tweet id","author id", "created at", "text", "reply count","like count"]

tweets_df.head()

Unnamed: 0,tweet id,author id,created at,text,reply count,like count
0,1367480557412769793,215618996,2021-03-04T14:22:26.000Z,2-2010 yılında Dörtyol ilçemizde 4 polis memur...,5,140
1,1375139722712977414,228446708,2021-03-25T17:37:13.000Z,"AK Parti MKYK Üyemiz, Genel Merkez Teşkilat Ba...",16,727
2,1367095084571852801,4472008409,2021-03-03T12:50:42.000Z,"İP’in Başkanı, yalanı bırak, tezviratı geç; aç...",44,1219
3,1363355151914983424,601912159,2021-07-27T17:29:45.000Z,#IBANGönderKurz https://t.co/WWD9n5zBS0,16,419
4,1420073937552252929,2161571388,2021-04-17T07:07:25.000Z,"RT @AKKADINGM: 81 İlde ""Kadın Emeği Türkiye'ni...",0,0


| Stretch/Untouched | ProbDistribution | Accuracy |
| --- | --- | --- |
| Stretched | Gaussian | .843 |

### B1.2.3. Getting users info

Consider the `turkey` dataframe that we created from the `2021.csv` data in the previous section. We want to get the profile information of a fraction of the politicians whose tweets are in that dataframe. First, we get the unique politicians in the data:

In [12]:
# Getting all IDs in the dataframe 
author_ids = turkey['author_id']

# Getting unique IDs
unique_ids = list(author_ids.unique())

With the following `get_user()` [function](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md), we can get the users' information based on their IDs (It's a bit like tweet rehydration, but we use users' IDs this time), and save them in the `users_list` list.

In [13]:
from twarc import Twarc2, expansions
import json

users_list = []

# Replace your bearer token below
client = Twarc2(bearer_token=bearertoken)

def get_user(ids):
    # List of user IDs to lookup, add the ones you would like to lookup
    users = ids
    # The user_lookup function gets the hydrated user information for specified users
    lookup = client.user_lookup(users=users)
    for page in lookup:
        result = expansions.flatten(page)
        for user in result:
            # Here we are printing the full Tweet object JSON to the console
            users_list.append(user)

In [14]:
import random

some_ids = random.sample(unique_ids, 50)

get_user(some_ids)

Now we can make a dataframe and put some of this useful information of users' in it:

In [15]:
username = []
screen_name = []
profile_pic = []
followers = []
followings = []

for i in users_list:
    
    username.append(i['username'])
    screen_name.append(i['name'])
    profile_pic.append(i['profile_image_url'])
    followers.append(i['public_metrics']['followers_count'])
    followings.append(i['public_metrics']['following_count'])
    
users_df = pd.DataFrame(data=[some_ids, username, screen_name, profile_pic, followers, followings]).transpose()

users_df.columns = ["user id", "user name","screen name", "profile picture", "followers","followings"]

users_df.head()

Unnamed: 0,user id,user name,screen name,profile picture,followers,followings
0,1002483544197824513,nusrettin_macin,Nusrettin Maçin,https://pbs.twimg.com/profile_images/126021185...,15426,714
1,243716292,avukatayhanerel,Ayhan EREL,https://pbs.twimg.com/profile_images/143139316...,18578,714
2,2783856919,ibrahimkaboglu,İbrahim Özden Kaboğlu,https://pbs.twimg.com/profile_images/138960714...,88611,493
3,2233223984,yasartuzun06,Yaşar Tüzün,https://pbs.twimg.com/profile_images/164894481...,36944,1975
4,3055572752,bakisimsekmhp,Baki Şimşek,https://pbs.twimg.com/profile_images/161616258...,54816,2350


### B1.2.4. Keyword search limited to a time window

We can use the the `search_all()` function of twarc to search for tweets in any time window of our choice. The following `search()` [function](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) looks for any tweet containing the query it takes, limited to a beginning and end time, and saves them into the `tweets` list.


You can find more information on writing search queries [here](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md).

In [1]:
from keys import *

In [2]:
from twarc import Twarc2, expansions
import datetime
import json


# Replace your bearer token below
client = Twarc2(bearer_token=bearertoken)

def search(beginning, end, q):
    
    # Specify the start time in UTC for the time period you want Tweets from
    start_time = beginning

    # Specify the end time in UTC for the time period you want Tweets from
    end_time = end

    # This is where we specify our query
    query = q

    # The search_all method call the full-archive search endpoint to get Tweets based on the query, start and end times
    search_results = client.search_all(query=query, start_time=beginning, end_time=end, max_results=100)

    # Twarc returns all Tweets for the criteria set above, so we page through the results
    for page in search_results:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            # Here we are printing the full Tweet object JSON to the console
            tweets.append(tweet)

In [3]:
tweets = []

# Beginning time for the time window
beginning = datetime.datetime(2008, 1, 5, 0, 0, 0, 0, datetime.timezone.utc)

# End time for the time window
end = datetime.datetime(2023, 1, 8, 0, 0, 0, 0, datetime.timezone.utc)

# The query for searching
q = "(Computational Social Science OR ComputationalSocialScience OR Quantitative Social Science OR QuantitativeSocialScience) Turkey"

search (beginning, end, q)

You can take a look at the number of retrieved tweets and the first tweet like this:

In [4]:
# Number of retrieved tweets
len(tweets)

105

In [6]:
# The information available for each tweet
tweets[0].keys()

dict_keys(['conversation_id', 'id', 'reply_settings', 'text', 'edit_history_tweet_ids', 'created_at', 'referenced_tweets', 'edit_controls', 'author_id', 'possibly_sensitive', 'lang', 'entities', 'public_metrics', 'author', '__twarc'])

In [15]:
tweets[0]['entities']['mentions'][0].keys()

dict_keys(['start', 'end', 'username', 'id', 'verified', 'name', 'public_metrics', 'url', 'entities', 'protected', 'description', 'created_at', 'profile_image_url', 'pinned_tweet_id'])

In [13]:
# The overall information of the first tweet 
tweets[0]['author']

{'public_metrics': {'followers_count': 112,
  'following_count': 685,
  'tweet_count': 268,
  'listed_count': 1},
 'name': 'Selim Balcısoy',
 'username': 'BalcSoy',
 'location': 'Boston, MA',
 'profile_image_url': 'https://pbs.twimg.com/profile_images/1173996771393069056/Wqykyzqt_normal.png',
 'protected': False,
 'id': '1173993379698552832',
 'created_at': '2019-09-17T16:13:43.000Z',
 'verified': False,
 'description': 'Science, Cultural Heritage, Opera, Wine and no BS'}

We can show some of the useful information of tweets in a dataframe:

In [21]:
tweet_id = []
author_id = []
created_at = []
text = []
lang = []

for i in tweets:
    
    tweet_id.append(i['id'])
    author_id.append(i['author_id'])
    created_at.append(i['created_at'])
    text.append(i['text'])
    lang.append(i['lang'])
    
search_df = pd.DataFrame(data=[tweet_id, author_id, created_at, text, lang,]).transpose()

search_df.columns = ["tweet id","author id", "created at", "text", "language"]

search_df.head()

Unnamed: 0,tweet id,author id,created at,text,language
0,1611472338427744256,1173993379698552832,2023-01-06T21:18:39.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
1,1611412573605466114,14519511,2023-01-06T17:21:10.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
2,1611406068873052178,3377132271,2023-01-06T16:55:19.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
3,1611333811354181633,2400010513,2023-01-06T12:08:12.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
4,1611295709545783296,1474656871,2023-01-06T09:36:48.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr


### [Tweets data dictionary](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet):

| Field value | Type | Description | How it can be used |
|-------|-----------|----------|----------|
| `id` | string | The unique identifier of the requested Tweet. `"id": "1050118621198921728"` | Use this to programmatically retrieve a specific Tweet. |
| `text` | string | The actual UTF-8 text of the Tweet. See [twitter-text](https://github.com/twitter/twitter-text/) for details on what characters are currently considered valid. `"text": "To make room for more expression, we will now count all emojis as equal—including those with gender‍‍‍ ‍‍and skin tone modifiers 👍🏻👍🏽👍🏿. This is now reflected in Twitter-Text, our Open Source library. \n\nUsing Twitter-Text? See the forum post for detail: https://t.co/Nx1XZmRCXA"` | Keyword extraction and sentiment analysis/classification. |
| `edit_history_tweet_ids` | object | Unique identifiers indicating all versions of a Tweet. For Tweets with no edits, there will be one ID. For Tweets with an edit history, there will be multiple IDs, arranged in ascending order reflecting the order of edits. The most recent version is the last position of the array. `"edit_history_tweet_ids": ["1584717154800521216"]` | Use this information to find the edit history of a Tweet. |
| `author_id` | string | The unique identifier of the User who posted this Tweet. `"author_id": "2244994945"` | Hydrating User object, sharing dataset for peer review |
| `conversation_id` | string | The Tweet ID of the original Tweet of the conversation (which includes direct replies, replies of replies). `"conversation_id": "1050118621198921728"` | Use this to reconstruct the conversation from a Tweet. |
| `created_at` | date (ISO 8601) | Creation time of the Tweet. `"created_at": "2019-06-04T23:12:08.000Z"` | This field can be used to understand when a Tweet was created and used for time-series analysis etc. |
| `edit_controls` | object | When present, this indicates how much longer the Tweet can be edited and the number of remaining edits. Tweets are only editable for the first 30 minutes after creation and can be edited up to five times. `"edit_controls": {"edits_remaining": 5, "is_edit_eligible": true, "editable_until": "2022-10-25T01:53:06.000Z"}` | Use this to determine if a Tweet is eligible for editing. |
| `lang` | string | Language of the Tweet, if detected by Twitter. Returned as a BCP47 language tag. `"lang": "en"` | Classify Tweets by spoken language. |
| ‍‍‍`possibly_sensitive` | boolean | This field indicates content may be recognized as sensitive. The Tweet author can select within their own account preferences and choose “Mark media you tweet as having material that may be sensitive” so each Tweet created after has this flag set. This may also be judged and labeled by an internal Twitter support agent. `"possibly_sensitive": false` | Studying circulation of certain types of content. |
| `public_metrics` | object | Public engagement metrics for the Tweet at the time of the request. `"public_metrics" : {"retweet_count": 8, "reply_count": 2, "like_count": 39, "quote_count": 1}` | Use this to measure Tweet engagement. |
| `referenced_tweets` | array | A list of Tweets this Tweet refers to. For example, if the parent Tweet is a Retweet, a Retweet with comment (also known as Quoted Tweet) or a Reply, it will include the related Tweet referenced to by its parent. `"referenced_tweets": [{"type": "replied_to", "id": "1242125486844604425"}]` | This field can be used to understand conversational aspects of retweets etc. |
| `reply_settings` | string | Shows you who can reply to a given Tweet. Fields returned are "everyone", "mentioned_users", and "followers". `"reply_settings": "everyone"` | This field allows you to determine whether conversation reply settings have been set for the Tweet and if so, what settings have been set. |

### [Authors data dictionary](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user):

| Field value | Type | Description | How it can be used |
|-------|-----------|----------|----------|
| `id` | string | The name of the user, as they’ve defined it on their profile. Not necessarily a person’s name. Typically capped at 50 characters, but subject to change. `"name": "Twitter Dev"` | - |
| `username` | string | The Twitter screen name, handle, or alias that this user identifies themselves with. Usernames are unique but subject to change. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names. `"username": "TwitterDev"` | - |
| `created_at` | date (ISO 8601) | The UTC datetime that the user account was created on Twitter. `"created_at": "2013-12-14T04:35:55.000Z"` | Can be used to determine how long a someone has been using Twitter. |
| `description` | string | The text of this user's profile description (also known as bio), if the user provided one. `"description": "The voice of Twitter's #DevRel team, and your official source for updates, news, & events about Twitter's API. \n\n#BlackLivesMatter"` | - |
| `location` | string | The location specified in the user's profile, if the user provided one. As this is a freeform value, it may not indicate a valid location, but it may be fuzzily evaluated when performing searches with location queries. `"location": "127.0.0.1"` | - |
| `profile_image_url` | string | The URL to the profile image for this user, as shown on the user's profile. `"profile_image_url": "https://pbs.twimg.com/profile_images/1267175364003901441/tBZNFAgA_normal.jpg"` | Can be used to download this user's profile image. |
| `protected` | boolean | Indicates if this user has chosen to protect their Tweets (in other words, if this user's Tweets are private). `"protected": false` | - |
| `public_metrics` | object | Contains details about activity for this user. `"public_metrics": {"followers_count": 507902, "following_count": 1863, "tweet_count": 3561, "listed_count": 1550}` | Can potentially be used to determine a Twitter user’s reach or influence, quantify the user’s range of interests, and the user’s level of engagement on Twitter. |
| `verified` | boolean | Indicates if this user is a verified Twitter User. `"verified": true` | Indicates whether or not this Twitter user has a verified account. A verified account lets people know that an account of public interest is authentic. |

### [Mentions data dictionary:](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet)

| Name | Type | Description |
|-------|-----------|----------|
| `entities.mentions` | array | Contains details about text recognized as a user mention. |
| `entities.mentions.start` | integer | The start position (zero-based) of the recognized user mention within the Tweet. All start indices are inclusive. |
| `entities.mentions.end` | integer | The end position (zero-based) of the recognized user mention within the Tweet. This end index is exclusive. |
| `entities.mentions.username` | string | The part of text recognized as a user mention. You can obtain the expanded object in `includes.users` by adding `expansions=entities.mentions.username` in the request's query parameter. |


TES-D “Computational Social Science Tweets 2023” 

**General Characteristics** 

1. Who collected the dataset and who funded the process? 



2. Where is the dataset hosted? Is the dataset distributed under a copyright or license? 

 

3. What do the instances that comprise the dataset represent? What data does each instance consist of? 

 

4. How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split? 

 

5. In which contexts and publications has the dataset been used already? 

 

6. Are there alternative datasets that could be used for the measurement of the same or similar constructs? Could they be a better fit? How do they differ? 

 

7. Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments? 

 

8. Were any ethical review processes conducted? 

 

9. Did any ethical considerations limit the dataset creation? 

 

10. Are there any potential risks for individuals using the data? 

 

**Construct Definition** 

Validity 

1. For the measurement of what construct was the dataset created? 

 

2. How is the construct operationalized? Can the dataset fully grasp the construct? If not, what dimensions are left out? Have there been any attempts to evaluate the validity of the construct's operationalization? 

 

3. What related constructs could (not) be measured through the dataset? What should be considered when measuring other constructs with the dataset? 

 

4. What is the target population? 

 

5. How does the dataset handle subpopulations? 

**Platform Selection**

Platform Affordances Error 

1. What are the key characteristics of the platform at the time of data collection? 

 

What are the effects of the platform's ToS on the collected data? 

 

What are the effects of the platform's sociocultural norms on the collected data? 

 

How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design? 

 

In case multiple data sources were used, what errors might occur through their merger or combination? 

Platform Coverage Error 

What is known about the platform/s population? 

**Data Collection** 

Trace Selection Error 

How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen? 

 

Was there any data that could not be adequately collected? 

 

Is any information missing from individual instances? Could there be a systematic bias? 

 

Does the dataset include sensitive or confidential information? 

User Selection Error 

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample from a larger set, what was the sampling strategy? 

 

What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria? 

 

Over what timeframe was the data collected, and how might that timeframe have affected the collected data? 

 

If the dataset relates to people, how did they consent to collecting and using their data? 

 

Does the data include information on minors? 

**Data Preprocessing and Data Augmentation**

Trace Augmentation and Trace Measurement Error 

Is there a label or target associated with each instance? If so, how were the labels or targets generated? 

 

If automated methods were used, how does the methods’ performance impact the correctness of the augmentations? 

 

If human annotations were used, who were the annotators that created the labels? How were they recruited or chosen? How were they instructed? 

 

If the final gold label was derived from different annotations, how was this done? 

 

Have there been any attempts to validate the labels? 

 

How could the data be misused? 

 

Can the dataset in any way unintendedly contribute to the reinforcement of social inequality? 

User Augmentation Error 

Have attributes and characteristics of individuals been inferred? 

 

Is it possible to identify individuals either directly or indirectly from the data? 

Trace Reduction Error 

Have traces been excluded? Why and by what criteria? 

User Reduction Error 

Have users been excluded? Why and by what criteria? 

Adjustment Error 

Does the dataset provide information to adjust the results to a target population? If so, is this information inferred or self-reported? 

## B1.3. Getting practical with Wikipedia API 

<img src='./images/wikipedia_logo.png' style='height: 190px; float: right; margin-left: 50px' >

Wikipedia is a rich source of data for social science research. Although we can access its data through other techniques like web scraping, there are also useful APIs that could ease collecting data from the website.

Since Wikipedia is built on [MediaWiki](https://en.wikipedia.org/wiki/MediaWiki), we will be using python wrappers written for its [API](
https://wikipedia.readthedocs.io/en/latest/code.html#api),
[Mediawiki Action API](https://www.mediawiki.org/wiki/API:Main_page). Each of these wrappers provide some useful methods, and we will try to go through the ones that are the most important to our data collection tasks.

We will also introduce two useful parsers for the Wikipedia markup language, and will see how they could be used for extracting clean data from the raw markup code.


Installation and importing: 

In [22]:
!pip install wikipedia



In [1]:
import wikipedia

Searching a query:

In [2]:
wikipedia.search("Barack")

['Barack Obama',
 'Barack Obama Sr.',
 'Presidency of Barack Obama',
 'Barack (disambiguation)',
 'Family of Barack Obama',
 'Barack (brandy)',
 'Zach Barack',
 'Barack (name)',
 'Barack Obama "Hope" poster',
 'Barack Obama religion conspiracy theories']

In [3]:
wikipedia.suggest("Barak Obama")

'barack obama'

Fewer or more results with a specific number:

In [4]:
wikipedia.search("Ford", results=3)

['Ford Motor Company', 'Ford', 'Gerald Ford']

Getting the summary of an article:

In [5]:
wikipedia.summary("Barack Obama")

"Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American former politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American  president of the United States. Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004, and worked as a civil rights lawyer before holding public office. \nObama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004

In [6]:
wikipedia.summary("Barack Obama", sentences=1)

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American former politician who served as the 44th president of the United States from 2009 to 2017.'

wikipedia.summary will raise a DisambiguationError if the page is a disambiguation page, or a PageError if the page doesn’t exist (although by default, it tries to find the page you meant with suggest and search.)

In [7]:
wikipedia.summary("Mercury")



  lis = BeautifulSoup(html).find_all('li')


DisambiguationError: "Mercury" may refer to: 
Mercury (planet)
Mercury (element)
Mercury (mythology)
Mercury (toy manufacturer)
Mercury Communications
Mercury Drug
Mercury Energy
Mercury Filmworks
Mercury General
Mercury Interactive
Mercury Marine
Mercury Systems
Mercury (programming language)
Mercury (metadata search system)
Ferranti Mercury
Mercury Browser
Mercury Mail Transport System
Mercury (film)
Mercury (TV series)
Young Adult
Mercury Black
Sailor Mercury
Mercury (Marvel Comics)
Makkari (comics)
Metal Men
Cerebro's X-Men
Amalgam Comics character
Mercury (magazine)
The American Mercury
The Mercury (Hobart)
The Mercury (South Africa)
The Mercury (Pennsylvania)
Mercury (Newport)
Reading Mercury
List of newspapers named Mercury
Mercury (Bova novel)
Mercury (Livesey novel)
Anna Kavan
Mercury Nashville
Mercury Records
Mercury Prize
Mercury, the Winged Messenger
Mercury (American Music Club album)
Mercury (Longview album)
Mercury (Madder Mortem album)
Mercury – Act 1
Mercury – Acts 1 & 2
"Mercury" (song)
Recovering the Satellites
Failer
Planetarium
Operation Mercury
Boeing E-6 Mercury
Miles Mercury
HMS Mercury
USS Mercury
Russian brig Mercury
Mercury (pigeon)
Mercury (name)
Mercury, Savoie
Mercury Bay
place in Alabama
Mercury, Nevada
Mercury, Texas
Mercury (plant)
Annual mercury
English mercury
Mercury FM
Mercury 96.6
Edmonton Mercurys
Fujita Soccer Club Mercury
Memphis Mercury
Phoenix Mercury
Toledo Mercurys
Blackburn Mercury
Bristol Mercury
Mercury (automobile)
Mercury (cyclecar)
Mercury (train)
Mercury (ship)
Cape Cod Mercury 15
Mercury 18
Project Mercury
Mercury
Mercury (satellite)
Archer Maclean's Mercury
Mercury (cipher machine)
Mercury Boulevard
Mercury Cinema
Shuttle America
The Mercury Mall
All pages with titles beginning with Mercury 
The American Mercury
Mercuri
Mercury 1 (disambiguation)
Mercury 2 (disambiguation)
Mercury 3 (disambiguation)
Mercury 4 (disambiguation)
Mercury 5 (disambiguation)
Mercury 6 (disambiguation)
Mercury 7 (disambiguation)
Mercury 8 (disambiguation)
Mercury City (disambiguation)
Mercury FM (disambiguation)
Mercury House (disambiguation)
Mercury mission (disambiguation)
Mercury program (disambiguation)
Mercury project (disambiguation)
All pages with titles containing Mercury

In [8]:
try:
    mercury = wikipedia.summary("Mercury")
except wikipedia.exceptions.DisambiguationError as e:
    print (e.options)

['Mercury (planet)', 'Mercury (element)', 'Mercury (mythology)', 'Mercury (toy manufacturer)', 'Mercury Communications', 'Mercury Drug', 'Mercury Energy', 'Mercury Filmworks', 'Mercury General', 'Mercury Interactive', 'Mercury Marine', 'Mercury Systems', 'Mercury (programming language)', 'Mercury (metadata search system)', 'Ferranti Mercury', 'Mercury Browser', 'Mercury Mail Transport System', 'Mercury (film)', 'Mercury (TV series)', 'Young Adult', 'Mercury Black', 'Sailor Mercury', 'Mercury (Marvel Comics)', 'Makkari (comics)', 'Metal Men', "Cerebro's X-Men", 'Amalgam Comics character', 'Mercury (magazine)', 'The American Mercury', 'The Mercury (Hobart)', 'The Mercury (South Africa)', 'The Mercury (Pennsylvania)', 'Mercury (Newport)', 'Reading Mercury', 'List of newspapers named Mercury', 'Mercury (Bova novel)', 'Mercury (Livesey novel)', 'Anna Kavan', 'Mercury Nashville', 'Mercury Records', 'Mercury Prize', 'Mercury, the Winged Messenger', 'Mercury (American Music Club album)', 'Merc

wikipedia.page enables you to load and access data from full Wikipedia pages. Initialize with a page title (keep in mind the errors listed above), and then access most properties using property methods:

In [9]:
bo = wikipedia.page("Barack Obama")

Getting the title of the page:

In [10]:
bo.title

'Barack Obama'

Getting the url of the page:

In [11]:
bo.url

'https://en.wikipedia.org/wiki/Barack_Obama'

Getting the full text of the page:

In [12]:
bo.content

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American former politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American  president of the United States. Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004, and worked as a civil rights lawyer before holding public office. \nObama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004

Getting the images of the page:

In [13]:
bo.images[0:5]

['https://upload.wikimedia.org/wikipedia/commons/1/17/Balance%2C_by_David.svg',
 'https://upload.wikimedia.org/wikipedia/commons/f/f1/BarackObamaportrait.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/3/32/Barack_Obama_addresses_joint_session_of_Congress_2009-02-24.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/1/11/Barack_Obama_signature.svg',
 'https://upload.wikimedia.org/wikipedia/commons/1/17/Barack_Obama_talks_with_Benjamin_Netanyahu_%288637772147%29.jpg']

Getting the links in the page:

In [14]:
bo.links[:10]

['109th United States Congress',
 '110th United States Congress',
 '14th Dalai Lama',
 '1828 United States presidential election',
 '1832 Democratic National Convention',
 '1835 Democratic National Convention',
 '1840 Democratic National Convention',
 '1844 Democratic National Convention',
 '1848 Democratic National Convention',
 '1852 Democratic National Convention']

To change the language of the Wikipedia you are accessing, use wikipedia.set_lang. Remember to search for page titles in the language that you have set, not English:

In [15]:
wikipedia.set_lang("fr")

In [16]:
wikipedia.summary("Francois Hollande")

"François Hollande [fʁɑ̃swa ʔɔlɑ̃d] , né le 12 août 1954 à Rouen (Seine-Maritime), est un haut fonctionnaire et homme d'État français. Il est président de la République française du 15 mai 2012 au 14 mai 2017.\nMagistrat à la Cour des comptes et brièvement avocat, il est élu pour la première fois député en 1988. Il exerce la fonction de premier secrétaire du Parti socialiste (PS) de 1997 à 2008, pendant la troisième cohabitation puis dans l'opposition. Au niveau local, il est maire de Tulle de 2001 à 2008 et président du conseil général de Corrèze de 2008 à 2012.\nDésigné candidat du PS à l'élection présidentielle de 2012 à l'issue d'une primaire à gauche, il est élu chef de l'État face au président sortant, Nicolas Sarkozy, avec 51,6 % des suffrages exprimés au second tour. Sa présidence est marquée par une augmentation de la fiscalité puis un virage social-libéral (le « pacte de responsabilité »), la loi sur le mariage homosexuel, la tenue de la Conférence de Paris sur le climat, des

In [20]:
wikipedia.set_lang("en")

List of URLs of the external links:

In [21]:
bo.references[1:]

['http://www.theaustralian.com.au/archive/news/obama-launches-afghanistan-surge/story-e6frg6t6-1111118893671',
 'http://www.abc.net.au/news/stories/2008/09/09/2360240.htm',
 'http://www.bncatalogo.cl/F?func=direct&local_base=red10&doc_number=000710697',
 'http://blogs.abcnews.com/politicalpunch/2010/09/president-obama-i-am-a-christian-by-choicethe-precepts-of-jesus-spoke-to-me.html',
 'http://www.aljazeera.com/news/middleeast/2014/09/obama-strike-wherever-it-exists-2014910223935601193.html',
 'http://corporate.ancestry.com/press/press-releases/2012/07/ancestry.com-discovers-president-obama-related-to-first-documented-slave-in-america/',
 'http://www.baltimoresun.com/news/nationworld/politics/bal-te.obama02mar02,0,3453027.story',
 'http://www.barackobama.com/2002/10/02/remarks_of_illinois_state_sen.php',
 'http://bloomberg.com/apps/news?pid=20601087&sid=aw4F_L7E4xYg',
 'http://archive.boston.com/news/local/articles/2007/01/28/at_harvard_law_a_unifying_voice/',
 'http://articles.boston.c

Getting the plain text content of a section in the page:

In [22]:
bo.section('Early life and career')

'Obama was born on August 4, 1961, at Kapiolani Medical Center for Women and Children in Honolulu, Hawaii. He is the only president born outside the contiguous 48 states. He was born to an American mother and a Kenyan father. His mother, Ann Dunham (1942–1995), was born in Wichita, Kansas and was of English, Welsh, German, Swiss, and Irish descent. In 2007 it was discovered her great-great-grandfather Falmouth Kearney emigrated from the village of Moneygall, Ireland to the US in 1850. In July 2012, Ancestry.com found a strong likelihood that Dunham was descended from John Punch, an enslaved African man who lived in the Colony of Virginia during the seventeenth century. Obama\'s father, Barack Obama Sr. (1934–1982), was a married Luo Kenyan from Nyang\'oma Kogelo. Obama\'s parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on a scholarship. The couple married in Wailuku, Hawaii, on February 2, 1961, six months bef

List of section titles: an example of a bug!

In [23]:
bo.sections

[]

### B1.3.2. Getting Wikipedia page information

POURIA: add these links to the rerference please! Thank you!

https://doc.wikimedia.org/pywikibot/stable/

https://mwparserfromhell.readthedocs.io/en/latest/index.html

https://github.com/5j9/wikitextparser

Using pywikibot to get the wikipedia markup code and then parse it with parsers like mwparserfromhell and wikitextparser.

Installation and importing:

In [40]:
pip install pywikibot

Note: you may need to restart the kernel to use updated packages.


In [41]:
pip install mwparserfromhell

Note: you may need to restart the kernel to use updated packages.


In [42]:
pip install wikitextparser

Note: you may need to restart the kernel to use updated packages.


In [20]:
import pywikibot
import mwparserfromhell as mwp
import wikitextparser as wtp
import pandas as pd

In [34]:
print(pywikibot.__version__)

7.6.0


Getting the markup code of the page [List of political parties in Germany]('https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany'):

In [21]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "List of political parties in Germany")
text = page.get()

In [3]:
revs = page.revisions()

In [4]:
wikicode = mwp.parse(text)

In [5]:
wikicode.get_sections()

["{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article '''lists [[political party|political parties]] in [[politics of Germany|Germany]]'''.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU and FDP or SPD 

In [6]:
revsl = []
for i in revs:
    revsl.append(i)

In [7]:
revsl[0]

Revision({'revid': 1151494554, 'parentid': 1151471307, 'user': 'Helper201', 'userid': 24323093, 'timestamp': Timestamp(2023, 4, 24, 12, 23, 8), 'size': 89206, 'sha1': '3ab5ccaddb097e3ed6842a470584614c66018ead', 'roles': ['main'], 'slots': {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki', '*': '{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article \'\'\'lists [[political party|political parties]] in [[politics of Germany|Germany]]\'\'\'.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Demo

In [8]:
rev1 = revsl[0].text
# page = wtp.parse(rev1)

In [9]:
rev1

'{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article \'\'\'lists [[political party|political parties]] in [[politics of Germany|Germany]]\'\'\'.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU and FDP or

In [11]:
revsl[1000]['timestamp']

Timestamp(2010, 10, 23, 18, 26, 9)

In [12]:
text

'{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article \'\'\'lists [[political party|political parties]] in [[politics of Germany|Germany]]\'\'\'.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU and FDP or

Parsing the page with wikitextparser, by first making a page object:

In [23]:
page = wtp.parse(text)

In [24]:
page

WikiText('{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article \'\'\'lists [[political party|political parties]] in [[politics of Germany|Germany]]\'\'\'.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU a

Getting page templates:

In [15]:
page.templates[:10]

[Template('{{Short description|Political parties in Germany}}'),
 Template('{{Politics of Germany|elections}}'),
 Template('{{Cite web |url=http://www.dw-world.de/popups/popup_printcontent/0,,1647406,00.html |title=Chronik: Bundestagswahlen von 1949 bis 2002 &#124; Deutschland &#124; Deutsche Welle &#124; 02.10.2005 |access-date=2009-10-10 |archive-url=https://web.archive.org/web/20081108114315/http://www.dw-world.de/popups/popup_printcontent/0,,1647406,00.html |archive-date=2008-11-08 |url-status=dead }}'),
 Template('{{cite web |url=http://www.dw-world.de/dw/article/0,,4541120,00.html |title=Political parties form colorful spectrum in Germany |date=2009-08-18 |access-date=2009-09-12 |publisher=[[Deutsche Welle]] }}'),
 Template('{{citation |url=http://www.dw-world.de/dw/article/0,,4582700,00.html |title=The Green party: Getting used to opposition |date=2009-08-24 |access-date=2009-10-12 |publisher=[[Deutsche Welle]] |quote=This made a so-called [[Jamaica coalition (politics)|Jamaica 

Like in the previous section, we can get the links in the page, this time with a different order:

In [16]:
page.wikilinks[:10]

[WikiLink('[[political party|political parties]]'),
 WikiLink('[[politics of Germany|Germany]]'),
 WikiLink('[[Federal Republic of Germany]]'),
 WikiLink('[[Multi-party system|multi party system]]'),
 WikiLink('[[Christian Democratic Union (Germany)|Christian Democratic Union]]'),
 WikiLink('[[Christian Social Union of Bavaria|Christian Social Union]]'),
 WikiLink('[[Social Democratic Party of Germany]]'),
 WikiLink('[[Free Democratic Party (Germany)|Free Democratic Party]]'),
 WikiLink('[[Alliance 90/The Greens]]'),
 WikiLink('[[The Left (Germany)|The Left]]')]

Getting sections, no bugs with wikitexmtparser!

In [17]:
page.sections[0]

Section("{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article '''lists [[political party|political parties]] in [[politics of Germany|Germany]]'''.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU and FDP 

Tables data:

In [18]:
data = page.tables[1].data()
data

[['Name',
  'Name',
  'Name',
  '{{tooltip|Abbr.|Abbreviation}}',
  'Leader(s)',
  'Ideology',
  'Political position',
  '[[Landtag|MdLs]]',
  'State'],
 ['',
  '[[File:CSU Logo since 2016.svg|center|75px]]',
  "[[Christian Social Union in Bavaria]]<br /><small>''Christlich-Soziale Union in Bayern''</small>",
  'CSU',
  '[[Markus Söder]]',
  '{{ubl|\n |[[Christian democracy]]\n |[[Conservatism]]\n |[[Bavaria]]n&nbsp;[[Regionalism (politics)|regionalism]]}}',
  '[[Centre-right politics|Centre-right]]',
  '{{Composition bar|82|205|{{party color|Christian Social Union of Bavaria}}}}',
  '[[Bavaria]]'],
 ['',
  '[[File:Logo-BVB-FREIE-WAEHLER.svg|center|75px]]',
  "[[Brandenburg United Civic Movements/Free Voters]]<br /><small>''Brandenburger Vereinigte Bürgerbewegungen / Freie Wähler''</small>",
  'BVB/FW',
  '[[Péter Vida]]',
  '[[Regionalism (politics)|Regionalism]]',
  "<!-- Please do NOT add a political position here unless it can be or is cited with a reliable third-party source eithe

Putting the data in a dataframe:

In [19]:
df = pd.DataFrame(data[1:])
df.columns = data[0]
df

Unnamed: 0,Name,Name.1,Name.2,{{tooltip|Abbr.|Abbreviation}},Leader(s),Ideology,Political position,[[Landtag|MdLs]],State
0,,[[File:CSU Logo since 2016.svg|center|75px]],[[Christian Social Union in Bavaria]]<br /><sm...,CSU,[[Markus Söder]],{{ubl|\n |[[Christian democracy]]\n |[[Conserv...,[[Centre-right politics|Centre-right]],{{Composition bar|82|205|{{party color|Christi...,[[Bavaria]]
1,,[[File:Logo-BVB-FREIE-WAEHLER.svg|center|75px]],[[Brandenburg United Civic Movements/Free Vote...,BVB/FW,[[Péter Vida]],[[Regionalism (politics)|Regionalism]],<!-- Please do NOT add a political position he...,{{Composition bar|5|88|{{party color|Brandenbu...,[[Brandenburg]]
2,,[[File:Ssw-logo.svg|center|75px]],[[South Schleswig Voters' Association]]<br /><...,SSW,[[Christian Dirschauer]],[[Social liberalism]]<br>[[Regionalism (politi...,[[Centre-left politics|Centre-left]],{{Composition bar|4|69|{{party color|South Sch...,[[Schleswig-Holstein]]
3,,[[File:Bürger in Wut Logo.svg|center|75px]],[[Citizens in Rage]]<br /><small>''Bürger in W...,BIW,[[Jan Timke]],[[Right-wing populism]],[[Right-wing politics|Right-wing]],{{Composition bar|2|84|{{party color|Citizens ...,[[Bremen (state)|Bremen]]
4,,[[File:Bürger für Thüringen Logo.png|center|75...,[[Citizens for Thuringia]]<br /><small>''Bürge...,BfTh,[[Ute Bergner]],[[Right-wing populism]],[[Right-wing politics|Right-wing]],{{Composition bar|2|90|#800080}},[[Thuringia]]


Parsing each cells data with mwparserfromhell and then making the dataframe:

In [44]:
for i in range(len(data)):
    for j in range(len(data[i])):
        wikicode = mwp.parse(data[i][j])
        data[i][j] = wikicode.strip_code(data[i][j])

In [45]:
df = pd.DataFrame(data[1:])
df.columns = data[0]
df

Unnamed: 0,Name,Name.1,Name.2,Unnamed: 4,Leader(s),Ideology,Political position,MdLs,State
0,,center|75px,Christian Social Union in BavariaChristlich-So...,CSU,Markus Söder,,Centre-right,,Bavaria
1,,center|75px,Brandenburg United Civic Movements/Free Voters...,BVB/FW,Péter Vida,Regionalism,,,Brandenburg
2,,center|75px,South Schleswig Voters' AssociationSüdschleswi...,SSW,Christian Dirschauer,Social liberalismRegionalismDanish minority in...,Centre-left,,Schleswig-Holstein
3,,center|75px,Citizens in RageBürger in Wut,BIW,Jan Timke,Right-wing populism,Right-wing,,Bremen
4,,center|75px,Citizens for ThuringiaBürger für Thüringen,BfTh,Ute Bergner,Right-wing populism,Right-wing,,Thuringia


### B1.3.2. Alternative ways for extracting tables data

**1. wikitables library:** Small bugs need to be handled by hand:


POURIA: wikitables should be mentioned in environment as well?

In [30]:
from wikitables import import_tables

tables = import_tables('List of political parties in Germany')

List of political parties in Germany[0][0]: dropping field from unknown column: S&D
List of political parties in Germany[0][0]: dropping field from unknown column: 404305
List of political parties in Germany[0][1]: dropping field from unknown column: EPP
List of political parties in Germany[0][1]: dropping field from unknown column: 399110
List of political parties in Germany[0][2]: dropping field from unknown column: EPP
List of political parties in Germany[0][2]: dropping field from unknown column: 137010
List of political parties in Germany[0][3]: dropping field from unknown column: Greens/EFA
List of political parties in Germany[0][3]: dropping field from unknown column: 106000
List of political parties in Germany[0][4]: dropping field from unknown column: RE
List of political parties in Germany[0][4]: dropping field from unknown column: 73000
List of political parties in Germany[0][5]: dropping field from unknown column: ID
List of political parties in Germany[0][5]: dropping fiel

List of political parties in Germany[2][26]: dropping field from unknown column: Right-wing to Far-right
List of political parties in Germany[2][26]: dropping field from unknown column: 
List of political parties in Germany[2][27]: dropping field from unknown column: Right-wing
List of political parties in Germany[2][27]: dropping field from unknown column: 
List of political parties in Germany[2][28]: dropping field from unknown column: 
List of political parties in Germany[2][28]: dropping field from unknown column: 
List of political parties in Germany[2][29]: dropping field from unknown column: Right-wing
List of political parties in Germany[2][29]: dropping field from unknown column: 
List of political parties in Germany[2][30]: dropping field from unknown column: Centre-right to right-wing
List of political parties in Germany[2][30]: dropping field from unknown column: 
List of political parties in Germany[2][31]: dropping field from unknown column: Far-left
List of political par

List of political parties in Germany[2][75]: dropping field from unknown column: Centre
List of political parties in Germany[2][75]: dropping field from unknown column: 
List of political parties in Germany[2][76]: dropping field from unknown column: Left-wing
List of political parties in Germany[2][76]: dropping field from unknown column: 
List of political parties in Germany[2][77]: dropping field from unknown column: 
List of political parties in Germany[2][77]: dropping field from unknown column: 
List of political parties in Germany[2][78]: dropping field from unknown column: 
List of political parties in Germany[2][78]: dropping field from unknown column: 
List of political parties in Germany[2][79]: dropping field from unknown column: 
List of political parties in Germany[2][79]: dropping field from unknown column: 
List of political parties in Germany[2][80]: dropping field from unknown column: 
List of political parties in Germany[2][80]: dropping field from unknown column: 
L

List of political parties in Germany[3][75]: dropping field from unknown column: 
List of political parties in Germany[3][76]: dropping field from unknown column: 
List of political parties in Germany[3][77]: dropping field from unknown column: 
List of political parties in Germany[3][78]: dropping field from unknown column: 
List of political parties in Germany[3][79]: dropping field from unknown column: 
List of political parties in Germany[3][80]: dropping field from unknown column: 
List of political parties in Germany[3][81]: dropping field from unknown column: Banned in 1952
List of political parties in Germany[3][82]: dropping field from unknown column: Split from SED
List of political parties in Germany[3][83]: dropping field from unknown column: 
List of political parties in Germany[3][84]: dropping field from unknown column: Split from KPD (RO)
List of political parties in Germany[3][85]: dropping field from unknown column: Merged into ISO
List of political parties in Germany

List of political parties in Germany[8][7]: dropping field from unknown column: 
List of political parties in Germany[8][8]: dropping field from unknown column: 
List of political parties in Germany[8][9]: dropping field from unknown column: 
List of political parties in Germany[8][10]: dropping field from unknown column: Short-lived merger of the DDP and Young German Order
List of political parties in Germany[8][11]: dropping field from unknown column: Predecessor of the Nazi Party
List of political parties in Germany[8][12]: dropping field from unknown column: 
List of political parties in Germany[8][13]: dropping field from unknown column: 
List of political parties in Germany[8][14]: dropping field from unknown column: 
List of political parties in Germany[9][0]: dropping field from unknown column: 1893–1933
List of political parties in Germany[9][1]: dropping field from unknown column: 1870–
List of political parties in Germany[9][2]: dropping field from unknown column: 1878–1918


In [31]:
tables

[<WikiTable 'List of political parties in Germany[0]'>,
 <WikiTable 'List of political parties in Germany[1]'>,
 <WikiTable 'List of political parties in Germany[2]'>,
 <WikiTable 'List of political parties in Germany[3]'>,
 <WikiTable 'List of political parties in Germany[4]'>,
 <WikiTable 'List of political parties in Germany[5]'>,
 <WikiTable 'List of political parties in Germany[6]'>,
 <WikiTable 'List of political parties in Germany[7]'>,
 <WikiTable 'List of political parties in Germany[8]'>,
 <WikiTable 'List of political parties in Germany[9]'>]

In [33]:
print(tables[0].rows[0])

{'Party': , '': center|75px, 'Leader': Social Democratic Party of Germany Sozialdemokratische Partei Deutschlands, 'Ideology': SPD, 'Political position': Lars Klingbeil , Saskia Esken, 'MdBs': Social democracy Pro-Europeanism, 'MEPs': Centre-left, 'EP group': 206 736 {{party color|Social Democratic Party of Germany}}, 'Membership': 16 96 {{party color|Social Democratic Party of Germany}}}


**2. Introducing DBpedia:** www.dbpedia.org

### ### B1.3.3. Alternative ways for extracting main text of different revisions

Extracting the main text of the first revision of an article in each year since the beginning:

In [25]:
import pywikibot
import mwparserfromhell

In [26]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "Koç University")

In [27]:
revisions = page.revisions(content=True)

In [28]:
revisions_list = []
years = []

for i in revisions:
    revisions_list.append(i)
    years.append(int(str(i['timestamp'])[:4]))
years.reverse()
revisions_list.reverse()

In [29]:
# years

In [80]:
# revisions_list[-1]

In [81]:
yearly_revisions = []
for i in range(years[0], years[-1]+1):
    index = years.index(i)
    yearly_revisions.append(revisions_list[index])

In [82]:
# yearly_revisions[-1]

In [83]:
text = yearly_revisions[-1].text

In [84]:
parsed = mwparserfromhell.parse(text)

In [85]:
print(parsed.strip_code())

Koç University () is a non-profit private university in Istanbul, Turkey. It started education in temporary buildings in İstinye in 1993, and moved to its current Rumelifeneri campus near Sarıyer in 2000. Koç University is ranked highest in Turkey according to the 2022 Times Higher Education World University Rankings and 2022 QS World University Rankings. Koç University currently consists of Colleges of Social Sciences and Humanities, Administrative Sciences and Economics, Science, Engineering, Law, Nursing and Medicine. Koç University offers 22 undergraduate, 29 graduate and 30 PhD programs. The university is home to around 7,000 students. The university accepts international students from various countries and has an extensive network of over 250 partner-universities including University of California and other universities such as Northwestern University, Cornell University and Georgetown University.

Founded in 1993, Koç University has become one of the most prestigious universitie

## B1.4. More APIs and precollected datasets 

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

## B1.5. Challenges

 >Two main downsides to working with APIs. First, there may be restrictions on what data is provided, and if there are, those restrictions are often grounded in business interests rather than technical requirements and limitations. That said, the data that API providers choose to include is almost always complete because it is often generated programmatically as a by-product of their platform, and applications depend on it (<a href='#mclevey_doing_2022'>McLevey, 2022</a>, ch.4). Second, APIs change, sometimes unexpectedly (see <a href='#Freelon'>Freelon, 2018</a>; <a href='#Hogan'>Hogan, 2018</a>; <a href='#Jünger'>Jünger, 2021</a>). 

For example, Facebook completely closed down many of it’s APIs and it is not very hard to get Facebook data besides CrowdTangle or FB Ads.

Twitter’s API now has the version 2 with substantial changes. These challanges make us stay vigilant and continuously update our code to keep up with the APIs. Also, good to keep ourselves up to date with tech companies like with this news [Why Twitter ending free access to its APIs should be a ‘wake-up call’](https://www.theguardian.com/technology/2023/feb/07/techscape-elon-musk-twitter-api).

More on Social Media data collection and data quality, please visit this [source](https://www.slideshare.net/suchprettyeyes/working-with-socialmedia-data-ethics-good-practice-around-collecting-using-and-storing-data).

## Commented references

<a id='mclevey_doing_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on using Pandas. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*

Zenk-Möltgen, Wolfgang (GESIS - Leibniz Institute for the Social Sciences), Python Script to rehydrate Tweets from Tweet IDs https://doi.org/10.7802/1504

Pfeffer, Morstatter (2016): Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness. Dataset. http://dx.doi.org/10.7802/1166

<a id='statista'></a>
Statista, 2023. https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/. Retrieved 26.04.2023.

<a id='van_Vliet'></a>
van Vliet, L., Törnberg, P., & Uitermark, J. (2020) "The Twitter parliamentarian database: Analyzing Twitter politics across 26 countries". PLoS ONE 15(9): e0237073. https://doi.org/10.1371/journal.pone.0237073.

<a id='Freelon'></a>
Freelon, D. (2018) "Computational Research in the Post-API Age". Political
Communication, 35 (4): 665–668. https://doi.org/10.1080/10584609.2018.1477506

<a id='Hogan'></a>
Hogan, B. (2018) "Social Media Giveth, Social Media Taketh Away: Facebook,
friendships, and APIs". International Journal of Communication, 12: 592–611. https://ssrn.com/abstract=3084159

<a id='Jünger'></a>
Jünger, J. (2021) "A brief history of APIs: Limitations and opportunities for online
research", in U. Engle and A. Quan-Haase (eds),. Handbook of Computational Social
Science Abingdon: Routledge. https://doi.org/10.4324/9781003025245

<a id='Sen'></a>
Sen, I., Flöck, F., Weller, K., Weiß, B., & Wagner, C. (2021). *A Total Error Framework for Digital Traces of Human Behavior on Online Platforms*. Public Opinion Quarterly, 85(S1), 399–422. https://doi.org/10.1093/poq/nfab018

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & Pouria Mirelmi 

Contributors: Haiko Lietz

Acknowledgements: 

Version date: 25. April 2023

License: ...
</div>