<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session B1: API harvesting and Twitter API

**Data collection** is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing accurate insights for research using various techniques. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

The ease of access to the technology has made various social media platforms more popular as communication tools, therefore as a source of data. With this rise of social media use as a data source, data collection using APIs has become a demanding skill. Here, in this session, we aim to teach how to collect data from various social media platforms such as Twitter.

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to collect digital behavioral data via API harvesting throughout this session. In subsession **B1.1**, we start with how to collect data programmatically using webbased application programming interfaces (APIs). In subsession **B1.2**, we introduce how to collect data from Twitter such as rehydrating tweet identifiers, getting information on users, and keyword search. In subsession **B1.3**, More APIs are introduced. Finally, Subsession **B1.4** is dedicated to how data collection through API can be challenging.
</div>

## B1.1. Collecting data from API

<img src="./images/database.png"  width="150" height = "150" align="right"/>


>APIs provide a number of benefits when it comes to collecting data from the web. With rare exceptions, rules about what data we can collect and how we can collect it using APIs are explicitly stated, which removes some of the legal uncertainties that can
complicate other data collection methods. Widely used APIs are generally well documented and maintained, and the data returned is usually well-structured and easy to work with.(<a href='#mclevey_doing_2022'>McLevey, 2022</a>, ch.4)

In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system.

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites.

However, even though we have access to these APIs, as researchers, we should not forget to respect API access rules and always read the documents before collecting data.


## B1.2. Getting practical with Twitter API 

Twitter is one of the most used social media platforms in the academic research. This microblogging and social networking service hosts users who can post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read those that are publicly available. As of January 2023, Twitter has 556 million active users worldwide (<a href='#statista'>Statista, 2023</a>). 

<img src="./images/twitter.png"  width="200" height = "200" align="left"/>


Different access options for different purposes:

- Twitter Developer: https://developer.twitter.com/
- APIs: https://developer.twitter.com/en/docs
- GNIP: http://support.gnip.com/apis/
- Twitter Enterprise: https://developer.twitter.com/en/enterprise

IMPORTANT to note that free APIs cover 7 days Tweets; Premium APIs exist for 30-day search and beyond. If you have an Academic Research access level, you can access even more data with full-archive search endpoint. There are changes to APIs policies over time, such as functionalities and user agreements. Also, limitations on volume and functions should be considered. 

Before we start with our first project on Twitter, first you need to sign up for Twitter and then, create a Developer account: 

- Sign up from [here.](https://help.twitter.com/en/using-twitter/create-twitter-account)
- Create a Developer Account from [here.](https://developer.twitter.com)

### B1.2.1. Setting up Twitter API keys 

In [None]:
apikey = 'YOURapikey' #25 alphanumeric characters
apisecretkey = 'YOURapisecretkey'
accesstoken = 'YOURaccesstoken'
accesstokensecret = 'YOURaccesstokensecret'
bearertoken = 'YOURbearertoken'


<img src="./images/developer_portal.png"  width="500" height = "500" align="center"/>

### B1.2.2. Rehydrating tweets

Researchers share large tweet data sets with **tweet identifiers** as Twitter Terms of Service does not allow researchers to share the full tweets data. In order to get the tweets used in a research work, we need to retrieve/reconstruct the tweets data using those tweet identifiers (tweet ids). This process is called hydrating/rehydrating tweets.

Since some of the tweets used in a research work migh have been deleted in time, we may not be able to access every single tweet used at the time when that research work has been done. We will see about that in more details later in this subsession.

In order to rehydrate tweets, we will be using Twarc library, which is a python wrapper for twitter API. You can install it with `pip`.

In [None]:
from twarc import Twarc2, expansions

We will rehydrate tweets from The Twitter Parliamentarian Database (<a href='#van_Vliet'>van Vliet et al., 2020</a>) for our teaching purposes here. Download the `2021.csv` data set from [this](https://figshare.com/articles/dataset/The_Twitter_Parliamentarian_Database/10120685) link, put it into a `data` folder in the current directory, and read it like the following:

In [None]:
import pandas as pd

data = pd.read_csv('./data/2021.csv', header = None, low_memory=False)
data.columns = ['country', 'party', 'author', 'author_id', 'district','date','tweet_id']

In [None]:
data.head()

We will take the tweets for Turkey and keep their IDs to rehydrate. We'll try rehydrating a random sample of 1000 of them:

In [None]:
turkey = data[data['country'] == 'Turkey']

tweet_ids = list(turkey.sample(1000, random_state = 2023)['tweet_id'])

We will use the following [`rehydrate()`](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) function to rehydrate the tweets and keep their data in the `tweets` list. It uses twarc's [`tweet_lookup()`](https://twarc-project.readthedocs.io/en/latest/api/client2/#twarc.client2.Twarc2.tweet_lookup)function:

In [None]:
import json

# Use your bearer token here
client = Twarc2(bearer_token=bearertoken)

tweets = []

def rehydrate(ids: list):
    # List of Tweet IDs you want to lookup
    tweet_ids = ids
    # The tweet_lookup function from twarc 
    lookup = client.tweet_lookup(tweet_ids=tweet_ids)
    for page in lookup:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            tweets.append(tweet)

Running the function for the 1000 tweet IDs takes around 30 seconds, since twarc sends a GET request for 100 tweet IDs every 3 seconds. More on twitter rate limits [here](https://developer.twitter.com/en/docs/twitter-api/rate-limits).

In [None]:
rehydrate(tweet_ids)

To see the information returned for each tweet ID, we can check the first item in `tweets` list:

In [None]:
tweets[0]

We can also check to see how many tweets could have been rehydrated from the IDs:

In [None]:
len(tweets)

As you can see, only 92 percent of tweets could have been rehydrated; others are not available anymore.

In [None]:
tweets[0]

We can show some of the useful information of tweets in a dataframe:

In [None]:
author_id = []
created_at = []
text = []
reply_count = []
like_count = []

for i in tweets:
    
    author_id.append(i['author_id'])
    created_at.append(i['created_at'])
    text.append(i['text'])
    reply_count.append(i['public_metrics']['reply_count'])
    like_count.append(i['public_metrics']['like_count'])
    
tweets_df = pd.DataFrame(data=[tweet_ids, author_id, created_at, text, reply_count, like_count]).transpose()

tweets_df.columns = ["tweet id","author id", "created at", "text", "reply count","like count"]

tweets_df.head()

### B1.2.3. Getting users info

Consider the `turkey` dataframe that we created from the `2021.csv` data in the previous section. We want to get the profile information of a fraction of the politicians whose tweets are in that dataframe. First, we get the unique politicians in the data:

In [None]:
# Getting all IDs in the dataframe 
author_ids = turkey['author_id']

# Getting unique IDs
unique_ids = list(author_ids.unique())

With the following [`get_user()`](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) function, we can get the users' information based on their IDs (It's a bit like tweet rehydration, but we use users' IDs this time), and save them in the `users_list` list. 

`get_user()` uses twarc's [`user_lookup()`](https://twarc-project.readthedocs.io/en/latest/api/client2/#twarc.client2.Twarc2.user_lookup) function:

In [None]:
from twarc import Twarc2, expansions
import json

users_list = []

# Replace your bearer token below
client = Twarc2(bearer_token=bearertoken)

def get_user(ids):
    # List of user IDs to lookup, add the ones you would like to lookup
    users = ids
    # The user_lookup function gets the hydrated user information for specified users
    lookup = client.user_lookup(users=users)
    for page in lookup:
        result = expansions.flatten(page)
        for user in result:
            # Here we are printing the full Tweet object JSON to the console
            users_list.append(user)

In [None]:
import random

some_ids = random.sample(unique_ids, 50)

get_user(some_ids)

Now we can make a dataframe and put some of this useful information of users' in it:

In [None]:
username = []
screen_name = []
profile_pic = []
followers = []
followings = []

for i in users_list:
    
    username.append(i['username'])
    screen_name.append(i['name'])
    profile_pic.append(i['profile_image_url'])
    followers.append(i['public_metrics']['followers_count'])
    followings.append(i['public_metrics']['following_count'])
    
users_df = pd.DataFrame(data=[some_ids, username, screen_name, profile_pic, followers, followings]).transpose()

users_df.columns = ["user id", "user name","screen name", "profile picture", "followers","followings"]

users_df.head()

### B1.2.4. Keyword search limited to a time window

We can use the the [`search_all()`](https://twarc-project.readthedocs.io/en/latest/api/client2/#twarc.client2.Twarc2.search_all) function of twarc to search for tweets in any time window of our choice. The following [`search()`](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) function looks for any tweet containing the query it takes, limited to a beginning and end time, and saves them into the `tweets` list.


You can find more information on writing search queries [here](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md).

In [None]:
from twarc import Twarc2, expansions
import datetime
import json


# Replace your bearer token below
client = Twarc2(bearer_token=bearertoken)

def search(beginning, end, q):
    
    # Specify the start time in UTC for the time period you want Tweets from
    start_time = beginning

    # Specify the end time in UTC for the time period you want Tweets from
    end_time = end

    # This is where we specify our query
    query = q

    # The search_all method call the full-archive search endpoint to get Tweets based on the query, start and end times
    search_results = client.search_all(query=query, start_time=beginning, end_time=end, max_results=100)

    # Twarc returns all Tweets for the criteria set above, so we page through the results
    for page in search_results:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            # Here we are printing the full Tweet object JSON to the console
            tweets.append(tweet)

In [None]:
tweets = []

# Beginning time for the time window
beginning = datetime.datetime(2008, 1, 5, 0, 0, 0, 0, datetime.timezone.utc)

# End time for the time window
end = datetime.datetime(2023, 1, 8, 0, 0, 0, 0, datetime.timezone.utc)

# The query for searching
q = "(Computational Social Science OR ComputationalSocialScience OR Quantitative Social Science OR QuantitativeSocialScience) Turkey"

search (beginning, end, q)

You can take a look at the number of retrieved tweets and the first tweet like this:

In [None]:
# Number of retrieved tweets
len(tweets)

In [None]:
# The information available for each tweet
tweets[0].keys()

In [None]:
tweets[0]['entities']['mentions'][0].keys()

In [None]:
# The overall information of the first tweet 
tweets[0]

We can show some of the useful information of tweets in a dataframe:

In [None]:
tweet_id = []
author_id = []
created_at = []
text = []
lang = []

for i in tweets:
    
    tweet_id.append(i['id'])
    author_id.append(i['author_id'])
    created_at.append(i['created_at'])
    text.append(i['text'])
    lang.append(i['lang'])
    
search_df = pd.DataFrame(data=[tweet_id, author_id, created_at, text, lang,]).transpose()

search_df.columns = ["tweet id","author id", "created at", "text", "language"]

search_df.head()

#### Data dictionaries of the tweets

As you saw, the returned data for the tweets are in the form of large JSONs. Twitter uses these sorts of data dictionaries to store all the information about the tweets. You can find an overview of this information in the following tables.

#### [Tweets data dictionary](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet):

| Field value | Type | Description | How it can be used |
|-------|-----------|----------|----------|
| `id` | string | The unique identifier of the requested Tweet. `"id": "1050118621198921728"` | Use this to programmatically retrieve a specific Tweet. |
| `text` | string | The actual UTF-8 text of the Tweet. See [twitter-text](https://github.com/twitter/twitter-text/) for details on what characters are currently considered valid. `"text": "To make room for more expression, we will now count all emojis as equal—including those with gender‍‍‍ ‍‍and skin tone modifiers 👍🏻👍🏽👍🏿. This is now reflected in Twitter-Text, our Open Source library. \n\nUsing Twitter-Text? See the forum post for detail: https://t.co/Nx1XZmRCXA"` | Keyword extraction and sentiment analysis/classification. |
| `edit_history_tweet_ids` | object | Unique identifiers indicating all versions of a Tweet. For Tweets with no edits, there will be one ID. For Tweets with an edit history, there will be multiple IDs, arranged in ascending order reflecting the order of edits. The most recent version is the last position of the array. `"edit_history_tweet_ids": ["1584717154800521216"]` | Use this information to find the edit history of a Tweet. |
| `author_id` | string | The unique identifier of the User who posted this Tweet. `"author_id": "2244994945"` | Hydrating User object, sharing dataset for peer review |
| `conversation_id` | string | The Tweet ID of the original Tweet of the conversation (which includes direct replies, replies of replies). `"conversation_id": "1050118621198921728"` | Use this to reconstruct the conversation from a Tweet. |
| `created_at` | date (ISO 8601) | Creation time of the Tweet. `"created_at": "2019-06-04T23:12:08.000Z"` | This field can be used to understand when a Tweet was created and used for time-series analysis etc. |
| `edit_controls` | object | When present, this indicates how much longer the Tweet can be edited and the number of remaining edits. Tweets are only editable for the first 30 minutes after creation and can be edited up to five times. `"edit_controls": {"edits_remaining": 5, "is_edit_eligible": true, "editable_until": "2022-10-25T01:53:06.000Z"}` | Use this to determine if a Tweet is eligible for editing. |
| `lang` | string | Language of the Tweet, if detected by Twitter. Returned as a BCP47 language tag. `"lang": "en"` | Classify Tweets by spoken language. |
| ‍‍‍`possibly_sensitive` | boolean | This field indicates content may be recognized as sensitive. The Tweet author can select within their own account preferences and choose “Mark media you tweet as having material that may be sensitive” so each Tweet created after has this flag set. This may also be judged and labeled by an internal Twitter support agent. `"possibly_sensitive": false` | Studying circulation of certain types of content. |
| `public_metrics` | object | Public engagement metrics for the Tweet at the time of the request. `"public_metrics" : {"retweet_count": 8, "reply_count": 2, "like_count": 39, "quote_count": 1}` | Use this to measure Tweet engagement. |
| `referenced_tweets` | array | A list of Tweets this Tweet refers to. For example, if the parent Tweet is a Retweet, a Retweet with comment (also known as Quoted Tweet) or a Reply, it will include the related Tweet referenced to by its parent. `"referenced_tweets": [{"type": "replied_to", "id": "1242125486844604425"}]` | This field can be used to understand conversational aspects of retweets etc. |
| `reply_settings` | string | Shows you who can reply to a given Tweet. Fields returned are "everyone", "mentioned_users", and "followers". `"reply_settings": "everyone"` | This field allows you to determine whether conversation reply settings have been set for the Tweet and if so, what settings have been set. |

#### [Mentions data:](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet)

Under the main tweets data dictionary, you can access data about its mentions via the `entities` & `mentions` inner dictionaries:

| Name | Type | Description |
|-------|-----------|----------|
| `entities.mentions` | array | Contains details about text recognized as a user mention. |
| `entities.mentions.start` | integer | The start position (zero-based) of the recognized user mention within the Tweet. All start indices are inclusive. |
| `entities.mentions.end` | integer | The end position (zero-based) of the recognized user mention within the Tweet. This end index is exclusive. |
| `entities.mentions.username` | string | The part of text recognized as a user mention. You can obtain the expanded object in `includes.users` by adding `expansions=entities.mentions.username` in the request's query parameter. |

#### [Users data dictionary](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user):

There are similar dictionaries for keeping the users (authors) data. You can see an overview of that here:

| Field value | Type | Description | How it can be used |
|-------|-----------|----------|----------|
| `id` | string | The name of the user, as they’ve defined it on their profile. Not necessarily a person’s name. Typically capped at 50 characters, but subject to change. `"name": "Twitter Dev"` | - |
| `username` | string | The Twitter screen name, handle, or alias that this user identifies themselves with. Usernames are unique but subject to change. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names. `"username": "TwitterDev"` | - |
| `created_at` | date (ISO 8601) | The UTC datetime that the user account was created on Twitter. `"created_at": "2013-12-14T04:35:55.000Z"` | Can be used to determine how long a someone has been using Twitter. |
| `description` | string | The text of this user's profile description (also known as bio), if the user provided one. `"description": "The voice of Twitter's #DevRel team, and your official source for updates, news, & events about Twitter's API. \n\n#BlackLivesMatter"` | - |
| `location` | string | The location specified in the user's profile, if the user provided one. As this is a freeform value, it may not indicate a valid location, but it may be fuzzily evaluated when performing searches with location queries. `"location": "127.0.0.1"` | - |
| `profile_image_url` | string | The URL to the profile image for this user, as shown on the user's profile. `"profile_image_url": "https://pbs.twimg.com/profile_images/1267175364003901441/tBZNFAgA_normal.jpg"` | Can be used to download this user's profile image. |
| `protected` | boolean | Indicates if this user has chosen to protect their Tweets (in other words, if this user's Tweets are private). `"protected": false` | - |
| `public_metrics` | object | Contains details about activity for this user. `"public_metrics": {"followers_count": 507902, "following_count": 1863, "tweet_count": 3561, "listed_count": 1550}` | Can potentially be used to determine a Twitter user’s reach or influence, quantify the user’s range of interests, and the user’s level of engagement on Twitter. |
| `verified` | boolean | Indicates if this user is a verified Twitter User. `"verified": true` | Indicates whether or not this Twitter user has a verified account. A verified account lets people know that an account of public interest is authentic. |

### B1.2.5. Documentation of digital behavioral data(sets) collected from online platforms

In the following we would like to show you how to describe systematically digital behavioral data. For this purpose we will utilize TES-D template (ADD citation; <a href='#Fröhling'>Fröhling et al., 2023</a>; <a href='#Sen'>Sen et al., 2021</a>). For more details you can refer to TES-D Manual (ADD citation).

**TES-D “Computational Social Science Turkey Tweets 2008-2023”**

**General Characteristics** 

1. *Who collected the dataset and who funded the process?*

The dataset have been collected by "Social ComQuant" Project team (Gizem Bacaksizlar Turbic, Haiko Lietz, Pouria Mirelmi, Olga Zagovora) at GESIS - Leibniz Institute for the Social Sciences, Computational Social Science department. The dataset collection was funded by a European Commission as a part of [the Social ComQuant Project](https://socialcomquant.ku.edu.tr/).

2. *Where is the dataset hosted? Is the dataset distributed under a copyright or license?* 

The dataset is hosted on open access [github repository](https://github.com/gesiscss/css_methods_python) of CSS department at GESIS. ADD LICENSE   

3. *What do the instances that comprise the dataset represent? What data does each instance consist of?*

Each line of dataset reprents a distinct Tweet posted on Twitter in the period between 5th January 2008 and 8th January 2023. Each instance consist of: the unique identifier of the Tweet, the unique identifier of the User who posted this Tweet, creation time of the Tweet (in ISO 8601 format), the actual UTF-8 text of the Tweet, language of the Tweet, if detected by Twitter (it is returned as a BCP47 language tag). Data was not prerocessed and is represented in formats provided by API. 

4. *How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split?*

There are 105 instances on the dataset. Instances are homogen, i.e., each of them is representing a Tweet. 

5. *In which contexts and publications has the dataset been used already?* 

The dataset have been used in the online materials of [the Introduction to Computational Social Science methods with Python](https://github.com/gesiscss/css_methods_python) Course. 

6. *Are there alternative datasets that could be used for the measurement of the same or similar constructs? Could they be a better fit? How do they differ?* 

The dataset have been created for teaching purpose, namely, exercise on getting data using API. Any similar dataset is unknown. 

7. *Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments?*

[Jupyter Notebook](https://github.com/gesiscss/css_methods_python/blob/main/b_data_collection_methods/1_API_harvesting.ipynb), subsection B1.2.4 provides code in Python that explain how to obtain the dataset. Be aware that Twitter API might be depricated due to changes in Policies on free Access to the API. All the relevant informatiom one can find in the [documentation](https://developer.twitter.com/en/docs) or in this news article [Why Twitter ending free access to its APIs should be a ‘wake-up call’](https://www.theguardian.com/technology/2023/feb/07/techscape-elon-musk-twitter-api).   

8. *Were any ethical review processes conducted?* 

No thical review processes have been conducted. Dataset do not consist of any Private Data.    

9. *Did any ethical considerations limit the dataset creation?* 

We have not stored any data related to user accounts that have been posting relevant Tweets. Storage of this data can cause additional ethical considerations. 

10. *Are there any potential risks for individuals using the data?* 

Theoretical, some Tweets' texts can include usernames. Thus, to achive complete anonymisation one might need to postprocess data and remove these names.    

**Construct Definition** 

Validity 

1. For the measurement of what construct was the dataset created? 

 

2. How is the construct operationalized? Can the dataset fully grasp the construct? If not, what dimensions are left out? Have there been any attempts to evaluate the validity of the construct's operationalization? 

 

3. What related constructs could (not) be measured through the dataset? What should be considered when measuring other constructs with the dataset? 

 

4. What is the target population? 

 

5. How does the dataset handle subpopulations? 



**Platform Selection**

Platform Affordances Error 

1. What are the key characteristics of the platform at the time of data collection? 

 

2. What are the effects of the platform's ToS on the collected data? 

 

3. What are the effects of the platform's sociocultural norms on the collected data? 

 

4. How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design? 

 

5. In case multiple data sources were used, what errors might occur through their merger or combination? 


Platform Coverage Error 

1. What is known about the platform/s population? 

**Data Collection** 

Trace Selection Error 

1. How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen? 

 

2. Was there any data that could not be adequately collected? 

 

3. Is any information missing from individual instances? Could there be a systematic bias? 

 

4. Does the dataset include sensitive or confidential information? 

User Selection Error 

1. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample from a larger set, what was the sampling strategy? 

 

2. What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria? 

 

3. Over what timeframe was the data collected, and how might that timeframe have affected the collected data? 

 

4. If the dataset relates to people, how did they consent to collecting and using their data? 

 

5. Does the data include information on minors? 

**Data Preprocessing and Data Augmentation**

Trace Augmentation and Trace Measurement Error 

Is there a label or target associated with each instance? If so, how were the labels or targets generated? 

 

If automated methods were used, how does the methods’ performance impact the correctness of the augmentations? 

 

If human annotations were used, who were the annotators that created the labels? How were they recruited or chosen? How were they instructed? 

 

If the final gold label was derived from different annotations, how was this done? 

 

Have there been anCan date the labels? 

 
How could the data be misused? 

 
Can the dataset in any way unintendedly contribute to the reinforcement of social inequality? 

User Augmentation Error 

Have attributes and characteristics of individuals been inferred? 

 

Is it possible to identify individuals either directly or indirectly from the data? 

Trace Reduction Error 

Have traces been excluded? Why and by what criteria? 

User Reduction Error 

Have users been excluded? Why and by what criteria? 

Adjustment Error 

Does the dataset provide information to adjust the results to a target population? If so, is this information inferred or self-reported? 

## B1.3. Getting practical with Wikipedia API 

<img src='./images/wikipedia_logo.png' style='height: 190px; float: right; margin-left: 50px' >

Wikipedia is a rich source of data for social science research. Although we can access its data through other techniques like web scraping, there are also useful APIs that could ease collecting data from the website.

Since Wikipedia is built on [MediaWiki](https://en.wikipedia.org/wiki/MediaWiki), we will be using python wrappers written for its API,
[Mediawiki Action API](https://www.mediawiki.org/wiki/API:Main_page). Each of these wrappers provide some useful methods, and we will try to go through the ones that are the most important to our data collection tasks.

We will also introduce two useful parsers for the Wikipedia markup language, and will see how they could be used for extracting clean data from the raw markup code.

### B1.3.1. wikipedia

The first wrapper we introduce here is simply called [wikipedia](https://wikipedia.readthedocs.io/en/latest/code.html#api). You can install it via `pip`.

In [None]:
import wikipedia

Searching a query with `wikipedia` can be done using the [`search()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [None]:
wikipedia.search("Barack")

Wikipedia's suggested query can be accessed with the [`suggest()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [None]:
wikipedia.suggest("Barak Obama")

You can get fewer or more results with a specific number like this:

In [None]:
wikipedia.search("Ford", results=3)

For getting the summary of an article, you can use the [`summary()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [None]:
wikipedia.summary("Barack Obama")

In [None]:
wikipedia.summary("Barack Obama", sentences=1)

`summary()` will raise a `DisambiguationError` if the page is a disambiguation page, or a `PageError` if the page doesn’t exist (although by default, it tries to find the page you meant with suggest and search.)

In [None]:
wikipedia.summary("Mercury")

In [None]:
try:
    mercury = wikipedia.summary("Mercury")
except wikipedia.exceptions.DisambiguationError as e:
    print (e.options)

The [`page()`](https://wikipedia.readthedocs.io/en/latest/code.html#api)function enables you to load and access data from full Wikipedia pages. Initialize with a page title (keep in mind the errors listed above), then you can easily access most properties of the page:

In [None]:
bo = wikipedia.page("Barack Obama")

You can get information like title of the page, its url etc. In order to get the title of the page, you can use the `title` attribute:

In [None]:
bo.title

Using the `url` attribute, you can get the url of the page:

In [None]:
bo.url

To get the full text of the page, you can use the `content` attribute:

In [None]:
bo.content

You can access the images in the page using `.images`. The URLs of the first five images are retrieved like this:

In [None]:
bo.images[0:5]

You can get the texts of the links in the page using `.links`:

In [None]:
bo.links[:10]

In order to get the URLs of the external links of the page, you can use `.references`:

In [None]:
bo.references[1:]

In order to access the plain text content of a section in the page, you can use the `sections` attribute:

In [None]:
bo.section('Early life and career')

In order to change the language of the Wikipedia pages you are accessing, you can use the [`set_lang()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function. Remember to search for page titles in the language that you have set, not English:

In [None]:
wikipedia.set_lang("fr")

In [None]:
wikipedia.summary("Francois Hollande")

In [None]:
wikipedia.set_lang("en")

### B1.3.2. Getting Wikipedia tables' information

The `wikipedia` package that we introduced in B1.3.1 cannot always help us with all the tasks we may want to do in order to collect data from Wikipedia.

For getting data other than what `wikipedia` can give us, we can use other libraries to access the markup code of Wikipedia, and then parse it to get the information we want. We will introduce [pywikibot](https://doc.wikimedia.org/pywikibot/stable/), a wrapper that can give us the markup, together with two parsers [mwparserfromhell](https://mwparserfromhell.readthedocs.io/en/latest/index.html) and [wikitextparser](https://wikitextparser.readthedocs.io/en/latest/), in order to parse the markup code. You can install all of them with `pip`.

In [None]:
import pywikibot
import mwparserfromhell as mwp
import wikitextparser as wtp
import pandas as pd

#### Getting tables data

We will begin with an example page: [List of political parties in Germany](https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany). We want to extract the tables data in that page. Using pywikibot, we can get the markup code of the page, and then parse it with wikitextparser:

In [None]:
site = pywikibot.Site('en', 'wikipedia')
pwb_page = pywikibot.Page(site, "List of political parties in Germany")
text = pwb_page.get()
page = wtp.parse(text)

We can get the tables data with `page.tables`. Let's say we want to get the second table's data:

In [None]:
data = page.tables[1].data()
data

By putting the data in a dataframe, we can have a better overview of it:

In [None]:
second_table = pd.DataFrame(data[1:])
second_table.columns = data[0]
second_table

As you can see, the cells data are not shown in a clean way, like the way they are in the original Wikipedia page. We can parse each cell's data with mwparserfromhell, and then create the dataframe:

In [None]:
for i in range(len(data)):
    for j in range(len(data[i])):
        wikicode = mwp.parse(data[i][j])
        data[i][j] = wikicode.strip_code(data[i][j])

In [None]:
second_table = pd.DataFrame(data[1:])
second_table.columns = data[0]
second_table

Now the table looks pretty much the same as the table in the original page.

#### An alternative for extracting tables data: wikitables library

In order to get table's data, you can also get help from `wikitables` library. It eases some steps of accessing the tables data, but you need to be careful with small bugs or mistakes in the resulting tables. Let's say we want to extract the second table's data:

In [None]:
from wikitables import import_tables

In [None]:
tables = import_tables('List of political parties in Germany')

In [None]:
tables

In [None]:
second_table_wt = pd.DataFrame(tables[1].rows)
second_table_wt

As you can see, the cells data are perfectly retrieved, but the columns names are not synchronized with the ones in the original table. This needs to be taken care of, in case you want to use `wikitables`.

### B1.3.3. Extracting main text of different revisions

There may be multiple different revisions available for each Wikipedia page. In this section, we will demonstrate how you can extract the main text of the first revision of an article in each year since the beginning, using `pywikibot` and `mwparserfromhell`:

In [None]:
import pywikibot
import mwparserfromhell

Like before, you can first get the page using pwwikibot's [`.Site()`](https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-site) and [`.Page()`](https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-site):

In [None]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "Koç University")

Then, you can get all the revisions of the page using `page.revisions()`. Depending on how old/rich the page is, this may take a few seconds:

In [None]:
revisions = page.revisions(content=True)

Now we can make a list of all of the revisions, and put the **year** in which each revision has been written into a `years` list. Each revision is in the form of a dictionary, and we can get the *years* using the `timestamp` key in those dictionaries:

In [None]:
revisions_list = []
years = []

for i in revisions:
    revisions_list.append(i)
    years.append(int(str(i['timestamp'])[:4]))
years.reverse()
revisions_list.reverse()

Since revisions are sorted from the newest to the eldest, we have to reverse the `years` and `revisions_list` lists to have their items in an ascending order. By printing the `years` list, you can see an overview of how many revisions in each year there are for the page:

In [None]:
print(years)

We want to put the first revision of each year into a `yearly_revisions` list. In order to do that, we first get the indices of the first appearances of each year in the `years` list, and get the revisions with those indices in the `revisions_list` list:

In [None]:
yearly_revisions = []
for i in range(years[0], years[-1]+1):
    index = years.index(i)
    yearly_revisions.append(revisions_list[index])

In order to get the clean main text of each revision, we can use the `text` attribute of the revisions, and have the result parsed using `mwparserfromhell`. Take the last revision as an example; we first put the un-parsed code into the `text` variable:

In [None]:
text = yearly_revisions[-1].text

Now we can parse it with `mwparserfromhell` like this:

In [None]:
parsed = mwparserfromhell.parse(text)
print(parsed.strip_code())

## B1.4. More APIs and precollected datasets 

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

## B1.5. Challenges

 >Two main downsides to working with APIs. First, there may be restrictions on what data is provided, and if there are, those restrictions are often grounded in business interests rather than technical requirements and limitations. That said, the data that API providers choose to include is almost always complete because it is often generated programmatically as a by-product of their platform, and applications depend on it (<a href='#mclevey_doing_2022'>McLevey, 2022</a>, ch.4). Second, APIs change, sometimes unexpectedly (see <a href='#Freelon'>Freelon, 2018</a>; <a href='#Hogan'>Hogan, 2018</a>; <a href='#Jünger'>Jünger, 2021</a>). 

For example, Facebook completely closed down many of it’s APIs and it is not very hard to get Facebook data besides CrowdTangle or FB Ads.

Twitter’s API now has the version 2 with substantial changes. These challanges make us stay vigilant and continuously update our code to keep up with the APIs. Also, good to keep ourselves up to date with tech companies like with this news [Why Twitter ending free access to its APIs should be a ‘wake-up call’](https://www.theguardian.com/technology/2023/feb/07/techscape-elon-musk-twitter-api).

More on Social Media data collection and data quality, please visit this [source](https://www.slideshare.net/suchprettyeyes/working-with-socialmedia-data-ethics-good-practice-around-collecting-using-and-storing-data).

## Commented references

<a id='mclevey_doing_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on using Pandas. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*

Zenk-Möltgen, Wolfgang (GESIS - Leibniz Institute for the Social Sciences), Python Script to rehydrate Tweets from Tweet IDs https://doi.org/10.7802/1504

Pfeffer, Morstatter (2016): Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness. Dataset. http://dx.doi.org/10.7802/1166

<a id='statista'></a>
Statista, 2023. https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/. Retrieved 26.04.2023.

<a id='van_Vliet'></a>
van Vliet, L., Törnberg, P., & Uitermark, J. (2020) "The Twitter parliamentarian database: Analyzing Twitter politics across 26 countries". PLoS ONE 15(9): e0237073. https://doi.org/10.1371/journal.pone.0237073.

<a id='Freelon'></a>
Freelon, D. (2018) "Computational Research in the Post-API Age". Political
Communication, 35 (4): 665–668. https://doi.org/10.1080/10584609.2018.1477506

<a id='Hogan'></a>
Hogan, B. (2018) "Social Media Giveth, Social Media Taketh Away: Facebook,
friendships, and APIs". International Journal of Communication, 12: 592–611. https://ssrn.com/abstract=3084159

<a id='Jünger'></a>
Jünger, J. (2021) "A brief history of APIs: Limitations and opportunities for online
research", in U. Engle and A. Quan-Haase (eds),. Handbook of Computational Social
Science Abingdon: Routledge. https://doi.org/10.4324/9781003025245

<a id='Fröhling'></a>
Fröhling, L., Sen, I., Soldner, F., Steinbrinker, L., Zens, M., & Weller, K. (2023). *Total Error Sheets for Datasets (TES-D)—A Critical Guide to Documenting Online Platform Datasets* (arXiv:2306.14219). arXiv. https://doi.org/10.48550/arXiv.2306.14219


<a id='Sen'></a>
Sen, I., Flöck, F., Weller, K., Weiß, B., & Wagner, C. (2021). *A Total Error Framework for Digital Traces of Human Behavior on Online Platforms*. Public Opinion Quarterly, 85(S1), 399–422. https://doi.org/10.1093/poq/nfab018

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & Pouria Mirelmi 

Contributors: Haiko Lietz

Acknowledgements: 

Version date: 25. April 2023

License: ...
</div>