<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session B1: API harvesting

In January 2023, the most popular social media platforms were Facebook, YouTube, WhatsApp, Instagram, and WeChat (Statista, 2023). The web and its **online platforms** provide vast amounts of data that are highly relevant and interesting for social research. Whether generated in social media platforms, search engines (*e.g.*, Google), or knowledge production platforms (*e.g.*, Wikipedia, GitHub), the data resembles **digital traces** of behavior that are, as a first approximation, unobtrusive (*i.e.*, not influenced by observational or memory effects), complete (*i.e.*, a non-probabilistic sample), and highly resolved (*i.e.*, in real time and at scale). This provides a unique opportunity to study human behaviour in naturalistic settings (<a href='#lazer_computational_2009'>Lazer *et al.*, 2009</a>).

Obtaining Digital Behavioral Data (DBD) from online platforms, however, is a whole different story. If you are lucky, then the dataset you need for your research has already been collected and dumped on a website or stored in a data archive. If your dream dataset has not been pre-collected, you must do it yourself. For those purposes, you can use the [APIs](https://en.wikipedia.org/wiki/API) which large online platforms typically provide. In general, an **Application Programming Interface** (API) is a piece of software that helps you facilitate communication among computer programs. When you type a domain name in your browser, you use an API that helpy you obtain information from a computer far away and spares you of having to type its IP address. Even Python packages like Pandas are APIs because they assist programmers in speeding up their coding by providing a set of pre-programmed functions that perform commonly needed operations without the need to write them from scratch.

Providers of large digital platforms typically provide APIs for the use of which you can apply as a researcher. In many cases, users must undergo a vetting procedure in which the goals and procedures of the project are described to the providers. Once you have access, APIs are usually accessed through wrappers that facilitate the interaction with the API through a programming language like Python. Essentially, **wrappers** are overlays that communicate with the API for us but are more convenient to the users due to easier implementations of automating requests. There are two main downsides to using APIs. First, most APIs have restrictions on data types and how much data they provide. Such restrictions are also often tiered, in which free access provides the least amount of data, while higher tiers provide a wider variety of data types and larger amounts of data. Second, APIs change (<a href='#junger_a_2022'>Jünger, 2022</a>; <a href='#mclevey_doing_2022'>McLevey, 2022</a>, ch. 4).

Recently, API change has become a tremendous problem for social research. In 2009, it still seemed that collaborations between platform operators and academic institutions could guarantee both data access and user privacy (<a href='#lazer_computational_2009'>Lazer *et al.*, 2009</a>). The two giants Meta (the company running Facebook) and Twitter had both set up **academic APIs**, but after years of experiments both terminated them in 2022 and 2023, respectively. We have definitely arrived in the "Post-API Age" (Freelon, 2018). Platform operators are private companies whose business models do not align well with free data access for researchers, journalists, or those who openly work in the public interest. Twitter has since been renamed X and now charges \\$42,000 for 50M requests. These developments have plummeted Computational Social Science into a **reproducibility crisis** (<a href='#davidson_social_2023'>Davidson *et al.*, 2023</a>) and are causing the field to invest more into other data collection methods like data scraping techniques, browser extensions, or user data donations, potentially provided by centralized infrastructures (<a href='#lazer_computational_2020'>Lazer *et al.*, 2020</a>).

Nevertheless, given the many digital platforms and APIs out there, API harvesting stays an important data collection method. This is particularly the case for non-commercial platforms like Wikipedia. To ensure reproducibility and **data quality**, the characteristics of collected – not just harvested – datasets should be transparently documented and communicated. Just like in survey research, computational scientists are advised to reflect upon the challenges associated with the collection of digital traces, the underlying population that produced them, the meaning encoded in these traces, and the role of the platform in the trace generation process (<a href='#sen_a_2021'>Sen *et al.*, 2021</a>). DBD can only be complete with respect to the trace-producing population, and platform effects render it obtrusive in its very own meaning. The Total Error Sheets for Datasets (TES-D) framework is a critical guide to documenting online platform datasets (<a href='#frohling_total_2023'>Fröhling *et al.*, 2023</a>).

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to collect Digital Behavioral Data via API harvesting. In subsession **B1.1**, we will list resources for social media APIs and for datasets that have already been collected. In subsession **B1.2**, we will dive into harvesting Wikipedia, introducing a few APIs that help with collecting various parts of Wikipedia pages. Finally, in subsession **B1.3**, we will discuss the Total Error Sheets for Datasets (TES-D) framework to document a Twitter dataset.
</div>

## B1.1. APIs and precollected datasets

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __Awesome list__
- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

## B1.2. Harvesting Wikipedia

<img src='./images/wikipedia_logo.png' style='height: 190px; float: right; margin-left: 50px' >

Wikipedia is a rich source of data for social science research. Although we can access its data through other techniques like web scraping, there are also useful APIs that could ease collecting data from the website.

Since Wikipedia is built on [MediaWiki](https://en.wikipedia.org/wiki/MediaWiki), we will be using python wrappers written for its API,
[Mediawiki Action API](https://www.mediawiki.org/wiki/API:Main_page). Each of these wrappers provide some useful methods, and we will try to go through the ones that are the most important to our data collection tasks.

We will also introduce two useful parsers for the Wikipedia markup language, and will see how they could be used for extracting clean data from the raw markup code.

### B1.2.1. wikipedia

The first wrapper we introduce here is simply called [wikipedia](https://wikipedia.readthedocs.io/en/latest/code.html#api).

In [None]:
import wikipedia as wp

Searching a query with `wikipedia` can be done using the [`search()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [None]:
wp.search("seattle")

You can get fewer or more results with a specific number like this:

In [None]:
wp.search("seattle", results=3)

Wikipedia's suggested query can be accessed with the [`suggest()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [None]:
wp.suggest("seattle") # what does it do?

For getting the summary of an article, you can use the [`summary()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [None]:
print(wp.summary("Chief Seattle"))

In [None]:
print(wp.summary("Chief Seattle", sentences=1))

`summary()` will raise a `DisambiguationError` if the page is a disambiguation page, or a `PageError` if the page doesn’t exist (although by default, it tries to find the page you meant with suggest and search.)

In [None]:
#print(wp.summary("Mercury"))

In [None]:
try:
    wp_summary = print(wp.summary("Mercury"))
except wp.exceptions.DisambiguationError as e:
    print(e.options)

The [`page()`](https://wikipedia.readthedocs.io/en/latest/code.html#api)function enables you to load and access data from full Wikipedia pages. Initialize with a page title (keep in mind the errors listed above), then you can easily access most properties of the page:

In [None]:
wp_page = wp.page("Chief Seattle")
wp_page

HTML:

In [None]:
from IPython.core.display import HTML

HTML(wp_page.html())

You can get information like title of the page, its url etc. In order to get the title of the page, you can use the `title` attribute:

In [None]:
wp_page.title

Using the `url` attribute, you can get the url of the page:

In [None]:
wp_page.url

To get the full text of the page, you can use the `content` attribute:

In [None]:
print(wp_page.content)

In order to access the plain text content of a section in the page, you can use the `sections` attribute:

In [None]:
wp_page.sections # should work but doesn't

In [None]:
print(wp_page.section('Biography'))

You can access the images in the page using `.images`. The URLs of the first five images are retrieved like this:

In [None]:
wp_page.images[0:5]

In order to get the URLs of the external links of the page, you can use `.references`:

In [None]:
wp_page.references[:5]

You can get the texts of the links in the page using `.links`:

In [None]:
wp_page.links[:10]

Categories (where from?)

In [None]:
wp_page.categories[:5]

Dataframe:

In [None]:
import pandas as pd

In [None]:
pd.DataFrame(
    data = [[wp_page.title, wp_page.url, wp_page.content, wp_page.images, wp_page.references, wp_page.links, wp_page.categories]], 
    columns = ['Title', 'URL', 'Content', 'Images', 'References', 'Links', 'Categories']
)

In order to change the language of the Wikipedia pages you are accessing, you can use the [`set_lang()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function. Remember to search for page titles in the language that you have set, not English:

In [None]:
wp.set_lang("es")

In [None]:
print(wp.summary("Chief Seattle"))

In [None]:
wp.set_lang("en")

### B1.2.2. Harvesting tables

The `wikipedia` package that we introduced in B1.2.1 cannot always help us with all the tasks we may want to do in order to collect data from Wikipedia.

For getting data other than what `wikipedia` can give us, we can use other libraries to access the markup code of Wikipedia, and then parse it to get the information we want. We will introduce [pywikibot](https://doc.wikimedia.org/pywikibot/stable/), a wrapper that can give us the markup, together with two parsers [mwparserfromhell](https://mwparserfromhell.readthedocs.io/en/latest/index.html) and [wikitextparser](https://wikitextparser.readthedocs.io/en/latest/), in order to parse the markup code.

In [None]:
import pywikibot as pwb
import wikitextparser as wtp

We will begin with an example page: [List of political parties in Germany](https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany). We want to extract the tables data in that page. Using pywikibot, we can get the markup code of the page, and then parse it with wikitextparser:

In [None]:
pwd_site = pwb.Site('en', 'wikipedia') # The site we want to run our bot on
pwb_page = pwb.Page(pwd_site, "List of political parties in Germany")
pwb_text = pwb_page.text
print(pwb_text)

In order to parse the table:

In [None]:
wtp_text = wtp.parse(pwb_text)
wtp_text

We can get the tables data with `page.tables`. Let's say we want to get the first table's data:

In [None]:
wtp_first_table = wtp_text.tables[0].data()

By putting the data in a dataframe, we can have a better overview of it:

In [None]:
first_table = pd.DataFrame(wtp_first_table[1:])
first_table.columns = wtp_first_table[0]
first_table.head()

As you can see, the cells data are not shown in a clean way, like the way they are in the original Wikipedia page. We can parse each cell's data with mwparserfromhell, and then create the dataframe:

In [None]:
import mwparserfromhell as mwp

In [None]:
for i in range(len(wtp_first_table)):
    for j in range(len(wtp_first_table[i])):
        wikicode = mwp.parse(wtp_first_table[i][j])
        wtp_first_table[i][j] = wikicode.strip_code(wtp_first_table[i][j])

In [None]:
first_table = pd.DataFrame(wtp_first_table[1:])
first_table.columns = wtp_first_table[0]
first_table.head()

Now the table looks pretty much the same as the table in the original page.

#### An alternative for extracting tables data: wikitables library

In order to get table's data, you can also get help from `wikitables` library. It eases some steps of accessing the tables data, but you need to be careful with small bugs or mistakes in the resulting tables. Let's say we want to extract the second table's data:

In [None]:
from wikitables import import_tables

In [None]:
tables = import_tables('List of political parties in Germany')

In [None]:
first_table_wt = pd.DataFrame(tables[0].rows)
first_table_wt

As you can see, ... This needs to be taken care of, in case you want to use `wikitables`.

### B1.2.3. Extracting main text of different revisions

There may be multiple different revisions available for each Wikipedia page. In this section, we will demonstrate how you can extract the main text of the first revision of an article in each year since the beginning, using `pywikibot` and `mwparserfromhell`:

In [None]:
import pywikibot
import mwparserfromhell

Like before, you can first get the page using pwwikibot's [`.Site()`](https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-site) and [`.Page()`](https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-site):

In [None]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "Koç University")

Then, you can get all the revisions of the page using `page.revisions()`. Depending on how old/rich the page is, this may take a few seconds:

In [None]:
revisions = page.revisions(content=True)

Now we can make a list of all of the revisions, and put the **year** in which each revision has been written into a `years` list. Each revision is in the form of a dictionary, and we can get the *years* using the `timestamp` key in those dictionaries:

In [None]:
revisions_list = []
years = []

for i in revisions:
    revisions_list.append(i)
    years.append(int(str(i['timestamp'])[:4]))
years.reverse()
revisions_list.reverse()

Since revisions are sorted from the newest to the eldest, we have to reverse the `years` and `revisions_list` lists to have their items in an ascending order. By printing the `years` list, you can see an overview of how many revisions in each year there are for the page:

In [None]:
print(years)

We want to put the first revision of each year into a `yearly_revisions` list. In order to do that, we first get the indices of the first appearances of each year in the `years` list, and get the revisions with those indices in the `revisions_list` list:

In [None]:
yearly_revisions = []
for i in range(years[0], years[-1]+1):
    index = years.index(i)
    yearly_revisions.append(revisions_list[index])

In order to get the clean main text of each revision, we can use the `text` attribute of the revisions, and have the result parsed using `mwparserfromhell`. Take the last revision as an example; we first put the un-parsed code into the `text` variable:

In [None]:
text = yearly_revisions[-1].text

Now we can parse it with `mwparserfromhell` like this:

In [None]:
parsed = mwparserfromhell.parse(text)
print(parsed.strip_code())

## B1.3. Documentation of datasets collected from online platforms

In the following we would like to show you how to describe systematically digital behavioral data. For this purpose we will utilize TES-D template (ADD citation; <a href='#Fröhling'>Fröhling et al., 2023</a>; <a href='#Sen'>Sen et al., 2021</a>). For more details you can refer to TES-D Manual (ADD citation).

**TES-D “Computational Social Science Turkey Tweets 2008-2023”**

**General Characteristics** 

1. *Who collected the dataset and who funded the process?*

The dataset have been collected by "Social ComQuant" Project team (Gizem Bacaksizlar Turbic, Haiko Lietz, Pouria Mirelmi, Olga Zagovora) at GESIS - Leibniz Institute for the Social Sciences, Computational Social Science department. The dataset collection was funded by a European Commission as a part of [the Social ComQuant Project](https://socialcomquant.ku.edu.tr/).

2. *Where is the dataset hosted? Is the dataset distributed under a copyright or license?* 

The dataset is hosted on open access [github repository](https://github.com/gesiscss/css_methods_python) of CSS department at GESIS. ADD LICENSE   

3. *What do the instances that comprise the dataset represent? What data does each instance consist of?*

Each line of dataset reprents a distinct Tweet posted on Twitter in the period between 5th January 2008 and 8th January 2023. Each instance consist of: the unique identifier of the Tweet, the unique identifier of the User who posted this Tweet, creation time of the Tweet (in ISO 8601 format), the actual UTF-8 text of the Tweet, language of the Tweet, if detected by Twitter (it is returned as a BCP47 language tag). Data was not prerocessed and is represented in formats provided by API. 

4. *How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split?*

There are 105 instances on the dataset. Instances are homogen, i.e., each of them is representing a Tweet. 

5. *In which contexts and publications has the dataset been used already?* 

The dataset have been used in the online materials of [the Introduction to Computational Social Science methods with Python](https://github.com/gesiscss/css_methods_python) Course. 

6. *Are there alternative datasets that could be used for the measurement of the same or similar constructs? Could they be a better fit? How do they differ?* 

The dataset have been created for teaching purpose, namely, exercise on getting data using API. Any similar dataset is unknown. 

7. *Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments?*

[Jupyter Notebook](https://github.com/gesiscss/css_methods_python/blob/main/b_data_collection_methods/1_API_harvesting.ipynb), subsection B1.2.4 provides code in Python that explain how to obtain the dataset. Be aware that Twitter API might be depricated due to changes in Policies on free Access to the API. All the relevant informatiom one can find in the [documentation](https://developer.twitter.com/en/docs) or in this news article [Why Twitter ending free access to its APIs should be a ‘wake-up call’](https://www.theguardian.com/technology/2023/feb/07/techscape-elon-musk-twitter-api).   

8. *Were any ethical review processes conducted?* 

No thical review processes have been conducted. Dataset do not consist of any Private Data.    

9. *Did any ethical considerations limit the dataset creation?* 

We have not stored any data related to user accounts that have been posting relevant Tweets. Storage of this data can cause additional ethical considerations. 

10. *Are there any potential risks for individuals using the data?* 

Theoretical, some Tweets' texts can include usernames. Thus, to achive complete anonymisation one might need to postprocess data and remove these names.    

**Construct Definition** 

Validity 

1. For the measurement of what construct was the dataset created? 

 

2. How is the construct operationalized? Can the dataset fully grasp the construct? If not, what dimensions are left out? Have there been any attempts to evaluate the validity of the construct's operationalization? 

 

3. What related constructs could (not) be measured through the dataset? What should be considered when measuring other constructs with the dataset? 

 

4. What is the target population? 

 

5. How does the dataset handle subpopulations? 



**Platform Selection**

Platform Affordances Error 

1. What are the key characteristics of the platform at the time of data collection? 

 

2. What are the effects of the platform's ToS on the collected data? 

 

3. What are the effects of the platform's sociocultural norms on the collected data? 

 

4. How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design? 

 

5. In case multiple data sources were used, what errors might occur through their merger or combination? 


Platform Coverage Error 

1. What is known about the platform/s population? 

**Data Collection** 

Trace Selection Error 

1. How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen? 

 

2. Was there any data that could not be adequately collected? 

 

3. Is any information missing from individual instances? Could there be a systematic bias? 

 

4. Does the dataset include sensitive or confidential information? 

User Selection Error 

1. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample from a larger set, what was the sampling strategy? 

 

2. What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria? 

 

3. Over what timeframe was the data collected, and how might that timeframe have affected the collected data? 

 

4. If the dataset relates to people, how did they consent to collecting and using their data? 

 

5. Does the data include information on minors? 

**Data Preprocessing and Data Augmentation**

Trace Augmentation and Trace Measurement Error 

Is there a label or target associated with each instance? If so, how were the labels or targets generated? 

 

If automated methods were used, how does the methods’ performance impact the correctness of the augmentations? 

 

If human annotations were used, who were the annotators that created the labels? How were they recruited or chosen? How were they instructed? 

 

If the final gold label was derived from different annotations, how was this done? 

 

Have there been anCan date the labels? 

 
How could the data be misused? 

 
Can the dataset in any way unintendedly contribute to the reinforcement of social inequality? 

User Augmentation Error 

Have attributes and characteristics of individuals been inferred? 

 

Is it possible to identify individuals either directly or indirectly from the data? 

Trace Reduction Error 

Have traces been excluded? Why and by what criteria? 

User Reduction Error 

Have users been excluded? Why and by what criteria? 

Adjustment Error 

Does the dataset provide information to adjust the results to a target population? If so, is this information inferred or self-reported? 

## References

### Recommended readings

<a id='junger_a_2022'></a>
Jünger, J. (2022) "A brief history of APIs: Limitations and opportunities for online
research." In: Engel, U. & Quan-Haase, A. (eds), *Handbook of Computational Social
Science*, vol. 2, p. 17–32. Abingdon: Routledge. https://doi.org/10.4324/9781003025245.

<a id='mclevey_doing_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on using Pandas. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*

### Complementary readings

<a id='davidson_social_2023'></a>
Davidson, B. I., Wischerath, D., Racek, D., Parry, D. A., Godwin, E., Hinds, J., Linden, D. v. d., Roscoe, J. F., & Ayravainen, L. (2023). "Social media APIs: A quiet threat to the advancement of science." *PsyArXiv*:ps32z. https://doi.org/10.31234/osf.io/ps32z.

<a id='freelon_computational_2018'></a>
Freelon, D. (2018). "Computational Research in the Post-API Age." *Political
Communication* 35:665–668. https://doi.org/10.1080/10584609.2018.1477506.

<a id='frohling_total_2023'></a>
Fröhling, L., Sen, I., Soldner, F., Steinbrinker, L., Zens, M., & Weller, K. (2023). "Total Error Sheets for Datasets (TES-D) -- A critical guide to documenting online platform datasets." *arXiv*:2306.14219. https://doi.org/10.48550/arXiv.2306.14219.

<a id='lazer_computational_2009'></a>
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., Van Alstyne, M. (2009). "Computational Social Science." *Science* 323:721–723. https://doi.org/10.1126/science.1167742.

<a id='lazer_computational_2020'></a>
Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., Nelson, A., Salganik, M. J., Strohmaier, M., Vespignani, A., & Wagner, C. (2020). "Computational Social Science: Obstacles and opportunities." *Science* 369:1060–1062. https://doi.org/10.1126/science.aaz8170.

<a id='sen_a_2021'></a>
Sen, I., Flöck, F., Weller, K., Weiß, B., & Wagner, C. (2021). "A total error framework for digital traces of human behavior on online platforms." *Public Opinion Quarterly* 85:399–422. https://doi.org/10.1093/poq/nfab018.

<a id='statista_most_2023'></a>
Statista (2023). "Most popular social networks worldwide as of January 2023, ranked by number of monthly active users." https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/. Retrieved 18 August 2023.

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & Pouria Mirelmi 

Contributors: Felix Beck-Soldner & Haiko Lietz

Version date: 18 August 2023

License: Creative Commons Attribution 4.0 International (CC BY 4.0)
</div>