<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session B1: API harvesting

The web and its **online platforms** provide vast amounts of data that are highly relevant and interesting for social research. Whether generated in social media platforms (*e.g.*, Facebook, X fka. Twitter, Reddit), search engines (*e.g.*, Google), or knowledge production platforms (*e.g.*, Wikipedia, GitHub), the data resembles **digital traces** of behavior that are, as a first approximation, unobtrusive (*i.e.*, not influenced by observational or memory effects), complete (*i.e.*, a non-probabilistic sample), and highly resolved (*i.e.*, in real time and at scale). This provides a unique opportunity to study human behaviour in naturalistic settings (<a href='#lazer_computational_2009'>Lazer *et al.*, 2009</a>).

Obtaining Digital Behavioral Data (DBD) from online platforms, however, is a whole different story. If you are lucky, then the dataset you need for your research has already been collected and dumped on a website or stored in a data archive. If your dream dataset has not been pre-collected, you must do it yourself. For those purposes, you can use the [APIs](https://en.wikipedia.org/wiki/API) which large online platforms typically provide. In general, an **Application Programming Interface** (API) is a piece of software that helps you facilitate communication among computer programs. When you type a domain name in your browser, you use an API that helpy you obtain information from a computer far away and spares you of having to type its IP address. Even Python packages like Pandas are APIs because they assist programmers in speeding up their coding by providing a set of pre-programmed functions that perform commonly needed operations without the need to write them from scratch.

Providers of large digital platforms typically provide APIs for the use of which you can apply as a researcher. In many cases, users must undergo a vetting procedure in which the goals and procedures of the project are described to the providers. Once you have access, APIs are usually accessed through wrappers that facilitate the interaction with the API through a programming language like Python. Essentially, **wrappers** are overlays that communicate with the API for us but are more convenient to the users due to easier implementations of automating requests. There are two main downsides to using APIs. First, most APIs have restrictions on data types and how much data they provide. Such restrictions are also often tiered, in which free access provides the least amount of data, while higher tiers provide a wider variety of data types and larger amounts of data. Second, APIs change (<a href='#junger_a_2022'>Jünger, 2022</a>; <a href='#mclevey_doing_2022'>McLevey, 2022</a>, ch. 4).

Recently, API change has become a tremendous problem for social research. In 2009, it still seemed that collaborations between platform operators and academic institutions could guarantee both data access and user privacy (<a href='#lazer_computational_2009'>Lazer *et al.*, 2009</a>). The two giants Meta (the company running Facebook) and Twitter had both set up **academic APIs**, but after years of experiments both terminated them in 2022 and 2023, respectively. Platform operators are private companies whose business models do not align well with free data access for researchers, journalists, or those who openly work in the public interest. Twitter has since been renamed X and now charges \\$42,000 for 50M requests. Reddit, another company that is trying to monetize its API, "just" charges \\$12,000. These developments have plummeted Computational Social Science into a **reproducibility crisis** (<a href='#davidson_social_2023'>Davidson *et al.*, 2023</a>) and are causing the field to invest more into other data collection methods like data scraping techniques, browser extensions, or user data donations, potentially provided by centralized infrastructures (<a href='#lazer_computational_2020'>Lazer *et al.*, 2020</a>).

Nevertheless, given the many digital platforms and APIs out there, API harvesting stays an important data collection method. This is particularly the case for non-commercial platforms like Wikipedia. To ensure reproducibility and **data quality**, the characteristics of collected – not just harvested – datasets should be transparently documented and communicated. Just like in survey research, computational scientists are advised to reflect upon the challenges associated with the collection of digital traces, the underlying population that produced them, the meaning encoded in these traces, and the role of the platform in the trace generation process (<a href='#sen_a_2021'>Sen *et al.*, 2021</a>). DBD can only be complete with respect to the trace-producing population, and platform effects render it obtrusive in its very own meaning. The Total Error Sheets for Datasets (TES-D) framework is a critical guide to documenting online platform datasets (<a href='#frohling_total_2023'>Fröhling *et al.*, 2023</a>).

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to collect Digital Behavioral Data via API harvesting. In subsession **B1.1**, we will list resources for social media APIs and for datasets that have already been collected. In subsession **B1.2**, we will dive into harvesting Wikipedia, introducing a few APIs that help with collecting various parts of Wikipedia pages. Finally, in subsession **B1.3**, we will discuss the Total Error Sheets for Datasets (TES-D) framework to document a Twitter dataset.
</div>

## B1.1. APIs and precollected datasets

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __Awesome list__
- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

## B1.2. Harvesting Wikipedia

<img src='./images/wikipedia_logo.png' style='height: 190px; float: right; margin-left: 50px' >

Wikipedia is a rich source of data for social science research. Although we can access its data through other techniques like web scraping, there are also useful APIs that could ease collecting data from the website.

Since Wikipedia is built on [MediaWiki](https://en.wikipedia.org/wiki/MediaWiki), we will be using python wrappers written for its API,
[Mediawiki Action API](https://www.mediawiki.org/wiki/API:Main_page). Each of these wrappers provide some useful methods, and we will try to go through the ones that are the most important to our data collection tasks.

We will also introduce two useful parsers for the Wikipedia markup language, and will see how they could be used for extracting clean data from the raw markup code.

### B1.2.1. wikipedia

The first wrapper we introduce here is simply called [wikipedia](https://wikipedia.readthedocs.io/en/latest/code.html#api).

In [1]:
import wikipedia as wp

Searching a query with `wikipedia` can be done using the [`search()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [2]:
wp.search("seattle")

['Seattle',
 'Seattle Seahawks',
 'OL Reign',
 'Seattle Sounders FC',
 'Seattle Mariners',
 'Seattle Kraken',
 'Chief Seattle',
 'Sleepless in Seattle',
 'Seattle metropolitan area',
 'The Seattle Times']

You can get fewer or more results with a specific number like this:

In [3]:
wp.search("seattle", results=3)

['Seattle', 'Seattle Seahawks', 'OL Reign']

Wikipedia's suggested query can be accessed with the [`suggest()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [4]:
wp.suggest("seattle") # what does it do?

For getting the summary of an article, you can use the [`summary()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function:

In [5]:
print(wp.summary("Chief Seattle"))

Chief Seattle (c. 1786 – June 7, 1866) was a Suquamish and Duwamish chief. A leading figure among his people, he pursued a path of accommodation to white settlers, forming a personal relationship with "Doc" Maynard. The city of Seattle, in the U.S. state of Washington, was named after him. A widely publicized speech arguing in favour of ecological responsibility and respect of Native Americans' land rights had been attributed to him.
The name Seattle is an Anglicization of the modern Duwamish conventional spelling Si'ahl, equivalent to the modern Lushootseed spelling siʔaɫ IPA: [ˈsiʔaːɬ] and also rendered as Sealth, Seathl or See-ahth. According to elder taqʷšəbluʔ, his name was traditionally pronounced siʔaƛ̕.




In [6]:
print(wp.summary("Chief Seattle", sentences=1))

Chief Seattle (c. 1786 – June 7, 1866) was a Suquamish and Duwamish chief.


`summary()` will raise a `DisambiguationError` if the page is a disambiguation page, or a `PageError` if the page doesn’t exist (although by default, it tries to find the page you meant with suggest and search.)

In [7]:
print(wp.summary("Mercury"))



  lis = BeautifulSoup(html).find_all('li')


DisambiguationError: "Mercury" may refer to: 
Mercury (planet)
Mercury (element)
Mercury (mythology)
Mercury (company)
Mercury (toy manufacturer)
Mercury Communications
Mercury Corporation
Mercury Cyclecar Company
Mercury Drug
Mercury Energy
Mercury Filmworks
Mercury General
Mercury Interactive
Mercury Marine
Mercury Systems
Mercury Truck & Tractor Company
TP-Link
Mercury (programming language)
Mercury (metadata search system)
Ferranti Mercury
Mercury Browser
Mercury Mail Transport System
Mercury (film)
Mercury (TV series)
Mercury Black
Sailor Mercury
Mercury (Marvel Comics)
Makkari (comics)
Metal Men
Cerebro's X-Men
Amalgam Comics character
Mercury (magazine)
The American Mercury
The Mercury (Hobart)
The Mercury (South Africa)
The Mercury (Pennsylvania)
Mercury (Newport)
Reading Mercury
The Mercury News
List of newspapers named Mercury
Mercury (Bova novel)
Mercury (Livesey novel)
Anna Kavan
Mercury Nashville
Mercury Records
Mercury Prize
Mercury, the Winged Messenger
Mercury (American Music Club album)
Mercury (Longview album)
Mercury (Madder Mortem album)
Mercury – Act 1
Mercury – Acts 1 & 2
"Mercury" (song)
Recovering the Satellites
Failer
Planetarium
Operation Mercury
Boeing E-6 Mercury
Miles Mercury
HMS Mercury
USS Mercury
Russian brig Mercury
Mercury (pigeon)
Mercury (name)
Mercury, Savoie
Mercury Bay
place in Alabama
Mercury, Nevada
Mercury, Texas
Mercury (plant)
Annual mercury
English mercury
Mercury FM
Mercury 96.6
Edmonton Mercurys
Fujita Soccer Club Mercury
Memphis Mercury
Phoenix Mercury
Toledo Mercurys
Mercury Theatre (disambiguation)
Blackburn Mercury
Bristol Mercury
Mercury (automobile)
List of Mercury vehicles
Mercury (cyclecar)
Mercury (train)
Mercury (ship)
Cape Cod Mercury 15
Mercury 18
Project Mercury
Mercury (satellite)
Archer Maclean's Mercury
Mercury (cipher machine)
Mercury Boulevard
Mercury Cinema
Shuttle America
The Mercury Mall
All pages with titles beginning with Mercury 
The American Mercury
Mercuri
Mercury 1 (disambiguation)
Mercury 2 (disambiguation)
Mercury 3 (disambiguation)
Mercury 4 (disambiguation)
Mercury 5 (disambiguation)
Mercury 6 (disambiguation)
Mercury 7 (disambiguation)
Mercury 8 (disambiguation)
Mercury City (disambiguation)
Mercury FM (disambiguation)
Mercury House (disambiguation)
Mercury mission (disambiguation)
Mercury program (disambiguation)
Mercury project (disambiguation)
All pages with titles containing Mercury

In [8]:
try:
    wp_summary = print(wp.summary("Mercury"))
except wp.exceptions.DisambiguationError as e:
    print(e.options)

['Mercury (planet)', 'Mercury (element)', 'Mercury (mythology)', 'Mercury (company)', 'Mercury (toy manufacturer)', 'Mercury Communications', 'Mercury Corporation', 'Mercury Cyclecar Company', 'Mercury Drug', 'Mercury Energy', 'Mercury Filmworks', 'Mercury General', 'Mercury Interactive', 'Mercury Marine', 'Mercury Systems', 'Mercury Truck & Tractor Company', 'TP-Link', 'Mercury (programming language)', 'Mercury (metadata search system)', 'Ferranti Mercury', 'Mercury Browser', 'Mercury Mail Transport System', 'Mercury (film)', 'Mercury (TV series)', 'Mercury Black', 'Sailor Mercury', 'Mercury (Marvel Comics)', 'Makkari (comics)', 'Metal Men', "Cerebro's X-Men", 'Amalgam Comics character', 'Mercury (magazine)', 'The American Mercury', 'The Mercury (Hobart)', 'The Mercury (South Africa)', 'The Mercury (Pennsylvania)', 'Mercury (Newport)', 'Reading Mercury', 'The Mercury News', 'List of newspapers named Mercury', 'Mercury (Bova novel)', 'Mercury (Livesey novel)', 'Anna Kavan', 'Mercury Na

The [`page()`](https://wikipedia.readthedocs.io/en/latest/code.html#api)function enables you to load and access data from full Wikipedia pages. Initialize with a page title (keep in mind the errors listed above), then you can easily access most properties of the page:

In [9]:
wp_page = wp.page("Chief Seattle")
wp_page

<WikipediaPage 'Chief Seattle'>

HTML:

In [10]:
from IPython.core.display import HTML

HTML(wp_page.html())

Chief Seattle,Chief Seattle.1
siʔaɬ,siʔaɬ
"Only known photograph of Seattle, 1864","Only known photograph of Seattle, 1864"
,
Suquamish & Duwamish leader,Suquamish & Duwamish leader
Personal details,Personal details
Born,c. 1780[1][2] Blake Island
Died,"June 7, 1866 (aged 85–86) Port Madison"
Resting place,"Port Madison, Washington, U.S."
Spouses,LadailaOwiyahl[3]
Relations,Doc Maynard

Authority control,Authority control.1
International,FAST ISNI 2 VIAF WorldCat 2
National,Norway Chile Spain France BnF data Catalonia Germany Israel United States Latvia Japan Czech Republic Australia Greece Korea Netherlands Portugal
Artists,MusicBrainz
People,Trove
Other,SNAC IdRef

vteAmerican frontier,vteAmerican frontier.1
1776 to 1912,1776 to 1912
Native Nations,Apache Arapaho Arikara Assiniboine (Nakota) Blackfoot Cahuilla Cayuse Cheyenne Chinook Chippewa (Ojibwe) Caddo Cocopah Comanche Crow Dakota Five Civilized Tribes Hidatsa Hopi Hualapai Kickapoo Kiowa Kumeyaay Kutenai Lakota Lenape (Delaware) Mandan Maricopa Modoc Mohave Muscogee Navajo Nez Perce Northern Paiute Nootka (Nuu-chah-nulth) Pawnee Pend d'Oreilles Pequots Pima Pueblo Seminoles Shoshone Sioux Southern Paiute Tohono Oʼodham Tonkawa Umpqua Ute Washoe Yakama Yaqui Yavapai Yuma (Quechan)
Notable people,"Native Americans Black Hawk Black Kettle Bloody Knife Chief Joseph Cochise Crazy Bear Crazy Horse Crazy Snake Dasoda-hae Geronimo Irataba Kiliahote Manuelito Massai Plenty Coups Quanah Parker Red Cloud Sacagawea Seattle Sitting Bull Smallwood Snapping Turtle Standing Bear Ten Bears Touch the Clouds Tuvi Victorio Washakie Explorers and pioneers John Bozeman Jim Bridger Tomás Vélez Cachupín William Clark Davy Crockett John C. Frémont Liver-Eating Johnson Meriwether Lewis Joe Mayer William John Murphy John Wesley Powell Juan Rivera Levi Ruggles Jedediah Smith Jack Swilling Trinidad Swilling Ora Rush Weed Richens Lacey Wootton Henry Wickenburg ""Old Bill"" Williams Brigham Young Lawmen Elfego Baca Charlie Bassett Roy Bean Morgan Earp Virgil Earp Wyatt Earp Henry Garfias Pat Garrett Jack Helm ""Wild Bill"" Hickok Bat Masterson ""Mysterious Dave"" Mather Bass Reeves George Scarborough John Selman John Horton Slaughter William ""Bill"" Tilghman James Timberlake Harry C. Wheeler Outlaws Billy the Kid Black Bart ""Curly Bill"" Brocius Butch Cassidy Billy Clanton Ike Clanton Dalton Brothers (Grat, Bill, Bob, Emmett) Bill Doolin Bill Downing John Wesley Hardin Johnny Ringo Jesse James Frank James Tom Ketchum Frank McLaury Tom McLaury Joaquin Murrieta Belle Starr Soapy Smith Sundance Kid Younger Brothers (Cole, Bob, Jim, John) Soldiers and scouts Frederick Russell Burnham Kit Carson ""Buffalo Bill"" Cody Texas Jack Omohundro James C. Cooney George Crook George Armstrong Custer Alexis Godey Samuel P. Heintzelman Tom Horn Calamity Jane Luther Kelly Ranald S. Mackenzie Charley Reynolds Philip Sheridan Al Sieber Others John Jacob Astor William H. Boring Jonathan R. Davis George Flavel C. S. Fly John Joel Glanton George E. Goodfellow Doc Holliday Andrew Jackson Zephaniah Kingsley Seth Kinman Octaviano Larrazolo Nat Love Sylvester Mowry Emperor Norton Annie Oakley Sedona Schnebly Thomas William Sweeny Peter Lebeck"
Native Americans,Black Hawk Black Kettle Bloody Knife Chief Joseph Cochise Crazy Bear Crazy Horse Crazy Snake Dasoda-hae Geronimo Irataba Kiliahote Manuelito Massai Plenty Coups Quanah Parker Red Cloud Sacagawea Seattle Sitting Bull Smallwood Snapping Turtle Standing Bear Ten Bears Touch the Clouds Tuvi Victorio Washakie
Explorers and pioneers,"John Bozeman Jim Bridger Tomás Vélez Cachupín William Clark Davy Crockett John C. Frémont Liver-Eating Johnson Meriwether Lewis Joe Mayer William John Murphy John Wesley Powell Juan Rivera Levi Ruggles Jedediah Smith Jack Swilling Trinidad Swilling Ora Rush Weed Richens Lacey Wootton Henry Wickenburg ""Old Bill"" Williams Brigham Young"
Lawmen,"Elfego Baca Charlie Bassett Roy Bean Morgan Earp Virgil Earp Wyatt Earp Henry Garfias Pat Garrett Jack Helm ""Wild Bill"" Hickok Bat Masterson ""Mysterious Dave"" Mather Bass Reeves George Scarborough John Selman John Horton Slaughter William ""Bill"" Tilghman James Timberlake Harry C. Wheeler"
Outlaws,"Billy the Kid Black Bart ""Curly Bill"" Brocius Butch Cassidy Billy Clanton Ike Clanton Dalton Brothers (Grat, Bill, Bob, Emmett) Bill Doolin Bill Downing John Wesley Hardin Johnny Ringo Jesse James Frank James Tom Ketchum Frank McLaury Tom McLaury Joaquin Murrieta Belle Starr Soapy Smith Sundance Kid Younger Brothers (Cole, Bob, Jim, John)"
Soldiers and scouts,"Frederick Russell Burnham Kit Carson ""Buffalo Bill"" Cody Texas Jack Omohundro James C. Cooney George Crook George Armstrong Custer Alexis Godey Samuel P. Heintzelman Tom Horn Calamity Jane Luther Kelly Ranald S. Mackenzie Charley Reynolds Philip Sheridan Al Sieber"
Others,John Jacob Astor William H. Boring Jonathan R. Davis George Flavel C. S. Fly John Joel Glanton George E. Goodfellow Doc Holliday Andrew Jackson Zephaniah Kingsley Seth Kinman Octaviano Larrazolo Nat Love Sylvester Mowry Emperor Norton Annie Oakley Sedona Schnebly Thomas William Sweeny Peter Lebeck
Frontier culture,American bison Barbed wire Boot Hill Cattle drive Cowboy poetry Cattle rustling Cow town Fast draw Ghost town Gunfights Homesteading Land rush Manifest destiny Moonshine One-room schoolhouse Rodeo Stagecoach Train robbery Vigilante justice Western saloon Tack piano Westward expansion Wild West shows

0,1
Native Americans,Black Hawk Black Kettle Bloody Knife Chief Joseph Cochise Crazy Bear Crazy Horse Crazy Snake Dasoda-hae Geronimo Irataba Kiliahote Manuelito Massai Plenty Coups Quanah Parker Red Cloud Sacagawea Seattle Sitting Bull Smallwood Snapping Turtle Standing Bear Ten Bears Touch the Clouds Tuvi Victorio Washakie
Explorers and pioneers,"John Bozeman Jim Bridger Tomás Vélez Cachupín William Clark Davy Crockett John C. Frémont Liver-Eating Johnson Meriwether Lewis Joe Mayer William John Murphy John Wesley Powell Juan Rivera Levi Ruggles Jedediah Smith Jack Swilling Trinidad Swilling Ora Rush Weed Richens Lacey Wootton Henry Wickenburg ""Old Bill"" Williams Brigham Young"
Lawmen,"Elfego Baca Charlie Bassett Roy Bean Morgan Earp Virgil Earp Wyatt Earp Henry Garfias Pat Garrett Jack Helm ""Wild Bill"" Hickok Bat Masterson ""Mysterious Dave"" Mather Bass Reeves George Scarborough John Selman John Horton Slaughter William ""Bill"" Tilghman James Timberlake Harry C. Wheeler"
Outlaws,"Billy the Kid Black Bart ""Curly Bill"" Brocius Butch Cassidy Billy Clanton Ike Clanton Dalton Brothers (Grat, Bill, Bob, Emmett) Bill Doolin Bill Downing John Wesley Hardin Johnny Ringo Jesse James Frank James Tom Ketchum Frank McLaury Tom McLaury Joaquin Murrieta Belle Starr Soapy Smith Sundance Kid Younger Brothers (Cole, Bob, Jim, John)"
Soldiers and scouts,"Frederick Russell Burnham Kit Carson ""Buffalo Bill"" Cody Texas Jack Omohundro James C. Cooney George Crook George Armstrong Custer Alexis Godey Samuel P. Heintzelman Tom Horn Calamity Jane Luther Kelly Ranald S. Mackenzie Charley Reynolds Philip Sheridan Al Sieber"
Others,John Jacob Astor William H. Boring Jonathan R. Davis George Flavel C. S. Fly John Joel Glanton George E. Goodfellow Doc Holliday Andrew Jackson Zephaniah Kingsley Seth Kinman Octaviano Larrazolo Nat Love Sylvester Mowry Emperor Norton Annie Oakley Sedona Schnebly Thomas William Sweeny Peter Lebeck

0,1
Places,Alaska Anchorage Iditarod Nome Seward Skagway Arizona Territory Canyon Diablo Fort Grant Prescott Phoenix Tombstone Tucson Window Rock Yuma California Bakersfield Fresno Jamestown Los Angeles Sacramento San Diego San Francisco Colorado Creede Denver Telluride Trinidad Dakota Territory Bismarck Deadwood Fargo Fort Yates Pine Ridge Rapid City Standing Rock Yankton Florida Territory Angola Negro Fort Pensacola Prospect Bluff St. Augustine St. Marks Tallahassee Idaho Territory Fort Boise Fort Hall Illinois Fort Dearborn Kansas Abilene Dodge City Ellsworth Hays Leavenworth Wichita Missouri Independence Kansas City St. Louis Montana Territory Billings Bozeman Deer Lodge Fort Benton Fort Peck Helena Livingston Missoula Virginia City Nebraska Chadron Fort Atkinson Fort Robinson Nebraska City Ogallala Omaha Valentine Whiteclay Nevada Carson City Virginia City Reno New Mexico Territory Alamogordo Albuquerque Cimarron Fort Sumner Gallup Las Vegas Lincoln Mesilla Mogollon Roswell Santa Fe Tucumcari Oklahoma Territory and Indian Territory Broken Arrow Fort Gibson Fort Sill Oklahoma City Okmulgee Pawhuska Tahlequah Tishomingo Tuskahoma Wewoka Oregon Territory Astoria The Dalles La Grande McMinnville Oregon City Portland Salem Vale Texas Austin Abilene El Paso Fort Worth Gonzales Lubbock San Antonio Utah Territory Salt Lake City Washington Territory Everett Port Townsend Seattle Vancouver Wyoming Territory Fort Bridger Fort Laramie
Alaska,Anchorage Iditarod Nome Seward Skagway
Arizona Territory,Canyon Diablo Fort Grant Prescott Phoenix Tombstone Tucson Window Rock Yuma
California,Bakersfield Fresno Jamestown Los Angeles Sacramento San Diego San Francisco
Colorado,Creede Denver Telluride Trinidad
Dakota Territory,Bismarck Deadwood Fargo Fort Yates Pine Ridge Rapid City Standing Rock Yankton
Florida Territory,Angola Negro Fort Pensacola Prospect Bluff St. Augustine St. Marks Tallahassee
Idaho Territory,Fort Boise Fort Hall
Illinois,Fort Dearborn
Kansas,Abilene Dodge City Ellsworth Hays Leavenworth Wichita

0,1
Alaska,Anchorage Iditarod Nome Seward Skagway
Arizona Territory,Canyon Diablo Fort Grant Prescott Phoenix Tombstone Tucson Window Rock Yuma
California,Bakersfield Fresno Jamestown Los Angeles Sacramento San Diego San Francisco
Colorado,Creede Denver Telluride Trinidad
Dakota Territory,Bismarck Deadwood Fargo Fort Yates Pine Ridge Rapid City Standing Rock Yankton
Florida Territory,Angola Negro Fort Pensacola Prospect Bluff St. Augustine St. Marks Tallahassee
Idaho Territory,Fort Boise Fort Hall
Illinois,Fort Dearborn
Kansas,Abilene Dodge City Ellsworth Hays Leavenworth Wichita
Missouri,Independence Kansas City St. Louis


You can get information like title of the page, its url etc. In order to get the title of the page, you can use the `title` attribute:

In [11]:
wp_page.title

'Chief Seattle'

Using the `url` attribute, you can get the url of the page:

In [12]:
wp_page.url

'https://en.wikipedia.org/wiki/Chief_Seattle'

To get the full text of the page, you can use the `content` attribute:

In [13]:
print(wp_page.content)

Chief Seattle (c. 1786 – June 7, 1866) was a Suquamish and Duwamish chief. A leading figure among his people, he pursued a path of accommodation to white settlers, forming a personal relationship with "Doc" Maynard. The city of Seattle, in the U.S. state of Washington, was named after him. A widely publicized speech arguing in favour of ecological responsibility and respect of Native Americans' land rights had been attributed to him.
The name Seattle is an Anglicization of the modern Duwamish conventional spelling Si'ahl, equivalent to the modern Lushootseed spelling siʔaɫ IPA: [ˈsiʔaːɬ] and also rendered as Sealth, Seathl or See-ahth. According to elder taqʷšəbluʔ, his name was traditionally pronounced siʔaƛ̕.


== Biography ==
Seattle's mother Sholeetsa was dxʷdəwʔabš (Duwamish) and his father Shweabe was chief of the suq̓ʷabš (Suquamish). Seattle was born some time between 1780 and 1786 on Blake Island, Washington. One source cites his mother's name as Wood-sho-lit-sa. The Duwamish 

In order to access the plain text content of a section in the page, you can use the `sections` attribute:

In [14]:
wp_page.sections # should work but doesn't

[]

In [15]:
print(wp_page.section('Biography'))

Seattle's mother Sholeetsa was dxʷdəwʔabš (Duwamish) and his father Shweabe was chief of the suq̓ʷabš (Suquamish). Seattle was born some time between 1780 and 1786 on Blake Island, Washington. One source cites his mother's name as Wood-sho-lit-sa. The Duwamish tradition is that Seattle was born at his mother's village of stukʷ on the Black River, in what is now the city of Kent, Washington, and that Seattle grew up speaking both the Duwamish and Suquamish dialects of Lushootseed. Because Native descent among the Salish peoples was not solely patrilineal, Seattle inherited his position as chief of the Duwamish Tribe from his maternal uncle.Seattle earned his reputation at a young age as a leader and a warrior, ambushing and defeating groups of tribal enemy raiders coming up the Green River from the Cascade foothills.
Like many of his contemporaries, he owned slaves captured during his raids. He was tall and broad, standing nearly six feet (1.8 m) tall; Hudson's Bay Company traders gave 

You can access the images in the page using `.images`. The URLs of the first five images are retrieved like this:

In [16]:
wp_page.images[0:5]

['https://upload.wikimedia.org/wikipedia/commons/9/95/Angeline%2C_daughter_of_Chief_Seattle_%284951162943%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/8/8f/B-17E_41-2656_435th_Bomb_Squadron_Chief_Seattle.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/7/74/Chief_Seattle%27s_bust.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/4/46/Chief_Seattle_gravesite.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/4/44/Chief_Seattle_tombstone.jpg']

In order to get the URLs of the external links of the page, you can use `.references`:

In [17]:
wp_page.references[:5]

['http://content.lib.washington.edu/aipnw/buerge2.html',
 'http://www.suquamish.nsn.us/',
 'http://www.traillink.com/trail/chief-sealth-trail.aspx',
 'http://www.chiefseattle.com/history/chiefseattle/chief.htm',
 'http://kuow.org/post/rare-move-chief-seattle-changed-future-city']

You can get the texts of the links in the page using `.links`:

In [18]:
wp_page.links[:10]

['435th Bombardment Squadron',
 'Abilene, Kansas',
 'Abilene, Texas',
 'Agate Passage',
 'Al Sieber',
 'Alamogordo, New Mexico',
 'Albuquerque, New Mexico',
 'Alexis Godey',
 'Allen Wright',
 'American Indian Wars']

Categories (where from?)

In [19]:
wp_page.categories[:5]

['1780s births',
 '1866 deaths',
 '18th-century Native Americans',
 '19th-century Native Americans',
 'Articles with BIBSYS identifiers']

Dataframe:

In [20]:
import pandas as pd

In [21]:
pd.DataFrame(
    data = [[wp_page.title, wp_page.url, wp_page.content, wp_page.images, wp_page.references, wp_page.links, wp_page.categories]], 
    columns = ['Title', 'URL', 'Content', 'Images', 'References', 'Links', 'Categories']
)

Unnamed: 0,Title,URL,Content,Images,References,Links,Categories
0,Chief Seattle,https://en.wikipedia.org/wiki/Chief_Seattle,"Chief Seattle (c. 1786 – June 7, 1866) was a S...",[https://upload.wikimedia.org/wikipedia/common...,[http://content.lib.washington.edu/aipnw/buerg...,"[435th Bombardment Squadron, Abilene, Kansas, ...","[1780s births, 1866 deaths, 18th-century Nativ..."


In order to change the language of the Wikipedia pages you are accessing, you can use the [`set_lang()`](https://wikipedia.readthedocs.io/en/latest/code.html#api) function. Remember to search for page titles in the language that you have set, not English:

In [22]:
wp.set_lang("es")

In [23]:
print(wp.summary("Chief Seattle"))

La estatua del Jefe Seattle es una escultura de tamaño natural al aire libre del Jefe Seattle del artista local James Wehn está instalada en Tilikum Place en Seattle, Washington, en los Estados Unidos.[1]​[2]​[3]​[4]​


In [24]:
wp.set_lang("en")

### B1.2.2. Harvesting tables

The `wikipedia` package that we introduced in B1.2.1 cannot always help us with all the tasks we may want to do in order to collect data from Wikipedia.

For getting data other than what `wikipedia` can give us, we can use other libraries to access the markup code of Wikipedia, and then parse it to get the information we want. We will introduce [pywikibot](https://doc.wikimedia.org/pywikibot/stable/), a wrapper that can give us the markup, together with two parsers [mwparserfromhell](https://mwparserfromhell.readthedocs.io/en/latest/index.html) and [wikitextparser](https://wikitextparser.readthedocs.io/en/latest/), in order to parse the markup code.

In [25]:
import pywikibot as pwb
import wikitextparser as wtp

We will begin with an example page: [List of political parties in Germany](https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany). We want to extract the tables data in that page. Using pywikibot, we can get the markup code of the page, and then parse it with wikitextparser:

In [38]:
pwb_page

Page('List of political parties in Germany')

In [45]:
pwd_site = pwb.Site('en', 'wikipedia') # The site we want to run our bot on
pwb_page = pwb.Page(pwd_site, "List of political parties in Germany")
pwb_text = pwb_page.text
print(pwb_text)

{{Short description|Political parties in Germany}}
{{Politics of Germany|elections}}This article '''lists [[political party|political parties]] in [[politics of Germany|Germany]]'''.

The [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).

Germany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU and FDP or SPD and FDP

In order to parse the table:

In [52]:
wtp_text = wtp.parse(pwb_text)
wtp_text

WikiText('{{Short description|Political parties in Germany}}\n{{Politics of Germany|elections}}This article \'\'\'lists [[political party|political parties]] in [[politics of Germany|Germany]]\'\'\'.\n\nThe [[Federal Republic of Germany]] has a plural [[Multi-party system|multi party system]]. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[German governing coalition|coalition]] of a major and a minor party, specifically CDU/CSU a

We can get the tables data with `page.tables`. Let's say we want to get the first table's data:

In [74]:
wtp_first_table = wtp_text.tables[0].data()

By putting the data in a dataframe, we can have a better overview of it:

In [83]:
first_table = pd.DataFrame(wtp_first_table[1:])
first_table.columns = wtp_first_table[0]
first_table.head()

Unnamed: 0,Party,Party.1,Party.2,{{tooltip|Abbr.|Abbreviation}},Leader(s),Ideology,Political position,[[Bundestag|MdBs]],[[2019 European Parliament election in Germany|MEPs]],EP group,Membership
0,,[[File:SPD logo.svg|center|75px]],[[Social Democratic Party of Germany]]<br /><s...,SPD,"[[Lars Klingbeil]],<br />[[Saskia Esken]]",[[Social democracy]]<br />[[Pro-Europeanism]],[[Centre-left politics|Centre-left]],{{Composition bar|206|736|{{party color|Social...,{{Composition bar|16|96|{{party color|Social D...,[[Progressive Alliance of Socialists and Democ...,404305
1,,[[File:Cdu-logo.svg|center|75px]],[[Christian Democratic Union of Germany]]<br /...,CDU,[[Friedrich Merz]],{{ubl|class=nowrap|\n |[[Christian&nbsp;democr...,[[Centre-right politics|Centre-right]],{{Composition bar|152|736|{{party color|Christ...,{{Composition bar|23|96|{{party color|Christia...,[[European People's Party Group|EPP]],399110
2,,[[File:CSU Logo since 2016.svg|center|75px]],[[Christian Social Union in Bavaria]]<br /><sm...,CSU,[[Markus Söder]],{{ubl|\n |[[Christian democracy]]\n |[[Conserv...,[[Centre-right politics|Centre-right]] to [[Ri...,{{Composition bar|45|736|{{party color|Christi...,{{Composition bar|6|96|{{party color|Christian...,[[European People's Party Group|EPP]],137010
3,,[[File:Bündnis 90 - Die Grünen Logo (transpare...,[[Alliance 90/The Greens]]<br /><small>''Bündn...,GRÜNE,"[[Ricarda Lang]],<br />[[Omid Nouripour]]",{{ubl|class=nowrap|\n |[[Green politics]]\n |[...,[[Centre-left politics|Centre-left]]{{efn|The ...,{{Composition bar|118|736|{{party color|Allian...,{{Composition bar|21|96|{{party color|Alliance...,[[The Greens–European Free Alliance|Greens/EFA]],106000
4,,[[File:Logo der Freien Demokraten.svg|center|7...,[[Free Democratic Party (Germany)|Free Democra...,FDP,[[Christian Lindner]],[[Liberalism in Germany|Liberalism]]<br />[[Cl...,[[Centre-right politics|Centre-right]],{{Composition bar|92|736|{{party color|Free De...,{{Composition bar|5|96|{{party color|Free Demo...,[[Renew Europe|RE]],73000


As you can see, the cells data are not shown in a clean way, like the way they are in the original Wikipedia page. We can parse each cell's data with mwparserfromhell, and then create the dataframe:

In [84]:
import mwparserfromhell as mwp

In [85]:
for i in range(len(wtp_first_table)):
    for j in range(len(wtp_first_table[i])):
        wikicode = mwp.parse(wtp_first_table[i][j])
        wtp_first_table[i][j] = wikicode.strip_code(wtp_first_table[i][j])

In [86]:
first_table = pd.DataFrame(wtp_first_table[1:])
first_table.columns = wtp_first_table[0]
first_table.head()

Unnamed: 0,Party,Party.1,Party.2,Unnamed: 4,Leader(s),Ideology,Political position,MdBs,MEPs,EP group,Membership
0,,center|75px,Social Democratic Party of GermanySozialdemokr...,SPD,"Lars Klingbeil,Saskia Esken",Social democracyPro-Europeanism,Centre-left,,,S&D,404305
1,,center|75px,Christian Democratic Union of GermanyChristlic...,CDU,Friedrich Merz,,Centre-right,,,EPP,399110
2,,center|75px,Christian Social Union in BavariaChristlich-So...,CSU,Markus Söder,,Centre-right to right-wing,,,EPP,137010
3,,center|75px,Alliance 90/The GreensBündnis 90/Die Grünen,GRÜNE,"Ricarda Lang,Omid Nouripour",,Centre-left,,,Greens/EFA,106000
4,,center|75px,Free Democratic PartyFreie Demokratische Partei,FDP,Christian Lindner,LiberalismClassical liberalismConservative lib...,Centre-right,,,RE,73000


Now the table looks pretty much the same as the table in the original page.

#### An alternative for extracting tables data: wikitables library

In order to get table's data, you can also get help from `wikitables` library. It eases some steps of accessing the tables data, but you need to be careful with small bugs or mistakes in the resulting tables. Let's say we want to extract the second table's data:

In [87]:
from wikitables import import_tables

In [88]:
tables = import_tables('List of political parties in Germany')

List of political parties in Germany[0][0]: dropping field from unknown column: S&D
List of political parties in Germany[0][0]: dropping field from unknown column: 404305
List of political parties in Germany[0][1]: dropping field from unknown column: EPP
List of political parties in Germany[0][1]: dropping field from unknown column: 399110
List of political parties in Germany[0][2]: dropping field from unknown column: EPP
List of political parties in Germany[0][2]: dropping field from unknown column: 137010
List of political parties in Germany[0][3]: dropping field from unknown column: Greens/EFA
List of political parties in Germany[0][3]: dropping field from unknown column: 106000
List of political parties in Germany[0][4]: dropping field from unknown column: RE
List of political parties in Germany[0][4]: dropping field from unknown column: 73000
List of political parties in Germany[0][5]: dropping field from unknown column: ID
List of political parties in Germany[0][5]: dropping fiel

List of political parties in Germany[2][27]: dropping field from unknown column: 
List of political parties in Germany[2][27]: dropping field from unknown column: 
List of political parties in Germany[2][28]: dropping field from unknown column: Right-wing
List of political parties in Germany[2][28]: dropping field from unknown column: 
List of political parties in Germany[2][29]: dropping field from unknown column: 
List of political parties in Germany[2][29]: dropping field from unknown column: 
List of political parties in Germany[2][30]: dropping field from unknown column: Single-issue
List of political parties in Germany[2][30]: dropping field from unknown column: 
List of political parties in Germany[2][31]: dropping field from unknown column: 
List of political parties in Germany[2][31]: dropping field from unknown column: 
List of political parties in Germany[2][32]: dropping field from unknown column: Right-wing to Far-right
List of political parties in Germany[2][32]: dropping

List of political parties in Germany[2][76]: dropping field from unknown column: 
List of political parties in Germany[2][77]: dropping field from unknown column: Far-right
List of political parties in Germany[2][77]: dropping field from unknown column: 
List of political parties in Germany[2][78]: dropping field from unknown column: 
List of political parties in Germany[2][78]: dropping field from unknown column: 
List of political parties in Germany[2][79]: dropping field from unknown column: 
List of political parties in Germany[2][79]: dropping field from unknown column: 
List of political parties in Germany[2][80]: dropping field from unknown column: Right-wing
List of political parties in Germany[2][80]: dropping field from unknown column: 
List of political parties in Germany[2][81]: dropping field from unknown column: Far-right
List of political parties in Germany[2][81]: dropping field from unknown column: 
List of political parties in Germany[2][82]: dropping field from unkno

List of political parties in Germany[3][64]: dropping field from unknown column: 
List of political parties in Germany[3][65]: dropping field from unknown column: 
List of political parties in Germany[3][66]: dropping field from unknown column: 
List of political parties in Germany[3][67]: dropping field from unknown column: 
List of political parties in Germany[3][68]: dropping field from unknown column: 
List of political parties in Germany[3][69]: dropping field from unknown column: Commonly known as Schill-Partei
List of political parties in Germany[3][70]: dropping field from unknown column: Merged into Alliance C
List of political parties in Germany[3][71]: dropping field from unknown column: Merged into Alliance C
List of political parties in Germany[3][72]: dropping field from unknown column: Merged into The Left
List of political parties in Germany[3][73]: dropping field from unknown column: 
List of political parties in Germany[3][74]: dropping field from unknown column: 
Lis

List of political parties in Germany[7][1]: dropping field from unknown column: Unofficial counterpart of the CSU
List of political parties in Germany[7][2]: dropping field from unknown column: Counterpart of the KPD
List of political parties in Germany[7][3]: dropping field from unknown column: Later merged into the FDP
List of political parties in Germany[7][4]: dropping field from unknown column: Later merged with the DFU
List of political parties in Germany[7][5]: dropping field from unknown column: Split from the SPS , later merged with the SPD
List of political parties in Germany[7][6]: dropping field from unknown column: Counterpart of the SPD
List of political parties in Germany[7][7]: dropping field from unknown column: 
List of political parties in Germany[8][0]: dropping field from unknown column: 
List of political parties in Germany[8][1]: dropping field from unknown column: 
List of political parties in Germany[8][2]: dropping field from unknown column: 
List of political

In [91]:
first_table_wt = pd.DataFrame(tables[0].rows)
first_table_wt

Unnamed: 0,Party,Unnamed: 2,Leader(s),Ideology,Political position,MdBs,MEPs,EP group,Membership
0,,center|75px,Social Democratic Party of Germany Sozialdemok...,SPD,"Lars Klingbeil , Saskia Esken",Social democracy Pro-Europeanism,Centre-left,206 736 {{party color|Social Democratic Party ...,16 96 {{party color|Social Democratic Party of...
1,,center|75px,Christian Democratic Union of Germany Christli...,CDU,Friedrich Merz,[[Christian&nbsp;democracy]]\n [[Liberal cons...,Centre-right,152 736 {{party color|Christian Democratic Uni...,23 96 {{party color|Christian Democratic Union...
2,,center|75px,Christian Social Union in Bavaria Christlich-S...,CSU,Markus Söder,[[Christian democracy]]\n [[Conservatism]]\n ...,Centre-right to right-wing,45 736 {{party color|Christian Social Union of...,6 96 {{party color|Christian Social Union of B...
3,,center|75px,Alliance 90/The Greens Bündnis 90/Die Grünen,GRÜNE,"Ricarda Lang , Omid Nouripour",[[Green politics]]\n [[Social liberalism]]\n ...,Centre-left,118 736 {{party color|Alliance 90/The Greens}},21 96 {{party color|Alliance 90/The Greens}}
4,,center|75px,Free Democratic Party Freie Demokratische Partei,FDP,Christian Lindner,Liberalism Classical liberalism Conservative l...,Centre-right,92 736 {{party color|Free Democratic Party (Ge...,5 96 {{party color|Free Democratic Party (Germ...
5,,center|75px,Alternative for Germany Alternative für Deutsc...,AfD,"Tino Chrupalla , Alice Weidel",Right-wing populism National conservatism Germ...,Far-right A [[Right-wing politics|right-wing]]...,80 736 {{party color|Alternative for Germany}},9 96 {{party color|Alternative for Germany}}
6,,center|75px,The Left Die Linke,LINKE,"Martin Schirdewan , Janine Wissler",Democratic socialism Left-wing populism,Left-wing,39 736 {{party color|The Left (Germany)}},5 96 {{party color|The Left (Germany)}}
7,,center|75px,South Schleswig Voters' Association Südschlesw...,SSW,Christian Dirschauer,[[Social liberalism]]<br>[[Regionalism (politi...,Centre-left,1 736 {{party color|South Schleswig Voters' As...,0 96 {{party color|South Schleswig Voters' Ass...
8,,center|75px,Free Voters Freie Wähler,FW,Hubert Aiwanger,Liberal conservatism Regionalism,Centre-right,0 736 #FF8000,2 96 #FF8000
9,,center|75px,"Die PARTEI Partei für Arbeit, Rechtsstaat, Tie...",Die PARTEI,Martin Sonneborn,[[Political satire]] [[Humanism]]\n [[Anti-fa...,Left-wing,0 736 {{party color|Die PARTEI}},1 96 {{party color|Die PARTEI}}


As you can see, ... This needs to be taken care of, in case you want to use `wikitables`.

### B1.2.3. Extracting main text of different revisions

There may be multiple different revisions available for each Wikipedia page. In this section, we will demonstrate how you can extract the main text of the first revision of an article in each year since the beginning, using `pywikibot` and `mwparserfromhell`:

In [None]:
import pywikibot
import mwparserfromhell

Like before, you can first get the page using pwwikibot's [`.Site()`](https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-site) and [`.Page()`](https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-site):

In [None]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "Koç University")

Then, you can get all the revisions of the page using `page.revisions()`. Depending on how old/rich the page is, this may take a few seconds:

In [None]:
revisions = page.revisions(content=True)

Now we can make a list of all of the revisions, and put the **year** in which each revision has been written into a `years` list. Each revision is in the form of a dictionary, and we can get the *years* using the `timestamp` key in those dictionaries:

In [None]:
revisions_list = []
years = []

for i in revisions:
    revisions_list.append(i)
    years.append(int(str(i['timestamp'])[:4]))
years.reverse()
revisions_list.reverse()

Since revisions are sorted from the newest to the eldest, we have to reverse the `years` and `revisions_list` lists to have their items in an ascending order. By printing the `years` list, you can see an overview of how many revisions in each year there are for the page:

In [None]:
print(years)

We want to put the first revision of each year into a `yearly_revisions` list. In order to do that, we first get the indices of the first appearances of each year in the `years` list, and get the revisions with those indices in the `revisions_list` list:

In [None]:
yearly_revisions = []
for i in range(years[0], years[-1]+1):
    index = years.index(i)
    yearly_revisions.append(revisions_list[index])

In order to get the clean main text of each revision, we can use the `text` attribute of the revisions, and have the result parsed using `mwparserfromhell`. Take the last revision as an example; we first put the un-parsed code into the `text` variable:

In [None]:
text = yearly_revisions[-1].text

Now we can parse it with `mwparserfromhell` like this:

In [None]:
parsed = mwparserfromhell.parse(text)
print(parsed.strip_code())

## B1.3. Documentation of datasets collected from online platforms

In the following we would like to show you how to describe systematically digital behavioral data. For this purpose we will utilize TES-D template (ADD citation; <a href='#Fröhling'>Fröhling et al., 2023</a>; <a href='#Sen'>Sen et al., 2021</a>). For more details you can refer to TES-D Manual (ADD citation).

**TES-D “Computational Social Science Turkey Tweets 2008-2023”**

**General Characteristics** 

1. *Who collected the dataset and who funded the process?*

The dataset have been collected by "Social ComQuant" Project team (Gizem Bacaksizlar Turbic, Haiko Lietz, Pouria Mirelmi, Olga Zagovora) at GESIS - Leibniz Institute for the Social Sciences, Computational Social Science department. The dataset collection was funded by a European Commission as a part of [the Social ComQuant Project](https://socialcomquant.ku.edu.tr/).

2. *Where is the dataset hosted? Is the dataset distributed under a copyright or license?* 

The dataset is hosted on open access [github repository](https://github.com/gesiscss/css_methods_python) of CSS department at GESIS. ADD LICENSE   

3. *What do the instances that comprise the dataset represent? What data does each instance consist of?*

Each line of dataset reprents a distinct Tweet posted on Twitter in the period between 5th January 2008 and 8th January 2023. Each instance consist of: the unique identifier of the Tweet, the unique identifier of the User who posted this Tweet, creation time of the Tweet (in ISO 8601 format), the actual UTF-8 text of the Tweet, language of the Tweet, if detected by Twitter (it is returned as a BCP47 language tag). Data was not prerocessed and is represented in formats provided by API. 

4. *How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split?*

There are 105 instances on the dataset. Instances are homogen, i.e., each of them is representing a Tweet. 

5. *In which contexts and publications has the dataset been used already?* 

The dataset have been used in the online materials of [the Introduction to Computational Social Science methods with Python](https://github.com/gesiscss/css_methods_python) Course. 

6. *Are there alternative datasets that could be used for the measurement of the same or similar constructs? Could they be a better fit? How do they differ?* 

The dataset have been created for teaching purpose, namely, exercise on getting data using API. Any similar dataset is unknown. 

7. *Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments?*

[Jupyter Notebook](https://github.com/gesiscss/css_methods_python/blob/main/b_data_collection_methods/1_API_harvesting.ipynb), subsection B1.2.4 provides code in Python that explain how to obtain the dataset. Be aware that Twitter API might be depricated due to changes in Policies on free Access to the API. All the relevant informatiom one can find in the [documentation](https://developer.twitter.com/en/docs) or in this news article [Why Twitter ending free access to its APIs should be a ‘wake-up call’](https://www.theguardian.com/technology/2023/feb/07/techscape-elon-musk-twitter-api).   

8. *Were any ethical review processes conducted?* 

No thical review processes have been conducted. Dataset do not consist of any Private Data.    

9. *Did any ethical considerations limit the dataset creation?* 

We have not stored any data related to user accounts that have been posting relevant Tweets. Storage of this data can cause additional ethical considerations. 

10. *Are there any potential risks for individuals using the data?* 

Theoretical, some Tweets' texts can include usernames. Thus, to achive complete anonymisation one might need to postprocess data and remove these names.    

**Construct Definition** 

Validity 

1. For the measurement of what construct was the dataset created? 

 

2. How is the construct operationalized? Can the dataset fully grasp the construct? If not, what dimensions are left out? Have there been any attempts to evaluate the validity of the construct's operationalization? 

 

3. What related constructs could (not) be measured through the dataset? What should be considered when measuring other constructs with the dataset? 

 

4. What is the target population? 

 

5. How does the dataset handle subpopulations? 



**Platform Selection**

Platform Affordances Error 

1. What are the key characteristics of the platform at the time of data collection? 

 

2. What are the effects of the platform's ToS on the collected data? 

 

3. What are the effects of the platform's sociocultural norms on the collected data? 

 

4. How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design? 

 

5. In case multiple data sources were used, what errors might occur through their merger or combination? 


Platform Coverage Error 

1. What is known about the platform/s population? 

**Data Collection** 

Trace Selection Error 

1. How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen? 

 

2. Was there any data that could not be adequately collected? 

 

3. Is any information missing from individual instances? Could there be a systematic bias? 

 

4. Does the dataset include sensitive or confidential information? 

User Selection Error 

1. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample from a larger set, what was the sampling strategy? 

 

2. What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria? 

 

3. Over what timeframe was the data collected, and how might that timeframe have affected the collected data? 

 

4. If the dataset relates to people, how did they consent to collecting and using their data? 

 

5. Does the data include information on minors? 

**Data Preprocessing and Data Augmentation**

Trace Augmentation and Trace Measurement Error 

Is there a label or target associated with each instance? If so, how were the labels or targets generated? 

 

If automated methods were used, how does the methods’ performance impact the correctness of the augmentations? 

 

If human annotations were used, who were the annotators that created the labels? How were they recruited or chosen? How were they instructed? 

 

If the final gold label was derived from different annotations, how was this done? 

 

Have there been anCan date the labels? 

 
How could the data be misused? 

 
Can the dataset in any way unintendedly contribute to the reinforcement of social inequality? 

User Augmentation Error 

Have attributes and characteristics of individuals been inferred? 

 

Is it possible to identify individuals either directly or indirectly from the data? 

Trace Reduction Error 

Have traces been excluded? Why and by what criteria? 

User Reduction Error 

Have users been excluded? Why and by what criteria? 

Adjustment Error 

Does the dataset provide information to adjust the results to a target population? If so, is this information inferred or self-reported? 

## References

### Recommended readings

<a id='junger_a_2022'></a>
Jünger, J. (2022) "A brief history of APIs: Limitations and opportunities for online
research." In: Engle, U. & Quan-Haase, A. (eds), *Handbook of Computational Social
Science* 2 (p. 17–32). Abingdon: Routledge. https://doi.org/10.4324/9781003025245

<a id='mclevey_doing_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on using Pandas. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*

### Complementary readings

<a id='davidson_social_2023'></a>
Davidson, B. I., Wischerath, D., Racek, D., Parry, D. A., Godwin, E., Hinds, J., Linden, D. v. d., Roscoe, J. F., & Ayravainen, L. (2023). "Social media APIs: A quiet threat to the advancement of science." *PsyArXiv*:ps32z. https://doi.org/10.31234/osf.io/ps32z.

<a id='frohling_total_2023'></a>
Fröhling, L., Sen, I., Soldner, F., Steinbrinker, L., Zens, M., & Weller, K. (2023). "Total Error Sheets for Datasets (TES-D) -- A critical guide to documenting online platform datasets." *arXiv*:2306.14219. https://doi.org/10.48550/arXiv.2306.14219.

<a id='lazer_computational_2009'></a>
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., Van Alstyne, M. (2009). "Computational Social Science." *Science* 323:721–723. https://doi.org/10.1126/science.1167742.

<a id='lazer_computational_2020'></a>
Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., Nelson, A., Salganik, M. J., Strohmaier, M., Vespignani, A., & Wagner, C. (2020). "Computational Social Science: Obstacles and opportunities." *Science* 369:1060–1062. https://doi.org/10.1126/science.aaz8170.

<a id='sen_a_2021'></a>
Sen, I., Flöck, F., Weller, K., Weiß, B., & Wagner, C. (2021). "A total error framework for digital traces of human behavior on online platforms." *Public Opinion Quarterly* 85:399–422. https://doi.org/10.1093/poq/nfab018.

___

Zenk-Möltgen, Wolfgang (GESIS - Leibniz Institute for the Social Sciences), Python Script to rehydrate Tweets from Tweet IDs https://doi.org/10.7802/1504

Pfeffer, Morstatter (2016): Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness. Dataset. http://dx.doi.org/10.7802/1166

<a id='statista'></a>
Statista, 2023. https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/. Retrieved 26.04.2023.

<a id='van_Vliet'></a>
van Vliet, L., Törnberg, P., & Uitermark, J. (2020) "The Twitter parliamentarian database: Analyzing Twitter politics across 26 countries". PLoS ONE 15(9): e0237073. https://doi.org/10.1371/journal.pone.0237073.

<a id='Freelon'></a>
Freelon, D. (2018) "Computational Research in the Post-API Age". Political
Communication, 35 (4): 665–668. https://doi.org/10.1080/10584609.2018.1477506

<a id='Hogan'></a>
Hogan, B. (2018) "Social Media Giveth, Social Media Taketh Away: Facebook,
friendships, and APIs". International Journal of Communication, 12: 592–611. https://ssrn.com/abstract=3084159



<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & Pouria Mirelmi 

Contributors: Haiko Lietz

Acknowledgements: Felix Beck-Soldner

Version date: 25. April 2023

License: ...
</div>