<img style="float: right;" src="https://github.com/gesiscss/orc/blob/f8fecffc6816085caa379653ea3cb0c902be4081/load_balancer/static/images/logo/logo_text.png?raw=true">

<h1> Workflow-Integrated Data Documentation </h1>
<h2> A demonstration on the "Call me sexist, but..." data collection </h2>

<div class="alert alert-block alert-warning"> This notebook serves as a prototype to demonstrate the Workflow-Integrated Data Documentation idea. Using the collection of the "Call me sexist, but..." dataset as an example for a typical Computational Social Science research process, we are going to show how a researcher might use the existing functionality of Jupyter Notebooks to collect, explore and pre-process Twitter data. On top of this core functionality, we implemented a very basic sketch of what the Workflow-Integrated Data Documentation project's outcome might look like.<br><br>
The "Call me sexist, but..." dataset was originally collected by colleagues at GESIS to study the nuances of sexism on online platforms like Twitter. The rationale behind the data collection were quite intuitive; by collecting Tweets that contained the keyphrase "Call me sexist, but..." from the Twitter API, the resulting dataset would most likely contain a number of sexist Tweets, as the keyphrase may be seen as a disclaimer for sexist content that follows.<br><br>
The original dataset was collected for the publication <a href=https://ojs.aaai.org/index.php/ICWSM/article/view/18085>Call me sexist, but... Revisiting Sexism Using Psychological Scales and Adversarial Samples</a> and is publicly available from the GESIS Datorium (https://doi.org/10.7802/2251). This notebook attempts a basic, less complex replication of the original data collection. </div>

Loading packages

In [17]:
pip install ipywidgets




In [19]:
import requests, os, json, re, pandas as pd

Loading the Twitter API credentials (important: do not share with anyone!) and handling authentification

In [20]:
# handling twitter authentication
with open('twitter_tokens.json') as rd:
    twitter_tokens = json.load(rd)
bearer_token = twitter_tokens['bearer_token']

def bearer_oauth(r):
    r.headers['Authorization'] = f'Bearer {bearer_token}'
    r.headers['User-Agent'] = 'v2FilteredStreamPython'
    return r

Setting parameters for connecting to the Twitter API v2 /tweets/search/all endpoint

In [21]:
# setting connection parameters
endpoint_url = 'https://api.twitter.com/2/tweets/search/all'

params = {'query': '"call me sexist, but" -is:retweet',
          'start_time': '2021-07-14T18:30:00Z',
         'max_results': 500,
         'tweet.fields': 'created_at,author_id,text'}

In [5]:
# ==============================
# @hidden
import ipywidgets as widgets
questions = {
    'General Characteristics':{
        'Who collected the dataset and who funded the process?': 'Who collected the dataset and who funded the process?',
        'Where is the dataset hosted?': 'Where is the dataset hosted? Is the dataset distributed under a copyright or license?',
        'What do the instances that comprise the dataset represent?': 'What do the instances that comprise the dataset represent? What data does each instance consist of?',
        'How many instances are there in total in each category?': 'How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split?',
        'In which contexts and publications has the dataset been used already?': 'In which contexts and publications has the dataset been used already?',
        'Are there alternative datasets that could be used for the measurement of the same or similar constructs?': 'Are there alternative datasets that could be used for the measurement of the same or similar constructs? Could they be a better fit? How do they differ?',
        'Can the dataset collection be readily reproduced?': 'Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments?',
        'Were any ethical review processes conducted?': 'Were any ethical review processes conducted?',
        'Did any ethical considerations limit the dataset creation?': 'Did any ethical considerations limit the dataset creation?',
        'Are there any potential risks for individuals using the data?': 'Are there any potential risks for individuals using the data?'
    },
    'Platform Affordances':{
        'How were the relevant traces collected from the platform?': 'How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design?'
    },
    'Platform Coverage':{
        'What is known about the platform/s population?': 'What is known about the platform/s population?'
    },
    'Trace Selection':{
        'How was the data associated with each instance acquired?': 'How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen?',
        'Over what timeframe was the data collected?': 'Over what timeframe was the data collected, and how might that timeframe have affected the collected data?'
    },
    'User Selection':{
        'What is known about the dataset population?': 'What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria?'
    },
    'Trace Augmentation and Trace Measurement':{
        'Is there a label or target associated with each instance?': 'Is there a label or target associated with each instance? If so, how were the labels or targets generated?'
    },
    'User Augmentation':{
        'Have attributes and characteristics of individuals been inferred?': 'Have attributes and characteristics of individuals been inferred?'
    },
    'Trace Reduction':{
        'Have traces been excluded?': 'Have traces been excluded? Why and by what criteria?'
    },
    'User Reduction':{
        'Have users been excluded?': 'Have users been excluded? Why and by what criteria?'
    },
    'Adjustment':{
        'Does the dataset provide information to adjust the results to a target population?': 'Does the dataset provide information to adjust the results to a target population? If so, is this information inferred or self-reported?'
    }
}

all_questions = {
    'Who collected the dataset and who funded the process?': 'Who collected the dataset and who funded the process?',
    'Where is the dataset hosted?': 'Where is the dataset hosted? Is the dataset distributed under a copyright or license?',
    'What do the instances that comprise the dataset represent?': 'What do the instances that comprise the dataset represent? What data does each instance consist of?',
    'How many instances are there in total in each category?': 'How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split?',
    'In which contexts and publications has the dataset been used already?': 'In which contexts and publications has the dataset been used already?',
    'Are there alternative datasets that could be used for the measurement of the same or similar constructs?': 'Are there alternative datasets that could be used for the measurement of the same or similar constructs? Could they be a better fit? How do they differ?',
    'Can the dataset collection be readily reproduced?': 'Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments?',
    'Were any ethical review processes conducted?': 'Were any ethical review processes conducted?',
    'Did any ethical considerations limit the dataset creation?': 'Did any ethical considerations limit the dataset creation?',
    'Are there any potential risks for individuals using the data?': 'Are there any potential risks for individuals using the data?',
    'How were the relevant traces collected from the platform?': 'How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design?',
    'How was the data associated with each instance acquired?': 'How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen?',
    'Over what timeframe was the data collected?': 'Over what timeframe was the data collected, and how might that timeframe have affected the collected data?',
    'What is known about the dataset population?': 'What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria?',
    'What is known about the platform/s population?': 'What is known about the platform/s population?',
    'Is there a label or target associated with each instance?': 'Is there a label or target associated with each instance? If so, how were the labels or targets generated?',
    'Have attributes and characteristics of individuals been inferred?': 'Have attributes and characteristics of individuals been inferred?',
    

}

from IPython.display import HTML

def track_question(change):
    print(all_questions[change['new']])
    text_answer = widgets.Textarea(value='',
                                   placeholder='Type something',
                                   description='Answer:',
                                   disabled=False,
                                   layout=widgets.Layout(width='100%', height='80px'))
    display(text_answer)
    
def track_category(change):
    select_question = widgets.Select(options=['---']+list(questions[change['new']].keys()),
                                     value='---',
                                     rows=5,
                                     layout=widgets.Layout(width='100%', height=f'{20+20*len(list(questions[change["new"]].keys()))}px'))
    select_question.observe(track_question, names='value')
    display(select_question)

def select_category(b):
    select_category = widgets.Select(options=['---']+list(questions.keys()), 
                                     value='---',
                                     rows=len(list(questions.keys())))

    select_category.observe(track_category, names='value')
    display(select_category)
    
btn = widgets.Button(description='Add Data Documentation Cell',
                    layout=widgets.Layout(width='25%', height='80px'))
display(btn)
btn.on_click(select_category)
# ==============================

HTML('''<script>
  code_show=true; 
  function code_toggle() {
    if (code_show){
        $('.cm-comment:contains(@hidden)').closest('div.input').hide();
    } else {
        $('.cm-comment:contains(@hidden)').closest('div.input').show();
    }
    code_show = !code_show
  } 
  $( document ).ready(code_toggle);
</script>
<a href="javascript:code_toggle()">Show hidden code</a>''')

Button(description='Add Data Documentation Cell', layout=Layout(height='80px', width='25%'), style=ButtonStyle…

Select(options=('---', 'General Characteristics', 'Platform Affordances', 'Platform Coverage', 'Trace Selectio…

Select(layout=Layout(height='220px', width='100%'), options=('---', 'Who collected the dataset and who funded …

Select(layout=Layout(height='40px', width='100%'), options=('---', 'How were the relevant traces collected fro…

How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design?


Textarea(value='', description='Answer:', layout=Layout(height='80px', width='100%'), placeholder='Type someth…

<div class="alert alert-info">

*Platform Affordances Error*
    
**How were the relevant traces collected from the platform? Are there any technical constraints of the data collection method? If yes, how did those limit the dataset design?** 

The Tweets were collected from the Twitter API v2 /2/tweets/search/all endpoint. The query used to collect the Tweets was '"call me sexist, but" -is:retweet'. The data collection covers the period from 14.07.2021 to 14.07.2022, thus one full year.

Collecting data from the /tweets/search/all endpoint means that - in theory - every Tweet that was posted during the specified time period and that matches the specified keyword or keyphrase can be collected. However, this form of retrospective collection of Tweets does not cover any Tweets that were removed by Twitter, removed by the author, or that have been made private through the author.

For this data collection, this could mean that not all Tweets that were originally posted containing the phrase 'Call me sexist, but' during the period from 14.07.2021 to 14.07.2022 are included in the dataset. Since the phrase was chosen for being an indicator of potentially sexist content that might follow it, it is rather likely that some of the Tweets posted with that phrase have been removed afterwards. The reason for this could be that either Twitter considered the Tweet to be against its platform rules and has thus removed it, or that the author of the Tweet deleted it, for the backlash experienced/feared or because of changes in their attitude.
    
</div>

<div class="alert alert-info">

*Trace Selection Error*
    
**How was the data associated with each instance acquired? On what basis were the trace selection criteria chosen?**
    
The Tweets for this dataset were collected through the Twitter search API, matching Tweets that contained the keyphrase “call me sexist, but”. With the idea of collecting sexists posts on Twitter, the rationale behind this choice of query was that any content following this phrase would most likely be sexist. They keyphrase was thus used as a disclaimer for sexist posts.
    
</div>

In [6]:
# collect tweets
response = requests.request('GET', url=endpoint_url, auth=bearer_oauth, params=params)
tweets = [[tweet['id'],tweet['author_id'],tweet['created_at'],tweet['text']] for tweet in response.json()['data']]
pd_tweets = pd.DataFrame(tweets, columns=['tweet_id','author_id','tweet_created_at','tweet_text'])

In [7]:
# 306 tweets collected in total
pd_tweets

Unnamed: 0,tweet_id,author_id,tweet_created_at,tweet_text
0,1546971137162285058,516572282,2022-07-12T21:33:55.000Z,@LoveHerMo @taritaribobari Call me sexist but ...
1,1546939984527204353,1505593407229595650,2022-07-12T19:30:07.000Z,@Jeanna350 She made that movie!! Call me sexis...
2,1546852110712594434,1504479095433834504,2022-07-12T13:40:56.000Z,@Kidmonger Y'all will call me sexist but there...
3,1546578485954727936,730203963375689728,2022-07-11T19:33:39.000Z,Because that doesn’t even sound right. Call me...
4,1546220247551713287,1544376727996170243,2022-07-10T19:50:09.000Z,@gailkimITSME Call me sexist but men and women...
...,...,...,...,...
301,1417877712551698435,2416797894,2021-07-21T16:02:44.000Z,@ANTHONYBLOGAN Call me sexist....but I cant th...
302,1417739720927371268,1394699798641528836,2021-07-21T06:54:24.000Z,Call me sexist but Barri Eid to larkon ki hai ...
303,1417150090045583360,1024065971986870272,2021-07-19T15:51:25.000Z,call me sexist but i dont think i'm taking the...
304,1416741034269237248,1416738573248782341,2021-07-18T12:45:59.000Z,@GoddessLuciaaa you can call me misogynistic y...


<div class="alert alert-info">

*General Characteristics*
    
**What do the instances that comprise the dataset represent? What data does each instance consist of?**

The instances in the dataset represent Tweets and have the data fields: 'tweet_id', 'author_id', 'tweet_created_at', 'tweet_text'.
    
</div>

In [9]:
# 293 different authors, with one author having written 13 tweets, one author having written 2 tweets, and the rest of the authors 1 tweet each 
pd_tweets.groupby('author_id').count().sort_values('tweet_id', ascending=False)

Unnamed: 0_level_0,tweet_id,tweet_created_at,tweet_text
author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3303192821,13,13,13
145554242,2,2,2
1009994417397043200,1,1,1
215303514,1,1,1
24622214,1,1,1
...,...,...,...
1359189395946041352,1,1,1
1355850989119676420,1,1,1
1354493816913461251,1,1,1
1351635026383937543,1,1,1


<div class="alert alert-info">

*User Selection Error*
    
**What is known about the dataset population? Are there user groups systematically in- or excluded in/from the dataset in direct consequence of the trace selection criteria?**
    
The only user-related information collected in the dataset is the Twitter ID of a Tweet's author. This ID is used by Twitter to uniquely identify every one of its users but does not encode any further user characteristics. For this dataset, no additional demographics have been collected.
    
Based on the number of unique author IDs present in the dataset, it may be observed that the 306 Tweets collected in total have been posted by 293 different authors, with the most productive author being responsible for 13 Tweets, the second most productive author for 2 Tweets, and the remaining 286 authors having contributed only 1 Tweet each to the dataset.
    
As the only inclusion criterion for the dataset was the use of the keyphrase "Call me sexist, but...", any type of Twitter user - independent of characteristics or demographics - could have ended up in the dataset. Systematic exclusion from the dataset could then in theory still be an issue, if certain groups of users are less prone to use the keyphrase in their public postings on Twitter.
</div>

In [10]:
# first and last tweet collected
print('First Tweet: ',list(pd_tweets['tweet_created_at'])[-1])
print('Last Tweet:  ',list(pd_tweets['tweet_created_at'])[0])

First Tweet:  2021-07-15T18:40:16.000Z
Last Tweet:   2022-07-12T21:33:55.000Z


<div class="alert alert-info">

*Trace Selection Error*
    
**Over what timeframe was the data collected, and how might that timeframe have affected the collected data?**
    
The first Tweet collected in the dataset was posted 15.07.2021, the last Tweet 12.07.2022. The collection period as specified in the search query was from 14.07.2021 to 14.07.2022, thus covering one whole year.    
</div>

In [11]:
# preprocessing tweets for crowdworker-annotation - removing the call-me-sexist prompt
tweets_processed = [re.sub('[Cc]all me sexist[,.]* [Bb]ut','',tweet[3]) for tweet in tweets]

In [12]:
# dummy-annotation of tweets
import random

def dummy_annotation(row):
    return random.choice(['sexist','non-sexist'])

pd_tweets['label'] = pd_tweets.apply(dummy_annotation, axis=1)

In [13]:
# class balance
pd_tweets.groupby('label').count()

Unnamed: 0_level_0,tweet_id,author_id,tweet_created_at,tweet_text
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
non-sexist,149,149,149,149
sexist,157,157,157,157


<div class="alert alert-info">

*General Characteristics*
    
**How many instances are there in total in each category (as defined by the instances’ label), and - if applicable - in each recommended data split?**
    
A total of 306 Tweets have been collected in the dataset, of which 149 are labelled as 'non-sexist' and 157 labelled as 'sexist'. There are no splits recommended for the dataset.    
</div>

In [14]:
# artifact: tweets discussing the publication of the original dataset are now also included
pd_tweets[pd_tweets['author_id']=='145554242']

Unnamed: 0,tweet_id,author_id,tweet_created_at,tweet_text,label
197,1465632388445310976,145554242,2021-11-30T10:42:46.000Z,"📖Read also the blog post about the ""'Call me s...",non-sexist
225,1450442868359897094,145554242,2021-10-19T12:45:02.000Z,#Blog post by Dr. Mattia Samory @hide_yourself...,non-sexist


In [15]:
for i,tweet in enumerate(pd_tweets[pd_tweets['author_id']=='145554242']['tweet_text']):
    print(f'Tweet {i+1} from user with author_id 145554242:\n',tweet,'\n_____________________________________________________')

Tweet 1 from user with author_id 145554242:
 📖Read also the blog post about the "'Call me sexist but' Dataset",https://t.co/d0hrPfUT4Q @hide_yourself https://t.co/Ien0zUvmMG 
_____________________________________________________
Tweet 2 from user with author_id 145554242:
 #Blog post by Dr. Mattia Samory @hide_yourself: The “Call Me #Sexist But” #Dataset - An interesting piece about a dataset of over 10.000 #tweets - curated by a group of researchers - to study the multiple ways sexism appears in day-to-day communication. https://t.co/AgPzSrsiCS 
_____________________________________________________


<div class="alert alert-info">

*General Characteristics*
    
**Can the dataset collection be readily reproduced given the current data access, the general context and other potentially interfering developments?**
    
The original version of this dataset has been published in 2021, together with a study that introduces a novel approach to the challenge of defining sexism online in its many nuances. As the study is titled “Call me sexist, but.” : Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples., any Tweet discussing the study or the dataset using their title would potentially end up in an attempt to recreate an updated version of the dataset.    
</div>

In [16]:
# ==============================
# @hidden
import ipywidgets as widgets
from IPython.display import HTML
    
def export(b):
    print('Successfully exported the Data Documentation Cells to TES-D.docx')

btn = widgets.Button(description='Export Data Documentation Cells',
                    layout=widgets.Layout(width='25%', height='80px'))
display(btn)
btn.on_click(export)
# ==============================

HTML('''<script>
  code_show=true; 
  function code_toggle() {
    if (code_show){
        $('.cm-comment:contains(@hidden)').closest('div.input').hide();
    } else {
        $('.cm-comment:contains(@hidden)').closest('div.input').show();
    }
    code_show = !code_show
  } 
  $( document ).ready(code_toggle);
</script>
<a href="javascript:code_toggle()">Show hidden code</a>''')

Button(description='Export Data Documentation Cells', layout=Layout(height='80px', width='25%'), style=ButtonS…