# Bots in Science 🧪

In this notebook, we process data from mentions to:
+ (1) classify altmetric papers by ESI field 📚
+ (2) analyze the botscores of the tweeters to identify bots 🤖 in combination with their activity 🕵️‍♂️

## Libraries

In [3]:
import pandas as pd
import numpy as np
from functions import outliers

## 1. Altmetric papers by ESI field

The Web of Science papers (2017-2021) indexed in altmetric are assigned one or more Web of Science categories. Based on the schema by Arroyo-Machado & Torres-Salinas (2021), their ESI field is obtained.

In [4]:
データ_alt_sub = pd.read_csv('data/altmetric_subjects_cat.tsv', sep='\t')
データ_alt_sub.DOI = データ_alt_sub.DOI.str.lower()
データ_alt_sub.shape

(6763070, 2)

In [5]:
データ_esi = pd.read_csv('data/mapping.csv', sep=';')
データ_esi

Unnamed: 0,WC,SC,ESI,Category
0,Agricultural Economics & Policy,Agriculture,Agricultural Sciences,Life Sciences & Biomedicine
1,Agricultural Engineering,Agriculture,Agricultural Sciences,Life Sciences & Biomedicine
2,"Agriculture, Dairy & Animal Science",Agriculture,Agricultural Sciences,Life Sciences & Biomedicine
3,"Agriculture, Multidisciplinary",Agriculture,Agricultural Sciences,Life Sciences & Biomedicine
4,Agronomy,Agriculture,Agricultural Sciences,Life Sciences & Biomedicine
...,...,...,...,...
249,Urban Studies,Urban Studies,"Social Sciences, General",Social Sciences
250,Women's Studies,Women's Studies,"Social Sciences, General",Social Sciences
251,Astronomy & Astrophysics,Astronomy & Astrophysics,Space Sciences,Physical Sciences
252,Tropical Medicine,Tropical Medicine,Clinical Medicine,Life Sciences & Biomedicine


In [6]:
データ_alt_sub = データ_alt_sub.merge(データ_esi[['WC', 'ESI']], how='inner', left_on='subject_category', right_on='WC')
データ_alt_sub.drop('WC', axis=1, inplace=True)

<div class="alert-warning">
    <strong>Warning:</strong> This line of code is commented to avoid generating new versions of the file when reviewing the code.
</div>

In [7]:
#データ_alt_sub.to_csv('data/altmetric_subjects_cat_esi.tsv', index=False, sep='\t', encoding='UTF-8')

## 2. Bot detection

Based on the botscore and the frequency of mentions of the tweeters, bots are identified.

In [8]:
データ_botscore = pd.read_csv('data/full_botometer_results.tsv', sep='\t', dtype={'user_id':str})
データ_botscore.shape

(4872369, 2)

In [9]:
データ_tw_men = pd.read_csv('data/final_mentions_full.tsv', sep='\t',
                         dtype={'Outlet or Author':str, 'External Mention ID':str},
                         encoding='UTF-8')
データ_tw_men.shape

(51999245, 5)

Out of the 4,983,251 tweeters, it has been possible to obtain the botscore of 4,872,369 tweeters (98%).

In [10]:
データ_tw_men[['Outlet or Author']].drop_duplicates().shape

(4983251, 1)

In [11]:
データ_tw_men[データ_tw_men['Outlet or Author'].isin(データ_botscore.user_id.tolist())][['Outlet or Author']].drop_duplicates().shape

(4872369, 1)

62 publications don't include Web of Science category. Possibly they were not correctly indexed when the data was retrieved.

In [12]:
len(データ_tw_men.loc[~データ_tw_men.DOI.isin(データ_alt_sub.DOI.tolist()), 'DOI'].drop_duplicates())

62

### 2.1. Outliers detection

In order to identify the bots, user mentions are taken into account. Therefore, the outliers of this distribution (IQR*1.5 & IQR*3) are calculated.

### 2.1.1. Mentions

In [13]:
データ_tw_men_freq = データ_tw_men[['Outlet or Author']].value_counts().reset_index(name='mentions')
データ_tw_men_freq

Unnamed: 0,Outlet or Author,mentions
0,46854930,160761
1,253847599,64403
2,1191732113676161028,52549
3,3178688418,51251
4,2797948887,43538
...,...,...
4983245,1241283714602151943,1
4983246,1241283639863726081,1
4983247,1241283284568399872,1
4983248,124128298,1


Outliers realize more than 8.5 mentions and extreme outliers more than 13.

In [14]:
outliers(データ_tw_men_freq.mentions)

8.5

In [15]:
outliers(データ_tw_men_freq.mentions, extreme=1)

13.0

### 2.1.2. Tweets

In [18]:
データ_tw_men_freq = データ_tw_men.groupby('Outlet or Author').sum('Original').reset_index()
データ_tw_men_freq.rename({'Original':'tweets'}, axis=1, inplace=True)
データ_tw_men_freq

Unnamed: 0,Outlet or Author,tweets
0,1000000214193893378,4
1,1000000362735120384,3
2,1000000548450525190,1
3,100000075,0
4,1000001158130388992,1
...,...,...
4983245,99999896,0
4983246,999999486,0
4983247,999999663150387200,0
4983248,999999699838005251,0


Outliers realize more than 2.5 tweets and extreme outliers more than 4.

In [20]:
outliers(データ_tw_men_freq.tweets)

2.5

In [22]:
outliers(データ_tw_men_freq.tweets, extreme=1)

4.0

### 2.1.3. Retweets

In [23]:
データ_tw_men_freq = データ_tw_men
データ_tw_men_freq.Original = -1*(データ_tw_men_freq.Original-1)
データ_tw_men_freq = データ_tw_men_freq.groupby('Outlet or Author').sum('Original').reset_index()
データ_tw_men_freq.rename({'Original':'retweets'}, axis=1, inplace=True)
データ_tw_men_freq

Unnamed: 0,Outlet or Author,retweets
0,1000000214193893378,0
1,1000000362735120384,5
2,1000000548450525190,0
3,100000075,1
4,1000001158130388992,0
...,...,...
4983245,99999896,1
4983246,999999486,1
4983247,999999663150387200,1
4983248,999999699838005251,2


Outliers realize more than 6 retweets and extreme outliers more than 9.

In [24]:
outliers(データ_tw_men_freq.retweets)

6.0

In [25]:
outliers(データ_tw_men_freq.retweets, extreme=1)

9.0

### 2.2. Bots detection

After some tests that are available on the robustenss check, it has been decided to set as thresholds a botscores of 0.6 and a minimum of 8 mentions.

In [26]:
データ_tw_men_freq = データ_tw_men[['Outlet or Author']].value_counts().reset_index(name='mentions')
データ_tw_men_freq = データ_tw_men_freq[データ_tw_men_freq.mentions>8]
データ_tw_men_freq

Unnamed: 0,Outlet or Author,mentions
0,46854930,160761
1,253847599,64403
2,1191732113676161028,52549
3,3178688418,51251
4,2797948887,43538
...,...,...
729487,1375577394157879301,9
729488,1186803282154405888,9
729489,1061934299832897537,9
729490,84113375,9


In [27]:
データ_bots = データ_tw_men_freq.merge(データ_botscore, how='inner', left_on='Outlet or Author', right_on='user_id')
データ_bots = データ_bots[データ_bots.botscore>0.6].copy()
データ_bots = データ_bots[['Outlet or Author']].copy()
データ_bots

Unnamed: 0,Outlet or Author
4,2797948887
8,2527490466
9,1331133278
10,207581304
11,1477921047034929156
...,...
716233,851717988780621825
716286,1391120434683432960
716320,1401509102
716337,1384566755830415368


<div class="alert-warning">
    <strong>Warning:</strong> This line of code is commented to avoid generating new versions of the file when reviewing the code.
</div>

In [111]:
#データ_bots.to_csv('results/bots_list.tsv', index=False, sep='\t')

# References

Arroyo-Machado, W., & Torres-Salinas, D. (2021). *Web of Science categories (WC, SC, main categories) and ESI disciplines mapping*. https://doi.org/10.6084/m9.figshare.14695176.v2
