This notebook does some rudimentary analysis of tweets in [Twitter's Iran disinformation datasets](https://about.twitter.com/en_us/values/elections-integrity.html#data). The [Makefile](/edit/Makefile) in this project's directory will download the Twitter data, unpack and then run some [basic filtering](/edit/process.py) on it to extract interactions with users that are listed in [seeds.csv](/edit/seeds.csv).

In [1]:
import csv
import pandas
import collections

## Read Data

We'll start by loading in our seed list of users that are of interest. Take care to treat the user_ids as strings rather than integers because of how Pandas handles NaN values.

In [2]:
users = pandas.read_csv("seeds.csv", dtype={'user_id': str})
users.head(5)

Unnamed: 0,screen_name,user_id,followers_count
0,mdubowitz,48252327,35224
1,SGhasseminejad,1113100093,28524
2,AlirezaNader,488772327,15891
3,HeshmatAlavi,2554131522,36278
4,IranDisinfo,1073301577828655105,4608


Next we will load in the tweets that matched those users by running [process.py](/edit/process.py) on the full dataset. Note, again we are converting user ids to strings so that they will match users that were just loaded. If we didn't use strings these would load as floats, which would be problematic for matching.

In [3]:
tweets = pandas.read_csv("results/tweets.csv", dtype={
    'retweet_userid': str, 
    'in_reply_to_userid': str, 
    'retweet_userid': str
})
tweets.tweetid.count()

3232

## Replies

Now we should be able to join the two datasets and create a table of replies. Replies are special because they are when a user directly responds to another users's tweet. It requires the responding user to have in some way read the tweet, and to craft a response, and to send it. This unlike retweets which are just a single click, and mentions which are when a screen_name happens to be in the text of the tweet. Note, we need to use the user identifier because `in_reply_to_userid` is what we are given in the Twitter dataset.

This table will represent tweets in the Twitter dataset that were sent by a *suspended user* that were replies to an *active user* in our seed list. To avoid confusion we will make this relationship clearer by renaming the appropriate columns in the resulting dataframe.

In [4]:
replies = pandas.merge(users, tweets, left_on='user_id', right_on='in_reply_to_userid')
replies = replies.rename(columns={
    'screen_name': 'active_screen_name',
    'in_reply_to_userid': 'active_user_id',
    'user_screen_name': 'suspended_screen_name'
})
replies.tweetid.count()

1201

So there are 1,201 replies! Let's create a new table that maps which user is replying to who, and how many times they replied.

In [5]:
reply_network = replies.groupby(by=['active_screen_name', 'active_user_id', 'suspended_screen_name']).size().reset_index()
reply_network = reply_network.rename(columns={0: 'replies'})
reply_network = reply_network.sort_values(by=['replies'], ascending=False)
reply_network.head(25)

Unnamed: 0,active_screen_name,active_user_id,suspended_screen_name,replies
250,AlinejadMasih,947924373029171200,saramosavi8,58
293,JZarif,47813521,3fXrw02Ese7Cy5Al59Z9Zf9KPTlvSOgTv86yEyd1bkY=,36
476,amiretemadi,62855930,QEXa7uNjy0XzvdxunFbo80Nr39Sd+NE+yPBp9iJLk=,31
305,JZarif,47813521,IranTalks,30
329,JZarif,47813521,ePJ2NXdd3hAPqhsfINNthxrES0xBY9YwrBOJ38hoPr0=,22
181,AlinejadMasih,947924373029171200,eykx70J3DZtCdSgoSQ6wNcTgust5E0cHTwWgo919ct4=,20
41,AlinejadMasih,947924373029171200,9awFHt+htU+xsDluiT0ZsDcXrFXfP5osvTIquKf06P8=,20
118,AlinejadMasih,947924373029171200,QEXa7uNjy0XzvdxunFbo80Nr39Sd+NE+yPBp9iJLk=,19
272,AlinejadMasih,947924373029171200,zpeGzossV+mtHsRZ3wAVivJPmEK2M8F2ZLI2TiMIZSA=,17
558,iranfarashgard,1029347598249943040,Q40fHILuLsb1wkquKa8q7QsfuXN8Vx5FjUKdmNsu8+Q=,12


AlinejadMasih (947924373029171200) is quite prominent here. How many suspended users interacted with that account?

In [6]:
len(reply_network.query('active_screen_name == "AlinejadMasih"'))

276

So 276 suspended users interacted with AlinejadMasih! Shall we look at some of the replies? Remember, we need to use AlinejadMasih's user identifier because rememeber the original data only includes the user id in the reply column.

In [7]:
AlinejadMasih = replies.query('active_user_id == "947924373029171200"')
for i, row in AlinejadMasih.iterrows():
    print(row['tweet_text'])

@AlinejadMasih چرا ماکارانی ها رو ریختی موهات؟
@AlinejadMasih https://t.co/3EjurAqc6Q
@AlinejadMasih https://t.co/QC0WpF6S2C
@AlinejadMasih همه طرفداران تو این گونه اند 
چه مرد چه زن
همه بی ادب
چه قدر بهم میاید
@AlinejadMasih https://t.co/WbauL96nl8
@AlinejadMasih خوب اینو از اول میگفتی 
حجاب زنان بهانست 
اصل نظام نشانست 

خاک تو‌ سرت تو‌‌ به این نظام که نه به مردم کشورت بد کردی 
البته اگر حالیت باشه
@AlinejadMasih https://t.co/WbauL96nl8
@AlinejadMasih https://t.co/3EjurAqc6Q
@AlinejadMasih تو خودت خدای تناقضی 
دنیا زن ایرانی را به نجابت و حیا می شناسد و اما تو برعکسی . آیا به تناقض تو نمی خندن
@AlinejadMasih بنده به عنوان یه مانتویی به آن خانم چادری و بقیه می گم حق ندارید قانون کشور عزیزم ایران را زیر پا بذارید
@AlinejadMasih خب تبلیغ بی قانونی می کنی که چی بشه 
دوست داری ایران بشه جنگل 
هرگز مردم ما به امثال تو محل نخواهند گذاشت و کارهای حقیر و کوچکی که در این فیلم و امثالهم هست تاثیری بر مردم ما نداره که اتفاقا بیدار تر نمی کنه مردم ما رو
@AlinejadMasih @john_lucckk @MarietjeSchaak

Let's drop those tweets into a separate CSV for inspection. 

In [8]:
AlinejadMasih.to_csv('results/AlinejadMasih.csv')

What does the network of replies look like?

In [9]:
import networkx
replies_g = networkx.DiGraph()

for index, row in reply_network.iterrows():
    replies_g.add_node(row.active_screen_name, data={"active": True})
    replies_g.add_node(row.suspended_screen_name, data={"active": False})
    replies_g.add_edge(row.suspended_screen_name, row.active_screen_name, weight=row.replies)
    
print(len(replies_g), 'nodes')
print(replies_g.size(), 'edges')

414 nodes
597 edges


In [10]:
import pyvis

replies_v = pyvis.network.Network(notebook=True, width='100%', directed=True)
replies_v.from_nx(replies_g)
replies_v.show('results/replies.html')

Eek, that's a bit of a hairball. What if we remove the one-off interactions where a pair of users only interacted once? To do that lets create a function that takes a networkx graph and an edge weight to use to limit it.

In [11]:
def trim(g1, n):
    g2 = g1.copy()
    in_degree = g1.in_degree(g1)

    for src, dst, data in g1.edges(data=True):
        if data['weight'] <= n:
            if in_degree[dst] == 1:
                g2.remove_node(dst)
            else:
                g2.remove_edge(src, dst)
    return g2

To make it easier to display the graph in a customized way we'll create a function do that with pyvis. We'll create a function to shorten the long anonymized ids that Twitter assigned to users with less that 5,000 followers at the time of suspension. We'll also color the node green if they are an active user, and pink if they have been suspended.

In [16]:
def label(s):
    if len(s) > 40:
        return s[0:10]
    return s

def vis(g, html_path):
    degree = g.degree(g)
        
    v = pyvis.network.Network(notebook=True, width='100%', height=600, directed=True)
    for src, dst, data in g.edges(data=True):
        v.add_node(src, label(src), title=src, mass=degree[src], color='pink')
        v.add_node(dst, label(dst), title=dst, color='lightgreen')
        v.add_edge(src, dst, weight=data['weight'])
        
    
    v.set_options('{"edges":{"color":{"inherit":true},"smooth":false},"physics":{"forceAtlas2Based":{"springLength":50},"minVelocity":0.75,"solver":"forceAtlas2Based"}}')
    return v.show(html_path)
    

vis(trim(replies_g, 2), "results/replies-trimmed.html")

## Retweets

We can do the same kind of thing for retweets:

In [13]:
retweets = pandas.merge(users, tweets, left_on='user_id', right_on='retweet_userid')
retweets.tweetid.count()


1569

In [14]:
retweets[['userid', 'retweet_userid']]

Unnamed: 0,userid,retweet_userid
0,0NT4B9WQrz8ApHuMaWQdgjkiIISIgc+1VnzNZYXDh8g=,48252327
1,0NT4B9WQrz8ApHuMaWQdgjkiIISIgc+1VnzNZYXDh8g=,48252327
2,0NT4B9WQrz8ApHuMaWQdgjkiIISIgc+1VnzNZYXDh8g=,48252327
3,3fXrw02Ese7Cy5Al59Z9Zf9KPTlvSOgTv86yEyd1bkY=,48252327
4,X6EbvZXkH8sAimFRU7Mw1YYkAwdqnqLkQ3LENmt8HL0=,1113100093
5,b1FBnHRadyhA6q1aTLaPri5ixvwpr48FibomPmBFcWM=,1113100093
6,i1m5wcVTbrRsd4ykyomWv9dPVBIYunKFIqny4yj9F7s=,1113100093
7,nQD1ryjPgH6CQNZTg6RWYEPwX7ECZHF8wEEML5H4Zo8=,1113100093
8,7I5d3hsJqwKQpgBD8shjkGTW1CYeanAZ0Dey3gVyuKw=,1113100093
9,qdXwNLaTGS5TxJiBjrxcwP+TBybEFwbS2Aj7jQZxw8=,1113100093
