# How to run?

1. Create a folder called "project" and put this file and the datasets (all-rnr-annotated-threads and threads). Put source_tweets.csv file inside all-rnr-annotated-threads folder.
2. If it complains that you need pip install the imports, do that before continue.
3. Remember that some of the parts takes some time (especially graph calculation), so be patient. With less data, it is quicker, so maybe for the visualization, we can choose a subset of dataset. But, it is better to do calculation on whole data in my opinion. We can ask this to Mirela on Wednesday.

IDEA: We can create a network by assigning weight to the edges, and see whether unverified news are closer to being true or false. We can also use who follows whom information (we can gather all of them by doing preprocessing).

## Here is some useful links and resources. Those can be directly used for analysis:


NetworkX documentation: https://networkx.org/documentation/stable/reference/index.html
https://github.com/swkasica/pheme-rnr-knowledge-discovery/blob/master/exploratory_data_analysis.ipynb
https://github.com/shakshi12/Rumor-Spreaders-using-GNN-approach-PHEME-dataset-/tree/master/code

In [1]:
import os
import json
import time
import pytz
from datetime import datetime, timezone, timedelta
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import networkx as nx
import networkx.algorithms.community as nx_comm
from convert_veracity_annotations import convert_annotations

In [2]:
%cd ../project/all-rnr-annotated-threads

/Users/aycavci/Desktop/social-network-analysis/project/all-rnr-annotated-threads


In [3]:
# This is for source tweets

source = pd.read_csv("source_tweets.csv")

In [6]:
source.head()

Unnamed: 0,tweet_id,text,tweet_date,fav_count,retweet_count,user_id,username,account_date,followers,followings,tweet_count,event,is_rumour,target
0,553529101659566080,BREAKING: Armed man takes hostage in kosher gr...,Fri Jan 09 12:29:56 +0000 2015,14,177,24506246,haaretzcom,Sun Mar 15 09:43:07 +0000 2009,193798,614,63575,charliehebdo-all-rnr-threads,1,unverified
1,553587735613952001,"#CharlieHebdo killers dead, confirmed by genda...",Fri Jan 09 16:22:55 +0000 2015,30,134,501768982,AgnesCPoirier,Fri Feb 24 13:18:20 +0000 2012,4709,375,1076,charliehebdo-all-rnr-threads,1,true
2,552816932643405824,"Top French cartoonists Charb, Cabu, Wolinski, ...",Wed Jan 07 13:20:02 +0000 2015,23,148,47624589,bouckap,Tue Jun 16 13:36:07 +0000 2009,20401,592,7182,charliehebdo-all-rnr-threads,1,unverified
3,553515399438811136,Police have surrounded the area where the #Cha...,Fri Jan 09 11:35:29 +0000 2015,329,684,759251,CNN,Fri Feb 09 00:35:02 +0000 2007,15405096,1038,54427,charliehebdo-all-rnr-threads,1,true
4,552808620187217920,PHOTO: Armed gunmen face police officers near ...,Wed Jan 07 12:47:00 +0000 2015,27,113,64643056,RT_com,Tue Aug 11 06:12:45 +0000 2009,842236,460,104998,charliehebdo-all-rnr-threads,1,true


In [7]:
folds = ['charliehebdo-all-rnr-threads', 'ottawashooting-all-rnr-threads', 'ebola-essien-all-rnr-threads',
         'prince-toronto-all-rnr-threads', 'ferguson-all-rnr-threads', 'putinmissing-all-rnr-threads',
         'germanwings-crash-all-rnr-threads', 'sydneysiege-all-rnr-threads']

In [8]:
# This is for reactions. I couldn't manage to write them to .csv file since it is huge
# Below are our candidate features, some of them will be dropped

tweet_ids = []
in_reply_to_tweet_ids = []
texts = []
tweet_dates = []
fav_counts = []
retweet_counts = []

user_ids = []
in_reply_to_user_ids = []
usernames = []
user_mentions = []
mentions = []
account_dates = []
protected = []
verified = []
followers = []
followings = []
tweet_counts = []

hashtags = []
urls = []

events = []
is_rumour = []

for f in folds:
    path1 = os.path.join(f, 'rumours')
    for dir1 in os.listdir(path1):
        if '_' not in dir1:
            path2 = os.path.join(path1, dir1,'reactions')
            for dir2 in os.listdir(path2):
                if '_' not in dir2:
                    path3  = os.path.join(path2, dir2)
                    file = open(path3)
                    data = json.load(file)

                    # tweet features
                    tweet_id = int(data['id'])
                    in_reply_to_tweet_id = data['in_reply_to_status_id']
                    text = data['text']
                    tweet_date = data['created_at']
                    favs = int(data['favorite_count'])
                    retweets = int(data['retweet_count'])

                    # user features
                    user_id = int(data['user']['id'])
                    in_reply_to_user_id = data['in_reply_to_user_id']
                    username = data['user']['screen_name']
                    user_mention = data['entities']['user_mentions']
                    account_date = data['user']['created_at']
                    is_protected = data['user']['protected']
                    is_verified = data['user']['verified']
                    no_followers = int(data['user']['followers_count'])
                    no_followings = int(data['user']['friends_count'])
                    no_tweets = int(data['user']['statuses_count'])

                    # entities
                    no_hashtags = int(len(data['entities']['hashtags']))      
                    has_url = data['entities']['urls']
                    
                    # append
                    tweet_ids.append(tweet_id)
                    in_reply_to_tweet_ids.append(in_reply_to_tweet_id)
                    texts.append(text)
                    tweet_dates.append(tweet_date)
                    fav_counts.append(favs)
                    retweet_counts.append(retweets)

                    user_ids.append(user_id)
                    in_reply_to_user_ids.append(in_reply_to_user_id)
                    usernames.append(username)
                    for mention in user_mention:
                        mentions.append((mention['id'], mention['screen_name']))
                    user_mentions.append(mentions)
                    account_dates.append(account_date)
                    protected.append(is_protected)
                    verified.append(is_verified)
                    followers.append(no_followers)
                    followings.append(no_followings)
                    tweet_counts.append(no_tweets)

                    hashtags.append(no_hashtags)
                    urls.append(has_url)

                    is_rumour.append(0)
                    events.append(f)
                    
    path4 = os.path.join(f, 'non-rumours')
    for dir3 in os.listdir(path4):
        if '_' not in dir3:
            path5 = os.path.join(path4, dir3,'reactions')
            for dir4 in os.listdir(path5):
                if '_' not in dir4:
                    path6  = os.path.join(path5, dir4)
                    file = open(path6)
                    data = json.load(file)

                    # tweet features
                    tweet_id = int(data['id'])
                    in_reply_to_tweet_id = data['in_reply_to_status_id']
                    text = data['text']
                    tweet_date = data['created_at']
                    favs = int(data['favorite_count'])
                    retweets = int(data['retweet_count'])

                    # user features
                    user_id = int(data['user']['id'])
                    in_reply_to_user_id = data['in_reply_to_user_id']
                    username = data['user']['screen_name']
                    user_mention = data['entities']['user_mentions']
                    account_date = data['user']['created_at']
                    is_protected = data['user']['protected']
                    is_verified = data['user']['verified']
                    no_followers = int(data['user']['followers_count'])
                    no_followings = int(data['user']['friends_count'])
                    no_tweets = int(data['user']['statuses_count'])

                    # entities
                    no_hashtags = int(len(data['entities']['hashtags']))      
                    has_url = data['entities']['urls']

                    # append
                    tweet_ids.append(tweet_id)
                    in_reply_to_tweet_ids.append(in_reply_to_tweet_id)
                    texts.append(text)
                    tweet_dates.append(tweet_date)
                    fav_counts.append(favs)
                    retweet_counts.append(retweets)

                    user_ids.append(user_id)
                    in_reply_to_user_ids.append(in_reply_to_user_id)
                    usernames.append(username)
                    for mention in user_mention:
                        mentions.append((mention['id'], mention['screen_name']))
                    user_mentions.append(mentions)
                    account_dates.append(account_date)
                    protected.append(is_protected)
                    verified.append(is_verified)
                    followers.append(no_followers)
                    followings.append(no_followings)
                    tweet_counts.append(no_tweets)

                    hashtags.append(no_hashtags)
                    urls.append(has_url)

                    is_rumour.append(0)
                    events.append(f)

In [9]:
reactions = pd.DataFrame([tweet_ids, texts, in_reply_to_tweet_ids, tweet_dates, fav_counts, retweet_counts,
                          user_ids, in_reply_to_user_ids, usernames, user_mentions, account_dates, protected,
                         verified, followers, followings, tweet_counts, hashtags, urls, events, is_rumour], 
                          ['tweet_id', 'text', 'in_reply_to_tweet_id', 'tweet_date', 'fav_count', 'retweet_count', 
                           'user_id', 'in_reply_to_user_id', 'username', 'user_mentions', 'account_date', 'protected', 
                           'verified', 'followers', 'followings', 'tweet_count', 'no_hashtags', 'urls', 'event', 
                           'is_rumour']).transpose()

reactions = reactions.infer_objects()

# drop categorical data and protected which has 0 var
reactions.drop(["protected", "verified", "urls"], axis=1, inplace=True)

reactions = reactions.dropna().reset_index(drop=True)

# convert boolen features into numerical
reactions = reactions.astype({"in_reply_to_tweet_id":'int64', "in_reply_to_user_id":'int64'})

# reactions.dtypes
# reactions.isna().sum()

In [None]:
# reactions.to_csv('reactions.csv', index=True)
# reactions.to_parquet('reactions.parquet.gzip', compression='gzip')

In [11]:
reactions.head()

Unnamed: 0,tweet_id,text,in_reply_to_tweet_id,tweet_date,fav_count,retweet_count,user_id,in_reply_to_user_id,username,user_mentions,account_date,followers,followings,tweet_count,no_hashtags,event,is_rumour
0,553530890908098561,“@haaretzcom: BREAKING: Armed man takes hostag...,553529101659566080,Fri Jan 09 12:37:03 +0000 2015,0,0,269112578,24506246,CapSpaulding01,"[(24506246, haaretzcom), (27238485, andreinett...",Sun Mar 20 03:45:28 +0000 2011,1296,1997,102425,0,charliehebdo-all-rnr-threads,0
1,553547184000339968,@haaretzcom @AhmetHez to kill is right? How da...,553529101659566080,Fri Jan 09 13:41:47 +0000 2015,0,0,2962507555,24506246,NodrugsRussia,"[(24506246, haaretzcom), (27238485, andreinett...",Wed Jan 07 07:00:41 +0000 2015,163,1998,2297,0,charliehebdo-all-rnr-threads,0
2,553546643941761024,@haaretzcom @AhmetHez play back infront of ur ...,553529101659566080,Fri Jan 09 13:39:38 +0000 2015,0,0,2962507555,24506246,NodrugsRussia,"[(24506246, haaretzcom), (27238485, andreinett...",Wed Jan 07 07:00:41 +0000 2015,163,1998,2297,0,charliehebdo-all-rnr-threads,0
3,553530583583047680,@ohohyesyesnono @haaretzcom Bots will conquest...,553529101659566080,Fri Jan 09 12:35:49 +0000 2015,0,0,2856401319,24506246,KingMasterbot,"[(24506246, haaretzcom), (27238485, andreinett...",Sun Nov 02 09:16:18 +0000 2014,339,3,113848,0,charliehebdo-all-rnr-threads,0
4,553547846612295681,@haaretzcom @AhmetHez you must be paranoid to ...,553529101659566080,Fri Jan 09 13:44:25 +0000 2015,0,0,2962507555,24506246,NodrugsRussia,"[(24506246, haaretzcom), (27238485, andreinett...",Wed Jan 07 07:00:41 +0000 2015,163,1998,2297,0,charliehebdo-all-rnr-threads,0


In [None]:
# Adjust this according to your path before running!

%cd /Users/aycavci/Desktop/social-network-analysis/project/threads/en

In [None]:
folders = ['charliehebdo', 'ottawashooting', 'ebola-essien', 'prince-toronto', 'ferguson', 'putinmissing',
           'germanwings-crash', 'sydneysiege']

In [None]:
who_follows_whom = []
count=0

for f in folders:
    path1 = os.path.join(f)
    for dir1 in os.listdir(path1):
        if '_' not in dir1:
            print(path2)
            count+=1
            path2  = os.path.join(path1, dir1,'who-follows-whom.dat')
            try:
                tuples = list(pd.read_table(path2, header=None).itertuples(name=None, index=False))
                who_follows_whom.extend(tuples)
            except:
                continue
print(count)

In [None]:
# Create directed graph
# G = nx.Graph()
G = nx.DiGraph()

# Assign edge
# G.add_weighted_edges_from(edges)
G.add_edges_from(who_follows_whom)

In [None]:
# -- Network statistics and metrics --

print("Number of nodes: ", G.number_of_nodes())
print("Number of edges: ", G.number_of_edges())
print("Degree distribution: ", G.degree())
print("In-degree: ", G.in_degree())
print("Out-degree: ", G.out_degree())

In [None]:
# Centrality indices
print("Degree centralityt: ", nx.degree_centrality(G))
print("In-degree centrality: ", nx.in_degree_centrality(G))
print("Out-degree centrality: ", nx.out_degree_centrality(G))
print("Betweenness centrality: ", nx.betweenness_centrality(G))
print("Closeness centrality: ", nx.closeness_centrality(G))
centrality = nx.eigenvector_centrality(G)
print("Eigenvector centrality: ", sorted((v, f"{c:0.2f}") for v, c in centrality.items()))

In [None]:
# Clustering coefficient
print("Clustering coefficient: ", nx.clustering(G))
print("Average clustering coefficient: ", nx.average_clustering(G))

In [None]:
# Network diameter and density
print("Network diameter: ", nx.diameter(G))
print("Network density: ", nx.density(G))

In [None]:
# Number of connected components and size of connected components
print("Number of connected components: ", nx.number_connected_components(G))
# TODO: We should calculate size of the connected components by ourselves

In [None]:
# -- Communities --

# TODO: Cliques: seems like only for undirected graphs -> https://networkx.org/documentation/stable/reference/algorithms/clique.html
# print("All cliques: ", list(nx.enumerate_all_cliques(G)))
# TODO: Homophily analysis: we need to implement by ourselves -> https://stackoverflow.com/questions/69482619/homophily-in-a-social-network-using-python
# Important nodes acting as Bridges
print("Bridges: ", list(nx.bridges(G)))
# Partitioning Algorithm: Girvan-Newman -> https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.centrality.girvan_newman.html
print("Girvan-Newman: ", list(nx_comm.girvan_newman(G)))

In [None]:
# HITS: rank nodes by hub and authority scores
h, a = nx.hits(G)
print("Hub score: ", h)
print("Authority score: ", a)

In [None]:
# TODO: Plotting: takes too long. Needs styling. Maybe just skip this for now.
plt.figure(figsize=(12, 12))
nx.draw(G, with_labels=True, pos=nx.kamada_kawai_layout(G))

In [None]:
# Longitudinal analysis

# Lets take 5 rumours source tweets from each events and see how reactions to those tweets changes in time 

events = ['charliehebdo-all-rnr-threads', 'ottawashooting-all-rnr-threads', 'ferguson-all-rnr-threads',
          'putinmissing-all-rnr-threads', 'germanwings-crash-all-rnr-threads']

source_tweet_list = []
time_list = []

for event in events:
    tweet = source.loc[source.event==event].iloc[0]
    tweet['tweet_date'] = datetime.strptime(tweet['tweet_date'], '%a %b %d %H:%M:%S %z %Y').replace(tzinfo=None)
    source_tweet_list.append(tweet)

In [None]:
# Before running this function for below three cells, you need to change timedelta=minutes for minutes,
# timedelta=hours for hours and timedelta=days for days, and run this cell and cell related above.

def tweet_counts_by_time(source_tweet_list, reactions, time_begin, time_end):
    tweet_reaction_list = []
    for tweet in source_tweet_list:
        count = 0
        for i in range(len(reactions)):
            reaction = reactions.iloc[i]
            if tweet.event == reaction.event:
                time = datetime.strptime(reaction.tweet_date, '%a %b %d %H:%M:%S %z %Y').replace(tzinfo=None)
                diff = time-tweet.tweet_date
                if (diff > timedelta(minutes=time_begin)) and (diff <= timedelta(minutes=time_end)):
                    count+=1
        tweet_reaction_list.append(count)
    
    return tweet_reaction_list

In [None]:
minute = tweet_counts_by_time(source_tweet_list, reactions, 0, 1)
five_min = tweet_counts_by_time(source_tweet_list, reactions, 0, 5)
ten_min = tweet_counts_by_time(source_tweet_list, reactions, 0, 10)
thirty_min = tweet_counts_by_time(source_tweet_list, reactions, 0, 30)

In [None]:
hour = tweet_counts_by_time(source_tweet_list, reactions, 0, 1)
three_hrs = tweet_counts_by_time(source_tweet_list, reactions, 0, 3)
eight_hrs = tweet_counts_by_time(source_tweet_list, reactions, 0, 8)
twelve_hrs = tweet_counts_by_time(source_tweet_list, reactions, 0, 12)

In [None]:
day = tweet_counts_by_time(source_tweet_list, reactions, 0, 1)
two_days = tweet_counts_by_time(source_tweet_list, reactions, 0, 2)
five_days = tweet_counts_by_time(source_tweet_list, reactions, 0, 5)

In [None]:
time_list.append(minute)
time_list.append(five_min)
time_list.append(ten_min)
time_list.append(thirty_min)
time_list.append(hour)
time_list.append(three_hrs)
time_list.append(eight_hrs)
time_list.append(twelve_hrs)
time_list.append(day)
time_list.append(two_days)
time_list.append(five_days)

In [None]:
import matplotlib.pyplot as plt

charlie_x = ['1 min', '5 min', '10 min', '30 min', '1 hr', '3 hrs', '8 hrs', '12 hrs', '1 day', '2 days', '5 days']
ottawa_x = ['1 min', '5 min', '10 min', '30 min', '1 hr', '3 hrs', '8 hrs', '12 hrs', '1 day', '2 days', '5 days']
ferguson_x = ['1 min', '5 min', '10 min', '30 min', '1 hr', '3 hrs', '8 hrs', '12 hrs', '1 day', '2 days', '5 days']
putin_x = ['1 min', '5 min', '10 min', '30 min', '1 hr', '3 hrs', '8 hrs', '12 hrs', '1 day', '2 days', '5 days']
germanwings_x = ['1 min', '5 min', '10 min', '30 min', '1 hr', '3 hrs', '8 hrs', '12 hrs', '1 day', '2 days', '5 days']

charlie_y = [time[0] for time in time_list]
ottawa_y = [time[1] for time in time_list]
ferguson_y= [time[2] for time in time_list]
putin_y = [time[3] for time in time_list]
germanwings_y = [time[4] for time in time_list]

plt.figure(figsize=(10, 10))

plt.plot(charlie_x, charlie_y, label = "charliehebdo")
plt.plot(ottawa_x, ottawa_y, label = "ottawashooting")
plt.plot(ferguson_x, ferguson_y, label = "ferguson")
plt.plot(putin_x, putin_y, label = "putinmissing")
plt.plot(germanwings_x, germanwings_y, label = "germanwings-crash")

plt.xlabel('Timespan')
plt.ylabel('Reaction counts (only comments)')
plt.title('How reactions to those tweets change in time')
  
plt.legend()
plt.show()