Dataset source = http://snap.stanford.edu/data/wiki-RfA.html <br>
Dataset title = "Wikipedia Requests for Adminship (with text)"<br>
Dataset public = Yes

#### Stats coming from dataset
(median/mean: 19/34 tokens)<br>
Nodes	10,835<br>
Edges	159,388<br>
Triangles	956,428<br>
Type of graph: Directed Signed Network

#### Dataset format
*Fields*<br>
SRC: user name of source, i.e., voter<br>
TGT: user name of target, i.e., the user running for election<br>
VOT: the source's vote on the target (-1 = oppose; 0 = neutral; 1 = support)<br>
RES: the outcome of the election (-1 = target was rejected as admin; 1 = target was accepted)<br>
YEA: the year in which the election was started<br>
DAT: the date and time of this vote<br>
TXT: the comment written by the source, in wiki markup<br>
<br>
<br>
*Example*<br>
SRC:Guettarda<br>
TGT:Lord Roem<br>
VOT:1<br>
RES:1<br>
YEA:2013<br>
DAT:19:53, 25 January 2013<br>
TXT:'''Support''' per [[WP:DEAL]]: clueful, and unlikely to break Wikipedia.<br>

In [1]:
import networkx as nx
import  pickle
import pandas as pd
import numpy as np

First let's setup the directed graph instance.

In [2]:
G = nx.DiGraph()

Load Preprocessed data

In [3]:
dataDF = pickle.load( open( "myData.pickle", "rb" ) )

Get a snapshot of our data

In [4]:
dataDF.head()

Unnamed: 0,DAT,RES,SRC,TGT,TXT,VOT,YEA
0,"23:13, 19 April 2013",1,Steel1943,BDD,'''Support''' as co-nom.,1,2013
1,"01:04, 20 April 2013",1,Cuchullain,BDD,'''Support''' as nominator.--,1,2013
2,"23:43, 19 April 2013",1,INeverCry,BDD,'''Support''' per noms.,1,2013
3,"00:11, 20 April 2013",1,Cncmaster,BDD,'''Support''' per noms. BDD is a strong contri...,1,2013
4,"00:56, 20 April 2013",1,Miniapolis,BDD,"'''Support''', with great pleasure. I work wit...",1,2013


In [5]:
for line in dataDF.iterrows():
    G.add_edge(line[1].SRC, line[1].TGT)
    G[line[1].SRC][line[1].TGT]['VOT'] = line[1].VOT
    G[line[1].SRC][line[1].TGT]['RES'] = line[1].RES
    G[line[1].SRC][line[1].TGT]['YEA'] = line[1].YEA
    G[line[1].SRC][line[1].TGT]['DAT'] = line[1].DAT
    G[line[1].SRC][line[1].TGT]['TXT'] = line[1].TXT
    

Let's save the graph so we don't have to reload it every time

In [6]:
pickle.dump( G, open( "fullGraph.pickle", "wb" ) )

Now load the pickle data to load our graph

In [4]:
G = pickle.load( open( "fullGraph.pickle", "rb" ) )

In [5]:
tmplist = list(G.nodes())
print(tmplist[0:5])

['Steel1943', 'BDD', 'Cuchullain', 'INeverCry', 'Cncmaster']


In [6]:
G.number_of_edges()

52514

In [7]:
G.number_of_nodes()

4249

Missmatch on nodes, let's see why

In [3]:
import glob
import os
import re
import pandas as pd
import numpy as np
from collections import Counter
files = glob.glob("./*.txt")
files

['./wiki-RfA.txt']

In [4]:
#finalDataframe = pd.DataFrame()
import re
count = 0
listWords = []
for file in files:
    with open(file, mode="r") as f:
        content = f.read().splitlines()

srcNodes = []
count = 1
for text in content:
    count = count + 1
    if count == 1000: 
        count = 0
    textsp = re.split(r":", text)
    if textsp[0] == 'TGT' or textsp[0] == 'SRC':
        srcNodes.append(textsp[1])

In [5]:
len(srcNodes)

396550

Lets get the number of unique nodes first by usig common

In [6]:
from collections import Counter
cntr = Counter(srcNodes)
result=cntr.most_common()

In [7]:
print("Unique nodes version 1 is ", len(result))

Unique nodes version 1 is  11380


Now let's do this via removing duplicates converting to a set

In [8]:
tst = set(result)
print(len(tst))

11380


In [9]:
print("Unique nodes version 2 is ",len(tst))

Unique nodes version 2 is  11380


In [10]:
#Check if all the fields are present
DAT=""
RES=0
SRC=""
TGT=""
TXT=""
VOT=0
YEA=0
count = 1
for text in content:
    count = count +1
    textsp = re.split(r":", text)
    if textsp[0] == 'DAT':
        DAT=textsp[1]
    if textsp[0] == 'RES':
        RES=textsp[1]
    if textsp[0] == 'SRC':
        SRC=textsp[1]
    if textsp[0] == 'TGT':
        TGT=textsp[1]
    if textsp[0] == 'TXT':
        G.add_edge(SRC, TGT)
        G[SRC][TGT]['VOT'] = VOT
        G[SRC][TGT]['RES'] = RES
        G[SRC][TGT]['YEA'] = YEA
        G[SRC][TGT]['DAT'] = DAT
        G[SRC][TGT]['TXT'] = TXT
        DAT=""
        RES=0
        SRC=""
        TGT=""
        TXT=""
        VOT=0
        YEA=0
    if textsp[0] == 'VOT':
        VOT=textsp[1]
    if textsp[0] == 'YEA':
        YEA=textsp[1]
print(count)    

1586201


In [12]:
G.number_of_nodes()
G.number_of_edges()

189003