# Create Graph Dataset

In this notebook, we put [BuzzFeed dataset](https://github.com/KaiDMML/FakeNewsNet/tree/old-version/Data/BuzzFeed) from the 2018 version of FakeNewsNet into a format that can be loaded to a Neptune cluster. To get the raw data, you can:
1. Clone the [FakeNewsNet repository](https://github.com/KaiDMML/FakeNewsNet) from GitHub
2. Checkout the old-version branch
3. Change directory to Data/BuzzFeed

Once we have created `nodes` and `edges` csv files that are compatible with Amazon Neptune, we upload them to a staging S3 bucket and then to our Neptune database.

## Setup

In [60]:
# import required libraries
import pandas as pd
import scipy.io
import json
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import boto3
import utils.neptune_ml_utils as neptune_ml

## Read Data

This notebook assumes BuzzFeed data from the 2018 version of FakeNewsNet are located under `./Data/BuzzFeed/` relative to this notebook.

In [2]:
# read raw data for users 
users = pd.read_csv('./Data/BuzzFeed/User.txt', header=None)

In [3]:
users.head()

Unnamed: 0,0
0,98d2b98ce305174e2f6c10b8f8a1a9d5
1,a273d0fd07c18a884ce2aa425813eb06
2,ac091e92df9e854a07563ffb397925d4
3,d2ded2de054f2ceb43dff7f80fc46774
4,3f2b23abf0e842f6bc97eed85596ff50


Each row in the above DataFrame provides a UIID for the corresponding user in the dataset!

In [4]:
users.shape

(15257, 1)

We have a total of 15,257 users in this dataset!

In [5]:
# read raw data for news 
news = pd.read_csv('./Data/BuzzFeed/News.txt', header=None)

In [6]:
news.head()

Unnamed: 0,0
0,BuzzFeed_Real_1
1,BuzzFeed_Real_2
2,BuzzFeed_Real_3
3,BuzzFeed_Real_4
4,BuzzFeed_Real_5


Each row in the above DataFrame provides a name and Id for the corresponding news in the dataset!

In [7]:
news.shape

(182, 1)

We have a total of 182 news in this dataset!

In [8]:
# read data about news_user relationships
news_user = pd.read_csv('./Data/BuzzFeed/BuzzFeedNewsUser.txt', sep='\t', header=None)

In [9]:
news_user.head()

Unnamed: 0,0,1,2
0,45,1,1
1,127,2,1
2,115,3,1
3,180,3,1
4,140,4,1


In the above DataFrame, the news_id in the first column is posted/spreaded by the user_id in the second column n times, where n is the value in the third column!

In [10]:
news_user.shape

(22779, 3)

In [11]:
news_user[2].sum()

25240

There are 22,779 unique news_user relationships and a total of 25,240 news_user relationships (accounting for users that have spread a news more than once) in the dataset!

In [12]:
# read data about user_user relationships
user_user = pd.read_csv('./Data/BuzzFeed/BuzzFeedUserUser.txt', sep='\t', header=None)

In [13]:
user_user.head()

Unnamed: 0,0,1
0,48,1
1,899,1
2,6781,1
3,10097,1
4,100,2


In the above DataFrame, user_id in the first column follows the user_id in the second column.

In [14]:
user_user.shape

(634750, 2)

There are a total of 634,750 user_user relationships (i.e. social links) in the dataset!

In [15]:
# read raw data about user features
user_features = scipy.io.loadmat('./Data/BuzzFeed/UserFeature.mat')['X'].toarray()

In [16]:
user_features.shape

(15257, 109626)

There are 109,626 features for each user! We will reduce dimentionality of the user features using PCA.

In [17]:
# reduce dimentionality of user_features using PCA
X = user_features
n = 100 # number of PCs
pca = PCA(n_components = n)
X_pca = pca.fit_transform(X)

## Create Nodes Table

In this section we create a DataFrame that will define nodes and their properties in the graph, in a format that is compatible with Amazon Neptune (with Apache TinkerPop Gremlin).

In [18]:
# create ~id and ~label for user nodes
users['row_num'] = users.index
users['~id'] = users.apply(lambda x: 'user_'+str(x['row_num']+1), axis=1)
users['~label'] = 'user'
# add user_features as a property for each user node
users['user_features:Double[]'] = np.nan
for i, r in users.iterrows():
    string = ";".join([str(val) for val in X_pca[i,:]])
    users.loc[i, 'user_features:Double[]'] = string

In [19]:
users.head()

Unnamed: 0,0,row_num,~id,~label,user_features:Double[]
0,98d2b98ce305174e2f6c10b8f8a1a9d5,0,user_1,user,-6.5335629813853275;-2.539568976534804;-0.0862...
1,a273d0fd07c18a884ce2aa425813eb06,1,user_2,user,3.7452664109727127;-2.3658738836989692;0.37171...
2,ac091e92df9e854a07563ffb397925d4,2,user_3,user,-6.169060740831697;-0.4554150240050502;-2.0617...
3,d2ded2de054f2ceb43dff7f80fc46774,3,user_4,user,11.19523153278209;-1.182848759491067;0.3934020...
4,3f2b23abf0e842f6bc97eed85596ff50,4,user_5,user,-6.182361512218626;0.4971749514792787;-0.91511...


In [20]:
# create ~id and ~label for news nodes
news['row_num'] = news.index
news['~id'] = news.apply(lambda x: 'news_'+str(x['row_num']+1), axis=1)
news['~label'] = 'news'
# specify news_type as a property for news nodes
news['news_type:String'] = news.apply(lambda x: x[0].split('_')[1], axis=1)

In [21]:
news.head()

Unnamed: 0,0,row_num,~id,~label,news_type:String
0,BuzzFeed_Real_1,0,news_1,news,Real
1,BuzzFeed_Real_2,1,news_2,news,Real
2,BuzzFeed_Real_3,2,news_3,news,Real
3,BuzzFeed_Real_4,3,news_4,news,Real
4,BuzzFeed_Real_5,4,news_5,news,Real


In [22]:
news.tail()

Unnamed: 0,0,row_num,~id,~label,news_type:String
177,BuzzFeed_Fake_87,177,news_178,news,Fake
178,BuzzFeed_Fake_88,178,news_179,news,Fake
179,BuzzFeed_Fake_89,179,news_180,news,Fake
180,BuzzFeed_Fake_90,180,news_181,news,Fake
181,BuzzFeed_Fake_91,181,news_182,news,Fake


In [23]:
# list of supposedly-authors appearing in the dataset who are not actually authors
# we will filter them out when creating author nodes from NewsContent data
non_authors = ['View All Posts', 'Cnn National Politics Reporter', 'Cnn White House Producer',
                'Senior Political Reporter', 'Cnn Pentagon Correspondent', 'Cnn Senior Congressional Producer']

In [24]:
# initialize news_title column in news dataframe with null values
news['news_title:String'] = np.nan

In [25]:
# extract list of authors and publishers from NewsContent files (i.e. authors and publishers nodes)
authors_list = []
publishers_list = []
for nwz in news[0]:
    
    if nwz.split('_')[1]=='Real':
        path = './Data/BuzzFeed/RealNewsContent/'+nwz+'-Webpage.json'
    else:
        path = './Data/BuzzFeed/FakeNewsContent/'+nwz+'-Webpage.json'
    
    with open(path) as fp:
        
        webpage = json.load(fp)
        
        if 'title' in webpage:
            news_title = webpage.get('title')
            # populate news_title column in news dataframe
            news.loc[news[0]==nwz, 'news_title:String'] = news_title
        
        
        if 'source' in webpage:
            publisher = webpage.get('source')
            if publisher not in publishers_list:
                publishers_list.append(publisher)

        if 'authors' in webpage: 
            for author in webpage.get('authors'):
                if author not in authors_list and author not in non_authors:
                    authors_list.append(author)

In [26]:
news.head()

Unnamed: 0,0,row_num,~id,~label,news_type:String,news_title:String
0,BuzzFeed_Real_1,0,news_1,news,Real,Another Terrorist Attack in NYC…Why Are we STI...
1,BuzzFeed_Real_2,1,news_2,news,Real,Hillary Clinton on police shootings: 'too many...
2,BuzzFeed_Real_3,2,news_3,news,Real,"Critical counties: Wake County, NC, could put ..."
3,BuzzFeed_Real_4,3,news_4,news,Real,NFL Superstar Unleashes 4 Word Bombshell on Re...
4,BuzzFeed_Real_5,4,news_5,news,Real,Obama in NYC: 'We all have a role to play' in ...


In [27]:
len(publishers_list)

28

There are 28 punlishers in the dataset!

In [28]:
publishers_list[:5]

['http://eaglerising.com',
 'http://cnn.it',
 'http://conservativebyte.com',
 'http://politi.co',
 'http://abcn.ws']

In [29]:
len(authors_list)

126

There are 126 authors in the dataset!

In [30]:
authors_list[:5]

['Leonora Cravotta', 'Mj Lee', 'Joyce Tseng', 'Eli Watkins', 'Kevin Liptak']

In [31]:
# extract author_publisher, author_news and publisher_news relationships
# from NewsContent files (i.e. author_publisher, author_news and publisher_news edges)
author_publisher = []
author_news = []
publisher_news = []

for news_id, nwz in enumerate(news[0]):
    
    if nwz.split('_')[1]=='Real':
        path = './Data/BuzzFeed/RealNewsContent/'+nwz+'-Webpage.json'
    else:
        path = './Data/BuzzFeed/FakeNewsContent/'+nwz+'-Webpage.json'
    
    with open(path) as fp:
        
        webpage = json.load(fp)
        
        if 'source' in webpage:
            publisher = webpage.get('source')
            publisher_id = publishers_list.index(publisher)
            # publisher ==> "published" ==> news
            publisher_news.append((publisher_id+1, news_id+1))

        if 'authors' in webpage: 
            for author in webpage.get('authors'):
                if author not in non_authors:
                    author_id = authors_list.index(author)
                    # author ==> "wrote_for" ==> publisher
                    author_publisher.append((author_id+1, publisher_id+1))
                    # author ==> "wrote" ==> news
                    author_news.append((author_id+1, news_id+1))

In [32]:
# create dataframe for author nodes
authors_df = pd.DataFrame(authors_list)
authors_df['row_num'] = authors_df.index
authors_df['~id'] = authors_df.apply(lambda x: 'author_'+str(x['row_num']+1), axis=1)
authors_df['~label'] = 'author'
authors_df['author_name:String'] = authors_df[0]

In [33]:
authors_df.head()

Unnamed: 0,0,row_num,~id,~label,author_name:String
0,Leonora Cravotta,0,author_1,author,Leonora Cravotta
1,Mj Lee,1,author_2,author,Mj Lee
2,Joyce Tseng,2,author_3,author,Joyce Tseng
3,Eli Watkins,3,author_4,author,Eli Watkins
4,Kevin Liptak,4,author_5,author,Kevin Liptak


In [34]:
# create dataframe for publisher nodes
publishers_df = pd.DataFrame(publishers_list)
publishers_df['row_num'] = publishers_df.index
publishers_df['~id'] = publishers_df.apply(lambda x: 'publisher_'+str(x['row_num']+1), axis=1)
publishers_df['~label'] = 'publisher'
publishers_df['publisher_website:String'] = publishers_df[0]

In [35]:
publishers_df.head()

Unnamed: 0,0,row_num,~id,~label,publisher_website:String
0,http://eaglerising.com,0,publisher_1,publisher,http://eaglerising.com
1,http://cnn.it,1,publisher_2,publisher,http://cnn.it
2,http://conservativebyte.com,2,publisher_3,publisher,http://conservativebyte.com
3,http://politi.co,3,publisher_4,publisher,http://politi.co
4,http://abcn.ws,4,publisher_5,publisher,http://abcn.ws


In [36]:
# concatenate all nodes dataframes to create an overall nodes (i.e. vertices) dataframe
nodes = pd.concat([news, users, publishers_df, authors_df], sort=True, ignore_index=True)

In [37]:
# drop unwanted columns
nodes = nodes.drop(nodes.columns[[0, 1]], axis=1)

In [38]:
nodes.shape

(15593, 7)

We have a total of 15593 nodes in the graph!

In [39]:
# user nodes
nodes.loc[nodes['~label']=='user'].head()

Unnamed: 0,~id,~label,news_type:String,news_title:String,user_features:Double[],publisher_website:String,author_name:String
182,user_1,user,,,-6.5335629813853275;-2.539568976534804;-0.0862...,,
183,user_2,user,,,3.7452664109727127;-2.3658738836989692;0.37171...,,
184,user_3,user,,,-6.169060740831697;-0.4554150240050502;-2.0617...,,
185,user_4,user,,,11.19523153278209;-1.182848759491067;0.3934020...,,
186,user_5,user,,,-6.182361512218626;0.4971749514792787;-0.91511...,,


In [40]:
# news nodes
nodes.loc[nodes['~label']=='news'].head()

Unnamed: 0,~id,~label,news_type:String,news_title:String,user_features:Double[],publisher_website:String,author_name:String
0,news_1,news,Real,Another Terrorist Attack in NYC…Why Are we STI...,,,
1,news_2,news,Real,Hillary Clinton on police shootings: 'too many...,,,
2,news_3,news,Real,"Critical counties: Wake County, NC, could put ...",,,
3,news_4,news,Real,NFL Superstar Unleashes 4 Word Bombshell on Re...,,,
4,news_5,news,Real,Obama in NYC: 'We all have a role to play' in ...,,,


In [41]:
# publisher nodes
nodes.loc[nodes['~label']=='publisher'].head()

Unnamed: 0,~id,~label,news_type:String,news_title:String,user_features:Double[],publisher_website:String,author_name:String
15439,publisher_1,publisher,,,,http://eaglerising.com,
15440,publisher_2,publisher,,,,http://cnn.it,
15441,publisher_3,publisher,,,,http://conservativebyte.com,
15442,publisher_4,publisher,,,,http://politi.co,
15443,publisher_5,publisher,,,,http://abcn.ws,


In [42]:
# author nodes
nodes.loc[nodes['~label']=='author'].head()

Unnamed: 0,~id,~label,news_type:String,news_title:String,user_features:Double[],publisher_website:String,author_name:String
15467,author_1,author,,,,,Leonora Cravotta
15468,author_2,author,,,,,Mj Lee
15469,author_3,author,,,,,Joyce Tseng
15470,author_4,author,,,,,Eli Watkins
15471,author_5,author,,,,,Kevin Liptak


## Create Edges Table

In [43]:
# create a list of edges from all edge types including edge labels 
edges_list = []

for i, r in user_user.iterrows():
    edges_list.append(('user_user_'+str(i+1), 'user_'+str(r[0]), 'user_'+str(r[1]), 'follows', np.nan))
    
for i, r in news_user.iterrows():
    edges_list.append(('news_user_'+str(i+1), 'news_'+str(r[0]), 'user_'+str(r[1]), 'spread_by', r[2]))
    
for i, item in enumerate(author_news):
    edges_list.append(('author_news_'+str(i+1), 'author_'+str(item[0]), 'news_'+str(item[1]), 'wrote', np.nan))
    
for i, item in enumerate(publisher_news):
    edges_list.append(('publisher_news_'+str(i+1), 'publisher_'+str(item[0]), 'news_'+str(item[1]), 'published', np.nan))
    
for i, item in enumerate(author_publisher):
    edges_list.append(('author_publisher_'+str(i+1), 'author_'+str(item[0]), 'publisher_'+str(item[1]), 'wrote_for', np.nan))

In [44]:
# convert edges_list to a dataframe
edges = pd.DataFrame(edges_list, columns=['~id', '~from', '~to', '~label', 'weight:Int'])

In [45]:
edges.head()

Unnamed: 0,~id,~from,~to,~label,weight:Int
0,user_user_1,user_48,user_1,follows,
1,user_user_2,user_899,user_1,follows,
2,user_user_3,user_6781,user_1,follows,
3,user_user_4,user_10097,user_1,follows,
4,user_user_5,user_100,user_2,follows,


In [46]:
edges.loc[edges['~label']=='spread_by'].head()

Unnamed: 0,~id,~from,~to,~label,weight:Int
634750,news_user_1,news_45,user_1,spread_by,1.0
634751,news_user_2,news_127,user_2,spread_by,1.0
634752,news_user_3,news_115,user_3,spread_by,1.0
634753,news_user_4,news_180,user_3,spread_by,1.0
634754,news_user_5,news_140,user_4,spread_by,1.0


In [47]:
edges.shape

(658203, 5)

We have a total of 658,203 edges across all edge types!

## Save Nodes and Edges to File

In [49]:
!mkdir -p ./Data/upload

In [50]:
nodes.to_csv('./Data/upload/nodes.csv', index=False)

In [51]:
edges.to_csv('./Data/upload/edges.csv', index=False)

## Upload to S3 Bucket

In [57]:
bucket = '<bucket-name>'
prefix = 'fake-news-detection/data'
s3_client = boto3.client('s3')

In [58]:
resp = s3_client.upload_file('./Data/upload/nodes.csv', bucket, f"{prefix}/nodes.csv")
resp = s3_client.upload_file('./Data/upload/edges.csv', bucket, f"{prefix}/edges.csv")

## Bulk Load to Neptune 

We use the `%load` magic command which is available as part of the AWS `graph-notebook` to bulk load data to our Neptune database. You can use the `%graph_notebook_config` magic command to see information about the Neptune cluster associated with your graph-notebook, and `%status` magic command to see the status of your Neptune cluster.

Note: Use [these CloudFormation templates](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-quick-start.html) to quickly spin up a `graph-notebook`, an associted Neptune cluster, and set up all the configurations needed to work with Neptune ML in a `graph-notebook`.

In [63]:
s3_uri = f"s3://{bucket}/{prefix}"

In [None]:
%load -s {s3_uri} -f csv -p OVERSUBSCRIBE --run

Once the above cell has completed, the data has been loaded into the cluster. We verify the data loaded correctly by running the traversals below to see the count of nodes and edges by label:

In [67]:
%%gremlin
g.V().groupCount().by(label).unfold().order().by(keys)

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

In [68]:
%%gremlin
g.E().groupCount().by(label).unfold().order().by(keys)

Tab(children=(Output(layout=Layout(max_height='600px', overflow='scroll', width='100%')), Output(layout=Layout…

We can see that all nodes and edges have been loaded to the Neptune cluster!