# Ditchley S2DS project August 2020 - Pipeline D

## Team: Adam Hawken, Luca Lamoni, Elizabeth Nicholson, Robert Webster

This notebook (D_pipeline) will be dedicated to:
D1: Initializing the Neo4j Graph Database 
D2: Importing files into the Neo4j Graph Database
D3: Remove statistical outliers
D4: Graph Boosting
D5: Filter by Topic and keyword


## Section D.1: Initialize graph database

Databse must be active, this can be done in the neo4j desktop.

### D.1: Open database and set keyword 

In [2]:
import numpy as np
import pandas as pd
import sys

# Set up working directory
# The working directory should reflect the structure of the Github repository https://github.com/S2DSLondon/Aug20_Ditchley
sys.path.insert(1, '/Users/adam/S2DS/GitHub/Aug20_Ditchley')
from src.graph_database import graphdb as gdb

#Set the keyword of interest
keyword = 'cybersecurity'

# load / declare the database
graph = gdb.get_graph(new_graph = True)

Neo4j import files need to be in a specific folder, however, the csv files saved above are in a different folder, to get around this problem on Windows machines it is possible to create a shortcut between the two folders, on linux/mac one can create a symbolic link.

## D.2 Load files into the database

### D.2.1 Load in journalists

Journalists exist as (Person) nodes on the graph.

In [3]:
# load in user information
print('Loading in user information and drawing (Person) nodes')
fn_users = 'processed/'+keyword+'_user_profiles.csv'
gdb.load_users(fn_users ,graph)

Loading in user information and drawing (Person) nodes


### D.2.2 Load in journalists' friends

Friends exist as (Person) nodes on the graph. Journalists connect to friends by [FOLLOWS] edges.

In [4]:
# load in friend information
print('Loading in friends info and drawing [FOLLOWS] edges')
fn_friends = 'processed/'+keyword+'_journalist_friends.csv'
gdb.load_friends(fn_friends,graph,new=True)

Loading in friends info and drawing [FOLLOWS] edges


In [5]:
# upload profile information of friends
fn = 'processed/'+keyword+'_user_friends_profiles.csv'
gdb.load_existing_users(fn,graph) 
#gdb.load_existing_users('processed/'+keyword+'_all_profiles.csv',graph) 

### D.2.3 Load in tweets

Tweets exist as (Tweet) nodes on the graph. They are connected to the users who tweeted them via [POSTS] edges. If they mention someone in the graph then they connect to that user via a [MENTIONS] edge. If the tweet is a reply to another tweet in the graph then it is connected to that tweet via a [REPLIES_TO] edge.

In [6]:
# load in tweet information from twint
print('Loading in tweets and drawing (Tweet) nodes')
fn_tweets = 'processed/'+keyword+'_standard_tweets_twint.csv'
gdb.load_tweets(fn_tweets ,graph) 

Loading in tweets and drawing (Tweet) nodes


In [7]:
# load in tweet information from API
print('Loading in tweets and drawing (Tweet) nodes')
fn_tweets = 'processed/'+keyword+'_standard_tweets_api.csv'
gdb.load_tweets(fn_tweets ,graph) 

Loading in tweets and drawing (Tweet) nodes


In [8]:
# draw edges between users and their tweets
print('Drawing [POSTS] edges')
gdb.get_posts(graph)

Drawing [POSTS] edges


In [9]:
# load in mentions information
print('Loading in mentions and drawing [MENTIONS] edges')
fn_mentions = 'processed/'+keyword+'_mentions_twint.csv'
gdb.load_mentions(fn_mentions,graph)

Loading in mentions and drawing [MENTIONS] edges


From MENTIONS information we can draw [TALKS_ABOUT] edges between users. These have a weight equal to the number of times one user mentions another.

In [10]:
# Count mentions to draw [TALKS_ABOUT] edges
gdb.get_talk_about_edges(graph)

### D.2.4 Load in topics

If we have some results from the topic modelling then we can include them in the graph.

In [11]:
# file containing a list of users and their associated topics
fn_topics = 'processed/user_name_topics_summed_10.csv'

# minimum threshhold to link a user with a topic
threshhold = 0.02

Load in topics as (Topic) nodes and draw [TWEETS_ABOUT] edges between topics and users who pass a certain threshhold. 

In [None]:
gdb.load_topics(fn_topics,graph,threshhold)

## D.3 Remove statistical outliers

Celebrities and public figures may have millions of followers but only a handful of friends. Conversely, inactive or irregular Twitter users may have very few friends and followers. These profiles, who are not of interest to us, are often outliers in the statistical distribution of friends and followers.

### D.3.1 Friends & followers

Assume friends and followers are lognormally distributed, calculate the chi squared of each user and remove outliers.

In [None]:
# load in user metrics from file, alternatively one could download them from the graph
user_profiles = pd.read_csv('../data/processed/'+keyword+'_user_profiles.csv' )
user_friends_profiles = pd.read_csv('../data/processed/'+keyword+'_user_friends_profiles.csv' )
users_df = pd.concat([user_profiles,user_friends_profiles])
users_df = users_df.drop_duplicates().reset_index(drop=True)

In [None]:
# This function calculates the chi squared for each user
no_loners = gdb.get_chi2(users_df)

#We can then classify each user as an inlier or outlier based on their chisquared
chi2_lim = 6.18
inliers = no_loners[no_loners['chi2']<chi2_lim]
outliers = no_loners[no_loners['chi2']>chi2_lim]

In [None]:
# add chi2 as a property to each node
gdb.add_property('chi2',no_loners,graph)

In [None]:
#excise outliers from database
gdb.excise_outliers(outliers['screen_name'],graph)

### D.3.2 H-index

Profiles with a very high H-index are often high profile generalist accounts. Profiles with an H-index of zero or a few do not illicit much interaction at all from other twitter users and so are not interesting to us. Again, assuming that the H-index is lognormally distributed we can calculate each user's position in the distribution and excise any outliers.

In [None]:
# load in data file containing H-index information
h_index = pd.read_csv('../data/processed/cybersecurity_h_index_users.csv')

In [None]:
# add H-index as a property on the graph
# some entries have an H-index of -1, which is meaningless
h_index = h_index[h_index['h_index_like_retweets']>0]
gdb.add_property('h_index_like_retweets',h_index,graph)

In [None]:
# calculate chi2 for the H-index distribution
with_h_chi2 = gdb.get_chi2_H_index(h_index)

In [None]:
# get list of outliers
chi2_lim = 4.0
black_list = with_h_chi2[with_h_chi2['chi2']>chi2_lim]['screen_name']

In [None]:
black_list

In [None]:
# excise outliers from the graph
gdb.excise_outliers(black_list,graph)

## D.4 Graph Boosting and Graph Algorithms 

### D.4.1 Run Page rank

In [None]:
# run Page rank using follower edges
print('running page rank')
nodelist = ['Person']
edgelist = ['FOLLOWS','TALKS_ABOUT']
page_rank = gdb.run_pagerank(nodelist,edgelist,graph,new_native_graph=True)

In [None]:
print(page_rank[:10])

It is possible to run the neo4j graph algorithms in such a way that they automatically write new properties to the nodes. However, here we shall write these properties manually using our add_properties function.

In [None]:
gdb.add_property('rank',page_rank,graph)

### D.4.2 Monte Carlo Graph Boosting

Attempting to find all the friends of friends may result in downloading hundreds of thousands or millions of profiles. The network gets exponentially bigger at each level of abstraction. We can avoid this by selecting a random sample of users in our database and seeing if they are following anyone else in our database. We can weight this random selection by, for example, their previously determined rank or the number of friends or followers they have. By repeating this process several times we can build complexity into our graph.

In [None]:
# run boosting

# number of boosting iterations
niter = 5

# number of samples to be drawn on each iteration
nsample = 5

# field(s) to be weighted 
fields = ['rank']

# strength of weights (-ve to downweight)
exponents = [2]

# arguments for twint
kwargs = {'n_retries':2,
         'suppress':False}

# package pagerank parameters into tuple
pagerank_params = nodelist, edgelist, graph

# run boosting, now would be a good time to make a cup of tea
gdb.boost_graph(niter,nsample,fields,exponents,pagerank_params,keyword,kwargs)

## D5: Filter by Topic and keyword

### D.5.1 Filter graph by keywords

Look for keywords in the bio and screen name of friends, filter users who have these keywords.

This is a brute force approach to identifying users associated with a topic. It can used in conjunction with or instead of the topic modelling. For example, one may select the list of key words based on analysing hashtags.

In [None]:
keywords = ['tech','security','artificial','machine', 'cyber', 'computer','code','hack']
not_techies = gdb.filter_users_by_keywords(keywords,graph,without=True)
print(len(not_techies))

In [None]:
# excise uninteresting profiles
gdb.excise_outliers(not_techies['screen name'],graph)

### D.5.2 Filter by topic

Use the results of the topic modelling to get a list a users who tweet about a given topic regularly. Users who don't regularly tweet about this topic can be excised from the database.

In [None]:
# get list of users who DO talk about a topic
topic = 'Cybersecurity'
topical = gdb.filter_by_topic(topic,graph)

In [None]:
# have a look at a few entries
print(topical[:10])
print(len(topical))

In [None]:
# get list of users who DON'T talk about this topic
untopical = users_df[~users_df['screen_name'].isin(topical['screen_name'])]['screen_name']

In [None]:
# have a look at a few entries
print(untopical[:10])

In [None]:
#excise untopical users
gdb.excise_outliers(untopical,graph)

### D.5.3 Run Page Rank again

Run page rank again to get ranking within topic

In [None]:
# run Page rank using follower edges
print('running page rank')
nodelist = ['Person']
edgelist = ['FOLLOWS']
page_rank = gdb.run_pagerank(nodelist,edgelist,graph,new_native_graph=True)


In [None]:
print(page_rank[:20])