# Creation of the graph database and production of statistics

In this notebook we will create a graph database from twitter data and run ranking and clustering algorithms. The notebook is in python3, but this is just a wrapper for cypher, the Neo4j language. 

## Install and configure Neo4j

Before running this notebook one needs to have the Neo4j desktop development environment up and running. The Neo4j development app can be downloaded from here:

https://neo4j.com/download/?ref=try-neo4j-lp

There is also and online "sandbox" version, this is actually quite useful because it contains an example of twitter analytics. Neo4j can also be run from a cypher terminal, which comes with the desktop installation. We need to install the APOC and Graph Data Science Library plugins. These can be done via the desktop app. 

On the desktop, create a new database by clicking "add database" and selecting "create local database". We shall name our databse "Cyber Journalists" and give it the password "tweetoftheday". Then click "create", then "start", a new window will open.

If your installation is anything like mine this will create a database in an obscure location on your computer. On my machine it has this address:

/Users/adam/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-31caf974-2ca6-4ba8-96ad-0adde23efd92/installation-4.1.0

For convenience the first thing we'll do is move into this directory. If anyone can figure out how to tell it to put the database in a sensible location please let me know :)

In [20]:
%cd "/Users/adam/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-31caf974-2ca6-4ba8-96ad-0adde23efd92/installation-4.1.0"

/Users/adam/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-31caf974-2ca6-4ba8-96ad-0adde23efd92/installation-4.1.0


Later we will want to be able to read in and write to different file formats. To do this we need to locate the configuration file and add the following couple of lines to location/conf/neo4j.conf

Next we need to install py2neo in order to wrap cypher commands inside a python script.

In [21]:
!pip install py2neo

You should consider upgrading via the '/Users/adam/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In theory, pretty much everything can be done with this package, but I did not find the syntax to be particularly useful. Instead we will give Neo4j commands in cypher, the graph database language. The first thing we need to do is to get python to log into the database.

In [35]:
# import standard libraries
import numpy as np
import pandas as pd

from py2neo import Graph
from py2neo.data import Node, Relationship

# load / declare the database
graph = Graph("bolt://localhost:7687", user="neo4j", password="tweetoftheday")
graph.begin()

<py2neo.database.Transaction at 0x11519a978>

In [23]:
# start with an empty graph, obviously don't run this if you already have stuff in there you don't want to delete
graph.delete_all()

Now we will start giving cypher commands, in this notebook will will write them as strings and then pass them to neo4j using py2neo. We could also run them in the desktop or the cypher shell. 

First we want to load in our tweet information, to do this we need to put the file containing the tweets in the location/import directory.

In [24]:
!mv ~/Downloads/standardised_cyber_tweets.csv import/

mv: /Users/adam/Downloads/standardised_cyber_tweets.csv: No such file or directory


## Create nodes representing tweets and users

Below, we use three cypher commands. The first loads the file. The second creates nodes representing tweets. The third creates nodes representing people. Note that "CREATE" and "MERGE" are slightly different. "CREATE" makes a new node but if that node already exists then it does nothing. "MERGE" creates a node if it doesn't already exist, and if it does exist will add or update information.

In [25]:
# load in tweets and twitter user information
query_string = '''
   LOAD CSV WITH HEADERS FROM "file:///standardised_cyber_tweets.csv" AS row
   
   CREATE (t:Tweet {tweet_id: row.tweet_id, conversation_id: row.conversation_id, user_id: row.user_id, 
   reply_to: row.reply_to, tweet_created_at_date: row.tweet_created_at_date, 
   tweet_created_at_time: row.tweet_created_at_time, text: row.text, replies_count: row.replies_count, 
   retweets_count: row.reteets_count, favourite_count: row.favourite_count, likes_count: row.likes_count,
   hashtags: row.hashtags, topics: row.topics})
   
   MERGE (p:Person {user_id: row.user_id, screen_name: row.screen_name, name: row.name, 
   user_description: row.user_description, user_friends_n: row.user_friends_n, user_followers_n: row.user_followers_n, 
   prof_created_at: row.prof_created_at, favourites_count: row.favourites_count, verified: row.verified, 
   statuses_count: row.statuses_count});
   '''
# run cypher query
graph.run(query_string)

<py2neo.database.Cursor at 0x1151950b8>

## Draw edges 

Now we will use "MATCH" to find who tweeted what and draw edges between people and tweets.

In [26]:
query_string = '''// Create edges linking tweeters and tweets
            MATCH (t:Tweet), (p:Person)
            WHERE t.user_id = p.user_id
            MERGE (p)-[:POSTS]->(t)'''
graph.run(query_string)

<py2neo.database.Cursor at 0x115195550>

## Load follower information

Move the friends file to the import directory. This file should be a csv file containing two columns with headings:

screen_name,friend

Depending on your machine this file may be too big to load in all at once. This can be avoided by splitting the file up into smaller files. Just make sure you put the headings at the top of each file.

In [27]:
query_string = '''// match link between users and friends in the database, if friend doesnt exisit then create it
            LOAD CSV WITH HEADERS FROM "file:///cyber_journalist_friends_2.csv" AS row
            MATCH (p:Person) WHERE p.screen_name = row.screen_name
            MERGE (n:Person {screen_name: row.friend})
            MERGE (p)-[:FOLLOWS]->(n)'''
graph.run(query_string)

<py2neo.database.Cursor at 0x115195898>

## Load mentions

Move the mentions file to the import directory. This file should be a csv file containing two columns with headings:

tweet_is,mentions

In [30]:
query_string = '''// match link between tweets and mentions
            LOAD CSV WITH HEADERS FROM "file:///tst.csv" AS row
            MATCH (t:Tweet) WHERE t.tweet_id = row.tweet_id
            MERGE (p:Person {screen_name: row.mentions})
            MERGE (t)-[:MENTIONS]->(p)'''
graph.run(query_string)

<py2neo.database.Cursor at 0x1151985c0>

We now have everything we need to start running our network analysis algorithms. If you'd like to visualize the graph then please scroll down to the graph visualization section.

## Centrality and clustering

In this section we will run the Page rank algorithm to measure the centrality of users and the Lauvain algorithm to look for communities within the graph. These algorithms are part of the Graph Data Science plugin that we installed earlier. If the plugins are not installed for this graph then you will need to install them and restart the graph.

To start with we must create a "named graph", this lists the components of the graph that we want to consider when running our algorithms. We'll start by just looking at the followers but this can be extended later.

In [36]:
query_string = '''// to run gds algorithms one needs to create a named graph
                CALL gds.graph.create(
                'my-native-graph',
                'Person',
                'FOLLOWS'
                )
                YIELD graphName, nodeCount, relationshipCount, createMillis'''
graph.run(query_string)

<py2neo.database.Cursor at 0x11519a860>

### Page rank

All the graph algorithms in GDS have the same syntax(ish) which makes them easy to use. 

By running Page rank on the Person nodes linked just by the FOLLOWS edges we rank the journalists' friends, ordered by the number of journalists who follow them, but weighted by the number of followers each journalist has amongst the ensemble of journalists. 

In [89]:
query_string = '''// run pagerank and return the ten highest scoring accounts
CALL gds.pageRank.stream('my-native-graph') 
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).screen_name AS screen_name, score
ORDER BY score DESC, screen_name ASC LIMIT 10'''
ans = graph.run(query_string)

We may want to do some python analysis on these results so let's put them in a dataframe

In [90]:
df = pd.DataFrame.from_records(ans, columns=['screen name', 'rank'])

print(df)

       screen name      rank
0      jimwaterson  0.150876
1  MalwareTechBlog  0.150768
2     ProfWoodward  0.150723
3        ruskin147  0.150723
4    openDemocracy  0.150684
5       DanRaywood  0.150676
6        DaveLeeFT  0.150623
7       ForbesTech  0.150623
8        LeoKelion  0.150623
9      Scott_Helme  0.150623


If we want to look for people we don't yet have the full information for we can slightly modify this code to exclude users already in the database.

In [100]:
query_string = '''// run pagerank and get the 100 highest scoring accounts without user information in the database
                CALL gds.pageRank.stream('my-native-graph') 
                YIELD nodeId, score
                WHERE NOT EXISTS(gds.util.asNode(nodeId).name)
                RETURN gds.util.asNode(nodeId).screen_name AS screen_name, score, gds.util.asNode(nodeId).name AS name
                ORDER BY score DESC, screen_name ASC LIMIT 100'''
ans = graph.run(query_string)

### Louvain communities

The Louvain method is an algorithm to detect communities in large networks. It maximizes a modularity score for each community, where the modularity quantifies the quality of an assignment of nodes to communities. This means evaluating how much more densely connected the nodes within a community are, compared to how connected they would be in a random network.

The Louvain algorithm is a hierarchical clustering algorithm, that recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.

In [106]:
query_string = '''// run Louvain algorithm to identify communities, return 10
CALL gds.louvain.stream('my-native-graph', {includeIntermediateCommunities: true}) 
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).screen_name AS screen_name, communityId, intermediateCommunityIds
ORDER BY communityId ASC, screen_name ASC LIMIT 10'''
ans = graph.run(query_string)
df = pd.DataFrame.from_records(ans, columns=['screen_name', 'communityId','intermediateCommunityIds'])
print(df)

      screen_name  communityId                  intermediateCommunityIds
0       leokelion            2            [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1    sophiafurber            7            [7, 7, 7, 7, 7, 7, 7, 7, 7, 7]
2    scfgallagher            8            [8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
3  mshannahmurphy            9            [9, 9, 9, 9, 9, 9, 9, 9, 9, 9]
4   jesscahaworth           10  [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
5    ad_nauseum74           11  [11, 11, 11, 11, 11, 11, 11, 11, 11, 11]
6   DaveColePhoto           24  [21, 21, 21, 21, 21, 21, 21, 21, 23, 24]
7      EliseKapNM           24  [15, 15, 15, 15, 16, 18, 20, 21, 23, 24]
8  HashemOsseiran           24  [13, 13, 14, 15, 16, 18, 20, 21, 23, 24]
9    RobaHusseini           24  [12, 13, 14, 15, 16, 18, 20, 21, 23, 24]


We can sort the groups by the number of members.

In [107]:
query_string = '''// count how many members each community has
CALL gds.louvain.stream('my-native-graph')
YIELD nodeId, communityId
RETURN communityId, COUNT(DISTINCT nodeId) AS members
ORDER BY members DESC LIMIT 10'''
ans = graph.run(query_string)

df = pd.DataFrame.from_records(ans, columns=['communityId', 'number of members'])
print(df)

   communityId  number of members
0         2856                 11
1          100                 11
2           24                 11
3          214                 11
4         2904                 11
5          192                 11
6            2                  1
7            9                  1
8            8                  1
9            7                  1


## Other bits and bobs

## Ordering tweets

Depending on the source of data (the Twitter API or twint) tweets either contain a "conversation_id" or a "reply_to_status_id". These are different but related numbers. The "conversation_id" is the id of the first tweet in the conversation, the "reply_to_status_id" is the id of the tweet in the conversation preceding the current tweet. Tweets also contain timestamp information, thus even if we don't have the "reply_to_status_id" we can order tweets in the conversation.

In [None]:
# start by finding tweets in the same conversation by matching conversation_id
query_string = '''// Find tweets in the same conversation
            MATCH (t1:Tweet), (t2:Tweet)
            WHERE t1.conversation_id = t2.conversation_id
            AND t1.tweet_id <> t2.tweet_id
            AND t1.tweet_created_at_date <= t2.tweet_created_at_date
            AND t1.tweet_created_at_time < t2.tweet_created_at_time
            MERGE (t1)-[:CONVERSATION]->(t2)'''
graph.run(query_string)

In [None]:
# conversations should be a chain rather than a mesh, although very occasionally tweets may have the same time stamp
query_string = '''// Find the direction of the conversation
            MATCH (t1)-[c1:CONVERSATION]->(t2), (t1)-[c2:CONVERSATION]->(t3) 
            WHERE t2.tweet_created_at_date < t3.tweet_created_at_date
            DELETE c2;'''
graph.run(query_string)

query_string = '''MATCH (t1)-[c1:CONVERSATION]->(t2), (t1)-[c2:CONVERSATION]->(t3) 
            WHERE t2.tweet_created_at_time < t3.tweet_created_at_time
            DELETE c2;'''
graph.run(query_string)

## Weighted interactions

As discussed, we may want to develop our ranking algorithm using, for example, weighted edges. Here we will add edges representing interactions between users weighted by the number of times they are in the same conversation.

In [None]:
query_string = '''// make edges between journalists, weighted by the number of interactions
            MATCH path=(t1)-[:CONVERSATION]-(t2)
            WHERE t1.user_id <> t2.user_id
            MATCH (p1:Person), (p2:Person)
            WHERE p1.user_id = t1.user_id AND p2.user_id = t2.user_id
            WITH p1,p2, COUNT(path) AS weight
            MERGE (p1)-[i:INTERACTION]-(p2)
            SET i.strength = weight'''
graph.run(query_string)