# Part D

In [130]:
import sys
import random
from pprint import pprint as pp
random.seed(42)
sys.version

'3.7.4 (default, Oct 15 2019, 22:29:14) \n[GCC 7.4.0]'

In [131]:
import neo4j
import py2neo
print(neo4j.__version__)
print(py2neo.__version__)

1.7.6
4.3.0


In [111]:
%load_ext cypher
# https://ipython-cypher.readthedocs.io/en/latest/

The cypher extension is already loaded. To reload it, use:
  %reload_ext cypher


In [132]:
from neo4j import GraphDatabase
from py2neo import Graph

# instantiate drivers
NEO4J_URI="bolt://localhost:7687"
gdb = GraphDatabase.driver(uri=NEO4J_URI, auth=None)
graph = Graph(NEO4J_URI)

The graph has the following structure

![graph](./schemas/dblp_slim_after/graph.png)

In [133]:
graph.run("CALL algo.list();").data()[0]

{'name': 'algo.allShortestPaths.stream',
 'description': "CALL algo.allShortestPaths.stream(weightProperty:String{nodeQuery:'labelName', relationshipQuery:'relationshipName', defaultValue:1.0, concurrency:4}) YIELD sourceNodeId, targetNodeId, distance - yields a stream of {sourceNodeId, targetNodeId, distance}",
 'signature': 'algo.allShortestPaths.stream(propertyName :: STRING?, config = {} :: MAP?) :: (sourceNodeId :: INTEGER?, targetNodeId :: INTEGER?, distance :: FLOAT?)',
 'type': 'procedure'}

### Define research communities

The first thing to do is to find/define the research communities. A community is
defined by a set of keywords. Assume that the database community is defined through
the following keywords: 

> data management, indexing, data modeling, big data, data
processing, data storage and data querying.

Since we don't have these keywords in the database, we will have to create them

In [135]:
dbms_kw = [kw.strip() for kw in 'data management, indexing, data modeling, big data, data processing, data storage and data querying'.split(',')]

In [136]:
dbms_kw

['data management',
 'indexing',
 'data modeling',
 'big data',
 'data processing',
 'data storage and data querying']

In [137]:
for kw in dbms_kw:
    print(graph.run("CREATE (kw:Keyword {name: $kw}) RETURN kw", kw=kw).data())

[{'kw': (_32301:Keyword {name: 'data management'})}]
[{'kw': (_32328:Keyword {name: 'indexing'})}]
[{'kw': (_32366:Keyword {name: 'data modeling'})}]
[{'kw': (_32367:Keyword {name: 'big data'})}]
[{'kw': (_32372:Keyword {name: 'data processing'})}]
[{'kw': (_32425:Keyword {name: 'data storage and data querying'})}]


#### Assign papers to these keywords randomly

In [138]:
article_ids = graph.run("MATCH (a:Article) RETURN a.id").data()
article_ids[:5]

[{'a.id': '8fb9c95bf34a0f28dc05819cb4aada0cb94fe555'},
 {'a.id': '5cfdb256b6ae968374469bd36702ed341cfe9485'},
 {'a.id': '0fbe46932967ec0db80b18e70fa199fb652313ea'},
 {'a.id': 'c4062742b4e0d13cfa0e992fdf2cebf2eb71c415'},
 {'a.id': 'f218ce53248d756db61726985f73e6e8c109b3e2'}]

In [139]:
for aid in article_ids:
    for kw in random.sample(dbms_kw, random.randint(0, len(dbms_kw))):
        graph.run(
            """MATCH (a:Article), (kw:Keyword)
            WHERE a.id = $aid AND kw.name = $kwname
            MERGE (a)-[:CONTAINS]->(kw)
            ON CREATE SET kw.name = $kwname, kw.fake = true
            ON MATCH SET kw.name = $kwname, kw.fake = true""",
            aid = aid["a.id"],
            kwname = kw
        )

Next, we need to find the conferences and journals related to the database community
(i.e., are specific to the field of databases). Assume that if 90% of the papers published
in a conference/journal contain one of the keywords of the database community we
consider that conference/journal as related to that community.

#### Create a research community

In [140]:
graph.run("MATCH (rc:ResearchCommunity) DETACH DELETE rc").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: False
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 0
relationships_created: 0
relationships_deleted: 0

In [141]:
graph.run("MERGE (rc:ResearchCommunity {name: 'databases', fake: true}) RETURN rc").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 1
labels_removed: 0
nodes_created: 1
nodes_deleted: 0
properties_set: 2
relationships_created: 0
relationships_deleted: 0

In [129]:
%%cypher
MATCH (kw:Keyword)
WHERE kw.name IN ['data management', 'indexing', 'data modeling', 'big data', 'data processing', 'data storage', 'data querying']
WITH kw
MATCH (rc:ResearchCommunity)
WHERE rc.name = "databases"
MERGE p=(kw)-[r:RELATED_TO]-(rc)
RETURN p"""

get the count of all the articles published in a journal

In [115]:
%%cypher
MATCH (jour:Journal)<-[jc:OF]-(vol:Volume)<-[:PUBLISHED_IN]-(a:Article)
RETURN jour.name AS JOURNAL, COUNT(jc) AS njc
ORDER BY njc DESC
LIMIT 5

5 rows affected.


JOURNAL,njc
ACS applied materials & interfaces,4
Journal of virology,2
Basic Research in Cardiology,2
Proceedings of the National Academy of Sciences of the United States of America,2
The American journal of surgical pathology,1


In [None]:
MATCH (article:Article)-[:PUBLISHED_IN]->(instance)-[r:OF]->(publication)
WHERE (publication:Journal or publication:Conference) AND (instance:Volume or instance:Edition)
WITH instance, count(article) as total_articles
MATCH (:ResearchCommunity {name: 'database'}) <-[:]- (:keyword) <-[:related_to]- (p:paper) -[:published_in]-> (v) -[:of]-> (j)
WHERE (v:volume or v:edition) and (j:journal or j:conference)
WITH j, total_papers, count(distinct p) as related_papers
WHERE toFloat(related_papers) / toFloat(total_papers) >= 0.9
MATCH (c:community {name:'database'})
CREATE (j) -[:related_to]-> (c)
RETURN j;


In [128]:
%%cypher
MATCH (jour:Journal)<-[jc:OF]-(:Volume)<-[:PUBLISHED_IN]-(a:Article)
WITH jour, COUNT(a) AS total_publications
MATCH (jour)<-[jc2:OF]-(:Volume)<-[:PUBLISHED_IN]-(a)-[:CONTAINS]->(kw: Keyword)
WHERE kw.name IN ['data management', 'indexing', 'data modeling', 'big data', 'data processing', 'data storage', 'data querying']
RETURN a, total_publications, COUNT(jc2) as njc2
ORDER BY njc2 DESC
LIMIT 5

5 rows affected.


a,total_publications,njc2
"{'doi_url': 'https://doi.org/10.1097/SIH.0b013e31825e8bcf', 'year': 2012, 'id': 'c104539460e4e4eac8b630ac81ef04d45a683286', 'title': '""Bump"": using a mobile app to enhance learning in simulation scenarios.', 'doi': '10.1097/SIH.0b013e31825e8bcf'}",1,5
"{'doi_url': 'https://doi.org/10.1016/j.ijcard.2009.06.058', 'year': 2010, 'id': '864ca4044d2ebf88ae4bf45df730f571039488b3', 'title': 'Heart rate dynamics in different levels of Zen meditation.', 'doi': '10.1016/j.ijcard.2009.06.058'}",1,5
"{'doi_url': 'https://doi.org/10.1007/BF02270828', 'year': 2005, 'id': 'ed6f7f0d65f4c8bde43f13667c406ff3403f9814', 'title': 'Predation, seed size partitioning and the evolution of body size in seed-eating finches', 'doi': '10.1007/BF02270828'}",1,5
"{'doi_url': 'https://doi.org/10.1177/0306624X08323158', 'year': 2009, 'id': '193ae042e6c11bd5f19b3f24c99cfda1e7fd658b', 'title': 'The role of violence in street crime: a qualitative study of violent offenders.', 'doi': '10.1177/0306624X08323158'}",1,5
"{'doi_url': 'https://doi.org/10.1177/1591019915623558', 'year': 2016, 'id': 'ed26a77ce2ffaa0ea2edd17afdd52f5642574b6f', 'title': 'Pipeline embolization device deployment via an envoy distal access XB guiding catheter-biaxial platform: A technical note.', 'doi': '10.1177/1591019915623558'}",1,5


In [126]:
%%cypher
MATCH (a)-[r:CONTAINS]->(kw:Keyword)
WHERE kw.name IN ['data management', 'indexing', 'data modeling', 'big data', 'data processing', 'data storage', 'data querying']
return a.title AS title, kw.name as kw, COUNT(r) as count
limit 5

5 rows affected.


title,kw,count
False-positivity of mediastinal lymph nodes has negative effect on survival in potentially resectable non-small cell lung cancer.,data management,1
Membrane-organizing protein moesin controls Treg differentiation and antitumor immunity via TGF-β signaling.,data management,1
Acetic acid as a sclerosing agent for renal cysts: Comparison with ethanol in follow-up results,data management,1
Sliding mode control of a hydraulic servo system position using adaptive sliding surface and adaptive gain,data management,1
Evidence for maize (Zea mays) in the Late Archaic (3000-1800 B.C.) in the Norte Chico region of Peru.,data management,1


In [122]:
%%cypher
MATCH (a:Article)-[r:CONTAINS]->(kw:Keyword)
WHERE kw.name IN ['data management', 'indexing', 'data modeling', 'big data', 'data processing', 'data storage', 'data querying']
RETURN a.title AS title, COUNT(r) as count
ORDER BY count DESC
limit 5

5 rows affected.


title,count
"""Bump"": using a mobile app to enhance learning in simulation scenarios.",5
Heart rate dynamics in different levels of Zen meditation.,5
"Predation, seed size partitioning and the evolution of body size in seed-eating finches",5
The role of violence in street crime: a qualitative study of violent offenders.,5
Pipeline embolization device deployment via an envoy distal access XB guiding catheter-biaxial platform: A technical note.,5
