In [2]:
import sys
sys.version

'3.7.4 (default, Oct 15 2019, 22:29:14) \n[GCC 7.4.0]'

In [19]:
import neo4j
import py2neo
print(neo4j.__version__)
print(py2neo.__version__)

1.7.6
4.3.0


In [20]:
from neo4j import GraphDatabase

# instantiate driver
NEO4J_URI="bolt://localhost:7687"
gdb = GraphDatabase.driver(uri=NEO4J_URI, auth=None)

## A.2 Loading Data


We will read data from dblp.uni-trier.de. From the XML's description of data in https://dblp.org/faq/16154937.html, the following elements are represented

> - article – An article from a journal or magazine.
> - inproceedings – A paper in a conference or workshop proceedings.
> - proceedings – The proceedings volume of a conference or workshop.
> - book – An authored monograph or an edited collection of articles.
> - incollection – A part or chapter in a monograph.
> - phdthesis – A PhD thesis.
> - mastersthesis – A Master's thesis. There are only very few Master's theses in dblp.
> - www – A web page. There are only very few web pages in dblp. See also the notes on person records.

We will rely on the script provided in https://github.com/ThomHurks/dblp-to-csv, and we will be removing some of the elements by editing from the `dtd` file. In particular we will be removing

- book
- incollection
- phdthesis
- masterthesis
- www

The script is then executed as

```bash
#!/bin/bash
./XMLToCSV.py --annotate --neo4j dblp-raw/dblp.xml dblp-raw/dblp_slim.dtd output_slim/output.csv --relations author:authored_by journal:published_in publisher:published_by school:submitted_at editor:edited_by cite:has_citation
```

and the `neo4j-admin import` command is

```bash
#!/bin/bash
neo4j-admin import --mode=csv --database=dblp_slim.db --delimiter ";" --array-delimiter "|" --id-type INTEGER --nodes:inproceedings "output_slim/output_inproceedings_header.csv,output_slim/output_inproceedings.csv" --nodes:article "output_slim/output_article_header.csv,output_slim/output_article.csv" --nodes:proceedings "output_slim/output_proceedings_header.csv,output_slim/output_proceedings.csv" --nodes:editor "output_slim/output_editor.csv" --relationships:edited_by "output_slim/output_editor_edited_by.csv" --nodes:publisher "output_slim/output_publisher.csv" --relationships:published_by "output_slim/output_publisher_published_by.csv" --nodes:journal "output_slim/output_journal.csv" --relationships:published_in "output_slim/output_journal_published_in.csv" --nodes:author "output_slim/output_author.csv" --relationships:authored_by "output_slim/output_author_authored_by.csv" --nodes:cite "output_slim/output_cite.csv" --relationships:has_citation "output_slim/output_cite_has_citation.csv"
```



By modifying the `dtd` file, we obtain a smaller graph
- node count from 9,985,270 to 7,338, 701
- relationship count from 19,917,751 to 17,079,387

Finally, running the scripts, we get something like this

![schema1](images/graph.png)

## Missing nodes and relationships

We are then missing the following nodes

- topics
- keywords
- journals
- volumes

and the following relationships

- topic -> has -> keywords
- article -> cited_by -> article
- author -> reviews -> article

## Faking citations

citations are hard to parse from xml data, so we will be randomly linking articles between them using the `cited_in` relationship

creating a relationship

```cypher
MATCH (a:article),(b:article)
WHERE ID(a) = 12 AND ID(b) = 13
CREATE (a)-[r:cited_by]->(b)
RETURN type(r)
```

deleting a relationship

```cypher
MATCH p=(:article)-[r:cited_by]->(:article) delete r
```

query relationships

```
MATCH p=(:article)-[r:cited_by]->(:article) RETURN p LIMIT 25
```

### Fake citations


Fetch existing articles IDs

In [5]:
q = "MATCH (n:article) RETURN ID(n) LIMIT 10000"

with gdb.session() as session:
    article_ids = [v[0] for v in session.run(q).values()]
    
article_ids[:5]

[2508965, 2508966, 2508967, 2508968, 2508969]

Optional, add proceedings and journals as citable elements

In [6]:
len(article_ids)

10000

#### Delete citations

In [577]:
# delete existing citations before inserting new ones
with gdb.session() as session:
    session.run("MATCH p=(:article)-[r:cited_by]->(:article) DELETE r ")

#### Create `cited_by` relationships

In [578]:
# https://neo4j.com/docs/driver-manual/1.7/sessions-transactions/#driver-transactions-transaction-functions

q_add_citation_rel_id = """MATCH (a:article),(b:article)
WHERE ID(a) = $id_a AND ID(b) = $id_b
CREATE (a)-[r:cited_by]->(b)
RETURN a, b"""


def add_citation_rel(driver, id_a, id_b):
    with driver.session() as session:
        # Caller for transactional unit of work
        return session.write_transaction(create_citation_rel, id_a, id_b)

# Simple implementation of the unit of work
def create_citation_rel(tx, id_a, id_b):
    return tx.run(q_add_citation_rel_id, id_a=id_a, id_b = id_b)

Add 500 relationships of type `cited_by`

In [579]:
random.seed(42)
# pick a sample of 500 papers, and make them be cited by other three papers at random
created = []

for article in random.sample(article_ids, 500):
    for citation in random.sample(article_ids, 3):
        if article != citation: # they can't cite themselves
            created.append((article, citation))
            add_citation_rel(gdb, id_a=article, id_b=citation)
        else:
            print(article, citation)
        

2542972 2542972


In [580]:
created[:5]

[(2527960, 2560941),
 (2527960, 2536901),
 (2527960, 2552365),
 (2509374, 2556357),
 (2509374, 2553167)]

#### Make articles `UNIQUE`

In [581]:
# no repeated articles
with gdb.session() as session:
    session.run("CREATE CONSTRAINT ON (n:article) ASSERT n.article IS UNIQUE")

#### Query citations

In [582]:
with gdb.session() as session:
    out = session.run("MATCH p=(:article)-[r:cited_by]->(:article) RETURN r LIMIT 5").values()
    for elem in out:
        print(elem)

[<Relationship id=17082515 nodes=(<Node id=2508971 labels=set() properties={}>, <Node id=2548988 labels=set() properties={}>) type='cited_by' properties={}>]
[<Relationship id=17082514 nodes=(<Node id=2508971 labels=set() properties={}>, <Node id=2557826 labels=set() properties={}>) type='cited_by' properties={}>]
[<Relationship id=17082513 nodes=(<Node id=2508971 labels=set() properties={}>, <Node id=2555890 labels=set() properties={}>) type='cited_by' properties={}>]
[<Relationship id=17081563 nodes=(<Node id=2508974 labels=set() properties={}>, <Node id=2557687 labels=set() properties={}>) type='cited_by' properties={}>]
[<Relationship id=17081562 nodes=(<Node id=2508974 labels=set() properties={}>, <Node id=2545595 labels=set() properties={}>) type='cited_by' properties={}>]


### Fake keywords
It's not trivial to parse keywords and topics from data, so we will fake some topics and random keywords using the `faker` library, as explained in http://www.jexp.de/blog/html/create_random_data.html

In [7]:
from faker import Faker
from faker.providers import lorem

fake = Faker()
fake.seed_instance(42)
fake.add_provider(lorem)

In [8]:
# +-100 fake, non repeated keywords
fake_keywords = [(ix, word) for ix, word in enumerate(list({fake.sentence(nb_words=3).rstrip('.') for _ in range(100)}), start=1)]
len(fake_keywords)

100

In [9]:
fake_keywords[:10]

[(1, 'Account stage federal'),
 (2, 'Stop peace'),
 (3, 'Behavior benefit'),
 (4, 'Tree that fear'),
 (5, 'Term herself'),
 (6, 'East organization people'),
 (7, 'Why'),
 (8, 'Bag control organization'),
 (9, 'Interest level rock'),
 (10, 'Discover detail audience')]

In [583]:
# delete keywords
with gdb.session() as session:
    session.run("MATCH (n:keyword) DELETE n")

In [584]:
q_create_keyword = "CREATE (:keyword {id:$id, keyword:$keyword})"

with gdb.session() as session:
    for ix, keyword in fake_keywords:
        session.run(q_create_keyword, id=ix, keyword=keyword)

In [585]:
# create UNIQUE constraint
with gdb.session() as session:
    session.run("CREATE CONSTRAINT ON (n:keyword) ASSERT n.keyword IS UNIQUE")

#### Assign 5 keywords to 1000 articles, at random

In [586]:
# delete relationships
with gdb.session() as session:
    session.run("MATCH p=(:article)-[r:has_keyword]->(:keyword) DELETE r")

In [587]:
# assign 5 keywords randomly to 1000 articles
random.seed(42)

q_add_keywords = """MATCH (a:article),(b:keyword)
WHERE ID(a) = $article_id AND b.keyword = $keyword
CREATE (a)-[r:has_keyword]->(b)
RETURN a, b"""

created_kw_rel = []

with gdb.session() as session:
    for article_id in random.sample(article_ids, 1000):
        for ix, keyword in random.sample(fake_keywords, 5):
            created_kw_rel.append((article_id, keyword))
            session.run(q_add_keywords, article_id=article_id, keyword=keyword)

In [588]:
created_kw_rel[:5]

[(2527960, 'Least'),
 (2527960, 'Author technology amount'),
 (2527960, 'Message board mean'),
 (2527960, 'Spend prove stock'),
 (2527960, 'Quickly appear piece')]

In [589]:
with gdb.session() as session:
    out = session.run("MATCH p=(:article)-[r:has_keyword]->(:keyword) RETURN r LIMIT 5").values()
    for elem in out:
        print(elem)

[<Relationship id=17081012 nodes=(<Node id=2557967 labels=set() properties={}>, <Node id=7340970 labels=set() properties={}>) type='has_keyword' properties={}>]
[<Relationship id=17080942 nodes=(<Node id=2553350 labels=set() properties={}>, <Node id=7340970 labels=set() properties={}>) type='has_keyword' properties={}>]
[<Relationship id=17080857 nodes=(<Node id=2556280 labels=set() properties={}>, <Node id=7340970 labels=set() properties={}>) type='has_keyword' properties={}>]
[<Relationship id=17080765 nodes=(<Node id=2540985 labels=set() properties={}>, <Node id=7340970 labels=set() properties={}>) type='has_keyword' properties={}>]
[<Relationship id=17080704 nodes=(<Node id=2527663 labels=set() properties={}>, <Node id=7340970 labels=set() properties={}>) type='has_keyword' properties={}>]


### Faking reviewers 

We will create a `reviewed_by` relationship between authors and articles

In [471]:
# delete relationships
with gdb.session() as session:
    session.run("MATCH p=(:article)-[r:reviewed_by]->(:author) DELETE r")

In [10]:
q = "MATCH (n:author) RETURN ID(n) LIMIT 10000"

with gdb.session() as session:
    author_ids = [v[0] for v in session.run(q).values()]
    
author_ids[:5]

[4771927, 4771928, 4771929, 4771930, 4771931]

In [473]:
# assign between 3 and 4 reviewers randomly to 1000 articles
random.seed(42)

q_add_reviewers = """MATCH (a:article),(b:author)
WHERE ID(a) = $article_id AND ID(b) = $author_id
CREATE (a)-[r:reviewed_by]->(b)"""

created_review_rel = []

with gdb.session() as session:
    for article_id in random.sample(article_ids, 1000):
        for author_id in random.sample(author_ids, 3):
            created_review_rel.append((article_id, author_id))
            session.run(q_add_reviewers, article_id=article_id, author_id=author_id)

In [11]:
created_review_rel[:5]

NameError: name 'created_review_rel' is not defined

## A.3 Evolving the graph 

The requirements are

#### Store the review and the approval sent by each reviewer

Since the queries in part B don't need this information, we can store the `review_decision` and the `review_text` as _edge_ attributes of the relation `REVIEWED_BY`.


Alternatively, and in the case these reviews were a requirement in part B, they can be implemented by creating a `Review` node, and linking it to articles, in a four step process.

1. Create a `Review` article for each `Article` that has reviewers (i.e. is connected to authors by `reviewed_by` edges). Each `Review` node will have an attribute named `review_contents´.
2. Attach review to the article
3. Attach reviewers to the review instance
4. Optional, delete the `reviewed_by` edges from the step 1.

In [12]:
with gdb.session() as session:
    relationships = session.run("MATCH p=(:article)-[r:reviewed_by]->(:author) return ID(r)").values()

In [717]:
for x in out[0][0].relationships:
    print(x["review_text"])

Lorem Ipsum Donor


In [737]:
for x in out[:5]:
    for i in x:
        for r in i.relationships:
            print(r.id, dict(r))

17088004 {'review_text': 'Lorem Ipsum Donor', 'review_accepted': True}
17088003 {}
17088002 {}
17087050 {}
17087049 {}


In [738]:
with gdb.session() as session:
    out = session.run("MATCH p=(:article)-[r:reviewed_by]->(:author) return ID(r)")
    rel_ids = [r[0] for r in out.values()]

In [739]:
rel_ids[:5]

[17088004, 17088003, 17088002, 17087050, 17087049]

In [743]:
q_merge_review_attributes = """MATCH p=(:article)-[r:reviewed_by]->(:author)
WHERE ID(r) = $rel_id
SET r = {review_accepted: $review_accepted, review_text: $review_text}
"""

created_review_rel_attr = []

random.seed(42)

with gdb.session() as session:
    for rel_id in rel_ids:
        created_review_rel_attr.append(rel_id)
        session.run(q_merge_review_attributes, 
                    rel_id=rel_id,
                    review_accepted=fake.pybool(), 
                    review_text=fake.texts(nb_texts=1, max_nb_chars=500)[0])

In [745]:
created_review_rel_attr[:5]

[17088004, 17088003, 17088002, 17087050, 17087049]

#### Store a reviewing policy for each Journal or Conference

This can also be implemented as an `review_policy_min_count` to the `Journal` and `Proceedings` labels

In [21]:
with gdb.session() as session:
    journal_ids = session.run("MERGE (n:journal) SET n.review_policy_min_count = 3 return n").values()
    proceeding_ids = session.run("MERGE (n:proceedings) SET n.review_policy_min_count = 3 return n").values()

In [22]:
journal_ids[:5]

[[<Node id=4742771 labels={'journal'} properties={'review_policy_min_count': 3, 'journal': 'meltdownattack.com'}>],
 [<Node id=4742772 labels={'journal'} properties={'review_policy_min_count': 3, 'journal': 'GTE Laboratories Incorporated'}>],
 [<Node id=4742773 labels={'journal'} properties={'review_policy_min_count': 3, 'journal': 'University of California at Berkeley'}>],
 [<Node id=4742774 labels={'journal'} properties={'review_policy_min_count': 3, 'journal': 'ANSI X3H2'}>],
 [<Node id=4742775 labels={'journal'} properties={'review_policy_min_count': 3, 'journal': 'ANSI X2H2'}>]]

In [24]:
proceeding_ids[:5]

[[<Node id=4687702 labels={'proceedings'} properties={'ee': ['https://doi.org/10.1007/978-3-642-32940-1'], 'editor': ['Irek Ulidowski', 'Maciej Koutny'], 'year': 2012, 'isbn': ['978-3-642-32939-5'], 'series-href': ['db/series/lncs/index.html'], 'title': 'CONCUR 2012 - Concurrency Theory - 23rd International Conference, CONCUR 2012, Newcastle upon Tyne, UK, September 4-7, 2012. Proceedings', 'url': 'db/conf/concur/concur2012.html', 'review_policy_min_count': 3, 'volume': '7454', 'mdate': neotime.Date(2019, 5, 14), 'series': ['Lecture Notes in Computer Science'], 'publisher': ['Springer'], 'proceedings': 4180529, 'booktitle': 'CONCUR', 'key': 'conf/concur/2012'}>],
 [<Node id=4687703 labels={'proceedings'} properties={'ee': ['https://doi.org/10.1007/11817949'], 'editor': ['Christel Baier', 'Holger Hermanns'], 'year': 2006, 'isbn': ['3-540-37376-4'], 'series-href': ['db/series/lncs/index.html'], 'title': 'CONCUR 2006 - Concurrency Theory, 17th International Conference, CONCUR 2006, Bonn, 

#### Display the affiliation of the author to an organization or a company 

This can be solved by creating an `affiliation_institution_type` attribute, and also an `affiliation_institution_name`. We could have created these as nodes, but the queries from part B do not refer to this information.


In [42]:
q_set_affiliations = """MATCH (n:author)
WHERE ID(n) = $author_id
SET n = {affiliation_institution_type: $affiliation_institution_type, affiliation_institution_name: $affiliation_institution_name}
RETURN n
"""

institution_types = ['University', 'Company', 'NGO']
created_affiliations = []

random.seed(42)

with gdb.session() as session:
    for author_id in author_ids:
        out = session.run(q_set_affiliations, 
                    author_id=author_id,
                    affiliation_institution_name=fake.company(), 
                    affiliation_institution_type=random.choice(institution_types))
        
        created_affiliations.append(out.values())
        