# Plot Generation - Paper/Topic Statistics

Importing libraries.

In [20]:
import random
import plotly.express as px
from neo4j import GraphDatabase
from pyvis.network import Network

Neo4j location and authentication data.

In [21]:
URI = "neo4j://localhost:7687/"
AUTH = ("neo4j", "neo4j")
driver = GraphDatabase.driver(URI, auth=AUTH)

Function to run a CYPHER query.

In [22]:
def run_query(query, parameters=None):
    with driver.session() as session:
        result = session.run(query, parameters)
        records = [record for record in result]
        return records

## Barplot Generation

We obtain the top 30 topics with the biggest amount of pubications associated with them.

In [23]:
query_all_topics = """
MATCH (n:ns0__Topic)-[r:ns0__hasPublication]->(i:ns0__Publication)
RETURN n.uri AS topicUri, COUNT(i) AS paperCount
ORDER BY paperCount DESC
LIMIT 30
"""

In [24]:
result = run_query(query_all_topics)
result = [dict(r) for r in result]

Organizing the results in a DataFrame to facilitate plotting.

In [25]:
df = pd.DataFrame(result)
df["topicName"] = df["topicUri"].apply(lambda x: x.split("/")[-1])
df.head()

Unnamed: 0,topicUri,paperCount,topicName
0,https://www.wikidata.org/wiki/Q7020694,4940,Q7020694
1,https://www.wikidata.org/wiki/Q6517860,3530,Q6517860
2,https://www.wikidata.org/wiki/Q5330456,3004,Q5330456
3,https://www.wikidata.org/wiki/Q4619,2847,Q4619
4,https://www.wikidata.org/wiki/Q7153055,2678,Q7153055


In [38]:
df

Unnamed: 0,topicUri,paperCount,topicName
0,https://www.wikidata.org/wiki/Q7020694,4940,Q7020694
1,https://www.wikidata.org/wiki/Q6517860,3530,Q6517860
2,https://www.wikidata.org/wiki/Q5330456,3004,Q5330456
3,https://www.wikidata.org/wiki/Q4619,2847,Q4619
4,https://www.wikidata.org/wiki/Q7153055,2678,Q7153055
5,https://www.wikidata.org/wiki/Q4884546,2626,Q4884546
6,https://www.wikidata.org/wiki/Q7019560,2375,Q7019560
7,https://www.wikidata.org/wiki/Q6752025,2203,Q6752025
8,https://www.wikidata.org/wiki/Q7211696,1928,Q7211696
9,https://www.wikidata.org/wiki/Q8790102,1924,Q8790102


Generating the barplot.

In [26]:
fig = px.bar(df, x="topicName", y="paperCount", color="paperCount", color_continuous_scale="viridis")
fig.write_html("barplot.html")

## Graph Generation

First, we must obtain for every topic, the amount of publications associated with it.

In [39]:
first_query = """
MATCH (t:ns0__Topic)-[:ns0__hasPublication]->(p:ns0__Publication)
RETURN DISTINCT t.uri AS topicUri, COUNT(p) AS paperCount
"""

result = run_query(first_query)
result = [dict(r) for r in result]
result[:10]

[{'topicUri': 'https://www.wikidata.org/wiki/Q9205879', 'paperCount': 49},
 {'topicUri': 'https://www.wikidata.org/wiki/Q7479291', 'paperCount': 463},
 {'topicUri': 'https://www.wikidata.org/wiki/Q7022256', 'paperCount': 66},
 {'topicUri': 'https://www.wikidata.org/wiki/Q8168518', 'paperCount': 59},
 {'topicUri': 'https://www.wikidata.org/wiki/Q466', 'paperCount': 112},
 {'topicUri': 'https://www.wikidata.org/wiki/Q7214259', 'paperCount': 1677},
 {'topicUri': 'http://dbpedia.org/resource/Corollary', 'paperCount': 22},
 {'topicUri': 'http://dbpedia.org/resource/Usability_goals', 'paperCount': 1},
 {'topicUri': 'https://www.wikidata.org/wiki/Q1343123', 'paperCount': 1749},
 {'topicUri': 'http://dbpedia.org/resource/Individual', 'paperCount': 78}]

In [40]:
len([entry for entry in result if entry["paperCount"] > 100])

479

With this second query, we will obtain for every pair of topics, how many publications in common they have. Several optimization methods had to be included, since this query handles a lot of data, e.g.:

- `db.awaitIndexes()` ensures that all indexes are available before executing the rest of the query.
- `UNWIND` topics as topic1 and topic2 essentially help us creating a cartesian product of the topics list with itself for each publication, generating pairs of topics.
- `CASE WHEN topic1 < topic2` and `WHERE filteredSharedPapers > 0` ensures that each pair is only considered once.

In [28]:
second_query = """
CALL db.awaitIndexes()
MATCH (p:ns0__Publication)-[:ns0__belongToTopic]->(t:ns0__Topic)
WITH p, collect(distinct t.uri) as topics
UNWIND topics as topic1
UNWIND topics as topic2
WITH topic1, topic2, count(p) as sharedPapers
WITH topic1, topic2, sharedPapers, CASE WHEN topic1 < topic2 THEN sharedPapers ELSE 0 END AS filteredSharedPapers
WITH topic1, topic2, filteredSharedPapers
WHERE filteredSharedPapers > 0
RETURN topic1, topic2, filteredSharedPapers AS sharedPapers
ORDER BY sharedPapers DESC
"""

result2 = run_query(second_query)
result2 = [dict(r) for r in result2]
result2[:10]

[{'topic1': 'https://www.wikidata.org/wiki/Q5330456',
  'topic2': 'https://www.wikidata.org/wiki/Q6517860',
  'sharedPapers': 2812},
 {'topic1': 'https://www.wikidata.org/wiki/Q6517860',
  'topic2': 'https://www.wikidata.org/wiki/Q7020694',
  'sharedPapers': 2806},
 {'topic1': 'https://www.wikidata.org/wiki/Q4619',
  'topic2': 'https://www.wikidata.org/wiki/Q7020694',
  'sharedPapers': 2622},
 {'topic1': 'https://www.wikidata.org/wiki/Q7020694',
  'topic2': 'https://www.wikidata.org/wiki/Q7153055',
  'sharedPapers': 2587},
 {'topic1': 'https://www.wikidata.org/wiki/Q4884546',
  'topic2': 'https://www.wikidata.org/wiki/Q5330456',
  'sharedPapers': 2482},
 {'topic1': 'https://www.wikidata.org/wiki/Q4884546',
  'topic2': 'https://www.wikidata.org/wiki/Q6517860',
  'sharedPapers': 2385},
 {'topic1': 'https://www.wikidata.org/wiki/Q7019560',
  'topic2': 'https://www.wikidata.org/wiki/Q7020694',
  'sharedPapers': 2346},
 {'topic1': 'https://www.wikidata.org/wiki/Q5330456',
  'topic2': 'https

In [36]:
len([entry for entry in result2 if entry["sharedPapers"] > 100])

6921

We see a total of almost 30k topics, and around 3 million connections. This is way too much for a graph.

In [41]:
len(result), len(result2)

(29882, 3007169)

The following function help us choosing what nodes to plot into the graph. It will select the top `k` most popular topics, adding the relationship between them. If `include_random` is set to True, it will also add `k` random topics into the graph.

In [30]:
def create_graph_lists(sorted_result, k = 25, include_random = False):
    top_k_topics = [entry['topicUri'] for entry in sorted_result[:k]]
    topic_paper_counts = sorted_result[:k]
    
    shared_papers = [
        relation for relation in result2
        if relation['topic1'] in top_k_topics and relation['topic2'] in top_k_topics
    ]

    if include_random:
        all_topic_uris = [entry['topicUri'] for entry in sorted_result]
        remaining_topic_uris = [uri for uri in all_topic_uris if uri not in top_k_topics]
        random_k_topics = random.sample(remaining_topic_uris, k)
        random_k_topics_full = [entry for entry in sorted_result if entry["topicUri"] in random_k_topics]
        
        filtered_relations_random_k = [
            relation for relation in result2
            if relation['topic1'] in random_k_topics and relation['topic2'] in random_k_topics
        ]
        
        filtered_relations_mixed = [
            relation for relation in result2
            if (relation['topic1'] in top_k_topics and relation['topic2'] in random_k_topics) or
               (relation['topic1'] in random_k_topics and relation['topic2'] in top_k_topics)
        ]
        
        topic_paper_counts += random_k_topics_full
        shared_papers += filtered_relations_random_k + filtered_relations_mixed

    return topic_paper_counts, shared_papers

We will choose the top 20 most popular topics.

In [31]:
sorted_result = sorted(result, key=lambda x: x['paperCount'], reverse=True)
topic_paper_counts, shared_papers = create_graph_lists(sorted_result, k = 20)

Now we can generate the graph. It will show up in the notebook, and it will also generate the html file.

In [33]:
nt = Network(notebook=True, height="750px", width="100%", select_menu=True, cdn_resources='remote')

color_palette = ['#ff9b9b', '#ffc19b', '#ffefac', '#c0f1ab', '#b1d4e0']
for i, topic in enumerate(topic_paper_counts):
    nt.add_node(topic['topicUri'].split("/")[-1], 
                label = topic['topicUri'].split("/")[-1],
                value = topic['paperCount'],
                title = "<a href={link}>{title}</a>".format(link=topic['topicUri'], title=topic['topicUri'].split("/")[-1]) + "\nPaper Count: {paperCount}".format(paperCount=topic["paperCount"]),
                color=color_palette[i % len(color_palette)])

for connection in shared_papers:
    nt.add_edge(connection['topic1'].split("/")[-1], connection['topic2'].split("/")[-1], value=connection['sharedPapers'], label = str(connection['sharedPapers']))

nt.set_options("""
const options = {
  "edges": {
    "color": {
      "inherit": true,
      "opacity": 0.8
    },
    "font": {
      "size": 8
    },
    "scaling": {
      "label": {
        "min": 5,
        "max": 10
      }
    },
    "selfReferenceSize": null,
    "selfReference": {
      "angle": 0.7853981633974483
    },
    "smooth": {
      "forceDirection": "none"
    }
  },
  "physics": {
    "enabled": false,
    "minVelocity": 0.75
  }
}
""")

nt.show('nx.html')

nx.html
