# MODAL 473G
_Gaëtan Ecrepont, Ilyas Glaib_

In [1]:
# useful libraries for manipulating and visualizing the data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 3. Graph construction

### 3.1 Querying Wikidata
Compute the number of distinct mathematicians:
<pre style="font-family: monospace">
SELECT (COUNT(DISTINCT(?mathematician)) as ?count)
WHERE
{
  ?mathematician wdt:P106 wd:Q170790.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
</pre>
Output : 37324
<br>
<hr>
<br>
Compute the number of distinct mathematicians with attributes dob, dod, university, field of work and country:
<pre style="font-family: monospace">
SELECT (COUNT(DISTINCT(?mathematician)) as ?count)
WHERE<
{
  ?mathematician wdt:P106 wd:Q170790;
                 wdt:P69 ?university;
                 wdt:P569 ?dob;
                 wdt:P570 ?dod;
                 wdt:P101 ?fieldOfWork;
                 wdt:P27 ?country.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
</pre>
Output : 3768
<br>
<hr>
<br>
Retrieve all mathematicians with attributes dob, dod, university, field of work and country and optional attribute doctoral students:
<pre style="font-family: monospace">
SELECT ?mathematicianLabel ?universityLabel ?dob ?dod ?fieldOfWorkLabel ?countryLabel ?doctoralStudentLabel
WHERE 
{
  ?mathematician wdt:P106 wd:Q170790;
                 wdt:P69 ?university;
                 wdt:P569 ?dob;
                 wdt:P570 ?dod;
                 wdt:P101 ?fieldOfWork;
                 wdt:P27 ?country.
  OPTIONAL { ?mathematician wdt:P185 ?doctoralStudent. }
                  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
</pre>
Output : our main CSV file with over 200,000 rows.

### 3.2 Preprocessing the data

In [2]:
df = pd.read_csv("../data/raw.csv")
print(f"Our main CSV file has {df.shape[0]} rows and {df.shape[1]} columns.")
print("Here is a preview of the data frame:")
df.sample(5)

Our main CSV file has 207926 rows and 7 columns.
Here is a preview of the data frame:


Unnamed: 0,mathematicianLabel,universityLabel,dob,dod,fieldOfWorkLabel,countryLabel,doctoralStudentLabel
52169,Dmitri Egorov,Moscow State University,1869-12-22T00:00:00Z,1931-09-10T00:00:00Z,calculus of variations,Russian Socialist Federative Soviet Republic,Ivan Privalov
48565,Rolf Nevanlinna,Helsingin Suomalainen Yhteiskoulu,1895-10-22T00:00:00Z,1980-05-28T00:00:00Z,complex analysis,Finland,Inkeri Simola
39430,John Edensor Littlewood,Trinity College,1885-06-09T00:00:00Z,1977-09-06T00:00:00Z,mathematics,United Kingdom,R. Cooper
165013,Ernest Vinberg,Moscow State University,1937-07-26T00:00:00Z,2020-05-12T00:00:00Z,algebra,Russia,Alexander Vladimirovich Smirnov
29417,Nikolai Chebotaryov,"3 Shevchenka Street, Kropyvnytskyi",1894-03-15T00:00:00Z,1947-07-02T00:00:00Z,number theory,Russian Empire,Wladimir Wladimirowitsch Morosow


In [3]:
df.columns = [str.replace(col, "Label", "") for col in df.columns] # cleaner names
df = df.drop(index=[61005, 110655]) # remove two mathematicians with no death date, causing errors in the the code below
df["dob"] = df["dob"].apply(lambda s: int(s[:4])) # only keep year
df["dod"] = df["dod"].apply(lambda s: int(s[:4])) # only keep year

print("After cleaning, our data frame looks like the following:")
df.sample(5)


After cleaning, our data frame looks like the following:


Unnamed: 0,mathematician,university,dob,dod,fieldOfWork,country,doctoralStudent
58498,Kazimierz Kuratowski,University of Warsaw,1896,1980,topology,Poland,Stanislaw Mrowka
32275,Nikolai Chebotaryov,Kamianets-Podilskyi Boys Gymnasium,1894,1947,function theory,Russian Empire,Aleksandr Yakovlevich Povzner
20428,George Pólya,Eötvös Loránd University,1887,1985,education,United States of America,Florian Eggenberger
200301,Robert V. Hogg,University of Iowa,1924,2014,statistics,United States of America,Elliot Alan Tanis
81523,Ivan Vidav,University of Ljubljana,1918,2015,algebra,Slovenia,Jože Grasselli


### 3.3 Quick exploration of the data

In [4]:
mathematicians = df["mathematician"].drop_duplicates()
print(f"There are {mathematicians.count()} mathematicians in the dataset.")

fow = df["fieldOfWork"].drop_duplicates()
print(f"There are {fow.count()} fields of work in the dataset.")

universities = df["university"].drop_duplicates()
print(f"There are {universities.count()} universities in the dataset.")

countries = df["country"].drop_duplicates()
print(f"There are {countries.count()} countries in the dataset.")

doctoralStudents = df["doctoralStudent"].drop_duplicates()
print(f"There are {doctoralStudents.count()} doctoral students in the dataset.")
# intersection between mathematicians and doctoral students
print(f"There are {doctoralStudents[doctoralStudents.isin(mathematicians)].count()} doctoral students who are also in our mathematicians list.")

There are 3753 mathematicians in the dataset.
There are 902 fields of work in the dataset.
There are 1309 universities in the dataset.
There are 184 countries in the dataset.
There are 22033 doctoral students in the dataset.
There are 1866 doctoral students who are also in our mathematicians list.


### 3.4 The issue of mathematicians with several dob/dod

In [5]:
n1 = len(df[["mathematician", "dob"]].drop_duplicates())
print(f"Number of distinct (mathematician, dob) rows: {n1}")
n2 = len(df[["mathematician"]].drop_duplicates())
print(f"Number of distinct mathematicians: {n2}")

Number of distinct (mathematician, dob) rows: 3769
Number of distinct mathematicians: 3753


Thus, some mathematicians have several dob! Who are they?

In [6]:
_df = df[["mathematician", "dob"]].drop_duplicates()
_df = _df.groupby("mathematician").count()
several_dob = _df[_df["dob"] > 1]
print(f"There are {len(several_dob)} mathematicians with several dob.")
several_dob

There are 15 mathematicians with several dob.


Unnamed: 0_level_0,dob
mathematician,Unnamed: 1_level_1
Adam Freytag,2
Edward Waring,2
Eugen Netto,2
J. N. Srivastava,2
Jeremiah Horrocks,2
Johannes Stabius,3
Joseph Raphson,2
Korneliy Karastelev,2
Lee Stiff,2
Mabel Gweneth Humphreys,2


The same goes for dod!

In [7]:
_df = df[["mathematician", "dod"]].drop_duplicates()
_df = _df.groupby("mathematician").count()
several_dod = _df[_df["dod"] > 1]
print(f"There are {len(several_dod)} mathematicians with several dod.")
several_dod

There are 10 mathematicians with several dod.


Unnamed: 0_level_0,dod
mathematician,Unnamed: 1_level_1
Adam Freytag,2
Christoph Rudolff,2
Dmitry Raikov,2
Johannes de Sacrobosco,2
Joseph Raphson,2
Karel Hruša,2
Mathukumalli V. Subbarao,2
Mikhail Lavrentyev,2
Roger Bacon,2
Truman Lee Kelley,2


Let's get rid of the extra rows by keeping only the lowest dob/dod (this is an arbitrary choice but doesn't have much impact since the uncertainty is always plus or minus one year).

In [8]:
_df = df.groupby("mathematician").nunique()
names = _df[_df["dob"]>1].index

for name in names:
    dobs = df[df["mathematician"]==name]["dob"].drop_duplicates().to_list()
    df = df[~((df["mathematician"]==name) & (df["dob"].isin(dobs[1:])))]

_df = df.groupby("mathematician").nunique()
names = _df[_df["dod"]>1].index

for name in names:
    dods = df[df["mathematician"]==name]["dod"].drop_duplicates().to_list()
    df = df[~((df["mathematician"]==name) & (df["dod"].isin(dods[1:])))]

Let's check!

In [9]:
_df = df[["mathematician", "dob"]].drop_duplicates()
_df = _df.groupby("mathematician").count()
several_dob = _df[_df["dob"] > 1]
print(f"There are now {len(several_dob)} mathematicians with several dob.")

_df = df[["mathematician", "dod"]].drop_duplicates()
_df = _df.groupby("mathematician").count()
several_dod = _df[_df["dod"] > 1]
print(f"There are now {len(several_dod)} mathematicians with several dod.")

There are now 0 mathematicians with several dob.
There are now 0 mathematicians with several dod.


Let's save this clean version of the data!

In [10]:
df.to_csv("../data/main.csv", index=False)

### 3.5 Building the graph using Neo4j

#### 1. Building the nodes


In [11]:
# CSV with all mathematicians and their dates of birth and death
df[["mathematician", "dob", "dod"]].drop_duplicates().to_csv("../data/nodes/mathematicians_dates.csv", index=False)

# for each attribute university, fieldOfWork, country, we create a CSV file with one row for each value of the attribute, as well as the number of mathematicians with this value
for colname in ["university", "fieldOfWork", "country"]:
    c = df.groupby(colname)["mathematician"].nunique().sort_values(ascending=False)
    c = c.rename_axis(colname).reset_index(name="count")
    c.to_csv(f"../data/nodes/{colname}Count.csv", index=False)

We can now load this data into Neo4j and create our 4 types of node. using the following CYPHER queries:

<pre style="font-family: monospace">
LOAD CSV WITH HEADERS FROM "file:///nodes/mathematicians_dates.csv" AS row
CREATE (:Mathematician {name:row.mathematician, dob:toInteger(row.dob), dod:toInteger(row.dod)})

LOAD CSV WITH HEADERS FROM "file:///nodes/universityCount.csv" AS row
CREATE (:University {name:row.university})

LOAD CSV WITH HEADERS FROM "file:///nodes/fieldOfWorkCount.csv" AS row
CREATE (:FOW {name:row.fieldOfWork})

LOAD CSV WITH HEADERS FROM "file:///nodes/countryCount.csv" AS row
CREATE (:Country {name:row.country})
</pre>

#### 2. Building the edges

In [12]:
# for each type of relationship source->target, we project our main CSV file onto the two corresponding columns, drop duplicates, and save the result as a CSV file
df[["mathematician", "university"]].drop_duplicates().to_csv("../data/edges/mathematicians_universities.csv", index=False)
df[["mathematician", "doctoralStudent"]].drop_duplicates().to_csv("../data/edges/mathematicians_doctoralStudents.csv", index=False)
df[["mathematician", "fieldOfWork"]].drop_duplicates().to_csv("../data/edges/mathematicians_fow.csv", index=False)
df[["mathematician", "country"]].drop_duplicates().to_csv("../data/edges/mathematicians_countries.csv", index=False)

We can now load this data into Neo4j and create our 3 first relationships (STUDIED_AT, DOCTORAL_ADVISOR_OF and HOME_COUNTRY) using the following CYPHER queries:

<pre style="font-family: monospace">
LOAD CSV WITH HEADERS FROM "file:///edges/mathematicians_universities.csv" AS row
MATCH (m:Mathematician), (u:University)
WHERE m.name=row.mathematician AND u.name=row.university
CREATE (m)-[:STUDIED_AT]->(u)

LOAD CSV WITH HEADERS FROM "file:///edges/mathematicians_doctoralStudents.csv" AS row
MATCH (m:Mathematician), (student:Mathematician)
WHERE m.name=row.mathematician AND student.name=row.doctoralStudent
CREATE (m)-[:DOCTORAL_ADVISOR_OF]->(student)

LOAD CSV WITH HEADERS FROM "file:///edges/mathematicians_countries.csv" AS row
MATCH (m:Mathematician), (c:Country)
WHERE m.name=row.mathematician AND c.name=row.country
CREATE (m)-[:HOME_COUNTRY]->(c)
</pre>

### Complementing our graph using Wikipedia's data

For the code, see the separate Python files.

The two outputs that we are interested in are
- edges_mathematicians.csv which links every mathematicians to the mathematical theories he has contributed to
- theory_theory.csv which links every theory to its parent theories

In [13]:
_df = pd.read_csv("../data/edges/edges_mathematicians.csv")
display(_df)
n_mathematicians = _df["Source"].nunique()
n_theories = _df["Target"].nunique()
print(f"There are {n_mathematicians} mathematicians and {n_theories} theories in the dataset.")

Unnamed: 0,Source,Target
0,Carl Friedrich Gauss,abstract algebra
1,Carl Friedrich Gauss,algebra
2,Carl Friedrich Gauss,algebraic number theory
3,Carl Friedrich Gauss,analysis
4,Carl Friedrich Gauss,arithmetic
...,...,...
8273,Geneviève Raugel,perturbation theory
8274,Elizabeth Meckes,applied mathematics
8275,Elizabeth Meckes,matrix theory
8276,Elizabeth Meckes,probability theory


There are 2132 mathematicians and 294 theories in the dataset.


We can now use this CSV file to create the CONTRIBUTED_TO relationship.

<pre style="font-family: monospace">
LOAD CSV WITH HEADERS FROM "file:///edges/edges_mathematicians.csv" AS row
MATCH (m:Mathematician), (t:Theory)
WHERE m.name=row.Source AND t.name=row.Target
CREATE (m)-[:CONTRIBUTED_TO]->(t)
</pre>

In [14]:
_df = pd.read_csv("../data/edges/theory_theory.csv", encoding="latin-1") # encoding is in latin-1, because the Wikidata CSV was encoded in latin-1 as well, probably because Wikidata detected that our system is in French
display(_df)

Unnamed: 0,Source,Target
0,absolute differential calculus,geometry
1,absolute differential calculus,applied mathematics
2,absolute differential calculus,calculus
3,absolute differential calculus,differential geometry
4,absolute differential calculus,linear algebra
...,...,...
1654,wavelets,functional analysis
1655,wavelets,chaos theory
1656,wavelets,computational mathematics
1657,wavelets,fourier analysis


We can use this CSV to create our last relationship: SUBTHEORY_OF.
<pre style="font-family: monospace">
LOAD CSV WITH HEADERS FROM "file:///edges/theory_theory.csv" AS row
MATCH (t1:Theory), (t2:Theory)
WHERE t1.name=row.Source AND t2.name=row.Target
CREATE (t1)-[:SUBTHEORY_OF]->(t2)
</pre>

Our graph is now built!

Here is a sample of what it looks like:

<img src="../data/images/sample_graph.png">

And here is a graph that represents the structure of our graph.

<img src="../data/images/graph_structure.png">

## 4. Analysis and visualization  of the graph

### 4.1 Querying the graph

Which 10 countries have the most mathematicians?
<pre style="font-family: monospace">
MATCH (m:Mathematician), (c:Country)
WHERE (m)-[:HOME_COUNTRY]->(c)
WITH COUNT(m) AS count, c.name as country
ORDER BY count DESC
LIMIT 10
RETURN country, count
pd.read_csv("../data/results/countries_most_mathematicians.csv")
</pre>
Output : countries_most_mathematicians.csv

In [15]:
pd.read_csv("../data/results/countries_most_mathematicians.csv")

Unnamed: 0,country,count
0,"""United States of America""",785
1,"""Soviet Union""",706
2,"""Germany""",389
3,"""Russian Empire""",344
4,"""Russia""",280
5,"""France""",266
6,"""United Kingdom""",259
7,"""Poland""",137
8,"""United Kingdom of Great Britain and Ireland""",132
9,"""Kingdom of Italy""",116


Which 10 universities have the most mathematicians?
<pre style="font-family: monospace">
MATCH (m:Mathematician), (u:University)
WHERE (m)-[:STUDIED_AT]->(u)
WITH COUNT(m) AS count, u.name as university
ORDER BY count DESC
LIMIT 10
RETURN university, count
</pre>
Output : universities_most_mathematicians.csv

In [16]:
pd.read_csv("../data/results/universities_most_mathematicians.csv")

Unnamed: 0,university,count
0,"""Moscow State University""",252
1,"""University of Göttingen""",210
2,"""MSU Faculty of Mechanics and Mathematics""",180
3,"""University of Cambridge""",171
4,"""Harvard University""",135
5,"""University of Paris""",127
6,"""Humboldt University of Berlin""",120
7,"""Princeton University""",107
8,"""École Normale Supérieure""",100
9,"""Trinity College""",89


Interestingly, we find that most mathematicians live in the U.S., and yet top universities where future mathematicians get their education are mostly outside the U.S. We see that Russia, the U.K., Germany and France have more mathematics students than the U.S., yet less mathematicians.

Maybe that is because the U.S. is more attractive to talented mathematicians: they get education in their birth country and then move to the U.S. to live there. Also, European universities had better reputation in the first half of the 20th century!

<pre style="font-family: monospace">
MATCH (m:Mathematician), (u:University)
WHERE (m)-[:STUDIED_AT]->(u) AND m.dob<=1920
WITH COUNT(m) AS count, u.name as university
ORDER BY count DESC
LIMIT 5
RETURN university, count
</pre>
Output : top_universities_before_1920.csv

<pre style="font-family: monospace">
MATCH (m:Mathematician), (u:University)
WHERE (m)-[:STUDIED_AT]->(u) AND m.dob>1920
WITH COUNT(m) AS count, u.name as university
ORDER BY count DESC
LIMIT 5
RETURN university, count
</pre>
Output : top_universities_after_1920.csv

In [17]:
print("Top universities before 1920:")
display(pd.read_csv("../data/results/top_universities_before_1920.csv"))
print("Top universities after 1920:")
display(pd.read_csv("../data/results/top_universities_after_1920.csv"))

Top universities before 1920:


Unnamed: 0,university,count
0,"""University of Göttingen""",200
1,"""University of Cambridge""",116
2,"""Moscow State University""",112
3,"""Humboldt University of Berlin""",109
4,"""University of Paris""",97
5,"""École Normale Supérieure""",78
6,"""Faculty of Physics and Mathematics of Moscow ...",75
7,"""University of Vienna""",74
8,"""Harvard University""",73
9,"""Trinity College""",69


Top universities after 1920:


Unnamed: 0,university,count
0,"""MSU Faculty of Mechanics and Mathematics""",142
1,"""Moscow State University""",137
2,"""Harvard University""",62
3,"""Princeton University""",58
4,"""University of Cambridge""",49
5,"""Saint Petersburg State University""",36
6,"""University of California, Berkeley""",34
7,"""University of Chicago""",33
8,"""Massachusetts Institute of Technology""",32
9,"""University of Paris""",29


We went from only 1 U.S. university in the top 10 before ~1950 to 5 after ~1950!

This confirms our hypothesis that U.S. universities became more attractive in the 2nd half of the 20th century.

Which ten fields of mathematics (excluding "mathematics" itself) are most popular? (i.e. have the most mathematicians which contributed ot them)

<pre style="font-family: monospace">
MATCH (m:Mathematician), (f:FOW)
WHERE (m)-[:CONTRIBUTED_TO]->(f) AND f.name<>"mathematics"
WITH COUNT(m) AS count, f.name as fieldOfWork
ORDER BY count DESC
LIMIT 10
RETURN fieldOfWork, count
</pre>
Output : fow_most_mathematicians.csv

In [18]:
pd.read_csv("../data/results/fow_most_mathematicians.csv")

Unnamed: 0,fieldOfWork,count
0,"""mathematical analysis""",357
1,"""number theory""",279
2,"""physics""",234
3,"""geometry""",229
4,"""algebra""",223
5,"""topology""",200
6,"""astronomy""",187
7,"""theory of differential equations""",184
8,"""probability theory""",178
9,"""mechanics""",151


As expected, we find large fields of mathematics like analysis, geometry, algebra, probability, etc. Interestingly, we also find non mathematicals domains such as physics, astronomy and mechanics. This shows that many mathematicians applied their skills to other fields of science.

Which ten mathematicians have had the most doctoral students?
<pre font-family="monospace">
MATCH (da:Mathematician)-[:DOCTORAL_ADVISOR_OF]->(:Mathematician)
WITH COUNT(da) AS count, da.name as name
ORDER BY count DESC
LIMIT 10
RETURN name, count
</pre>
Output : top_doctoral_advisors.csv

In [19]:
pd.read_csv("../data/results/top_doctoral_advisors.csv")

Unnamed: 0,name,count
0,"""David Hilbert""",26
1,"""Felix Klein""",25
2,"""Ernst Kummer""",20
3,"""Andrey Kolmogorov""",19
4,"""Nikolai Luzin""",17
5,"""Issai Schur""",16
6,"""Karl Weierstraß""",16
7,"""Charles Émile Picard""",15
8,"""Pavel Aleksandrov""",14
9,"""Rolf Nevanlinna""",14


Which universities are best in probability theory?
<pre style="font-family: monospace">
MATCH (m:Mathematician)-[:STUDIED_AT]->(u:University)
WHERE (m)-[:CONTRIBUTED_TO]->(:FOW {name:"probability theory"})
WITH COUNT(m) AS count, u.name as university
ORDER BY count DESC
LIMIT 3
RETURN university, count
</pre>
Output: top_universities_probability_theory.csv

In [20]:
pd.read_csv("../data/results/top_universities_probability_theory.csv")

Unnamed: 0,university,count
0,"""Moscow State University""",24
1,"""Saint Petersburg State University""",13
2,"""MSU Faculty of Mechanics and Mathematics""",13


All Russian, as expected!

Now, which universities are best in number theory?
<pre style="font-family: monospace">
MATCH (m:Mathematician)-[:STUDIED_AT]->(u:University)
WHERE (m)-[:CONTRIBUTED_TO]->(:FOW {name:"algebra"})
WITH COUNT(m) AS count, u.name as university
ORDER BY count DESC
LIMIT 3
RETURN university, count
</pre>
Output: top_universities_number_theory.csv

In [21]:
pd.read_csv("../data/results/top_universities_number_theory.csv")

Unnamed: 0,university,count
0,"""University of Göttingen""",26
1,"""Humboldt University of Berlin""",20
2,"""Moscow State University""",19


This time Germany is better! Note that Russian universities are leading the way on most theories though...

In [22]:
pd.read_csv("../data/nodes/nodes_mathematicians.csv")

Unnamed: 0,Node
0,Carl Friedrich Gauss
1,Gerardus Mercator
2,Sophie Germain
3,Leonhard Euler
4,Galileo Galilei
...,...
2127,Anna Tramontano
2128,Emma Previato
2129,Garnik Karapetyan
2130,Geneviève Raugel


### 4.2 Using graph algorithms on projected graphs

#### 1. The graph of mathematical theories and how they relate to each other
We wanted to build a graph that would represent the interactions between mathematical subtheories. We were expecting to see some huge theories like algebra or analysis dominate and have many nodes pointing towards them through the SUBTHEORY_OF relationship, and these nodes would be intermediary theories, themselves pointed towards by even smaller theories, etc.

#### First approach
Our first approach was to simply take our CSV file containing the SUBTHEORY_OF (directed) edges and loading it straight up into Gephi.
Note that our graph's family tree structure makes it unfit for certain graph algorithms, such as PageRank or betweenness centrality. We did attempt running these algorithms and obtained poor results as expected.

Despite being quite basic in its construction, running modularity inference on our graph proved quite successful, as the visualization below showcase. However, it was not able to cluster theories into large domains like algebra and instead created smaller clusters, like computational mathematics. We thus have 63 communities and a modularity score of 0.477, which is average.

Also, graph density is very low (0.008) and we have a few lone nodes, which correspond to niche mathematical theories such as systolic hyperbolic geometry, stratified morse theory, bolyai-lobachevskian geometry, etc.


Here is a visualization using Force Atlas layout:

<img src="../graphs/theory_theory_naive/force_atlas.png">

Here is another using Fruchterman Reingold layout:

<img src="../graphs/theory_theory_naive/fruchterman_reingold.png">

#### Second approach
We felt like this first approach was a bit naive and that we could do better. We thus created another graph, with the same Theory nodes but this time weighting the edges t1->t2 with the number of mathematicians who had contributed to both theories t1 and t2. We achieved this using the following CYPHER query:
<pre style="font-family: monospace">
MATCH (t1:Theory), (t2:Theory), (m:Mathematician) WHERE (m)-[:CONTRIBUTED_TO]->(t1) AND (m)-[:CONTRIBUTED_TO]->(t2) AND t1.name<>t2.name RETURN t1.name AS Source, t2.name AS Target, COUNT(DISTINCT(m)) as Weight
</pre>
Output : theory_theory_2.csv

We then loaded this CSV file into Gephi and this time the results were a lot more satisfying. Indeed, the modularity inference algorithm detected four classes only, which we identified as algebra, analysis, geometry and combinatorics, as shown below:

<img src="../graphs/theory_theory_2/4_modularity_classes.png">

Overall modularity is 0.136, which is rather low as expected because mathematics are not clustered : all domains are linked one way or another. However we were quite surprised to find combinatorics along algebra, analysis and geometry, which are much larger domains. We thought it was some sort of algorithmical artifact and we tweaked the modularity algorithm slightly (using a resolution of 1.05 instead of 1) so as to end up with only 3 classes. This time combinatorics disappeared and we ended up with the algebra/analysis/geometry triad, as hoped!

<img src="../graphs/theory_theory_2/3_modularity_classes.png">

#### 2. The graph of mathematicians and how they are linked to each other through the doctoral student/advisor relationship

To build this graph, we only kept Mathematician nodes and we used directed edges. The edge m1->m2 means that m1 is a doctoral advisor of m2.

The following CYPHER code does exactly this projection:
<pre style="font-family: monospace">
MATCH (m:Mathematician)-[:DOCTORAL_ADVISOR_OF]->(s:Mathematician) RETURN m.name as Source, s.name as Target
</pre>
Output : mathematician_mathematician.csv

We can then load this CSV into Gephi to visualize our projected graph and run graph algorithms on it.

Let's first visualize it using Yifan Hu layout.

<img src="../graphs/mathematician_mathematician/modularity_classes.png">

We immediately see three interesting patterns:
1. the modularity algorithm was able to cluster mathematicians into lineages where a mathematician's parent is its doctoral advisor
2. it would appear that most lineages have common ancestors (corresponding to the large nodes in the center of the graph), meaning ancient mathematicians who's students became prominent mathematicians themselves, and so on
3. some lineages (he ones on the borders of the circle) seem to be independent ; this could be due to a lack of data (i.e. we're missing some edges) or to the fact that some mathematical communities are most closed than others (e.g. the Japanese for a long time)

The modularity score for the overall graph is 0.906, which is very high. This is coherent because as we've seen, mathematicians are clustered into lineages which don't interact much with each other.

The community size distribution below demonstrates that most of the lineages are small (less than 10 mathematicians), although other lineages have over 50 members.

<img src="../graphs/mathematician_mathematician/communities-size-distribution.png">

Another interesting metric with such a graph is eccentricity, which is defined as the distance from a given starting node to the farthest node from it in the network. Since edges are directional, a node's eccentricity is simply the length of its lineage, starting from itself. The eccentricity histogram below shows that most mathematicians have no doctoral students or just one. This is probably because few mathematicians actually have more than one doctoral student. However it's a bit surprising that most mathematicians have no doctoral students at all. Our hypothesis is that most do have one or more doctoral students, but they are not famous enough and therefore do not appear on the graph as they do not exist on Wikidata.

<img src="../graphs/mathematician_mathematician/eccentricity.png">

Note that the graph diameter (defined as the longest path from one node to another) is 14. This is quite impressive, as it means that there is a lineage composed of 14 mathematicians. Let's take a look at it.

We first retrieve the corresponding path using the following CYPHER command:
<pre style="font-family: monospace">
MATCH path=(:Mathematician)-[:DOCTORAL_ADVISOR_OF*14]->(:Mathematician) RETURN path
</pre>

Visualizing it directly in the Neo4j browser, we obtain the following:

<img src="../graphs/mathematician_mathematician/longest_path.png">

We see that this lineage starts with Christian August Hausen, a German mathematician and physicist born in 1693, and ends in several leafs, including Marta Bunge, an Argentinian-Canadian mathematician who died last year, in 2022! We also see that some very famous mathematicians are actually "grandparents" of other very famous mathematicians through the doctoral student/advisor link. In that regard, Gauss is Felix Klein's great grandfather, and Klein himself is David Hilbert's grandfather!

One additional question that comes to mind naturally is to determine the size of Christian August Hausen's lineage, i.e. the number of mathematicians who have him as their ancestor. We can figure this out quite simply using the following CYPHER query:
<pre style="font-family: monospace">
MATCH path=(:Mathematician {name:"Christian August Hausen"})-[:DOCTORAL_ADVISOR_OF*1..]->(m:Mathematician) RETURN COUNT(DISTINCT(m))
</pre>
Output : 978

In other words, over one quarter of the mathematicians in our graph have Christian August Hausen as their ancestor!

Finally, if we look at the number of connected components, we find 112 separate connected components. The histogram belows confirms our intuition that there is one massive mathematical communityb with over 1600 nodes, and then a handful of much smaller, isolated lineages with no more than 50 nodes.

<img src="../graphs/mathematician_mathematician/cc-size-distribution.png">

Also, note that again our graph's family tree structure makes it unfit for certain graph algorithms, such as PageRank or betweenness centrality. We did attempt running these algorithms and obtained poor results as expected.