# Course's network data

In this notebook we will analyze the similarity between different courses. <br>
Each course corresponds to a set of students who were enrolled in it, we will define the 'similarity' between two courses as the Jaccard coefficient between the two set of students who were enrolled in those courses. <br>


---
<b> An example: </b> <br>
Take two courses: Machine learning and  Applied data analysis. <br>
We define the students enrolled in ML course as the set A, the ones enrolled in ADA as B. <br>
Then, the similarity between those two courses is equal to the Jaccard coefficient between A and B. <br>
<img src="../images/jaccard.png"> <br>

---
<br>
For our final visualization, we will build a network in which the courses will be nodes and, each course will be connected to the most 'similar' courses. <br>
In this notebook, we will build the underlying network. We will apply an heuristic that will create a planar graph in order to help us in a latter phase of our visualization.


## 1. Load the data set

First, we will import a set of libraries that will be useful for our analysis.

In [31]:
import pandas as pd
from matplotlib import pyplot as plt
import plotly.express as px
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import networkx as nx

%matplotlib inline

Now, we will red the data. They are divided into 3 main data sets:
* <b> Student: </b> contains informations about each student erolled at EPFL
* <b> Courses: </b> contains informations about all courses offere by EPFL at master level
* <b> Enrollment: </b> contains informations about student enrollment in courses

In [5]:
df_student = pd.read_csv('../../data/csv/student.csv')
df_course = pd.read_csv('../../data/csv/courses.csv')
df_enrollment = pd.read_csv('../../data/csv/enrollment.csv')

In [7]:
df_course.head(2)

Unnamed: 0.1,Unnamed: 0,course_id,course_name,year
0,0,0,Biological and physiological transport,2006-2007
1,1,1,Biological and physiological transport,2007-2008


In [8]:
df_enrollment.head()

Unnamed: 0.1,Unnamed: 0,student_id,course_id,semester
0,0,3692,0,Master semestre 2


In [8]:
df_student.head()

Unnamed: 0.1,Unnamed: 0,student_name,section,student_id
0,0,Aabid Fouad,Génie mécanique,0
1,1,Aamodt Simen,Génie mécanique,1
2,2,Aanhaanen Simone,Architecture,2
3,3,Aapro Laurent,Systèmes de communication - master,3
4,4,Aapro Niccolò,Informatique,4


## 2. Set of students enrolled in each course

For each course we will now compute the set of students attending that course, we will create a file 'jaccard.csv' to save this info as we will use it often in the visualization.

In [13]:
df_complete = df_course.merge(df_enrollment, on='course_id')
df_jaccard = df_complete.groupby('course_name')['student_id'].agg({'set': lambda x: set(x)}).reset_index()
df_jaccard.head(2)


using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use                 named aggregation instead.

    >>> grouper.agg(name_1=func_1, name_2=func_2)




Unnamed: 0,course_name,set
0,Numerical approximation of PDE's II,"{33792, 18439, 36362, 37773, 15374, 20879, 359..."
1,3D Electron Microscopy and FIB-Nanotomography,"{35843, 34309, 5650, 539, 32805, 7720, 7733, 2..."


In [14]:
df_jaccard.to_csv('../../data/csv/jaccard.csv')

We will particularly use this file in order to compute the jaccard coefficient between two courses using the students enrolled over the years.

## 2. The course network

Now, we will compute the connection between the courses in our database. We will again use the jaccard coefficient to obtain a similarity measure between courses.

In [15]:
df_jaccard['key'] = 1
df_product = df_jaccard.merge(df_jaccard, on='key')
df_product.drop('key', axis=1, inplace=True)
df_product.head(2)

Unnamed: 0,course_name_x,set_x,course_name_y,set_y
0,Numerical approximation of PDE's II,"{33792, 18439, 36362, 37773, 15374, 20879, 359...",Numerical approximation of PDE's II,"{33792, 18439, 36362, 37773, 15374, 20879, 359..."
1,Numerical approximation of PDE's II,"{33792, 18439, 36362, 37773, 15374, 20879, 359...",3D Electron Microscopy and FIB-Nanotomography,"{35843, 34309, 5650, 539, 32805, 7720, 7733, 2..."


In [16]:
def jaccard(s1, s2):
    common = 0
    for e in s1:
        if e in s2:
            common +=1
    return common / (len(s1) + len(s2) - common)

df_edges = df_product.copy()
df_edges['jaccard'] = df_product.apply(lambda x: jaccard(x['set_x'], x['set_y']), axis=1)
#df_edges['jaccard'] = df_save
df_edges.drop(['set_x', 'set_y'], axis=1, inplace=True)
df_edges.head()

Unnamed: 0,course_name_x,course_name_y,jaccard
0,Numerical approximation of PDE's II,Numerical approximation of PDE's II,1.0
1,Numerical approximation of PDE's II,3D Electron Microscopy and FIB-Nanotomography,0.0
2,Numerical approximation of PDE's II,A History of Evolutionary Theory,0.0
3,Numerical approximation of PDE's II,A Political History of Urban Form,0.0
4,Numerical approximation of PDE's II,A guided tour for engineers in applied stochas...,0.0


Save all the jaccard coefficient computed.

In [17]:
df_edges.to_csv('../../data/csv/all_edges.csv')

Now, we will apply an euristic method in order to keep only the most relevant edges. For each node we will keep the top 5% of its neighbours (jaccard > 0).

In [103]:
df_count = df_edges[df_edges['jaccard'] > 0].groupby('course_name_x').count().drop('course_name_y', axis=1).reset_index()
df_count.rename(columns={"course_name_x": "course_name", "jaccard": "neighbours"}, inplace=True)
df_count.head()

Unnamed: 0,course_name,neighbours
0,Numerical approximation of PDE's II,93
1,3D Electron Microscopy and FIB-Nanotomography,146
2,A History of Evolutionary Theory,2
3,A Political History of Urban Form,117
4,A guided tour for engineers in applied stochas...,98


In [104]:
size_map = {}
for index, row in df_count.iterrows():
    size_map[row['course_name']] = row['neighbours']
len(size_map) # Should be 2893

2867

In [105]:
# How many isolated nodes? Looks like none 
df_count[df_count['neighbours'] == 0]

Unnamed: 0,course_name,neighbours


In [106]:
# Remove self-loops
df_edges = df_edges[df_edges['course_name_x'] != df_edges['course_name_y']]
# Remove edges with zero value
df_edges = df_edges[df_edges['jaccard'] > 0]
df_edges.head(5)

Unnamed: 0,course_name_x,course_name_y,jaccard
63,Numerical approximation of PDE's II,Advanced methods in computational solid mechanics,0.017857
65,Numerical approximation of PDE's II,Advanced multiprocessor architecture,0.007407
67,Numerical approximation of PDE's II,Advanced numerical analysis,0.025
74,Numerical approximation of PDE's II,Advanced regression,0.02439
76,Numerical approximation of PDE's II,Advanced scientific computing,0.018182


In [107]:
def keep_friends(g):
    for course_name in set(g['course_name_x']):
        return g.nlargest(5, "jaccard")

df_edges_planar = df_edges.groupby('course_name_x', group_keys=False).apply(keep_friends)
df_edges_planar.head(20)

Unnamed: 0,course_name_x,course_name_y,jaccard
1768,Numerical approximation of PDE's II,Numerical approximation of PDE's I,0.242105
1773,Numerical approximation of PDE's II,Numerical integration of dynamical systems,0.219512
532,Numerical approximation of PDE's II,Computational linear algebra,0.176471
1774,Numerical approximation of PDE's II,Numerical methods for conservation laws,0.12
844,Numerical approximation of PDE's II,Elliptic partial differential equations,0.116279
4644,3D Electron Microscopy and FIB-Nanotomography,Non-destructive evaluation methods,0.153846
3562,3D Electron Microscopy and FIB-Nanotomography,Céramiques,0.08589
5447,3D Electron Microscopy and FIB-Nanotomography,Thermodynamique,0.075556
3303,3D Electron Microscopy and FIB-Nanotomography,"Ceramics, properties",0.074257
3659,3D Electron Microscopy and FIB-Nanotomography,Déformation et rupture à basse température,0.068182


In [146]:
df_edges_planar.to_csv('../../data/csv/server/course_network_v0.csv')

In [108]:
len(df_edges_planar)

14041

In [109]:
2893*3

8679

In [133]:
# Sorting the edges: first the less important
df_edges_planar = df_edges_planar.sort_values('jaccard', ascending=True)
df_edges_planar.head()

Unnamed: 0,course_name_x,course_name_y,jaccard
1761406,Contact mechanics and nonsmooth tribology,Systèmes mécaniques,0.000602
56830,Advanced Topics on Privacy Protection,Pattern classification and machine learning,0.00119
6988052,Summer School on Reproducibility in Computatio...,Machine learning,0.001252
1760608,Contact mechanics and nonsmooth tribology,Mécanique des fluides incompressibles,0.001344
7840052,Topics in Language-based Software Security,Advanced algorithms,0.001395


Let's take a look at our graph. Is it connected?

In [134]:
nx.is_connected(graph)

False

It is not connected, we can print the size of each connected component to have an idea of its topology.

In [137]:
[len(c) for c in sorted(nx.connected_components(graph), key=len, reverse=True)]

[2755, 39, 25, 8, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Now, we will try to generate a planar graph.

In [138]:
graph = nx.from_pandas_edgelist(df_edges_planar, source='course_name_x', target='course_name_y')
nx.algorithms.planarity.check_planarity(graph)

(False, None)

In [139]:
# [len(c) if len(c) > 20 else 'sob' for c in sorted(nx.connected_components(graph), key=len, reverse=True)]
lst_edges = df_edges_planar.values.tolist()
lst_edges[:3]

[['Contact mechanics and nonsmooth tribology',
  'Systèmes mécaniques',
  0.0006024096385542169],
 ['Advanced Topics on Privacy Protection',
  'Pattern classification and machine learning',
  0.0011904761904761906],
 ['Summer School on Reproducibility in Computational Sciences 2019',
  'Machine learning',
  0.0012515644555694619]]

We will now remove the less impotant edges following the criteria:
* Take the edge with the smallest value with has not been saved
* If the edge does not disconnect the graph, remove it. Othewise save it and restart
* Check if graph is planar, if not repeat

In [140]:
# WARNING: long running time
index = 0
removed = 0
cc = len(sorted(nx.connected_components(graph), key=len, reverse=True))
while not nx.algorithms.planarity.check_planarity(graph)[0]:
    source = lst_edges[index][0]
    target = lst_edges[index][1]
    
    if graph.has_edge(source, target):
        graph.remove_edge(source, target)
        if len(sorted(nx.connected_components(graph), key=len, reverse=True)) != cc:
            graph.add_edge(source, target)
        else:
            removed += 1
    index += 1
    if index % 1000 == 0:   
        print(index, removed)
    
nx.algorithms.planarity.check_planarity(graph)

1000 826
2000 1601
3000 2311
4000 2995
5000 3591
6000 4143
7000 4683
8000 5184
9000 5656
10000 6132
11000 6563
12000 7002
13000 7358


(True, <networkx.algorithms.planarity.PlanarEmbedding at 0x1a2a9db7d0>)

Now, how many edges are left in our network?

In [143]:
2 * len(graph.edges) / len(graph.nodes)

1.9916288803627484

The average degree is 2, which is quite low!

In [None]:
nx.draw(graph)

So, we decided to try to readd those edges which do not break planarity.

In [147]:
df_edges_planar = df_edges_planar.sort_values('jaccard', ascending=False)
incr_edge_list = df_edges_planar.values.tolist()
incr_edge_list[:3]

[['Immunology and microbiology II', 'Immunology and microbiology I', 1.0],
 ['Images de la nature I', 'Images de la nature II', 1.0],
 ['Cours UNIL - Faculté des hautes études commerciales HEC (printemps 13)',
  "Cours UNIL - Faculté des géosciences et de l'environnement GSE  (printemps 13)",
  1.0]]

In [148]:
# WARNING: long running time
index = 0
added = 0
for edge in incr_edge_list:
    source = edge[0]
    target = edge[1]
    if not graph.has_edge(source, target):
        graph.add_edge(source, target)
        if not nx.algorithms.planarity.check_planarity(graph)[0]:
            graph.remove_edge(source, target)
        else:
            added += 1
    index += 1
    if index % 1000 == 0:
        print(index, added)

nx.algorithms.planarity.check_planarity(graph)[0]
        

1000 161
2000 411
3000 709
4000 1001
5000 1278
6000 1575
7000 1865
8000 2200
9000 2512
10000 2821
11000 3119
12000 3390
13000 3638
14000 3853


True

In [149]:
df_edges_planar = nx.to_pandas_edgelist(graph)
df_edges_planar.head()

Unnamed: 0,source,target
0,Contact mechanics and nonsmooth tribology,Rhéologie des matériaux
1,Contact mechanics and nonsmooth tribology,Mécanique des fluides incompressibles
2,Systèmes mécaniques,Mécanique des structures (pour GM)
3,Systèmes mécaniques,Écoulement des fluides
4,Systèmes mécaniques,Procédés de production


Save this edges in a CSV file.

In [150]:
df_edges_planar.to_csv('../../data/csv/server/course_network_v1.csv')

In [151]:
def keep_3_friends(g):
    for course_name in set(g['course_name_x']):
        return g.nlargest(3, "jaccard")

df_edges_planar = df_edges.groupby('course_name_x', group_keys=False).apply(keep_3_friends)

In [152]:
df_edges_planar.to_csv('../../data/csv/server/course_network_v2.csv')