# 1 Graph Centralities

In [42]:
# Packages
import numpy as np
import pandas as pd
from zipfile import ZipFile
import networkx as nx
from fastprogress import master_bar, progress_bar
from networkx.algorithms import centrality

In [2]:
import matplotlib.pyplot as plt

Mount google drive

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [4]:
PATH_FOLDER = '/content/gdrive/MyDrive/Colab Notebooks/graph_based_recommendation_system'
%cd {PATH_FOLDER}

/content/gdrive/.shortcut-targets-by-id/1Qu7UOLxDtaHg6JbrDo0q-3e4ePr807CH/graph_based_recommendation_system


In [5]:
%ls

DataExploration.ipynb  GraphCentralities.ipynb  project_proposal.docx
dataset.zip            graph_features.adjlist   project_proposal.pdf
[0m[01;34mfigures[0m/               graph_features.edgelist  README.md


Load graph

In [6]:
G = nx.readwrite.edgelist.read_weighted_edgelist('./graph_features.edgelist')

Top elements to retrieve:

In [7]:
num_tops = 10

## 1.1 Degree centrality
Without considering that the graph is bipartite

In [101]:
%%time
degree_centrality = centrality.degree_centrality(G)

CPU times: user 110 ms, sys: 4 µs, total: 110 ms
Wall time: 113 ms


In [102]:
# Sort dictionary by value
degree_centrality = {k: v for k, v in sorted(degree_centrality.items(), key=lambda item: item[1], reverse=True)}

In [103]:
u = 0 #top users
a = 0 #top anime

top_degreeU = []
top_degreeA = []

for k in degree_centrality.keys():
  if k[:4] == 'user' and u < num_tops:
    u += 1
    top_degreeU.append((k, degree_centrality[k]))
  elif k[:5] == 'anime' and a < num_tops:
    a += 1
    top_degreeA.append((k, degree_centrality[k]))
  if u == num_tops and a == num_tops:
    break

In [104]:
print('Top users:')
for k,v in top_degreeU:
  print('{}    {}'.format(k,v))

print('--------------\n')

print('Top anime:')
for k,v in top_degreeA:
  print('{}    {}'.format(k,v))

Top users:
user_42635    0.0471172587236718
user_53698    0.0365293932725558
user_57620    0.03381326626846903
user_59643    0.03309651053127947
user_51693    0.03295818924866394
user_45659    0.03095881798176674
user_7345    0.030543854133920148
user_12431    0.029525306507387612
user_65840    0.0278528764539453
user_22434    0.025036152153410877
--------------

Top anime:
anime_1535    0.4303803835271927
anime_11757    0.3308267840301792
anime_16498    0.3180006287331028
anime_1575    0.30336372209996854
anime_6547    0.2963219113486325
anime_226    0.2958566488525621
anime_20    0.27753536623703234
anime_5114    0.27027978623074506
anime_121    0.26824269097767994
anime_2904    0.2656271612700409


## 1.2 Closeness centrality
Without considering that the graph is bipartite.  
Note: Basic implementation of **Eppstein-Wang Algorithm** for computing an approximate closeness centralic metric.  
_Wang, David Eppstein Joseph. "Fast approximation of centrality." Graph algorithms and applications 5.5 (2006): 39._  

For weighted graph, Dijkstra's algorithm will be used.

**Without weights** (ratings in this case)

In [26]:
def cost_path(G, edges_list, weight=None):
  result = 0
  if weight != None:
    for i in range(1, len(edges_list)):
      v = edges_list[i-1]
      u = edges_list[i]
      result += G[v][u][weight]
  else:
    result = len(edges_list) - 1
  return result

In [48]:
from collections import defaultdict
import random as rd

def apprimateClosenessCentralities(G, k, weight=None):
  rd.seed(1)
  sum_v = defaultdict()
  for v in G.nodes:
    sum_v[v] = 0

  for i in range(k):
    v_i = rd.choice(list(G.nodes))
    if weight != None: sssp = nx.algorithms.shortest_paths.generic.shortest_path(G, source=v_i,weight=weight, method='dijkstra')
    else: sssp = nx.algorithms.shortest_paths.generic.shortest_path(G, source=v_i,weight=None)
    for v in G.nodes:
      if weight != None: sum_v[v] += cost_path(G, sssp[v_j], weight)
      else: sum_v[v] += cost_path(G, sssp[v_j])

  cc = defaultdict() #closeness centrality
  n = len(G.nodes)
  for v in G.nodes:
    cc[v] = 1 / ((n * sum_v[v]) / (k *(n-1)))

  return cc

In [71]:
%%time
k = 150
closeness_centrality = apprimateClosenessCentralities(G, 150)

CPU times: user 8min 29s, sys: 1.25 s, total: 8min 30s
Wall time: 8min 30s


In [72]:
# Sort dictionary by value
closeness_centrality = {k: v for k, v in sorted(closeness_centrality.items(), key=lambda item: item[1], reverse=True)}

In [73]:
u = 0 #top users
a = 0 #top anime

top_closenessU = []
top_closenessA = []

for k in closeness_centrality.keys():
  if k[:4] == 'user' and u < num_tops:
    u += 1
    top_closenessU.append((k, closeness_centrality[k]))
  elif k[:5] == 'anime' and a < num_tops:
    a += 1
    top_closenessA.append((k, closeness_centrality[k]))
  if u == num_tops and a == num_tops:
    break

In [74]:
print('Top users:')
for k,v in top_closenessU:
  print('{}    {}'.format(k,v))

print('--------------\n')

print('Top anime:')
for k,v in top_closenessA:
  print('{}    {}'.format(k,v))

Top users:
user_1    0.39266521943562727
user_2    0.39266521943562727
user_3    0.39266521943562727
user_5    0.39266521943562727
user_7    0.39266521943562727
user_8    0.39266521943562727
user_9    0.39266521943562727
user_10    0.39266521943562727
user_11    0.39266521943562727
user_12    0.39266521943562727
--------------

Top anime:
anime_8074    0.39266521943562727
anime_11617    0.39266521943562727
anime_11757    0.39266521943562727
anime_15451    0.39266521943562727
anime_11771    0.39266521943562727
anime_20    0.39266521943562727
anime_154    0.39266521943562727
anime_170    0.39266521943562727
anime_199    0.39266521943562727
anime_225    0.39266521943562727


**With weights** (ratings in this case)

In [75]:
%%time
k = 150
closeness_centrality = apprimateClosenessCentralities(G, 150, 'weight')

CPU times: user 43min 27s, sys: 6.08 s, total: 43min 33s
Wall time: 43min 35s


In [76]:
# Sort dictionary by value
closeness_centrality = {k: v for k, v in sorted(closeness_centrality.items(), key=lambda item: item[1], reverse=True)}

In [77]:
u = 0 #top users
a = 0 #top anime

top_closenessU = []
top_closenessA = []

for k in closeness_centrality.keys():
  if k[:4] == 'user' and u < num_tops:
    u += 1
    top_closenessU.append((k, closeness_centrality[k]))
  elif k[:5] == 'anime' and a < num_tops:
    a += 1
    top_closenessA.append((k, closeness_centrality[k]))
  if u == num_tops and a == num_tops:
    break

In [78]:
print('Top users:')
for k,v in top_closenessU:
  print('{}    {}'.format(k,v))

print('--------------\n')

print('Top anime:')
for k,v in top_closenessA:
  print('{}    {}'.format(k,v))

Top users:
user_1    0.07295628104300078
user_2    0.07295628104300078
user_3    0.07295628104300078
user_5    0.07295628104300078
user_7    0.07295628104300078
user_8    0.07295628104300078
user_9    0.07295628104300078
user_10    0.07295628104300078
user_11    0.07295628104300078
user_12    0.07295628104300078
--------------

Top anime:
anime_8074    0.07295628104300078
anime_11617    0.07295628104300078
anime_11757    0.07295628104300078
anime_15451    0.07295628104300078
anime_11771    0.07295628104300078
anime_20    0.07295628104300078
anime_154    0.07295628104300078
anime_170    0.07295628104300078
anime_199    0.07295628104300078
anime_225    0.07295628104300078


## 1.3 Degree centrality - Bipartite
Considering that the graph is bipartite

In [85]:
# Loading dataset from .zip file
path_dataset = "dataset.zip"

with ZipFile(path_dataset, 'r') as zip_ref:
    all_path = zip_ref.namelist()
    print('Paths: ', all_path)
    
    df_anime = pd.read_csv(zip_ref.open(all_path[0]))
    df_rating = pd.read_csv(zip_ref.open(all_path[1]))

Paths:  ['anime.csv', 'rating.csv']


### Pre-processing

Elimination of rows that have Nan values

In [86]:
#df_anime.dropna(inplace=True)
df_rating.dropna(inplace=True)

Elimination of all rating values whose anime doesn't have a description in file 'anime.csv'

In [87]:
for i in df_rating.anime_id.unique():
  if df_anime[df_anime['anime_id'] == i].empty:
    df_rating = df_rating[df_rating['anime_id'] != i]

Elimination of all user-item iteraction when a rate has not been given (rating = -1)

In [88]:
# number of user that have seen an anime, but didn't give a rate (meaning rating = -1)
df_rating[df_rating['rating'] == -1].rating.value_counts()

-1    1476488
Name: rating, dtype: int64

In [89]:
# new dataframe without user-item itercations with rating = -1
new_df_rating = df_rating[df_rating['rating'] != -1]
assert(new_df_rating.shape[0] == df_rating.shape[0] - 1476488)

df_rating = new_df_rating
del new_df_rating

Number of possible nodes and edges after the pre-processing phase

In [90]:
num_users = len(df_rating['user_id'].unique())
num_items = len(df_rating['anime_id'].unique())
num_nodes = num_users + num_items

print('Number of nodes: ', num_nodes)
print('Number of edges: ', df_rating.shape[0])

Number of nodes:  79526
Number of edges:  6337239


### Computation

In [91]:
# Add user node features
for i in df_rating.user_id.unique():
  attrs = {'user_' + str(i): {'node_type':'user'}}
  nx.set_node_attributes(G, attrs)

# Add anime node features
for i in df_rating.anime_id.unique():
  attrs = {'anime_' + str(i): {'node_type':'anime'}}
  nx.set_node_attributes(G, attrs)

In [96]:
anime_nodes = []
for id,t in G.nodes(data='node_type'):
  if t == 'anime':
    anime_nodes.append(id)

In [97]:
%%time
degree_centrality = nx.algorithms.bipartite.centrality.degree_centrality(G, anime_nodes)

CPU times: user 186 ms, sys: 999 µs, total: 187 ms
Wall time: 188 ms


In [98]:
# Sort dictionary by value
degree_centrality = {k: v for k, v in sorted(degree_centrality.items(), key=lambda item: item[1], reverse=True)}

In [99]:
u = 0 #top users
a = 0 #top anime

top_degreeU = []
top_degreeA = []

for k in degree_centrality.keys():
  if k[:4] == 'user' and u < num_tops:
    u += 1
    top_degreeU.append((k, degree_centrality[k]))
  elif k[:5] == 'anime' and a < num_tops:
    a += 1
    top_degreeA.append((k, degree_centrality[k]))
  if u == num_tops and a == num_tops:
    break

In [100]:
print('Top users:')
for k,v in top_degreeU:
  print('{}    {}'.format(k,v))

print('--------------\n')

print('Top anime:')
for k,v in top_degreeA:
  print('{}    {}'.format(k,v))

Top users:
user_42635    0.3774934515414064
user_53698    0.2926657263751763
user_57620    0.27090469474108403
user_59643    0.2651622002820875
user_51693    0.2640539995970179
user_45659    0.24803546242192223
user_7345    0.2447108603667137
user_12431    0.23655047350392908
user_65840    0.2231513197662704
user_22434    0.2005843239975821
--------------

Top anime:
anime_1535    0.4917528735632184
anime_11757    0.3780028735632184
anime_16498    0.3633477011494253
anime_1575    0.3466235632183908
anime_6547    0.33857758620689654
anime_226    0.33804597701149425
anime_20    0.31711206896551725
anime_5114    0.30882183908045974
anime_121    0.3064942528735632
anime_2904    0.3035057471264368


## 1.4 Top anime Comparison

In [105]:
df_anime.sort_values(by=['rating'], ascending=False).iloc[:10, 5]

10464    10.00
10400     9.60
9595      9.50
0         9.37
9078      9.33
1         9.26
2         9.25
10786     9.25
3         9.17
4         9.16
Name: rating, dtype: float64

**Some final considerations**:  
A first consideration is that while degree centrality metric (in both versions) seems to be effective, on the other hand closeness centrality metric doesn't produce significative results either considering the graph as weighted and unweighted. This is probably due to the fact that given the task under consideration and thus the graph structure where only user-item itercations are considered as edges, a lot of nodes have more or less the same value of closeness. It's important to rembember that graph G is huge, therefore computing the exact closeness metric is really difficult in terms of time complexity. For this reason, an approximeted version has been implemented with k=150 empiricaly chosen. It may be that increasing k value, so will happen for the effectiveness of closeness centrality. This is a possibile suggestion when more hardware resources are available.  
Nevertheless, a basic graph analysis can be done thanks degree centrality metric, which is computed considering with and without the property of the graph to be bipartite.
Interesting is the fact that the ranking output given with degree metric is not the same as sorting the anime by global avarage rating. A possible reason is that degree metric considers only the number of user-item iteractions regarding the fact that the relative rating is good or bad. What matters is simply if the anime has been whatched or not. At the same time an anime watched by a small comunity that really likes it and so having a high global average rating, it may happen that it's not actually one of the most popular items.