# Creation and visualization of a graph of books and readers using networkx and pyvis

## Part 1: Networkx creation 

**There will be two kind of nodes: Books and readers. Edges will connect readers with books They rated**

In [2]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network
import random

import warnings
warnings.filterwarnings("ignore")

### Checking the data

The data in 3 csv files have the following information:

* Users --> Id,Age, Location
* Books --> ISBNN, Title, author, year, publisher
* Reviews --> Review in a 0-5 scale

In [3]:
users=pd.read_csv("Users.csv")
ratings=pd.read_csv("Ratings.csv")
books=pd.read_csv("Books.csv",usecols=[0,1,2,3,4])

The ratings dataFrame has the necessary information to create the links between nodes (reader)--\[rate\]-->(book)

In [None]:
ratings.info()

In [None]:
users.info()

In [None]:
books.info()

In [4]:
print(f"Number of readers: {len(ratings['User-ID'].unique())}")
print(f"Number of books: {len(ratings['ISBN'].unique())}")

Number of readers: 105283
Number of books: 340556


There are missing information of several books in ratings (books dataFrame only has information of 271360 books). Let's use only books with information in books dataFrame. Also, We are going to use only information of readers in ratings.

In [5]:
ratings=ratings[ratings['ISBN'].isin(list(books['ISBN'].unique()))]
users=pd.merge(pd.DataFrame(ratings['User-ID'].sort_values().unique(),columns=['User-ID']),users,how='inner')
books=pd.merge(pd.DataFrame(ratings['ISBN'].sort_values().unique(),columns=['ISBN']),books,how='inner')

Cleanning the info

In [6]:
ratings['User-ID']=ratings['User-ID'].astype(str)
users['User-ID']=users['User-ID'].astype(str)
users['City']=users['Location'].str.split(',').str[0]
users['Country']=users['Location'].str.split(',').str[2]
users['Age']=users['Age'].fillna('Unknown')
books['Book-Author']=books['Book-Author'].fillna('Unknown')

### Creating the network
**The network will be created with networkx from the pandas dataFrame ratings**

The G graph is a directed graph while H is an undirected graph. We defined both because for a correct visualization we will use the directed one and, to calculate some important features of our network we will use networkx functions only defined for uniderected graphs.

In [7]:
G=nx.from_pandas_edgelist(ratings,source='User-ID',target='ISBN',edge_attr=['Book-Rating'],create_using=nx.DiGraph) #directed graph
H=nx.from_pandas_edgelist(ratings,source='User-ID',target='ISBN',edge_attr=['Book-Rating']) #simple graph

**Adding node attributes**

Let's add some attributes to the nodes of G

In [8]:
nx.set_node_attributes(G,dict(zip(books['ISBN'],books['Book-Title'])), "Book-Title") 
nx.set_node_attributes(G,dict(zip(users['User-ID'],users['Country'])), "Country") 

In [None]:
#To check the new added attributes
#G.nodes(data=True)

**Some information of our network**

In [9]:
print(H)
print(G) #It's the same as G

Graph with 362257 nodes and 1031136 edges
DiGraph with 362257 nodes and 1031136 edges


In [10]:
print(f"Number of connected components: {nx.number_connected_components(H)}")
print(f"Nodes in largest connected component: {len(max(nx.connected_components(H), key=len))}")

Number of connected components: 6117
Nodes in largest connected component: 348278


1. Number of readers and books

In [11]:
print(f"Number of readers: {len(nx.get_node_attributes(G,'Country').keys())}")
print(f"Number of books: {len(nx.get_node_attributes(G,'Book-Title').keys())}")

Number of readers: 92106
Number of books: 270151


2. Top five readers with more rating books

In [12]:
out_degree= sorted(G.out_degree(), key=lambda item: item[1], reverse=True)

In [13]:
out_degree[0:5] #(User-ID,rated books)

[('11676', 11144),
 ('198711', 6456),
 ('153662', 5814),
 ('98391', 5779),
 ('35859', 5646)]

In [14]:
users[users['User-ID'].isin([i[0] for i in out_degree[0:5]])]

Unnamed: 0,User-ID,Location,Age,City,Country
3629,11676,"n/a, n/a, n/a",Unknown,,
11848,35859,"duluth, minnesota, usa",Unknown,duluth,usa
32581,98391,"morrow, georgia, usa",52.0,morrow,usa
50848,153662,"ft. stewart, georgia, usa",44.0,ft. stewart,usa
65332,198711,"little canada, minnesota, usa",62.0,little canada,usa


3. Top five books with more ratings 

In [15]:
in_degree= sorted(G.in_degree(), key=lambda item: item[1], reverse=True)

In [16]:
in_degree[0:5]

[('0971880107', 2502),
 ('0316666343', 1295),
 ('0385504209', 883),
 ('0060928336', 732),
 ('0312195516', 723)]

In [17]:
books[books['ISBN'].isin([i[0] for i in in_degree[0:5]])]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
7344,60928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial
32370,312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
38570,316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
70798,385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
215952,971880107,Wild Animus,Rich Shapero,2004,Too Far


## Part 2: Pyvis Visualization

### Attributes for visualization

Attributes in nodes and edges can be used to show important information in the visualization of our graph. In *pyvis* attributes with specific names are used to customize the visualization of the graph. Let's create attributes for the visualization.

**Node attributes**

*icon*

For icons to work, the "shape" attribute needs be set as "icon"

In [38]:
country_attributes = nx.get_node_attributes(G, 'Country')
node_icon={(key):{'face': 'FontAwesome',"code": "\uf007",'color':'blue'} if country_attributes.get(key) 
           else {'face':'FontAwesome','code':'\uf02d','color':'brown'} 
           for key in G.nodes()}
nx.set_node_attributes(G,node_icon, 'icon')
nx.set_node_attributes(G,'icon', 'shape')

If we don't want an icon, con just put the deafult bullet for a node and change color and size
*color* 

In [None]:
#node_colors={(key): ('gray' if country_attributes.get(key) is not None else 'brown') for key in G.nodes()}
#nx.set_node_attributes(G,node_colors, "color")

*size*

In [32]:
#node_size={(key): (round(np.log(item+10)*10,2)) for key,item in out_degree}
#nx.set_node_attributes(G,node_size, "size")

*title* (will be the labeling)

In [19]:
books['info']=books.apply(lambda x: 'Title: '+ x['Book-Title']+'\n'+'Author: '+x['Book-Author']+'\n'+'Year: '+str(x['Year-Of-Publication']),axis=1)
users['info']=users.apply(lambda x: 'ID: '+x['User-ID']+'\n'+'Age: '+str(x['Age'])+'\n'+'City: ' +x['City'],axis=1)

In [20]:
nx.set_node_attributes(G,dict(zip(books['ISBN'],books['info'])), "title")
nx.set_node_attributes(G,dict(zip(users['User-ID'],users['info'])), "title")

**Edges attributes**

*color*

In [21]:
edge_color_map={10:'red',9:'red',8:'orange',7:'orange',6:'yellow',5:'yellow',4:'green',3:'green',2:'blue',1:'red',0:'gray'}
ratings['color']=ratings['Book-Rating'].map(edge_color_map)

In [22]:
#{key: {'color':group['color']} for key, group in ratings.groupby(['User-ID', 'ISBN'])}
edge_colors={(row['User-ID'], row['ISBN']): {'color':row['color']} for _, row in ratings.iterrows()}
nx.set_edge_attributes(G,edge_colors)

*title*

In [23]:
for u, v, data in G.edges(data=True):
    data['title'] = data['Book-Rating']

**For purposes of visualization We are going to plot just small samples of the complete network**

In [49]:
components=sorted(nx.connected_components(H), key=len, reverse=True)
g=nx.subgraph(G,components[4]) ##a subgraph could be a connected component

In [40]:
nodes = list(nx.subgraph(G,components[0]))
num_nodes_to_sample = 1000 # Example: select 100 random nodes of number 1 component
random_nodes = random.sample(nodes, num_nodes_to_sample)
h=nx.subgraph(G,random_nodes) #random subgraph

**Function to create pyvis graph**

In [57]:
def pyvis_vis(networkx_graph):
    
    my_network = Network(height='400px',directed=True,notebook=True,font_color='#00000000')
    my_network.from_nx(networkx_graph)
    
    return my_network.show('graph_visualization.html')

In [58]:
pyvis_vis(g)

graph_visualization.html
