# Creation and visualization of a graph of books and readers using networkx and pyvis

## Part 1: Networkx creation 

**There will be two kind of nodes: Books and readers. Edges will connect readers with rating books**

In [2]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network
import random

import warnings
warnings.filterwarnings("ignore")

### Checking the data

The data in 3 csv files have the following information:

* Users --> Id,Age, Location
* Books --> ISBNN, Title, author, year, publisher
* Reviews --> Review in a 0-5 scale

In [3]:
users=pd.read_csv("Users.csv")
ratings=pd.read_csv("Ratings.csv")
books=pd.read_csv("Books.csv",usecols=[0,1,2,3,4])

The ratings dataFrame has the necessary information to create the links between nodes (reader)--\[rate\]-->(book)

In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [5]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [6]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


In [7]:
print(f"Number of readers: {len(ratings['User-ID'].unique())}")
print(f"Number of books: {len(ratings['ISBN'].unique())}")

Number of readers: 105283
Number of books: 340556


There are missing information of several books in ratings (books dataFrame only has information of 271360 books). Let's use only books with information in books dataFrame. Also, We are going to use only information of readers in ratings.

In [8]:
ratings=ratings[ratings['ISBN'].isin(list(books['ISBN'].unique()))]

In [9]:
users=pd.merge(pd.DataFrame(ratings['User-ID'].sort_values().unique(),columns=['User-ID']),users,how='inner')
books=pd.merge(pd.DataFrame(ratings['ISBN'].sort_values().unique(),columns=['ISBN']),books,how='inner')

Cleanning the info

In [10]:
ratings['User-ID']=ratings['User-ID'].astype(str)
users['User-ID']=users['User-ID'].astype(str)
users['City']=users['Location'].str.split(',').str[0]
users['Country']=users['Location'].str.split(',').str[2]
users['Age']=users['Age'].fillna('Unknown')

In [11]:
users.head()

Unnamed: 0,User-ID,Location,Age,City,Country
0,2,"stockton, california, usa",18.0,stockton,usa
1,8,"timmins, ontario, canada",Unknown,timmins,canada
2,9,"germantown, tennessee, usa",Unknown,germantown,usa
3,10,"albacete, wisconsin, spain",26.0,albacete,spain
4,12,"fort bragg, california, usa",Unknown,fort bragg,usa


### Creating the network
**The network will be created with networkx from the pandas dataFrame ratings**

The G graph is a directed graph while H is an undirected graph. We defined both because for a correct visualization we will use the directed one and, to calculate some important features of our network we will use networkx functions only defined for uniderected graphs.

In [12]:
G=nx.from_pandas_edgelist(ratings,source='User-ID',target='ISBN',edge_attr=['Book-Rating'],create_using=nx.MultiDiGraph) #multidirected graph
H=nx.from_pandas_edgelist(ratings,source='User-ID',target='ISBN',edge_attr=['Book-Rating']) #simple graph

**Adding node attributes**

Let's add some attributes to the nodes of G

In [13]:
nx.set_node_attributes(G,dict(zip(books['ISBN'],books['Book-Title'])), "Book-Title") 
nx.set_node_attributes(G,dict(zip(users['User-ID'],users['Country'])), "Country") 

In [None]:
#To check the new added attributes
#G.nodes(data=True)

**Some information of our network**

In [14]:
print(H)
print(G) #It's the same 

Graph with 362257 nodes and 1031136 edges
MultiDiGraph with 362257 nodes and 1031136 edges


In [15]:
print(f"Number of connected components: {nx.number_connected_components(H)}")
print(f"Nodes in largest connected component: {len(max(nx.connected_components(H), key=len))}")

Number of connected components: 6117
Nodes in largest connected component: 348278


1. Number of readers and books

In [16]:
print(f"Number of readers: {len(nx.get_node_attributes(G,'Country').keys())}")
print(f"Number of books: {len(nx.get_node_attributes(G,'Book-Title').keys())}")

Number of readers: 92106
Number of books: 270151


2. Top five readers with more rating books

In [17]:
out_degree= sorted(G.out_degree(), key=lambda item: item[1], reverse=True)

In [18]:
out_degree[0:5]

[('11676', 11144),
 ('198711', 6456),
 ('153662', 5814),
 ('98391', 5779),
 ('35859', 5646)]

In [19]:
users[users['User-ID'].isin([i[0] for i in out_degree[0:5]])]

Unnamed: 0,User-ID,Location,Age,City,Country
3629,11676,"n/a, n/a, n/a",Unknown,,
11848,35859,"duluth, minnesota, usa",Unknown,duluth,usa
32581,98391,"morrow, georgia, usa",52.0,morrow,usa
50848,153662,"ft. stewart, georgia, usa",44.0,ft. stewart,usa
65332,198711,"little canada, minnesota, usa",62.0,little canada,usa


3. Top five books with more ratings 

In [20]:
in_degree= sorted(G.in_degree(), key=lambda item: item[1], reverse=True)

In [21]:
in_degree[0:5]

[('0971880107', 2502),
 ('0316666343', 1295),
 ('0385504209', 883),
 ('0060928336', 732),
 ('0312195516', 723)]

In [22]:
books[books['ISBN'].isin([i[0] for i in in_degree[0:5]])]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
7344,60928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial
32370,312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
38570,316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
70798,385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
215952,971880107,Wild Animus,Rich Shapero,2004,Too Far


## Part 2: Pyvis Visualization

### Attributes for visualization

Attributes in nodes can be used to show important information in the visualization of our graph. In *pyvis* attributes with specific names are used to customize the visualization of the graph,so, let's create attributes for the visualization

**Node attributes**

In [23]:
country_attributes = nx.get_node_attributes(G, 'Country')
node_colors={(key): ('gray' if country_attributes.get(key) is not None else 'brown') for key in G.nodes()}
nx.set_node_attributes(G,node_colors, "color")

In [38]:
1*10000/11144

0.8973438621679828

In [39]:
node_size={(key): (round(item*10000/11144,2) if item!=0 else 0.8) for key,item in out_degree}
nx.set_node_attributes(G,node_size, "size")

In [None]:
title_attributes

**Edges attributes**

In [25]:
edge_color_map={10:'red',9:'red',8:'orange',7:'orange',6:'yellow',5:'yellow',4:'green',3:'green',2:'blue',1:'red',0:'gray'}
ratings['color']=ratings['Book-Rating'].map(edge_color_map)

In [None]:
{(row['User-ID'],row['ISBN']): {row['color']} for row in ratings.iterow()}

In [None]:
edge_colors=dict(zip(tuple(zip(ratings['User-ID'],ratings['ISBN'])),ratings['color']))
nx.set_edge_attributes(G, edge_colors,name='color')

**For purposes of visualization We are going to plot just small samples of the complete network**

In [48]:
components=sorted(nx.connected_components(H), key=len, reverse=True)
g=nx.subgraph(G,components[1]) ##a subgraph could be a connected component

In [49]:
nodes = list(H.nodes()) 
num_nodes_to_sample = 1000 # Example: select 100 random nodes
random_nodes = random.sample(nodes, num_nodes_to_sample)
h=nx.subgraph(G,random_nodes) #random subgraph

In [41]:
h.nodes(data=True)

NodeDataView({'0704349647': {'Book-Title': 'DIRTY PLANET', 'color': 'brown', 'size': 0.8}, '0679023755': {'Book-Title': "Hiking a Celebration of the Sport and the World's Best Places to Enjoy It (Fodor's sports)", 'color': 'brown', 'size': 0.8}, '115558': {'Country': ' usa', 'color': 'gray', 'size': 0.9}, '0394505557': {'Book-Title': 'Space', 'color': 'brown', 'size': 0.8}, '225315072X': {'Book-Title': "Les chenes d'or", 'color': 'brown', 'size': 0.8}, '247830': {'Country': ' usa', 'color': 'gray', 'size': 0.9}, '0520047605': {'Book-Title': 'Kino-Eye: The Writings of Dziga Vertov', 'color': 'brown', 'size': 0.8}, '0961551410': {'Book-Title': 'The best places to kiss in the northwest: A romantic travel guide', 'color': 'brown', 'size': 0.8}, '0394419081': {'Book-Title': 'Designing Your Face', 'color': 'brown', 'size': 0.8}, '57072': {'Country': ' usa', 'color': 'gray', 'size': 29.61}, '0312195168': {'Book-Title': 'Night Talk : A Novel', 'color': 'brown', 'size': 0.8}, '132180': {'Countr

**Function to create pyvis graph**

In [50]:
my_network = Network(height='500px',directed=True,notebook=True)
my_network.from_nx(g)
my_network.show('nx.html')

nx.html
