# Creation and visualization of a graph of books and readers using networkx and pyvis

## Part 1: Networkx creation 

**There will be two kind of nodes: Books and readers. Edges will connect readers with rating books**

In [2]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pyvis.network import Network

import warnings
warnings.filterwarnings("ignore")

### Checking the data

The data is in 3 csv with the following information:

* Users --> Id,Age, Location
* Books --> ISBNN, Title, author, year, publisher
* Reviews --> Review in a 0-5 scale

In [3]:
users=pd.read_csv("Users.csv")
ratings=pd.read_csv("Ratings.csv")
books=pd.read_csv("Books.csv",usecols=[0,1,2,3,4])

The ratings dataFrame has the necessary information to create the links between nodes (reader)--\[rate\]-->(book)

In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [5]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User-ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [6]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


In [7]:
print(f"Number of readers: {len(ratings['User-ID'].unique())}")
print(f"Number of books: {len(ratings['ISBN'].unique())}")

Number of readers: 105283
Number of books: 340556


There are missing information of several books in ratings (books dataFrame only has information of 271360 books). Let's use only books with information in books dataFrame. Also, We are going to use only information of readers in ratings.

In [8]:
ratings=ratings[ratings['ISBN'].isin(list(books['ISBN'].unique()))]

In [9]:
users=pd.merge(pd.DataFrame(ratings['User-ID'].sort_values().unique(),columns=['User-ID']),users,how='inner')
books=pd.merge(pd.DataFrame(ratings['ISBN'].sort_values().unique(),columns=['ISBN']),books,how='inner')

Cleanning the info

In [10]:
users['User-ID']=users['User-ID'].astype(str)
users['City']=users['Location'].str.split(',').str[0]
users['Country']=users['Location'].str.split(',').str[2]
users['Age']=users['Age'].fillna('Unknown')

In [22]:
users.head()

Unnamed: 0,User-ID,Location,Age,City,Country
0,2,"stockton, california, usa",18.0,stockton,usa
1,8,"timmins, ontario, canada",Unknown,timmins,canada
2,9,"germantown, tennessee, usa",Unknown,germantown,usa
3,10,"albacete, wisconsin, spain",26.0,albacete,spain
4,12,"fort bragg, california, usa",Unknown,fort bragg,usa


### Creating the network
**The network will be created with networkx from the pandas dataFrame ratings**

The G graph is a directed graph while H is an undirected graph. We defined both because for a correct visualization we will use the directed one and, to calculate some important features of our network we will use networkx functions only defined for uniderected graphs.

In [12]:
G=nx.from_pandas_edgelist(ratings,source='User-ID',target='ISBN',edge_attr=['Book-Rating'],create_using=nx.MultiDiGraph) #multidirected graph
H=nx.from_pandas_edgelist(ratings,source='User-ID',target='ISBN',edge_attr=['Book-Rating']) #simple graph

**Adding node attributes**

Let's add some attributes to the nodes of G

In [13]:
nx.set_node_attributes(G,dict(zip(books['ISBN'],books['Book-Title'])), "Book-Title") 
nx.set_node_attributes(G,dict(zip(users['User-ID'],users['Country'])), "Country") 

In [None]:
#To check the new added attributes
#G.nodes(data=True)

**Some information of our network**

In [14]:
print(H)
print(G)

Graph with 362257 nodes and 1031136 edges
MultiDiGraph with 362257 nodes and 1031136 edges


In [15]:
print(f"Number of connected components: {nx.number_connected_components(H)}")
print(f"Nodes in largest connected component: {len(max(nx.connected_components(H), key=len))}")

Number of connected components: 6117
Nodes in largest connected component: 348278


Number of readers and books

In [16]:
print(f"Number of readers: {len(nx.get_node_attributes(G, 'Country').keys())}")
print(f"Number of books: {len(nx.get_node_attributes(G, 'title').keys())}")

Number of readers: 0
Number of books: 270151


## Part 2: Pyvis Visualization

In [17]:
components=sorted(nx.connected_components(H), key=len, reverse=True)

In [18]:
components[0]

{'0394546407',
 8,
 9,
 10,
 12,
 14,
 '0140185631',
 16,
 17,
 '3746612152',
 19,
 '157673420X',
 20,
 '0425181642',
 '0375708553',
 23,
 '1550821733',
 26,
 22,
 '8445071955',
 '0139703683',
 '0193113120',
 '0195074858',
 32,
 '0811818608',
 36,
 39,
 42,
 44,
 '0898866928',
 51,
 '1550171933',
 '0140447520',
 53,
 '0060392738',
 56,
 '0380722798',
 '0192836668',
 '0060962976',
 '0307115178',
 67,
 68,
 69,
 '8807840014',
 '0684829975',
 73,
 75,
 '0697062023',
 77,
 78,
 '0669095648',
 '2070551636',
 '1400048095',
 81,
 83,
 '0385066538',
 '0316138142',
 '0195089618',
 87,
 85,
 '0373095716',
 86,
 91,
 92,
 88,
 95,
 97,
 99,
 '0373834039',
 102,
 '1865086983',
 107,
 109,
 110,
 '0452278945',
 '0312312458',
 114,
 '0465047548',
 '084232044X',
 '3788628804',
 '0425158497',
 125,
 '0802141293',
 '0805004270',
 129,
 132,
 '0743474163',
 '0814613039',
 135,
 '0940242753',
 133,
 139,
 141,
 '0452274052',
 144,
 '0380723085',
 '0440116201',
 151,
 '0786881046',
 '0446326410',
 '039396

In [19]:
g=nx.subgraph(G,components[1])

### Attributes for visualization

Attributes in nodes can be used to show important information in the visualization of our graph. In *pyvis* attributes with specific names are used to customize the visualization of the graph,so, let's create attributes for the visualization

**Node attributes**

**Let's store all components in a list of sets**

In [19]:
components=sorted(nx.connected_components(H), key=len, reverse=True)

In [38]:
g=nx.subgraph(G,components[1])

In [39]:
print(g)

MultiDiGraph with 30 nodes and 29 edges


**Function to create pyvis graph**

In [20]:
my_network = Network(height='500px',directed=True,notebook=True)
my_network.from_nx(g)
my_network.show('nx.html')

nx.html
