<a target="_blank" href="https://colab.research.google.com/github/alejandrogtz/cccs630-fall2023/blob/main/module09/network_models.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Network Models

## Introduction

So far in the course, we have studied various structures to model the location of different elements and relationships between them. For example, we employed 2D grids and cell proximity to represent the locations and connections between elements in the cellular automaton models and the agent-based simulations reviewed. This module will explore a different structure to represent more complex relationships: network models. 

In preparation for the live session, please watch the following video to learn about the importance of networks in our lives.

In [None]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/RfgjHoVCZwU" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

## Concepts

You will find a list of important concepts we will review in the module below.

- Centrality metrics
- Connections
- Edges
- Networks
- Network models
- Network structures
- Nodes
- Vertices

## Interaction

In this interation, we will create a network model to explore and analyze web browsing patterns, and focus on the transition or switch between different websites (domains). 

We will utilize a sample data set extracted from a research study that tracked the web browsing behaviour of 2,148 German users for one month in 2018. The data has been anonymized and the participants authorized its collection via a tracking software installed on their personal computer.

If you are interested, you can read more about the <a href="https://arxiv.org/abs/2012.15112">research study</a> and access the <a href="https://zenodo.org/records/4757574">research data</a> published. <b>Note that the research data contains links to sensitive content (adult websites, hate speech, etc.), so please be aware, and do not access or search unfamiliar websites</b>.

### Instructions

- Load the sample data into Jupyter.
- Explore the loaded data.
- Convert the data into a network model.
- Explore and analyze the network model.

### Initial Conditions, Assumptions, and Limitations

- The research data follows a sequential order organized by user.
- The network model focuses on the transitions between domains.
- The network model represents a single user.

In [None]:
import pandas as pd

In [None]:
import networkx as nx

In [None]:
import matplotlib.pyplot as plt

Load and explore the raw data. 

In [None]:
# Load the data into Jupyter
data = pd.read_excel('module09_data.xlsx', sheet_name='data')

In [None]:
data

In [None]:
# Group users by ID
panelist_ids = data.groupby('panelist_id').size().reset_index(name='counts')

In [None]:
panelist_ids

In [None]:
"""
Use the panelist_id value to select the user you want to analyze
"""
# Filter a single user
user_data = data[data['panelist_id'] == 1]

In [None]:
user_data

In [None]:
# Group records by category1
visited_categories = user_data.groupby('category1').size().reset_index(name='counts')

In [None]:
visited_categories

In [None]:
# Group records by website (domain)
visited_domains = user_data.groupby(['top_level_domain']).agg({
    'active_seconds': 'sum',
    'top_level_domain': 'count'
}).rename(columns={'top_level_domain': 'total_visits','active_seconds':'total_active_seconds'}).reset_index()

In [None]:
visited_domains

In [None]:
visited_domains.to_excel('visited_domains.xlsx', index=False, header=True)

Convert the raw data into a data structure that can be used to create a network model.

In [None]:
# Create an empty dictonary
connections = {
    'start_node': [],
    'end_node': []
}

In [None]:
# Identify and organize the start and end nodes
for index, row in user_data.iterrows():
    
    linked_rows = user_data.loc[user_data['prev_id'] == row['id']]
    
    if (len(linked_rows)>0):
        connections['start_node'].append(row['top_level_domain'])
        connections['end_node'].append(linked_rows.iloc[0]['top_level_domain']) # Select the first row of the dataframe

In [None]:
connections

In [None]:
# Convert a dictionary into a dataframe
connections = pd.DataFrame.from_dict(connections)

In [None]:
connections

In [None]:
# Group the data by start and end nodes
connections = connections.groupby(['start_node','end_node']).size().reset_index(name='count')

In [None]:
connections

In [None]:
connections.to_excel('connections.xlsx', index=False, header=True)

Create the network model.

In [None]:
# Create a directional graph
G = nx.DiGraph()

In [None]:
# Add nodes and connections to the graph
for index, row in connections.iterrows():
    if (row['start_node']!=row['end_node']): # Ignore self-loop edges, an edge that connects a node to itself 
        if (not G.has_node(row['start_node'])):
            G.add_node(row['start_node'])
        if (not G.has_node(row['end_node'])):
            G.add_node(row['end_node'])
        G.add_edge(row['start_node'],row['end_node'], weight=row['count'])

In [None]:
# Print the number of nodes
print('Nodes: ',G.number_of_nodes())

In [None]:
# Print the number of edges (connections)
print('Edges: ',G.number_of_edges())

Explore the network model.

In [None]:
# Visualize the network.
nx.draw(G,with_labels=True)

In [None]:
pos = nx.spring_layout(G, seed=2)

In [None]:
fig = plt.figure(1, figsize=(50, 50), dpi=200)
nx.draw(G, pos, with_labels=True, node_color='lightblue', font_weight='normal', node_size=1500, width=1)

In [None]:
def get_node_labels(G):
    return {node: data.get('label', node) for node, data in G.nodes(data=True)}

In [None]:
node_labels = get_node_labels(G)

In [None]:
fig = plt.figure(1, figsize=(50, 50), dpi=200)

labels = nx.get_edge_attributes(G,'weight')

nx.draw_networkx_edge_labels(G,pos,edge_labels=labels)
nx.draw_networkx_labels(G, pos, labels=node_labels)
nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=1000)
nx.draw_networkx_edges(G, pos, alpha=0.5, arrows=True, arrowstyle='-|>', arrowsize=20)

plt.show()

Analyze the network using three fundamental centrality metrics.

In [None]:
# A degree centrality measures how many connections a node has. The higher the value, the more central the node is.
nx.degree_centrality(G)

In [None]:
# Create a dataframe from a dictionary
degree_centrality = pd.DataFrame.from_dict(nx.degree_centrality(G), orient='index',columns=['degree_centrality'])

In [None]:
# Convert the dataframe index to a column and create a new index
degree_centrality = degree_centrality.rename_axis('domain').reset_index()

In [None]:
# Betweenness measures the node's importance in the flow of information through a network
nx.betweenness_centrality(G)

In [None]:
betweenness_centrality = pd.DataFrame.from_dict(nx.betweenness_centrality(G), orient='index',columns=['betweenness_centrality'])

In [None]:
betweenness_centrality = betweenness_centrality.rename_axis('domain').reset_index()

In [None]:
# Closeness centrality measures how close a node is to all other nodes in the network 
# Average of the shortest path length from the node to every other node in the network
nx.closeness_centrality(G)

In [None]:
closeness_centrality = pd.DataFrame.from_dict(nx.closeness_centrality(G), orient='index',columns=['closeness_centrality'])

In [None]:
closeness_centrality = closeness_centrality.rename_axis('domain').reset_index()

In [None]:
network_metrics = pd.merge(pd.merge(degree_centrality,betweenness_centrality,on='domain'),closeness_centrality,on='domain')

In [None]:
def get_connections(node):
    return(G.edges(node))

In [None]:
# List the node's connections
get_connections('gmx.net')

In [None]:
network_metrics['connections'] = network_metrics.apply(lambda row: len(get_connections(row['domain'])), axis = 1)

In [None]:
def get_visits(node):
    row = visited_domains[visited_domains.top_level_domain == node].iloc[0]
    return(row['total_active_seconds'],row['total_visits'])

In [None]:
network_metrics['total_active_seconds'], network_metrics['total_visits'] = zip(*network_metrics.apply(lambda row: get_visits(row['domain']), axis = 1))

In [None]:
network_metrics = network_metrics.sort_values(by=['degree_centrality'], ascending=False)

In [None]:
network_metrics

In [None]:
network_metrics.to_excel('network_metrics.xlsx', index=False, header=True)

Extra Material - Information Disemination Simulation

In [None]:
"""
Adjust the initial spreader and the simulation steps
"""
initial_spreader = 'gmx.net'
simulation_steps = 1

In [None]:
# Reset the graph states
for node in G.nodes():
    G.nodes[node]['state'] = 'S' # Susceptible

G.nodes[initial_spreader]['state'] = 'I' # Infected

# Information spread based on the two-state model, infected and susceptible
def spread_info(G):
    new_spreaders = []
    for node in G.nodes():
        if G.nodes[node]['state'] == 'I':
            for neighbor in G.neighbors(node):
                if G.nodes[neighbor]['state'] == 'S':
                    new_spreaders.append(neighbor)
    
    for new_spreader in new_spreaders:
        G.nodes[new_spreader]['state'] = 'I'

for _ in range(simulation_steps):
    spread_info(G)

# Visualization
fig = plt.figure(1, figsize=(50, 50), dpi=50)
color_map = {'S': 'lightblue', 'I': 'lightcoral'}
colors = [color_map[G.nodes[node]['state']] for node in G.nodes()]
nx.draw(G, pos, node_color=colors, with_labels=True, node_size=1500, width=1)
plt.show()

## Assignment 

### Conceptual Option

Research how network modelling and analysis can be used to solve a problem or understand phenomena you would like to study. Use Google Scholar and the McGill Library to identify previous studies (research articles, book chapters, etc.) that have explored the same problem or phenomena before. Explain the problem or phenomena of interest and briefly summarize two of the identified studies. Submit a 2-3 page Word document with the summaries, descriptions, and insights. Reference the consulted sources using the APA format.

### Hands-on Option

Select the user ID 1137 and recreate the analysis followed in class. Use the metrics covered (connections, centrality metrics, active time, etc.) in the module to identify the important nodes and connections in the network. Briefly summarize the web browsing pattern of user 1137 and describe your insights. Submit a 1-2 page Word document with the produced graphs, descriptions, and insights. Reference the consulted sources using the APA format.

## Recommended Reading

- Chapter 2 - Graphs. Downey, A. (2018). Think complexity: Complexity science and computational modeling (Second). O’Reilly Media. https://mcgill.on.worldcat.org/oclc/1043913738

## Optional Readings

You will find additional resources in case you would like to continue exploring the topics covered in this module below.

- Chapter 3 - Small World Graphs. Downey, A. (2018). Think complexity: Complexity science and computational modeling (Second). O’Reilly Media. https://mcgill.on.worldcat.org/oclc/1043913738
- Chapter 4 - Scale-Free Networks. Downey, A. (2018). Think complexity: Complexity science and computational modeling (Second). O’Reilly Media. https://mcgill.on.worldcat.org/oclc/1043913738