<a href="https://colab.research.google.com/github/abnormalPotassium/DATA620/blob/main/Assignment%203/Assignment3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: Graph Visualization
By: Al Haque, Taha Ahmad


---
## Goal

This week's assignment is to:
1. Load a graph database of your choosing from a text file or other source. If you take a
large network dataset from the web (such as from https://snap.stanford.edu/data/), please
feel free at this point to load just a small subset of the nodes and edges.
2. Create basic analysis on the graph, including the graph’s diameter, and at least one other
metric of your choosing. You may either code the functions by hand (to build your
intuition and insight), or use functions in an existing package.
3. Use a visualization tool of your choice (Neo4j, Gephi, etc.) to display information.
4. Please record a short video (~ 5 minutes), and submit a link to the video as part of your
homework submission.

---
## Conda Initialization

Installing Conda for Google Colab to save and replicate environments.

Note that installing condacolab causes a kernel restart, thus this code block should be ran separately.

In [5]:
!pip install condacolab
import condacolab
condacolab.install()

Collecting condacolab
  Using cached condacolab-0.1.8-py3-none-any.whl (7.2 kB)
Installing collected packages: condacolab
Successfully installed condacolab-0.1.8
[0m✨🍰✨ Everything looks OK!


Checking if Conda was successfully installed

In [1]:
!conda --version

conda 23.1.0


Copying our conda environment with networkx, nltk, and pyvis included


In [2]:
!wget https://raw.githubusercontent.com/abnormalPotassium/DATA620/main/Environment/data620conda.yml -O data620conda.yml
!conda env update -f data620conda.yml -n base

--2024-02-05 22:36:51--  https://raw.githubusercontent.com/abnormalPotassium/DATA620/main/Environment/data620conda.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3331 (3.3K) [text/plain]
Saving to: ‘data620conda.yml’


2024-02-05 22:36:51 (25.2 MB/s) - ‘data620conda.yml’ saved [3331/3331]

Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

---
## Loading a Graph Dataset

The graph database that we will be showcasing in this assignment is the Twitch social network dataset which showcases the connections between different streamers who were active on May 2018 where edges are mutual friendships.

The dataset is sourced from B. Rozemberczki, C. Allen and R. Sarkar. Multi-scale Attributed Node Embedding. 2019
https://snap.stanford.edu/data/twitch-social-networks.html


### Downloading and unpacking

We retrieve the zipped dataset files from where they are hosted and download them to our environment.

In [6]:
!wget https://snap.stanford.edu/data/twitch.zip
!unzip -j twitch.zip

--2024-02-05 22:53:34--  https://snap.stanford.edu/data/twitch.zip
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2842994 (2.7M) [application/zip]
Saving to: ‘twitch.zip’


2024-02-05 22:53:39 (692 KB/s) - ‘twitch.zip’ saved [2842994/2842994]

Archive:  twitch.zip
  inflating: musae_DE.json           
  inflating: musae_DE_edges.csv      
  inflating: musae_DE_target.csv     
  inflating: musae_ENGB_edges.csv    
  inflating: musae_ENGB_features.json  
  inflating: musae_ENGB_target.csv   
  inflating: musae_ES_edges.csv      
  inflating: musae_ES_features.json  
  inflating: musae_ES_target.csv     
  inflating: musae_FR_edges.csv      
  inflating: musae_FR_features.json  
  inflating: musae_FR_target.csv     
  inflating: musae_PTBR_edges.csv    
  inflating: musae_PTBR_features.json  
  inflating: musae_PTBR_target.csv   
  inf

### Loading as a NetworkX Graph

Note that after unzipping we have multiple different files for different languages and also extra features for the nodes. Right now we will only want to visualize a basic graph from english, so we will load `musae_ENGB_edges.csv` into a NetworkX graph object.

In [26]:
import networkx as nx
import pandas as pd

file = 'musae_ENGB_edges.csv'

df = pd.read_csv(file)
G = nx.from_pandas_edgelist(
    df,
    source="from",
    target="to"
)

print(f"A graph has been created with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")

A graph has been created with 7126 nodes and 35324 edges


With our network graph data succesfully loaded we see that we are working with 7,126 different streamers who have made 35,324 combined friendships.

---
## Analyzing the Graph

With the graph loaded we can now tackle a basic analysis of the different attributes of the graph. We'll start with analyzing the diameter of the graph. This tells us the greatest distance between any two nodes within our graph. Thankfully, networkx has a built in method to compute the diameter.

In [27]:
nx.diameter(G)

10

We see that given any two different English streamers from our sample on Twitch, they are separated by at most 10 mutual friendships. It might be interesting to try to see who is the most separated when we visualize our graph

Next, we want to see who is the most well connected and least well connected in our analysis by discovering the smallest and largest degree of the nodes. The degree attribute for nx graphs allows us to generate a degreevalue object and then create a list comprehension with just the values to get the minimum degree and maximum degree.

In [61]:
dlis = [degree[1] for degree in G.degree]
print(f"The streamers with the least connections only have {min(dlis)} mutual friends on Twitch while the streamer with the most connections has {max(dlis)} mutual friends.")

The streamers with the least connections only have 1 mutual friends on twitch while the streamer with the most connections has 720 mutual friends


---
## Data Visualization

We will attempt to utilize the Python graph visualization tool of PyVis to display our very large network graph in an interactive visualization.

Note that attempt is a keyword here as our network appears to be too complex to display through utilizing PyVis with the html display stuck at 0%. Thus, we will move to attempting to visual our network within the open source network viewer of Gephi.

In [66]:
from pyvis.network import Network
from IPython.core.display import display, HTML

nt = Network(notebook = True)
nt.from_nx(G)
nt.show("twitchnetwork.html")
display(HTML('twitchnetwork.html'))

Local cdn resources have problems on chrome/safari when used in jupyter-notebook. 


We showcase visualizing the graph in the YouTube video below through IPython's built-in YouTubeVideo module which allows for embeds:

In [78]:
from IPython.display import YouTubeVideo
YouTubeVideo('jTuUVN57f7U')