<div style="font-size:30px" align="center"> <b> Using Node Embeddings to Study Open Source Software Collaborations </b> </div>

<div style="font-size:18px" align="center"> <b> Brandon Kramer, UVA Biocomplexity Institute, OSS DSPG 2021 </b> </div>

<br>

### Setup  

In this notebook, we use `node2vec` to study open source software collaborations. First, let's load all of our modules and our node and edge data from the PostgreSQL database.

In [6]:
%matplotlib inline

# load modules 
import warnings
from datetime import datetime
from text_unidecode import unidecode
from collections import deque
import os
import multiprocessing
import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
from sklearn.manifold import TSNE
import numpy as np
import networkx as nx
from gensim.models import Word2Vec
from node2vec import Node2Vec
import altair as alt
warnings.filterwarnings('ignore')

# connect to the database, download data 
connection = pg.connect(host = 'postgis1', database = 'sdad', 
                        user = os.environ.get('db_user'), 
                        password = os.environ.get('db_pwd'))

edgelist_data = '''SELECT slug1, slug2, weight FROM gh_2007_2020.sna_repos_subnet_edges'''
edgelist_data = pd.read_sql_query(edgelist_data, con=connection)

# convert the edgelist to a graph 
graph = nx.from_pandas_edgelist(edgelist_data, source='slug1', target='slug2', edge_attr='weight')

print("Node count:", graph.number_of_nodes(), "- Edge count:", graph.number_of_edges())

Node count: 416 - Edge count: 5237


In [23]:
# pip install pyreadr
import pyreadr
nodelist_data = pyreadr.read_r('/project/class/bii_sdad_dspg/uva_2021/dspg21oss/descriptions_classified.rds')
nodelist_data = nodelist_data[None]
nodelist_data.head()

Unnamed: 0,slug,description,language,commits,prog_go,prog_ruby,prog_cpp,prog_csharp,prog_clang,prog_objc,...,sys_emulapi,sys_grouping,sys_other,system_all,topics_ai,topics_dataviz,app_cryptocurrency,app_blockchain_all,app_business_all,app_database_all
0,otus-kuber-2019-12/OLEGIM_platform,olegim platform repository,C#,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,alex/line-counter,"like `wc -l`, but in rust and maybe faster",Rust,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,arlm/exercism,http://exercism.io/ practice,C#,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ksundong/TIL,today i learned // 그날 그날 공부한 내용을 정리하여 관리한다.,,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ianmartinez/AsciiStudio,a cross-platform program that converts both st...,Java,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7259288,tripolskypetr/material-ui-umd,разработка ui на react используя как систему с...,TypeScript,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7259289,LeuAlmeida/fastfeet.api,:truck: fastfeet is a fictitious logistic comp...,JavaScript,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7259290,chromeos/chromeos.dev,chromeos.dev is the digital home for all thing...,JavaScript,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
7259291,Dr-AlaaKhamis/Graph-Search-Algorithms,graph search methods include blind search meth...,Jupyter Notebook,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Training node2vec

Next, we use `node2vec` to embed the network subset and save the model for later visualization.

In [7]:
cores_available = multiprocessing.cpu_count() - 1

# train the graph with node2vec 
print("Started at:", datetime.now())
# node2vec
node2vec = Node2Vec(graph, dimensions=20, walk_length=16, num_walks=100, workers=cores_available)
# extract model
model = node2vec.fit(window=10, min_count=1)
print("Finished at:", datetime.now())

os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
model.save("node2vec_subnet.model")

# Running large graphs 
# https://github.com/eliorc/node2vec
# https://github.com/eliorc/node2vec/blob/master/example.py
#node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200, workers=4, temp_folder="/mnt/tmp_data")

Started at: 2021-07-23 11:01:18.931333


Computing transition probabilities:   0%|          | 0/416 [00:00<?, ?it/s]

Finished at: 2021-07-23 11:01:57.304867


### Incorporating Node Attributes

Once we have the node embedding data (in a multidimensional space), we can make some interesting visualizations. Before we do that, we are going to run some basic network centrality measures (degree centrality, betweenness centrality, and page rank) to supplment our visualizations. Second, we will need to reduce the dimensions of the node embeddings by using t-distributed stochastic neighbor embedding (`TSNE` from `scikit-learn`). Lastly, we are going to join our node attributes (companies, countries and sectors) to help our final analyses.  

In [17]:
# import the node2vec model 
os.chdir('/project/biocomplexity/sdad/projects_data/ncses/oss/dspg_2021/')
model = Word2Vec.load("node2vec_subnet.model")

# run centrality measures 
deg_cent = nx.degree_centrality(graph)
btw_cent = nx.betweenness_centrality(graph, normalized = True, endpoints = False)
page_rank = nx.pagerank(graph, alpha = 0.8)
deg_cent_df = pd.DataFrame(deg_cent.items(), columns=['slug', 'deg_cent'])
btw_cent_df = pd.DataFrame(btw_cent.items(), columns=['slug', 'btw_cent'])
page_rank_df = pd.DataFrame(page_rank.items(), columns=['slug', 'page_rank'])
deg_cent_df = btw_cent_df.join(deg_cent_df.set_index('slug'), on='slug', how='left')
cent_measures = page_rank_df.join(deg_cent_df.set_index('slug'), on='slug', how='left')
cent_measures

# join all of the node attributes together for data viz 
vocab = list(model.wv.vocab)
model_x = model[vocab]
model_tsne = TSNE(n_components=2)
model_tsne_x = model_tsne.fit_transform(model_x)
model_tsne_x
tsne_df = pd.DataFrame(model_tsne_x, index=vocab, columns=['x', 'y'])
tsne_df["slug"] = vocab
#tsne_df = tsne_df.join(nodelist_data.set_index('login'), on='login', how='left')
tsne_df = cent_measures.join(tsne_df.set_index('slug'), on='slug', how='left')
tsne_df

Unnamed: 0,slug,page_rank,btw_cent,deg_cent,x,y
0,elixir-lang/elixir,0.006308,0.002009,0.069880,-22.105881,-8.102344
1,phoenixframework/phoenix,0.007031,0.008009,0.077108,-22.327425,-8.097646
2,Homebrew/brew,0.008249,0.020599,0.132530,4.595654,9.032990
3,Homebrew/homebrew-cask,0.025073,0.186809,0.356627,4.283966,9.311499
4,angular/angular,0.006497,0.003586,0.089157,-12.499209,10.287621
...,...,...,...,...,...,...
411,typicode/json-server,0.000521,0.000000,0.002410,-1.678538,-3.109298
412,domnikl/DesignPatternsPHP,0.000545,0.000000,0.002410,-20.062460,12.840777
413,bilibili/flv.js,0.000766,0.000000,0.002410,-3.231424,20.323439
414,shadowsocks/shadowsocks-windows,0.001039,0.000000,0.002410,-14.384509,-12.656199


### Node Embedding Visualizations 

First, we can visualize these embeddings by sector. Given that most the nodes are from the 

In [18]:
#domain = ['academic', 'business', 'non-profit', 'government']# , 'not classified', 'null/missing']
#range_ = ['crimson', 'teal', 'darkorange', 'darkblue'] #, 'lightgrey', 'lightgrey']

alt.Chart(tsne_df,title="Node Embedding of OSS Repo Networks").mark_circle().encode(
   x='x', y='y', 
    #color=alt.Color('sector', scale=alt.Scale(domain=domain, range=range_)),
    size=alt.Size('page_rank'),
    tooltip=['slug']
).interactive().properties(
    width=700,
    height=500
)

In [10]:
domain = ['microsoft', 'google', 'red hat', 'ibm', 'facebook', 'intel', 'thoughtworks', 'alibaba', 'amazon', 'databricks']

alt.Chart(tsne_df, title="Node Embedding of OSS Collaboration Networks (Top Companies)").mark_circle().encode(
   x='x', y='y', 
    color=alt.Color('company_cleaned', scale=alt.Scale(domain=domain)),
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [58]:
alt.Chart(tsne_df, title="Node Embedding of OSS Collaboration Networks (by Company)").mark_circle(size=150).encode(
   x='x', y='y', 
    color='company_cleaned',#alt.Color('sector', scale=alt.Scale(domain=domain, range=range_)),
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [59]:
alt.Chart(tsne_df,title="Node Embedding of OSS Collaboration Networks (by Country)").mark_circle(size=150).encode(
   x='x', y='y', 
    color='cc_viz',
    size=alt.Size('page_rank'),
    tooltip=['login', 'sector', 'city_info', 'cc_viz', 'company_original', 'company_cleaned']
).interactive().properties(
    width=700,
    height=500
)

In [4]:
for node, _ in model.most_similar('rasbt'):
    # Show only players
    if len(node) > 3:
        print(node)

buaaliyi
rinugun
yuanbyu
EronWright
charlesnicholson
sugyan
darrengarvey
markmcd
zafartahirov
vahidk


### Future directions

https://github.com/shenweichen/GraphEmbedding <<< 

node2vec 
struc2vec

[Node2Vec Tutorial](https://github.com/eliorc/Medium/blob/master/Nod2Vec-FIFA17-Example.ipynb)
