# coNNect: Friend Recommendation with GNNs
*Created by Horváth Szilárd (MZ7VX5) and Szarvas Dániel (A85UKT)*

## Overview
This is our solution for our selected homework topic for the BME-VIK MSc "**Deep Learning**" course (Mélytanulás, BMEVITMMA19), which is named "**Friend recommendation with graph neural networks**".

Our task was to build a friend recommendation system based on graph neural networks (GNNs). We had the opportunity to work with anonymized data coming from major social media platforms like Facebook, Google+ and Twitter (now X), that offered us the option to utilize complex user profiles and connection circles.

In [2]:
import torch
print(torch.__version__)

2.4.1+cu121


In [3]:
!pip install -q torch_geometric

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Download the data
Our primary data source is SNAP from Stanford (https://snap.stanford.edu/).

- Facebook (10 friend networks): https://snap.stanford.edu/data/ego-Facebook.html
- Google+ (123 friend networks): https://snap.stanford.edu/data/ego-Gplus.html
- Twitter (973 friend networks): https://snap.stanford.edu/data/ego-Twitter.html

In [4]:
def download_data(social_media: str):
  social_media = social_media.lower()
  social_media_options = ["facebook", "gplus", "twitter"]
  if social_media not in social_media_options:
    raise ValueError(f"Invalid social media name. Select from {social_media_options}.")

  !wget https://snap.stanford.edu/data/{social_media}_combined.txt.gz -P data/{social_media}
  !gunzip data/{social_media}/{social_media}_combined.txt.gz -f
  !wget https://snap.stanford.edu/data/{social_media}.tar.gz -P data/{social_media}
  !tar -xzvf data/{social_media}/{social_media}.tar.gz -C data/{social_media}
  !rm data/{social_media}/{social_media}.tar.gz
  !wget https://snap.stanford.edu/data/readme-Ego.txt -P data/{social_media}

In [5]:
SOCIAL_MEDIA = "facebook"

download_data(SOCIAL_MEDIA)

--2024-10-06 22:51:52--  https://snap.stanford.edu/data/facebook_combined.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218576 (213K) [application/x-gzip]
Saving to: ‘data/facebook/facebook_combined.txt.gz’


2024-10-06 22:51:54 (223 KB/s) - ‘data/facebook/facebook_combined.txt.gz’ saved [218576/218576]

--2024-10-06 22:51:54--  https://snap.stanford.edu/data/facebook.tar.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 732104 (715K) [application/x-gzip]
Saving to: ‘data/facebook/facebook.tar.gz’


2024-10-06 22:51:56 (374 KB/s) - ‘data/facebook/facebook.tar.gz’ saved [732104/732104]

facebook/
facebook/3980.egofeat
facebook/0.featnames
facebook/698.egofeat
facebook/3437

## Load dataset

All of the datasets include connection "circles", ego networks (friend network of a single user), node features (user profiles). The edges are undirected for Facebook, while they are directed for Google+ and Twitter.

We can make sense of the given file structures from the `readme-Ego.txt` file, which defines the following 5 different file structures (`nodeId` is replaced by the actual ID of the ego node):

- `nodeId.edges` (**node structure**): The edges in the ego network for the node 'nodeId'. Edges are undirected for facebook, and directed (a follows b) for twitter and gplus. The 'ego' node does not appear, but it is assumed that they follow every node id that appears in this file.

- `nodeId.circles` (**node structure**): The set of circles for the ego node. Each line contains one circle, consisting of a series of node ids. The first entry in each line is the name of the circle.

- `nodeId.feat` (**node features**): The features for each of the nodes that appears in the edge file.

- `nodeId.egofeat` (**node features**): The features for the ego user.

- `nodeId.featnames` (**metadata for features**): The names of each of the feature dimensions. Features are '1' if the user has this property in their profile, and '0' otherwise. This file has been anonymized for facebook users, since the names of the features would reveal private data.

First, to illustrate the data, we dive into the files. There are 10 friend networks, and we will be looking into number 0.

### `nodeId.edges`(**node structure**):




In [6]:
!head data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.edges

236 186
122 285
24 346
271 304
176 9
130 329
204 213
252 332
82 65
276 26


We can see the edges in the friend network between nodes. The two number are the IDs of two connected nodes.

### `nodeId.circles`(**node structure**):

In [7]:
!head data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.circles

circle0	71	215	54	61	298	229	81	253	193	97	264	29	132	110	163	259	183	334	245	222
circle1	173
circle2	155	99	327	140	116	147	144	150	270
circle3	51	83	237
circle4	125	344	295	257	55	122	223	59	268	280	84	156	258	236	250	239	69
circle5	23
circle6	337	289	93	17	111	52	137	343	192	35	326	310	214	32	115	321	209	312	41	20
circle7	225	46
circle8	282
circle9	336	204	74	206	292	146	154	164	279	73


In this file, every line is a node ID series that constitutes to a circle, so these are the first circles of the number 0 friend network. All of the circles are listed in this file which are actually not that much in this instance.

### `nodeId.featnames`(**metadata for features**):

In [8]:
!head -n 15 data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.featnames
!echo -e "\nNumber of features:"
!wc -l data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.featnames

0 birthday;anonymized feature 0
1 birthday;anonymized feature 1
2 birthday;anonymized feature 2
3 birthday;anonymized feature 3
4 birthday;anonymized feature 4
5 birthday;anonymized feature 5
6 birthday;anonymized feature 6
7 birthday;anonymized feature 7
8 education;classes;id;anonymized feature 8
9 education;classes;id;anonymized feature 9
10 education;classes;id;anonymized feature 10
11 education;classes;id;anonymized feature 11
12 education;classes;id;anonymized feature 12
13 education;concentration;id;anonymized feature 13
14 education;concentration;id;anonymized feature 14

Number of features:
224 data/facebook/facebook/0.featnames


These are the names of the features present in the nodes and the ego node itself. The number of features is 224 for this dataset.

### `nodeId.feat`(**node features**):

In [9]:
!head data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.feat
!echo -e "\nNumbers in a line (a.k.a. number of features):"
!head -n 1 data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.feat | tr -cd ' \t' | wc -c

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

It can be seen that for each node we get a 224-length vector, where the first column represents the node's row number, while the remaining 223-length binary vector represents the value of each of the 223 features. This is already a one-hot feature vector, since only 0 and 1 are included.

### `nodeId.egofeat`(**node features**):

In [10]:
!head data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.egofeat
!echo -e "\nNumbers in a line (a.k.a. number of features):"
!head -n 1 data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}/0.egofeat | tr -cd ' \t' | wc -c

0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0

Numbers in a line (a.k.a. number of features):
223


You can see the 223 features in the same way, only here for the ego node. The reason why the number 224 is no longer present is that while in the `0.feat` file the first column was a line numbering, here it was not needed because of the single line and has been omitted.

### Working with the dataset

Since this dataset is quite unique and specific with the ego networks and it's circles, we wouldn't want to create errors in our final solution because of parsing errors. We examined the data ourselves, but used a solution for parsing that's certainly devoid of mistakes.

This is the reason we decided to parse and download our solution with the help of `pytorch_geometric` itself, using a data class from `torch_geometric.datasets` named `SNAPDataset` (package code is visible on [this link](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/datasets/snap_dataset.html)).

We went with it because we go by the philosophy that if there is something already done and working correctly, furthermore we could't write it better, then we, as engineers, should utilize it.

In [11]:
from torch_geometric.datasets import SNAPDataset

facebook_dataset = SNAPDataset(root=".", name="ego-facebook")

Downloading https://snap.stanford.edu/data/facebook.tar.gz
Processing...
100%|██████████| 10/10 [00:00<00:00, 15.21it/s]
Done!


## Analyze dataset

We already conducted static analysis a few cells back when we looked into the files themselves. However, we want more than just that. Analyzing larger graphs is a near impossible task when we are only using adjacency matrices to do it. We need to visualize it, to make some kind of analysis.

It's can also be challenging to visualize a huge graph dataset like the ones we are working with. Finding the right package and data format to visualize the graphs isn't trivial. We chose `pyvis` in our case.

In [12]:
!pip install pyvis -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/756.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/756.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m655.4/756.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
import pandas as pd
from pyvis.network import Network

We use the combined .txt file that consists of all the existing connections inside the graph. This enables us to analyze the data without the need to use other complex data structures.

In [14]:
connection_df = pd.read_csv(f"/content/data/{SOCIAL_MEDIA}/{SOCIAL_MEDIA}_combined.txt", delimiter=" ")
len(connection_df)

88233

We see that the full network consists of 88.233 number of connections between nodes: these are the number of friendships in this dataset.

In [15]:
connection_df = connection_df.rename(columns={"0": "person_A", "1": "person_B"})
connection_df.head()

Unnamed: 0,person_A,person_B
0,0,2
1,0,3
2,0,4
3,0,5
4,0,6


In [16]:
net = Network(
    notebook = True,
    cdn_resources = "remote",
    bgcolor = "#111111",
    font_color = "white",
    height = "780px",
    width = "100%",
    select_menu = True,
    filter_menu = True,
)

Since plotting 88.233 nodes in a notebook wouldn't work, we sampled only a few hundred nodes to plot and constructed the visualization from them.

In [17]:
from IPython.core.display import HTML

sample = connection_df.sample(n=500)
nodes = list(set([*sample["person_A"], *sample["person_B"]]))
edges = sample.values.tolist()
net.add_nodes(nodes)
net.add_edges(edges)
net.show("graph.html", notebook=True)
display(HTML('graph.html'))

graph.html


With this interactive plot we can see that if we select one of the ego nodes (specific number from the name of the files, e.g. 0 or 107), there is a section that is more dense in connections, centered around the selected ego node.

### Average neighbors

In [None]:
# TODO in this iteration

### Distribution of features

In [None]:
# TODO in this iteration

## Clean and prepare data

Since this dataset comes directly from working social media sites, it's quite likely that there would be no need to clean the data.

In [None]:
# TODO in this iteration

## Evaluation criteria definition

*TODO*

In [None]:
# TODO in this iteration (?)

## Baseline model

*TODO*

## Model definition

*TODO* (incremental model refinement)

## Advanced evaluation

*TODO*

## Containerization

*TODO*

## ML-as-a-Service

Our solution is hosted online with the help of a local Gradio server (https://www.gradio.app/).

*TODO* (hosting the model for prediction)