### Introduction (A lot of work left here)

By now we have seen the GCN and DeepWalk, which when given a graph gave us the embedding for each node in the graph. These embedding can then be used for other purposes like link prediction or classification. Lets move on to a slightly different problem, In this problem we need the embeddings for each node of a graph where new nodes are continously been added. A possible way to do this would be do rerun the entire model (GCN or DeepWalk) again on the new graph, but it is computationaly expensive. Today  we will be covering a new method which will allow us to get embedding for such graphs is a much easier way.

### The start
In the (GCN or DeepWalk) model, the graph was fixed beforehand, lets say the 'Zachary karate club', some model was trained on it, and then we could make prediction regarding which person went to which part of the club after separation.
![Zachary Karate Club](img/karate_club.png "Karate Club")

In this problem the nodes in this graph were fixed from beginning and all the predictions were also to be made on this fixed nodes. In contrast to this, take 'Youtube' videos to be the nodes and assume there is a edge between related videos, and say we have to classify these videos into categories depending on the content. If we take the same model as in the previous dataset, we can classify all these videos, but lets say some new video is added to 'YouTube', if we want to classify it we will have to retrain the model on the entire new dataset again. This is not feasible as there are too many videos for us to retrain.



To solve this issue, what we can do is to not learn embedding for each node but to learn a function which, given the features and edges joining this node, will give the embeddings for the node. 

### The embedding function

The intuition behind this is to map the nodes to d-dimensional embeddings such that similar nodes in the graph are embedded close together.
![Embedding_function](img/embedding_function.png "Embedding")


//TODO: Adding an animation showing similarity mapping

### Inductive Learning



# Aggregating Neighbours

The idea is to generate embeddings based on the neighbourhood of the given node. In other words, the embedding of a node will depend upon the embedding of the nodes it is connected to. Like in the graph below, the node 1 and 2 are likely to be more similar than node 1 and 5.
![Simple_graph](img/example_graph_1.png "Simple Graph")

Now lets see a simple method to generate embedding depending on neighbours.

First we assign random values to the embeddings and the on each step we will make the value of the embedding the average of the embeddings of all the nodes it shares an edge with. The following example shows the working on a simple linear graph.

![Mean_Embeddings](img/animation.gif "Mean Embeddings")

This is a very simple idea, which can be generalized by representing it in the following way
![Simple_Neighbours](img/simple_neighbours.png "Simple Neighbours")

Here The Black Box joining A with B, C , D represents some function of the A,B, C , D. ( In the above animation it was the mean function). We can replace this box by any function like say sum or max. This function is known as the aggregator function.

Now lets try to make it more general by using not only the neighbhours of a node but also the neighours of neighbours. The first question is to how to make use of neighbours of neighbours. The way which we will be using here is to first generate each nodes embedding in the first step by using only its neighbours just like we did above, then in the second step we will use these embeddings to generate the new embeddings. Take a look at the following 

![One_Layer_Aggregation](img/aggregation_1.png "Aggregation Demo")

The numbers written along with the nodes are the value of embedding at time, T=0.

Values of embedding after one step are as follows

![Animation_aggregation_layer_1](img/animation_2.gif "Aggregation Layer 1")

So after one interation the values are as follows:

![Second_Layer_Aggregation](img/aggregation_2.png "Aggregation After One Layer")

Repeating the same procedure again on this new graph we get (try verifying yourself)

![Third_Layer_Aggregation](img/aggregation_3.png "Aggregation After Two Layer")

Lets try to do some analysis of the aggregation. Represent by $A^{(0)}$ the initial value of embedding of A(i.e. 0.1), by $A^{(1)}$ the value after one layer(i.e. 0.25) similarly $A^{(2)}$, $B^{(0)}$, $B^{(1)}$ and all other values.

Clearly 

$$A^{(1)} = \frac{(A^{(0)} + B^{(0)} + C^{(0)} + D^{(0)})}{4}$$

Similarly

$$A^{(2)} = \frac{(A^{(1)} + B^{(1)} + C^{(1)} + D^{(1)})}{4}$$

Writing all the value in the RHS in terms of initial values of embeddings we get

$$A^{(2)} = \frac{\frac{(A^{(0)} + B^{(0)} + C^{(0)} + D^{(0)})}{4} + \frac{A^{(0)}+B^{(0)}+C^{(0)}}{3} + \frac{A^{(0)}+B^{(0)}+C^{(0)}+E^{(0)} +F^{(0)}}{5} + \frac{A^{(0)}+D^{(0)}}{2}}{4}$$

If you look closely you will see the all the nodes that were either neighbours of A or neighbours of neighbours of A are present in this term. It is equavalent to saying that all nodes that have distance of less than or equal to 2 edges from A are influencing this term. Had there been a node G connected only to node F. then it is clearly at a distance of 3 from A and hence wont be influencing this term.

Generalizing this we can say that if we repeat this produce N times, then all the nodes ( and only those nodes) that are at a within a distance N from the node will be influencing the value of the terms.

If we replace the mean function, with some other function lets say $F$, then in this case we can write

$$A^{(1)} = F(A^{(0)} , B^{(0)} , C^{(0)} , D^{(0)})$$

Or more generally

$$A^{(k)} = F(A^{(k-1)} , B^{(k-1)} , C^{(k-1)} , D^{(k-1)})$$

If we denote by $N(v)$ the set of neighbours of $v$, so $N(A)=\{B, C, D\}$ and $N(A)^{(k)}=\{B^{(k)}, C^{(k)}, D^{(k)}\}$, the above equation can be simplified as

$$A^{(k)} = F(A^{(k-1)}, N(A)^{(k-1)} )$$

This process can be visualized as:

![Sampling](img/showing_1.png "Showing one")

If we try to generalize this, we can replace the function F by multiple functions such that in first layer it is 
F1, in second layer F2 and so on, and then fixing the amount of layers that we want lets say k.

So our embeddings will become like this


![Sampling_2](img/showing_2.png "Showing one")

There might be some doubts at the moment, so lets do a detailed example illustrating the working of this type of model.




Lets formalize our notation a bit now, so that it is easy to understand things.

1. Instead of writing $A^{(k)}$  we will be writing $h_{A}^{k}$
2. Rename the functions $F1$, $F2$ and so on as, $AGGREGATE_{1}$, $AGGREGATE_{2}$ and so on. i.e, $Fk$ becomes $AGGREGATE_{k}$.
3. There will be a total of $K$ aggregation functions.
3. Let our graph be represented by $G(V,E)$ where $V$ is the set of vertices and $E$ is the set of edges.

## What GraphSAGE proposes?

What we have been doing by now can be written as 

Initialise($h_{v}^{0}$) $\forall v \in V$ <br>
for $k=1..K$ do <br>
&nbsp;&nbsp;&nbsp;&nbsp;for $v\in V$ do<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$h_{v}^{k}=AGGREGATE_{k}(h_{v}^{k-1}, \{h_{u}^{k-1} \forall u \in N(v)\})$

$h_{v}^{k}$ will now be containing the embeddings

### Some issues with this

Take a look at the sample graph that we discussed above, in this graph even though the initial embeddings for $E$ and $F$ were different but because their neighbours were same they ended with exactly the same embedding, this is not a a good things as their must be atleast some difference between their embeddings. 

GraphSAGE proposes an interesting idea to deal with it. Rather than passing both of the them into the same aggregating function, what we will do is to pass into aggregating function only the neighbours and then concantenating this vector with the vector of that node. This can be written as:

$h_{v}^{k}=CONCAT(h_{v}^{k-1},AGGREGATE_{k}( \{h_{u}^{k-1} \forall u \in N(v)\}))$

In this way we can prevent two vectors from attaining exactly the same embedding.

Lets now add some non linearity to make it more expressive. So it becomes

$h_{v}^{k}=\sigma[W^{(k)}.CONCAT(h_{v}^{k-1},AGGREGATE_{k}( \{h_{u}^{k-1} \forall u \in N(v)\}))]$

Where \sigma is some non linear function (eg. RELU, sigmoid, etc.) and $W^{(k)}$ is the weight matrix, each layer will have one such matrix.

One more thing we will add is to normalize the value of h after each iteration,i.e, divide them by thier L2 norm, and hence our complete algorithm becomes.

![GraphSAGE_Algorithm](img/graphsage_algorithm.png "GraphSAGE")



## What are the 'AGGREGATE' functions?

By now we have just been taking AGGREGATE function to be some fixed functions like 'mean', 'sum', 'max' etc. But if we just used some arbitarty functions and run them for only a fix number of times, how can we say that the embeddings will converge? We will be tackling all these question in a short while.

The task to solve given the graph might be supervised, semi-supervised or unsupervised. We will be discussing the details of AGGREGATION on each of them one by one.

1. **SEMI-SUPERVISED** Take the example of Zachary Karate Club, here we given information about the two nodes, the instructor and the owner. The information was that these two are in two different groups. Now based on this we have to segregate all the other nodes in one of the category.

In [1]:
()

()