### Recap
- how to best describe the feature
- can we get away with feature engineering
- automatically learn the features - Representation Learning
- no manual feature engineering is needed
- Goal - efficient task-independent feature learnign for ML with graphs
- represent mapping automatically in the form of $f:u \rightarrow \mathbb R ^{d}$, where u is the node
- the vector is called __feature representation__ or __embedding__

### Why embedding
- task is to map nodes into embedding
- there is similarity between nodes and embedding
- if both nodes are close in the network, they must be close in the embedding as well
- automatically encode network structure information
- this can be used for different type of downstream tasks
  - classify nodes
  - predict links
  - classify graphs
  - anomalous node detection
  - clustering
- in 2014, DeepWalk paper represented social representation of karate club network

### Encoder and Decoder
- how to formulate this as a task using encoder and decoder
- represent graph as adjacency matrix
  - do not make assumptions about the features or extra information
  - binary
  - simplicity assume it as undirected graph
- goal is to encode nodes in the form of embeddings
  - some notion of similarity in network is approximated in embedding space
    - in the form of space
    - the dot product of two nodes in a coordinate system approximates similarity in the embedding space
  - goal is to define similarity in original network and to map nodes in the embedding space
    - dot product is the angle between two vectors
    - if two points are close together or in same direction, will have high dot product
    - if two points are orthogonal, they will have zero dot product or represent disimilarity
    > similarity(u,v) $ \approx z_{v}^{T}z_{u}$

<img src="./images/03_encoderEmbedding.png" width=400 height=400>  
$\tiny{\text{YouTube-Stanford-CS224W-Jure Leskovec}}$

### Learning node embeddings
- Encoder maps from node to embedding
- Define node similarity function 
- Decoder maps from embeddings to similarity score
- Optimize parameters of encoder so that similarity in original network is as approximate to similarity of embedding
  - Decoder on right side is simply the dot product
- Encoder
  - maps each node to low-dimensional vector
  - ENC(v) = $z_{v}$, where z is in d-dimensional embedding
  - d is generally between 64 and 1000
  - d also depends on size of network and other factors
  - Similarity function specifies the relationship 

#### Shallow Encoder 
- Encoder is just an embedding lookup - so its called shallow
- encoding of a given node is simply a vector of numbers, which is a lookup in a big matrix
> ENC($v) = z_{v} = Z.v$
- goal is to learn/estimate the matrix $Z \in \mathbb R ^{dx|v|}$
  - matrix Z has embedding dimension d times the number of nodes
  - in this matrix, each column is a node embedding
  - $v$ is an indicator vector that has all zeros except one in column indicating node $v \in \mathbb I^{|v|}$  
  - this method is not very scalable, you can estimate upto say million nodes
  - for every node we have estimate d parameters
- Methods used: DeepWalk, node2vec
  
<img src="./images/03_encoderEmbeddingMatrix.png" width=400 height=400> 
$\tiny{\text{YouTube-Stanford-CS224W-Jure Leskovec}}$

### Framework summary
- Shallow encoder
- Parameters to optimize Z
- Deep encoders in GNN is another variation 
  - it does not use node embeddings
- Decoder 
  - will be very simple
  - it will be based on node similarity based on the dot product
  - objective will be to maximize the dot product $z_{v}^{T}z_{u}$ for node pairs (u,v), which are similar

### How to define node similarity
- we will define similarity based on random walks
- then optimize embeddings for such similarity measure

### Node embedding
- this method is called unsupervised/self-supervised when we learn node embeddings
- the node labels are not utilized
- the node features/attributes are not utilized/captured
- the goal is only to capture some notion of network similarity, 
- the notion of labels of nodes are not needed/captured because if the nodes are human, then the features such as location, gender, age are attached to the node 
- goal is to directly estimate a set of coordinates of node, so that some aspect of network structure is preserved
- in this sense, these embeddings are task independent because they are not trained on a given prediction task or the labeling of node or given subset of links - it is trained given the network itself