# Lab: Link Analysis
Data Mining 2021/2022  
Danny Plenge and Gosia Migut  
Revised by Bianca Cosma

**WHAT** This *optional* lab consists of several programming exercises and insight questions. 
These exercises are meant to let you practice with the theory covered in: [Chapter 5][1] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman.

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam.

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use [StackOverflow][2]
to discuss the questions with your peers. For additional questions and feedback please consult the TAs during the assigned lab session. The answers to these exercises will not be provided.

[1]: http://infolab.stanford.edu/~ullman/mmds/ch5.pdf
[2]: https://stackoverflow.com/c/tud-cs/questions

#### Summary

You will develop an algorithm which will let you rank Internet pages (and other objects) based on their relative importance.

## Exercise 1: PageRank

PageRank, named after _web pages_ and co-founder of Google Larry Page, was designed to combat the growing number of term spammers. In this exercise we will look at the algorithm and some of its adaptations. In the end, we will use PageRank to compute which airports in the US are most important.

We will start this exercise with a small network, simulating the entire Internet with a few sites. Then we will simulate what a random surfer would do on this network and where it is most likely to end up.

### Step 1: Investigate the data

Investigate the data of transitions from one vertex to the other in the example below. The data is of the form:

```
source|destination|weight
```

In this case, all weights are set to 1, meaning that all transitions are equally likely to happen.

**`example` data:**

```
A|C|1  
A|D|1  
B|A|1  
B|D|1  
C|A|1   
D|B|1  
D|C|1  
```

$\textbf{Question 1}$: Draw the directed graph based on this data.

$\textbf{Question 2}$: Write out the transition matrix for this network. Verify that all columns sum up to 1.

$\textbf{Question 3}$: If we initialize a random surfer at a random location, what are the chances for this random surfer to be at a certain location after one iteration? Manually calculate the probabilities for all locations.

### Step 2: Parse the data

Create a PageRank object and import the data from the given example. Print the data object to see how the data is stored.

In [None]:
from collections import OrderedDict

# This path might be different on your local machine.
example = 'data/example.txt'

def import_data(example): 
    """
    This function loads the given datasets in an OrderedDict Object and
    can be used for the next steps in this assignment.
    :param example: The input file containing the (example) data.
    :return: An OrderedDict containing an OrderedDict for each data point.
    """
    
    # Extract data.
    lines = [line.rstrip('\n') for line in open(example)]
    
    # Initialize data structure.
    data = OrderedDict()
    for l in lines:
        line = l.split("|")
        data[line[0]] = OrderedDict()
    
    # START ANSWER 
    # END ANSWER
    
    return data
 

data = import_data(example)
# Check that your code works on the example.
assert data == OrderedDict([('A', OrderedDict([('A', 0), ('B', 0), ('C', 1), ('D', 1)])),
                            ('B', OrderedDict([('A', 1), ('B', 0), ('C', 0), ('D', 1)])),
                            ('C', OrderedDict([('A', 1), ('B', 0), ('C', 0), ('D', 0)])),
                            ('D', OrderedDict([('A', 0), ('B', 1), ('C', 1), ('D', 0)]))])
data

### Step 3: Implement `construct_transition_matrix`

Next, a transition matrix has to be constructed, by creating the function: `construct_transition_matrix`.

In [None]:
import numpy as np

def construct_transition_matrix(data):
    """
    This function returns a transition_matrix based on the given data.
    Note: you can convert an OrderedDict object to a list of (key, value) tuples with OrderedDict_Object.items().
    :param data: The OrderedDict containing the input data.
    :return: A two-dimensional array representing the transition matrix.
    """
    matrix = np.zeros((len(data), len(data)))
    
    # START ANSWER
    # END ANSWER           
    return matrix

trans_matrix = construct_transition_matrix(data)
# Check that all columns of the matrix sum up to 1.
column_sums = np.sum(trans_matrix, axis=0)
assert np.all(np.isclose(column_sums, 1.0))
trans_matrix

$\textbf{Question 4}$: Is the output matrix from the function `construct_transition_matrix` the same as the matrix you calculated in question 1.2?

### Step 4: Implement `get_random_surfer`

Finish the `get_random_surfer` function, which should create a row vector of length equal to the number of vertices in the data. Each element should represent an equal probability, and the vector elements should sum up to 1. In other words, it should construct the following vector:

<center>$v = \begin{bmatrix}\dfrac{1}{n} \\ \dfrac{1}{n} \\ . \\ . \\ . \\ \dfrac{1}{n}\end{bmatrix}$</center>  
  
Where $n$ is the number of vertices in the data, and $dim(v) = n$.

In [None]:
def get_random_surfer(data):
    """
    This function returns a row vector of length equal to the number of vertices in the given data. 
    :param data: The OrderedDict containing the input data.
    :return: An array where each value has the same probability summing up to 1.
    """
    result = np.zeros((len(data), 1))
    
    # START ANSWER
    # END ANSWER   
    
    return result

random_surfer = get_random_surfer(data)
# Check that all probabilities are equal and sum up to 1.
assert np.all(np.isclose(random_surfer, random_surfer[0])) and np.isclose(np.sum(random_surfer), 1.0)

random_surfer

### Step 5: Implement `calculate_page_rank`

Now complete the `calculate_page_rank` function. This function should calculate a transition matrix, get a random surfer vector and multiply these for a number of iterations. The iterative step is:  

<center>$v' = Mv$</center>  

Where $M$ is the transition matrix.

Run the `calculate_page_rank` function on the example dataset with 10 iterations. Verify that the result is approximately as follows:  

<center>$v_{10} = \begin{bmatrix}A \\ B \\ C \\ D\end{bmatrix} = \begin{bmatrix}0.354 \\ 0.119 \\ 0.294 \\ 0.233\end{bmatrix}$</center>

In [None]:
def calculate_page_rank(data, trans_matrix, iterations):
    """
    This function calculates the page rank based on the given data,
    a given transition matrix (trans_matrix) and a given amount of iterations.
    :param data: The OrderedDict containing the input data.
    :param trans_matrix: The transition matrix.
    :param iteration: The amount of iterations.
    :return: A dictionary containing the PageRank for each data item.
    """
    
    # Initialize result.
    result = dict()
    
    # START ANSWER    
    # END ANSWER   

    return result

page_ranks = calculate_page_rank(data, trans_matrix, 10)
# Check that the page ranks are approximately equal to the given values.
assert np.isclose(page_ranks['A'], 0.354, atol=0.001) and np.isclose(page_ranks['B'], 0.119, atol=0.001) \
        and np.isclose(page_ranks['C'], 0.294, atol=0.001) and np.isclose(page_ranks['D'], 0.233, atol=0.001)

page_ranks

### Find the PageRank for given data

Now run the `calculate_page_rank` function on the `data/example2.txt` dataset with 10 iterations.   
  
**`example2` data:**  
```
A|C|1  
A|D|1  
B|A|1  
B|D|1  
C|C|1   
D|B|1  
D|C|1  
```

As you can see, this dataset is slightly different. The edge from C to A is replaced by an edge from C to C itself.

In [None]:
# This path might be different on your local machine.
example2 = 'data/example2.txt'

# START ANSWER
# END ANSWER
new_page_rank = calculate_page_rank(data2, trans_matrix2, 10)
# Check that the page rank of C changes accordingly.
assert np.isclose(new_page_rank['C'], 1.0, atol = 0.05)

new_page_rank

$\textbf{Question 5}$: Explain the results you now get from the PageRank algorithm.

### Step 7: Add taxation to PageRank

In order to make sure nodes like these do not corrupt our results, we can use taxation to allow the random surfer to randomly jump from one page to another. This comes down to changing our iterative step to:

<center>$v' = \beta Mv + \dfrac{(1 - \beta)e}{n}$</center>  

Where $e$ is a vector of all ones, $n$ is the number of vertices in the data and $\beta$ is a constant.  
Implement the function `taxation_page_rank` which calculates this modified PageRank value using the iterative step. You may set $\beta$ to 0.8.

In [None]:
def taxation_page_rank(data, trans_matrix, beta, iterations):
    """
    This function calculates the page rank using taxation based on the initial data 
    of import_data, a given transitionMatrix (trans_matrix), a given beta for the 
    taxation and a given amount of iterations.
    :param data: The OrderedDict containing the input data.
    :param trans_matrix: The transition matrix.
    :param beta: The beta.
    :param iterations: The amount of iterations.
    :return: A dictionary containing the PageRank for each data item.
    """
    
    # Initialize result.
    result = dict()
    
    # START ANSWER    
    # END ANSWER   
    return result

taxed_page_rank = taxation_page_rank(data2, trans_matrix2, 0.8, 10)
# Check that taxation lowers the page rank of C.
assert taxed_page_rank['C'] < new_page_rank['C']

taxed_page_rank

$\textbf{Question 6}$: Are the results better using the `taxation_page_rank` function? What happens if we lower the beta? What happens if we increase the beta?

### Step 8: Use PageRank on the airport network

Check out the `data/flight_data.txt` file.  

First 10 rows of `flight_data`:

```
Cincinnati, OH|Omaha, NE|1
Cincinnati, OH|Los Angeles, CA|56
Cincinnati, OH|Milwaukee, WI|26
Cincinnati, OH|Charlotte, NC|123
Cincinnati, OH|Raleigh/Durham, NC|50
Cincinnati, OH|Nashville, TN|50
Cincinnati, OH|Chicago, IL|353
Cincinnati, OH|Fort Myers, FL|34
Cincinnati, OH|Orlando, FL|87
Cincinnati, OH|San Francisco, CA|25
```

This file contains information regarding airports in the US and flights between them. Each line represents a connection from one airport to another with the weight equal to the number of flights in January 2013. Run the algorithm on this dataset for 10 iterations.

In [None]:
from itertools import islice

# This path might be different on your local machine.
example3 = 'data/flight_data.txt'

data3 = None
trans_matrix3 = None
# START ANSWER
# END ANSWER

flights_page_rank = taxation_page_rank(data3, trans_matrix3, 0.8, 10)
expected_result = ['Pago Pago, TT', 'Rockford, IL', 'Trenton, NJ', 'Staunton, VA', 'North Bend/Coos Bay, OR']
result = list(islice({k: v for k, v in sorted(flights_page_rank.items(), key=lambda item: item[1])}.keys(), 5))
assert result == expected_result

$\textbf{Question 7}$: What is the most important airport according to the results?

In [None]:
# START ANSWER
# END ANSWER