# Homework 8 – Data Science, Conclusion

## History of Data Science, Winter 2022

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import networkx as nx

## Question 1: PageRank

In this question, you'll replicate the PageRank algorithm for any general network of webpages.

The matrix $A$ that we developed in lecture is known as an **adjacency matrix**. An adjacency matrix contains a description of all of the edges (links) between nodes (webpages) in a network, along with weights for each edge. The adjacency matrix from Lecture 8 is given below.

In [None]:
# Run this cell.
A = np.array([[0, 1/2, 1/2, 1/3],
              [1, 0, 0, 1/3],
              [0, 0, 0, 1/3],
              [0, 1/2, 1/2, 0]])
A

Note that throughout this assignment, unlike in lecture, we will number our webpages 0, 1, 2, 3, and so on.

To interpret the numbers in $A$:
- Column 0 of $A$ describes the movement out of Page 0. The only page that Page 0 links to is Page 1, so that link has a weight of 1. As a result, the element in row 1 and column 0 of $A$ is equal to 1.
- Page 1 links to Page 0 and Page 3, so a weight of 1/2 is assigned to each link. As a result, the element in row 0 and column 1 is equal to 1/2, and the element in row 3 and column 1 is equal to 1/2.
- The same interpretations hold true for Page 2 and Page 3. 

Each column of an adjacency matrix describes the movement **from** a given page; **the sums of the columns in an adjacency matrix are equal to 1**. Each row of an adjacency matrix describes the movement **into** a given page; the sums of the rows in an adjacency matrix don't necessarily add to 1.

Below, we've defined a function that uses the `networkx` library to draw a graph of a network given an adjacency matrix.

In [None]:
def plot_from_adjacency(adjacency_matrix, labels_dict=None, node_sizes=0.25):
    np.random.seed(25)
    plt.figure(figsize=(8, 5))
    G = nx.from_numpy_matrix(adjacency_matrix.T, create_using=nx.DiGraph)
    layout = nx.spring_layout(G)
    nx.draw(G, layout, 
            node_size=15000 * node_sizes, labels=labels_dict, with_labels=True, font_color='white', font_weight='bold', font_size=15, 
            connectionstyle='arc3, rad = 0.1')
    plt.show()

The result of calling it on `A` is as follows:

In [None]:
plot_from_adjacency(A)

### Question 1.1

Having to specify an adjacency matrix manually is slightly cumbersome. It is more natural and convenient for us to describe the links between webpages using a dictionary. Once such example is as follows:

In [None]:
example_net = {
    0: [1],
    1: [0, 3],
    2: [0, 3],
    3: [0, 1, 2]
}

In the above "network dictionary", we are told that:
- Page 0 links to Page 1,
- Page 1 links to Pages 0 and 3,
- Page 2 links to Pages 0 and 3, and
- Page 3 links to Pages 0, 1, and 2

**Note that this dictionary describes the same network that the adjacency matrix `A` does.**

Below, complete the implementation of the function `create_adjacency`, which takes in a network dictionary (similar to `example_net`) and returns an adjacency matrix. A few notes:
- It is **not** guaranteed that there are 4 pages in the network. 
- It **is** guaranteed that all pages link to at least one other page.

***Hint:*** Look into `np.zeros`.

In [None]:
def create_adjacency(network):
    # YOUR CODE HERE
    ...

# Should evaluate to the same matrix as A
A_with_function = create_adjacency(example_net)
A_with_function

Run the following cell. It tests your `create_adjacency` function on a larger example. The output you should see is given below.

```
[[0.         0.25       0.         1.         0.33333333 0.        ]
 [0.         0.         0.33333333 0.         0.33333333 0.5       ]
 [0.         0.25       0.         0.         0.         0.        ]
 [0.         0.25       0.         0.         0.         0.5       ]
 [0.         0.         0.33333333 0.         0.         0.        ]
 [0.         0.25       0.33333333 0.         0.33333333 0.        ]]
```

In [None]:
net_1 = {
    0: [5],
    1: [0, 2, 3, 5],
    2: [1, 4, 5],
    3: [0],
    4: [0, 1, 5],
    5: [1, 3],
}

matrix_1 = create_adjacency(net_1)
print(matrix_1)

**In your PDF writeup, include a screenshot of all of the code you wrote plus the outputs of all of the above code cells.**

### Question 1.2

Complete the implementation of the function `scores`, which takes in an adjacency matrix (`matrix`) and returns the array containing the "scores" of each page. Recall from lecture that the score of a page can be interpreted as the long-run probability that a random internet user is on that page. As a result, the elements in the array that is returned must all be **non-negative and must sum to 1**.

***Hint:*** This was done almost exactly in lecture, you just need to generalize the calculation for any adjacency matrix. There are two approaches you can use: 
- `np.linalg.matrix_power` (**recommended**). Wsatch the lecture recording to see how this works. If you use this approach, use an exponent of 100.
- `np.linalg.eig`. The eigenvector you find might have `+0j` at the end of each element. This is because some of the other eigenvectors of the adjacency matrix may contain complex numbers, and so the entire result of `np.linalg.eig` contains complex numbers (rather than real numbers). The eigenvector corresponding to the eigenvalue of 1 for our adjacency matrices will always contain only real numbers, so you can safely convert the eigenvector you find to a float by using `.astype(np.float64)`.

Either way, **do not** use a `for`-loop! The example in lecture involving a `for`-loop was purely for demonstration purposes.

In [None]:
def scores(matrix):
    # YOUR CODE HERE
    ...

# Should be close to array([0.30769231, 0.38461538, 0.07692308, 0.23076923])
scores(A_with_function)

Once you've completed `scores`, run the following cell. The output should be close to

```
array([0.25      , 0.17647059, 0.04411765, 0.20098039, 0.01470588,
       0.31372549])
```

In [None]:
scores(matrix_1)

**In your PDF writeup, include a screenshot of all of the code you wrote plus the outputs of all of the above code cells.**

We can change the sizes of the pages in our network to be proportional to their PageRank scores. To do this, use the `node_sizes` argument in `plot_from_adjacency`.

In [None]:
plot_from_adjacency(A_with_function, node_sizes=scores(A_with_function))

In [None]:
plot_from_adjacency(matrix_1, node_sizes=scores(matrix_1))

### Question 1.3

Complete the implementation of the function `pagerank`, which takes in an adjacency matrix (`matrix`) and returns the **numbers** of the pages in the matrix, in **decreasing** order of score. For example, since `scores(A_with_function)` evaluates to `array([0.30769231, 0.38461538, 0.07692308, 0.23076923])`, `pagerank(A_with_function)` should evaluate to `array([1, 0, 3, 2])`. This is because Page 1 has the highest score, Page 0 has the next highest score, Page 3 has the next highest score, and Page 2 has the lowest score.

***Hint:*** Look into `np.argsort`. This should only take 2-3 lines; do not write a `for`-loop.

In [None]:
def pagerank(matrix):
    # YOUR CODE HERE
    ...

# Should evaluate to array([1, 0, 3, 2])
pagerank(A_with_function)

Once you've completed `pagenrank`, run the following cell. The output should be

```
array([5, 0, 3, 1, 2, 4])
```

In [None]:
pagerank(matrix_1)

**In your PDF writeup, include a screenshot of all of the code you wrote plus the outputs of all of the above code cells.**

### Question 1.4

Consider the following network:

In [None]:
weird_net = {
    0: [0],
    1: [0, 2, 4],
    2: [1, 3],
    3: [0, 1, 2],
    4: [2]
}

Note that Page 0 links to itself and to no other pages. Practically speaking, we can interpret Page 0 as being a "dead end", with no outgoing links.

Run the cells below to compute the adjacency matrix and scores for the above network.

In [None]:
weird_matrix = create_adjacency(weird_net)
weird_scores = scores(weird_matrix)
weird_scores

In addition, run the cell below.

In [None]:
plot_from_adjacency(weird_matrix, node_sizes=weird_scores)

It appears that Page 0's score is 1, and all of the other pages' scores are 0!

**In your PDF writeup, include your answer to the following two questions.**

- Why do you think Page 0's score is so high, and the other pages' scores are so low? (***Hint:*** Think about how we interpret the score of a page.)
- Read the [Damping factor](https://en.wikipedia.org/wiki/PageRank#Damping_factor) section of the Wikipedia article on PageRank. In two sentences, describe (to the best of your ability) how using damping would prevent the score of Page 0 from becoming 1. (If you read the article closely, the answer is there – describe it in your own words.)