# Practice Session 06: PageRank

We will compute PageRank on a graph that represents the web of UK around 2007. Each node is a host, and there is a link between two hosts if there is a web page in one of them pointing to a web page in the other one. This network is weighted: the weight is the number of pages that point from one host to the other one.

The collection we will use, [WEBSPAM-UK2007](http://chato.cl/webspam/datasets/uk2007/), has been used in multiple studies on the effect of web spam. Feel free to decompress these files to inspect them, **but your code must read only these files in compressed form**:

* ``webspam_uk2007-nodes.csv.gz`` contains (``nodeid``, ``hostname``, ``label``) records
* ``webspam_uk2007-edges.csv.gz`` contains (``source``, ``destination``, ``weight``) records

Your task is to compute PageRank twice: first considering all the links, and then ignoring links from or to a known spam host.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 0. Code snippets you may need

## 0.1. Read a CSV file with a header

Suppose ``FILENAME`` points to a file with the following contents:

```
a,b,c,d
1,2,3,4
5,6,7,8
```

The following code:

```python
with gzip.open(FILENAME, "rt", encoding="utf-8") as input_file:
    reader = csv.DictReader(input_file, delimiter=',', quotechar='"')
    for record in reader:
        print(record["b"])
```

Prints:

```
2
6
```

## 0.2. Sort a list of scores

You can use the `enumerate()` function which converts a list `[a, b, c]` into `[(0,a), (1,b), (2,c)]` and then `sort()` as follows. Suppose ``score`` contains ``[0.2, 0.7, 0.4]``:

```python
hosts_by_score = sorted(enumerate(score), key=lambda x: x[1], reverse=True)
```

Will return the list `[(1,0.7), (2,0.4), (0,0.2)]`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. Read host names

Read the names of the nodes into a dictionary `id2name`, and the labels into another dictionary `id2label`. They keys (nodeids) should be converted to integers using ``int(...)``. Remember in this file each record contains ``nodeid``, ``hostname``, and ``label``.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [125]:
import io
import gzip
import csv

In [126]:
INPUT_NODES_FILENAME = "webspam_uk2007-nodes.csv.gz"
INPUT_EDGES_FILENAME = "webspam_uk2007-edges.csv.gz"

<font size="+1" color="red">Replace this cell with your code to read the nodes file into id2name and id2label.</font>

Verify that you are reading correctly the file:
    
```python
print("%s: %s" % (id2name[873], id2label[873]))
print("%s: %s" % (id2name[105715], id2label[105715]))
print("Number of hosts: %s" % len(id2name))
```

Should print:

```
bbc.co.uk: nonspam
www.top-mobile-phones.co.uk: spam
Number of hosts: 114529
```

If you get a key not found error, most likely you did not convert the ids to integers.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Count how many hosts are labeled as spam, how many as nonspam, and how many are unlabeled, which should be the majority.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to print how many hosts are spam, how many are nonspam, and how many are unlabeled (this should be the large majority).</font>

# 2. Compute the degree of each node

Compute the out-degree of each node and store it in the dictionary id2degree. For this, you will need to read the edges file once. This file contains (``source``, ``destination``, ``weight``) records

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [128]:
# Initialization of id2degree

id2degree = {}
N = len(id2name)
for nodeid in range(N):
    id2degree[nodeid] = 0

<font size="+1" color="red">Replace this cell with your code to read the degrees of nodes into id2degree.</font>

Verify that you are reading correctly the file:
    
```python
print("%s: %s" % (id2name[890], id2degree[890]))
print("%s: %s" % (id2name[1469], id2degree[1469]))
print("%s: %s" % (id2name[105715], id2degree[105715]))
```

Should print:

```
bc1.org.uk: 16
candycaine.skinthesun.co.uk: 22
www.top-mobile-phones.co.uk: 0
```

If you get a key not found error, most likely you did not convert the ids to integers or you did ot initialize the id2degree.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 3. Compute PageRank

Perform `iterations=20` iterations with `alpha=0.85`. In each iteration, you will read the file of the graph, **without loading the entire graph in memory**. This means each iteration involves opening (and implicitly, closing) the edges file.

Your code should do the following:

* At the beginning, initialize the vector `pagerank` as a vector of 1/N and the vector `pagerank_aux` as a vector of 0s.
* For `iterations` iterations:
   * Read the graph and for every link from *source* to *destination*:
      * Add to `pagerank_aux[destination]` the value `pagerank[source]/degree`, where *degree* is the out-degree of the source node (i.e, its number of out-links).
   * Set *pagerank* of every node to *alpha x pagerank_aux + (1.0-alpha) x (1.0/N)*.
   * Set `pagerank_aux` to 0.0

Remember: do not keep the graph in memory, because that will limit the size of the graphs your code can handle. At every iteration you must read the file again.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [129]:
ITERATIONS = 20
ALPHA = 0.85

<font size="+1" color="red">Replace this cell with your code to compute PageRank.</font>

# 4. Nodes with largest values of PageRank

Print the top 20 hosts by PageRank, including the host name, and the PageRank value with 6 decimals.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to print the 20 hosts having the largest PageRank. Print the host id, host name, label, and score with 6 decimals.</font>

<font size="+1" color="red">Include a brief commentary of what you see here. Which are the hosts with the largest values of PageRank? Why?</font>

# 5. Non-spam PageRank

Now, write code and run non-spam PageRank. For this, compute PageRank as before but ignore any link in which either the source or the destination is a known spam host, i.e., any node for which ``id2label[nodeid] == "spam"``. Consider only the edges that start and end in a ``nonspam`` or ``unlabeled`` node.

This will change the degree of the nodes: the degree should not consider the links that are being ignored. Hence, we must first re-compute the degree.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute id2nsdegree (ns stands for no-spam).</font>

Verify that you are computing the non-spam degree correctly.
    
```python
print("%s: %s" % (id2name[890], id2nsdegree[890]))
print("%s: %s" % (id2name[1469], id2nsdegree[1469]))
print("%s: %s" % (id2name[105715], id2degree[105715]))
```

Should print:

```
bc1.org.uk: 16
candycaine.skinthesun.co.uk: 20
www.top-mobile-phones.co.uk: 0
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute nspagerank (ns stands for no-spam).</font>

<font size="+1" color="red">Replace this cell with code to print the 20 hosts having the largest no-spam PageRank scores. Print the host id, host name, label, and score with 6 decimals.</font>

<font size="+1" color="red">Include a brief commentary of what you see here. Which are the hosts with the largest values of non-spam PageRank? Is this list equal or different from the regular PageRank list? Why do you think this happens?</font>

# 6. Compute spam gain

Finally, compute the *PageRank gain* of every host as *(Normal PageRank) / (No spam PageRank)*. And print the 20 hosts with the largest *PageRank gain*.

Among the top hosts you might find many "spam" (business that look ilegitimate or that tend to rely on spam such as gambling, pornography, counterfeits, and scams) and "normal" sites (i.e., websites that look legitimate), because spammers also point to legitimate sites to disguise their actions.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute pagerank_gain.</font>

<font size="+1" color="red">Replace this cell with code to print the 20 hosts with the largest PageRank gain. Print the host id, host name, label, pagerank, nspagerank, and pagerank_gain.</font>

<font size="+1" color="red">Include a brief commentary of the types of sites you see in this list, and what can you conclude from this.</font>

# Deliver (individually)

A .zip file containing:

* This notebook.


## Extra points available

If you would like to go for extra points (+2, so your maximum grade can be a 12 in this assignment), include a Cytoscape drawing of a sample of hosts (e.g., the top ones by PageRank, or the top ones by degree), and painting in one color the nodes that are spam, and in another color the nodes that are nonspam. Exclude the nodes that are *unlabeled*.

Include in your sample at least a few hundred hosts; as many as possible without crashing Cytoscape or having to wait an unreasonable amount of time for the layout to be completed.

Remember that the `subgraph` function in NetworkX allows you to select a sub-graph given a list of nodes.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: spam/nonspam visualization</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>