# Networks: structure, evolution & processes
**Internet Analytics - Lab 2**

---

**Group:** *H*

**Names:**

* *BAFFOU Jérémy*
* *BASSETO Antoine*
* *PINTO Andrea*

---

#### Instructions

*This is a template for part 4 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [55]:
import numpy as np
import random

---

## 2.4 PageRank

### 2.4.1 Random Surfer Model

#### Exercise 2.12

In [72]:
def file_to_directed_graph(filename):
    
    with open("../data/" + filename) as f:
        content = f.read().splitlines()
    
    graph = {}
    for line in content:
        c = list(map(int, line.split()))
        graph[int(c[0])] = c[1:]
        
    return graph

In [73]:
def page_rank_naive(graph, nb_iterations):
    scores = np.zeros(len(graph))
    current = random.choice(list(graph))
    scores[current] += 1
    
    for i in range(nb_iterations):
        if not graph[current]:
            return scores / np.sum(scores)
    
        current = int(random.choice(graph[current]))
        scores[current] += 1
        
    return scores / np.sum(scores)

In [74]:
filename = "absorbing.graph"
graph = file_to_directed_graph(filename)
print(graph)
page_rank_naive(graph, 20)

{0: [1, 4], 1: [], 2: [3], 3: [0, 1, 2], 4: [1]}


array([0. , 0.2, 0.4, 0.4, 0. ])

Explain result here

In [76]:
filename = "components.graph"
graph = file_to_directed_graph(filename)
page_rank_naive(graph, 20)

array([0.28571429, 0.28571429, 0.33333333, 0.0952381 , 0.        ,
       0.        , 0.        , 0.        ])

Explain result here

#### Exercise 2.13

In [95]:
def page_rank(graph, nb_iterations, damping_factor=0.15):
    scores = np.zeros(len(graph))
    current = random.choice(list(graph))
    scores[current] += 1
    
    for i in range(nb_iterations):
        if np.random.choice([True, False], p=[damping_factor, 1-damping_factor]):
            current = random.choice(list(graph))
        elif not graph[current]:
            current = random.choice(list(graph))
        else:
            current = int(random.choice(graph[current]))
        scores[current] += 1
        
    return scores / np.sum(scores)

In [96]:
filename = "absorbing.graph"
graph = file_to_directed_graph(filename)
page_rank(graph, 20)

array([0.04761905, 0.19047619, 0.33333333, 0.38095238, 0.04761905])

Explain result here

In [97]:
filename = "components.graph"
graph = file_to_directed_graph(filename)
page_rank(graph, 20)

array([0.0952381 , 0.0952381 , 0.0952381 , 0.04761905, 0.19047619,
       0.0952381 , 0.19047619, 0.19047619])

Explain result here

---

### 2.4.2 Power Iteration Method

#### Exercise 2.14: Power Iteration method

In [98]:
def file_to_google_matrix(filename, theta=0.85):
    
    with open("../data/" + filename) as f:
        content = f.read().splitlines()
    
    nb_nodes = len(content)
    g = np.zeros((nb_nodes, nb_nodes))
    
    for line in content:
        c = list(map(int, line.split()))
        outgoing_degree = len(c) - 1
        
        if outgoing_degree == 0:
            g[c[0]] = np.ones(nb_nodes) / nb_nodes
        else:
            for i in c[1:]:
                g[c[0]][i] = 1 / outgoing_degree
        
    return theta * g + (1 - theta) * np.ones((nb_nodes, nb_nodes)) / nb_nodes

In [99]:
def power_iteration(g, nb_iterations):
    v = np.ones(np.shape(g)[0]) / np.shape(g)[0]

    for i in range(nb_iterations):
        v = v @ g

    return v

In [119]:
def get_wikipedia_pages(id_array):
    
    with open("../data/wikipedia_titles.tsv") as f:
        # Strip away the first line, because it corresponds to a legend and not to a page
        content = np.array(f.read().splitlines()[1:])
        
    # Return the selected ids and format the result to only contain page titles (and not their id)
    return list(map(lambda x: x.split(None, 1)[1], content[id_array]))

In [120]:
filename = "wikipedia.graph"
g = file_to_google_matrix(filename)
scores = power_iteration(g, 50)
get_wikipedia_pages(np.argsort(scores)[-1:-11:-1])

['United States',
 'United Kingdom',
 'France',
 'Europe',
 'Germany',
 'England',
 'World War II',
 'Latin',
 'India',
 'English language']

Explain here

---

### 2.4.3 Gaming the system *(Bonus)*

#### Exercise 2.15 *(Bonus)*

In [149]:
def get_score_and_rank_of_page(scores, page_id):
    nb_of_pages = len(scores)
    
    page_title = get_wikipedia_pages([page_id])[0]
    score = scores[page_id]
    rank = np.nonzero(np.argsort(scores)[::-1] == page_id)[0][0]
    
    # Return a formated strings, with the rank going from 1 for the best to the number of pages for the worst
    return f"Page \"{page_title}\":\n\tScore: {score}\n\tRank: {rank + 1} out of {nb_of_pages}"

In [150]:
filename = "wikipedia.graph"
g = file_to_google_matrix(filename)
scores = power_iteration(g, 50)

# ID of page "United States" as a check
page_id = 5210
print(get_score_and_rank_of_page(scores, page_id))

# ID of page "History of mathematics"
page_id = 2463
print(get_score_and_rank_of_page(scores, page_id))

Page "United States":
	Score: 0.007459087286658076
	Rank: 1 out of 5540
Page "History of mathematics":
	Score: 9.846341053223444e-05
	Rank: 2530 out of 5540


In [159]:
def cheating_google(g, scores, page_id, new_edge_budget, theta=0.85):
    nb_nodes = g.shape[0]
    h_hat = g - (1 - theta) * np.ones((nb_nodes, nb_nodes)) / nb_nodes
    ranking = np.argsort(scores)[::-1]
    
    for i in range(new_edge_budget):
        outgoing_degree = len(np.nonzero(h_hat[ranking[i]])[0])
        
        # If the page is a dangling node (i.e. in h_hat the page is connected to all other pages)
        # Does not work correctly if it is actually a page that links to every single other page, because it will consider those to be dangling nodes.
        if outgoing_degree == nb_nodes:
            line = np.zeros(nb_nodes)
            line[page_id] = 1
            h_hat[ranking[i]] = line
        else:
            h_hat[ranking[i]] *= outgoing_degree / (outgoing_degree + 1)
            h_hat[ranking[i]][page_id] += 1 / (outgoing_degree + 1)
            
    return h_hat + (1 - theta) * np.ones((nb_nodes, nb_nodes)) / nb_nodes

In [161]:
# ID of page "History of mathematics"
page_id = 2463
new_edge_budget = 300
filename = "wikipedia.graph"

g = file_to_google_matrix(filename)
scores = power_iteration(g, 50)
print("Before cheating :")
print(get_score_and_rank_of_page(scores, page_id))

print("\n")

g_cheat = cheating_google(g, scores, page_id, new_edge_budget)
scores = power_iteration(g_cheat, 50)
print(f"After cheating (by adding {new_edge_budget} new edges):")
print(get_score_and_rank_of_page(scores, page_id))

Before cheating :
Page "History of mathematics":
	Score: 9.846341053223444e-05
	Rank: 2530 out of 5540


After cheating (by adding 300 new edges):
Page "History of mathematics":
	Score: 0.0057024291432487256
	Rank: 2 out of 5540
