In [None]:
# SETUP

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from scipy import stats
from client.api.assignment import load_assignment
autograder = load_assignment('main.ok')

# Lab 5: Steady State Markov Chains

#If server has not been updated yet, turn this into a code cell and run this. Then restart kernel

!pip install -U setuptools --ignore-installed

!pip install prob140 --upgrade

In [None]:
import prob140
prob140.__version__
# This should return 0.2.4.2
# Talk to lab instructor if it doesn't

The following cell imports data that we will be using for today's lab.

In [None]:
# Use to load data
import pickle

def load_data():
    return pickle.load(open('prob140.data','rb'))

website_data = load_data()

def load_country_data():
    rrc,ntn,cl = pickle.load(open('country_data','rb'))
    return rrc,ntn,cl
real_real_countries,new_total_num,country_links = load_country_data()

# Part 1: The Tsetlin Library

Suppose we have a library bookshelf with 3 books: `'A'`, `'B'`, and `'C'`. Each day, someone borrows a book at random with probability `p_A`, `p_B` and `p_C` respectively and returns the book to the leftmost position before the end of the day. For example, if we have the order `ABC`, at the beginning of the day, we could end up with `ABC`, `BAC`, or `CAB` depending on which book was borrowed.

We will investigate the long run distribution of the leftmost book using a markov chain

### Question 1.1

In the following cell, we have defined `p_A`, `p_B`, and `p_C`.

In [None]:
p_A = 0.6
p_B = 0.3
p_C = 0.1

If we have the state `ABC`, what's the one-step probability of  `ABC`, `BAC`, `CAB`, and `CBA`?

In [None]:
p_ABC = ...
p_BAC = ...
p_CAB = ...
p_CBA = ...

### Question 1.2

What are the possible states for the books on the library bookshelf? Keep them in alphabetical order.

In [None]:
s = make_array(...)

### Question 1.3

What are the states that can transition to the state `ABC`? For each of those states, what is the probability that it ends up at `ABC`?

*Provide your answer and reasoning in this Markdown cell.*

### Question 1.4

Define the transition function `trans_left` that finds the probability of going from the state `perm1` to the state `perm2`. For example, `trans_left('ABC', 'BAC')` should return `p_B` while `trans_left('ABC', 'BCA')` should return 0. You may find the following useful:

* string[0] returns the first letter of a string. For example:
```python
>>> 'HELLO'[0]
'H'
```
* string[1:] returns everything except the first letter of a string. For example:
```python
>>> 'HELLO'[1:]
'ELLO'
```
* string.replace(old, new) replaces all instances of the substring `old` with the substring `new`. For example:
```python
>>> 'HELLO'.replace('E', '')
'HLLO'
```
* The helper function `prob` returns the corresponding probability

In [None]:
def trans_left(perm1, perm2):
    first = perm2[0] # This extracts the first letter of perm2
    if ... :
        return ...
    else:
        return 0
    
def prob(x):
    if x == 'A':
        return p_A
    if x == 'B':
        return p_B
    if x == 'C':
        return p_C

# Part 2: The Limiting Behavior of the Tsetlin Library

Using a markov chain we will investigate the long run distribution of the leftmost book of the Tsetlin Library after many days. 

### Question 2.1

Set `mc` to be the markov chain corresponding to the books on the library shelf. Make sure every row sums to 1.

In [None]:
mc = ...
mc

### Question 2.2

At the end of one day, the books are arranged as `CBA`. Find the distribution of the states 5 days later.

In [None]:

mc_5 = ...
mc_5

### Question 2.3

Suppose someone knocked the books off the bookshelf and placed them back in a random order. Find the distribution of the states 5 days later

In [None]:

initial = Table()
mc_unif_5 = ...
mc_unif_5

### Question 2.4

Find the steady state distribution of the Tsetlin library. 

In [None]:
mc_long_run = ...
mc_long_run

### Question 2.4

What do you notice about the probability that book A, B, or C ends up on the left?


Explain why these numbers make sense.

*Provide your answer and reasoning in this Markdown cell.*

## Part 3: Surfing the Prob140 website

In the 1990s, web search engine companies began to develop bots that crawled and indexed webpages so that users could search for any word on any webpage in their databases. These early search engines ranked webpages by the number of occurences of the keywords in each webpage. An obvious problem is that websites could easily cram popular terms into their webpages to increase the number of views.

Larry Page and Sergey Brin, the founders of Google, noted that important and reputable webpages are likely to have more incoming hyperlinks. Thus, PageRank, the algorithm that powers Google search, orders the search results by modeling visits as a markov chain. Image a web "surfer" who jumps from page to page who randomly clicks weblinks, selecting any link on a page with equal probability. Under this model, the most important pages are the ones that the surfer is likeley to spend the most time on in the long run. Thus, by taking advantage of the limiting behavior of the Markov Chains and the properties of its steady state distribution, PageRank assigns relative importance of webpages.


To better visualize PageRank and investigate its effectiveness, we will run PageRank on the pages of the Prob140 website

### Question 3.1

`website_data` contains the counts for the number of hyperlinks between each webpage on Prob140 website:
* http://prob140.org
* http://prob140.org/about
* http://prob140.org/instructors
* http://prob140.org/latex
* http://prob140.org/logistics
* http://prob140.org/materials_placeholder
* http://prob140.org/references
* http://prob140.org/weekly


Links to external webpages are ignored, so we are only considering links between the webpages listed above. Which webpage do you think should be ranked the highest? Why?

*Provide your answer and reasoning in this Markdown cell.*

### Question 3.2

In the cell below, we define the transition probability of going from the `source` webpage to the `target` webpage. `website_data[source][target]` returns the number of hyperlinks that go from `source` to `target` while `sum(website_data[source].values()` sums up the total number of outbound links from the `source` webpage.

In [None]:
def prob_site1_site2(source,target):
    return website_data[source][target] / sum(website_data[source].values())

Based on our model of random surfing, explain why the number of page links from $\text{source}_{i}$ to $\text{target}_{j}$ is proportional to the probability $p_{ij}$.

*Provide your answer and reasoning in this Markdown cell.*

### Question 3.3

The cell below constructs a table of transition probabilities from every pair of webpages on the Prob140 website. Define `mc_prob140` to be a markov chain that represents a random surfer on the `Prob140` website. Since we have 9 states, confirm that the transition matrix is 9x9

In [None]:
t = Table().states(list(website_data.keys())).transition_function(prob_site1_site2)

In [None]:
mc_prob140 = ...
mc_prob140

### Question 3.4

Find the PageRanks for each of the 9 webpages, sorted from highest to lowest

### Question 3.5

What does the page ranking tell us, asymptotically, about what page a random surfer would spend the most time? Would this be the page you spend the most time at if you were to manually surf the prob140 site on your own?

*Provide your answer and reasoning in this Markdown cell.*

### Question 3.6

Explain why the webpage with the second highest ranking is ranked so highly

*Provide your answer and reasoning in this Markdown cell.*

### Question 3.7

Now, we want to know if this steady state distribution changes if we choose a starting source with a uniform random distribution. That is with probability 1/9, we choose any page on the site and run our chain from there. Below run the chain starting with the uniform random as our intitial distribution, and find the emperical distribution of the steady state distribution using ``emperical_distribution`` with 1000 simulations.

In [None]:
domain = make_array('http://prob140.org',
 'http://prob140.org/about',
 'http://prob140.org/instructors',
 'http://prob140.org/latex',
 'http://prob140.org/logistics',
 'http://prob140.org/materials_placeholder',
 'http://prob140.org/references',
 'http://prob140.org/textbook_placeholder',
 'http://prob140.org/weekly')

What does the empirical distribution of the chain after 1000 simulations tell you about the relation between the intitial and steady-state distribution? Can we pass to the chain a different initial distribution and get the same results? What might change?

*Provide your answer and reasoning in this Markdown cell.*

## Part 4: Surfing Wikipedia

### Question 4.1

Similar to the Prob140 data, `country_table` contains the counts between every country's Wikiedia article. Any hyperlinks to webpages other than a country's Wikipedia page are excluded

Using the provided transition function, rank the countries by the Wikipedia's article's PageRank


In [None]:
def transition_prob(sc,tc):
    return country_links[(sc,tc)]/new_total_num[sc]
country_table = Table().states(real_real_countries).transition_function(transition_prob)
country_table

In [None]:
mc_country = ...
mc_country

### Question 4.2 (optional)

We'd love to hear your explanation for why the top ranked country is so high

*Provide your answer and reasoning in this Markdown cell.*

## Part 5: The Tsetlin Library Continued

In part one we used the prob140 library to analyze the Tsetlin library chain and realized its stationary distribution was exactly the inital weights we gave for each book to be moved leftmost. 

Now we will derive this stationary distribution, using pen and paper.

### Question 5.1

Derive the stationary distribution for a general `p_A`, `p_B`, and `p_C`. One answer in terms of P(XYZ) where X, Y, Z are either A, B, or C is sufficent.


*Provide your answer and reasoning in this Markdown cell.*

### Question 5.2

Find all such conditional probabilities for the 3 book case:

$$ P(\text{Book j next leftmost} | \text{Book i current leftmost}) ~~~ i \neq j $$

Then analyze the ratio of these conditionals, for example $\frac{P(\text{Book B next leftmost} | \text{Book A current leftmost})}{P(\text{Book C next leftmost} | \text{Book A current leftmost})}$ for all such combinations. How do these compare to the same ratios with the inital weights?


#SOLUTION

In [None]:
_ = autograder.grade('q1')

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()