In [39]:
from datascience import *
from prob140 import *
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline
import numpy as np
from scipy import stats

The course staff is trying to figure out what topics you feel least prepared for going into the last week of class. It would be very helpful if you all could complete this [survey](http://bit.ly/2p0sA6l) so that we can figure out what topics need better review material the most. Feel free to fill out the survey after lab or at home, but please finish it before the end of the weekend!

# Lab 13: Chinese Restaurant Process, Part II #
This lab is a continuation of Lab 12. Please review the description of the Chinese Restaurant process provided at the beginning of that lab.

In Lab 12 you studied the distribution of the number of tables formed by $N$ people in a Chinese Restaurant process with parameter $\theta$. You noticed that with high probability there are not many tables compared to $N$, and you showed that the expected number of tables is roughly $\theta \log(N)$. 

You also saw by simulation that the distribution of people across tables was typically quite uneven, with a few tables proving to be more popular than others.

In this lab you will study the long run behavior of the proportion of people at Table 1.

For ease of reference, here is the definition of the function `cr` from Lab 12. You defined it to take `N` and `theta` as its arguments, run the Chinese Restaurant process with parameter `theta` until `N` people have been seated, and return an array of the counts of people at the tables in the order of table formation.

In [42]:
def cr(N, theta):
    tables = make_array()
    people = make_array()
    
    for i in range(N):
        n = sum(people)
        new_table = len(tables) + 1
        
        tbl_choices = np.append(tables, new_table)
        tbl_probs = np.append(people, theta)/(n+theta)
    
        choice = int(np.random.choice(tbl_choices, p = tbl_probs))
    
        if choice == new_table:
            tables = tbl_choices
            people = np.append(people, 1)
        else:
            people[choice-1] = people[choice-1]+1 
        
    
    return people

### Part 1. The Rich Get Richer ###

In our analysis of the Chinese Restaurant stochastic process, we will say that Person $n$ enters the system at time $n$. Thus time is always equal to the total number of people in the system at that time.

In this Part you will follow the number of people at Table 1 as the process evolves.

- At time 1 the number of people at Table 1 is 1 because there is only one person and that person sits at Table 1. 
- At time 2 the number is either 2 (if Person 2 chooses Table 1) or 1 (if Person 2 starts a new table). 
- And so on.

### a) ###

Modify the definition of `cr` to define a function `t1_counts` that takes `N` and `theta` as arguments and does the following:

- Runs the Chinese Restaurant Process with parameter `theta` till time `N`.
- Then returns an array of length `N` such that the $i$th element of the array is the number of people at Table 1 at time $i$. 

Thus the first element of the returned array should always be 1.

In [None]:
def t1_counts(N, theta):
    
    t1 = make_array()    # array of Table 1 counts
    
    ...
    ...
    
    return t1

Run the following cell several times and check that the output makes sense. By now you have probably discovered that this goes much quicker if you run the cell by using Control-Return instead of Shift-Return.

Change the value of $\theta$ to 0.5 and also to 2, run the cell a few times with each $\theta$, and make sure that the output still makes sense.

In [26]:
t1_counts(2, 1)

Choose from the three options below to complete the sentence:

For every number of people `N` and positive parameter `theta`, the sequence of entries in the array `t1_counts(N, theta)` should be

(i) non-increasing.

(ii) non-decreasing.

(iii) not necessarily either non-increasing or non-decreasing.

*Provide your answer and reasoning in this Markdown cell.*

### b) ###

Now define a function `plot_t1_counts` that takes `N`, `theta`, and `repetitions` as its arguments and for each repetition does the following:

- Runs the Chinese Restaurant process with parameter `theta` till time `N` and keeps track of the number of people at Table 1 at each time 1 through $N$.
- Displays a graph of the counts at Table 1 versus time.

We will call each graph a *path* of the number of people at Table 1 as people enter the system.

In [44]:
def plot_t1_counts(N, theta, repetitions):
    n = ...
    for i in range(repetitions):
        plt.plot(n, ..., lw=2)  

### c) ###

Run the following cell several times. Then change $N$ and $\theta$ and run the cell again. Make sure you include $\theta = 0.5$ and $\theta = 2$. If you change the number of paths, keep it fairly small so that you can see the individual paths.

In [46]:
N = 100
theta = 1
plot_t1_counts(N, theta, 10)
plt.xlabel('Time')
plt.title('Number of People at Table 1');

### d) ###
Briefly summarize what the paths are likely to look like, based on your observations in (c).

*Provide your answer and reasoning in this Markdown cell.*

### e) ###
Let $W_n$ be the number of people at Table 1 at time $n$, and consider the rate of change of $W_n$ as a function of $n$. Based on your observations in (c), for paths that have a high rate of change when $n$ is small, is the rate of change typically high or typically low as $n$ gets larger?

*Provide your answer and reasoning in this Markdown cell.*

### f) ###
The Chinese Restaurant process is said to have the property that "the rich get richer." Briefly explain this in light of your answer to (e).

*Provide your answer and reasoning in this Markdown cell.*

### Part 2. Long Run Proportion at Table 1 ###
Now track the *proportion* of people at Table 1, instead of the number of people at the table.

### a) ###
Define a function `plot_t1_proportions` that's just like `plot_t1_counts` except that it plots the proportions of people at Table 1 instead of the counts.

In [None]:
def plot_t1_proportions(N, theta, repetitions):
    ...
    ...

### b) ###
Run the cell below several times, then change $N$ and $\theta$ as in Part 2(c) and run it again several times.

In [50]:
N = 100
theta = 1
plot_t1_proportions(N, theta, 10)

### c) ###
What feature of the paths helps confirm the following result?

If $W_n$ is the number of people at Table 1 at time $n$, then the proportion $\frac{W_n}{n}$ converges with probability 1 as $n \to \infty$.

*Provide your answer and reasoning in this Markdown cell.*

### Part 3. Limit Distribution of the Proportion ###

Let $W = \lim_{n \to \infty} \frac{W_n}{n}$ be the limit of the proportion of people at Table 1.

Clearly the possible values of $W$ are the interval $(0, 1)$. In this part you will identify the distribution of $W$ over the unit interval.

To do this, you will simulate the distribution of $\frac{W_N}{N}$ for a large $N$, and compare it with a known distribution.

### a) ###
Define a function `t1_prop_at_fixed_time` that takes `N`, `theta` and `repetitions` as its arguments. In each repetition, the function runs the Chinese Restaurant process with parameter `theta` till time `N`, and computes the proportion of people at Table 1 at time `N`. The function returns an array of the simulated proportions.

In [None]:
def t1_prop_at_fixed_time(N, theta, repetitions):
    
    t1_proportion = make_array()
    for i in np.arange(repetitions):
    ...
    ...
    return t1_proportion    

Run the cell below to check that the output is an array of proportions and that the array has the correct length.

In [66]:
t1_prop_at_fixed_time(100, 1, 5)

### b) ###
Complete the function definition in the cell below. The function should display the empirical histogram of $\frac{W_N}{N}$ with the beta $(1, \theta)$ density overlaid.

In [67]:

# Empirical distribution of W_N/N
# with beta (1, theta) density overlaid

def plot_limit_t1_proportion(N, theta, repetitions):
    t1_props = ...
    Table().with_column('Proportion at Table 1', t1_props).hist(bins=20)
    x = np.arange(0, 1.01, 0.01)
    plt.plot(x, ..., color='red', lw=2)
    plt.title('Overlaid Density: Beta'+ r'$(1, \theta)$');

### c) ###
Use `plot_limit_t1_proportion` to display the empirical histogram of $\frac{W_{100}}{100}$ based on 2000 repetitions of the following:

- Run the Chinese Restaurant process with $\theta = 1$ till time 100, and compute $\frac{W_{100}}{100}$

How does the empirical distribution compare with the overlaid beta $(1, \theta)$ density? 

[Note: If you want to experiment with making $N$ larger than 100, be prepared to be patient as the code chugs along.]

### d) ###
Repeat Part (c) keeping everything the same but changing $\theta$ from 1 to 0.5.

### d) ###
Give a brief intuitive explanation why the long run proportion of people at Table 1 is more likely to be high than low when $\theta = 0.5$. It will help to think about what happens when the first few people enter the system, and also to keep in mind Part 1 of this lab.

*Provide your answer and reasoning in this Markdown cell.*

### e) ###
Repeat Part (c) keeping everything the same but changing $\theta$ from 1 to 2.

### Part 5. Deriving the Beta Limit ###
Part 4 uses simulation to give a pretty convincing demonstration that in the limit as the total number of people gets large, the distribution of the proportion of people at Table 1 is beta $(1, \theta)$. In this Part your goal is to use theory to give a pretty convincing demonstration of the same result.

Define the "Beta $(1, \theta)$ Binomial" process as follows:
- $X$ has beta $(1, \theta)$ distribution.
- Given $X = p$, there is a sequence of i.i.d. Bernoulli $(p)$ trials $I_1, I_2, \ldots $.
- $S_n = I_1 + I_2 + \cdots + I_n$ is the number of successes in the first $n$ trials.

This is the process we studied extensively in class, with a general beta prior in place of beta $(1, \theta)$.


### a) ###
Return to the [Long Run Proportion of Successes](https://textbook.prob140.org/ch22/Long_Run_Proportion_of_Successes.html) discussion of Section 2.4 of the Course Notes, and review "Beta Prior, Beta Limit" at the end of the section. 

Consider the Beta $(1, \theta)$ Binomial process as defined above.

Fill in the blank with the name of a distribution and the appropriate parameters:

As $n$ gets large, with probability 1 the proportion $\frac{S_n}{n}$ approaches a limit. The distribution of this limit is $\underline{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}$.

*Provide your answer and reasoning in this Markdown cell.*

### b) ###
Return to the [Beta-Binomial](https://textbook.prob140.org/ch22/Beta_Binomial.html) discussion of Section 22.2, and review "Prediction: The Distribution of $S_{n+1}$ given $S_n$" at the end of the section.

Consider the Beta $(1, \theta)$ Binomial process.

Find the transition probability $P(S_n = k+1 \mid S_{n-1} = k)$.

Given $S_{n-1} = k$, what are the other possible values of $S_n$ and what are their probabilities?

*Provide your answer and reasoning in this Markdown cell.*

### c) ###
Back to the Chinese Restaurant process with parameter $\theta$. Let $W_n$ be the number of people at Table 1 at time $n$. The goal is to figure out the transition behavior of $W_n$.

Because Table 1 starts deterministically with Person 1 at time 1, we have to be a bit careful about what's random and what's constant. Let's consider $V_n = W_n - 1$, the random number of people *other than Person 1* who are at Table 1 at time $n$.

Find $P(V_{n+1} = k+1 \mid V_n = k)$. This is a straightforward application of the rules of the Chinese Restaurant process, but you have to be careful about how many people are at Table 1 when Person $n+1$ enters the system.

*Provide your answer and reasoning in this Markdown cell.*

### d) ###
Given $V_n = k$, what are the possible values of $V_{n+1}$? Compare the transition behavior of the following two sequences:
- $V_2, V_3, \ldots $ of the Chinese Restaurant process with parameter $\theta$
- $S_1, S_2, \ldots $ of the Beta $(1, \theta)$ Binomial process

*Provide your answer and reasoning in this Markdown cell.*

### e) ###
Explain why the distribution of the long run proportion of people at Table 1 of the Chinese Restaurant process is beta $(1, \theta)$ where $\theta$ is the parameter of the process. That is, explain why the limit of $\frac{W_n}{n}$ as $n \to \infty$ has the beta $(1, \theta)$ distribution.

*Provide your answer and reasoning in this Markdown cell.*

## Survey

Don't forget to fill out the survey!

http://bit.ly/2p0sA6l

In [None]:
_ = autograder.grade('q1')

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()