In [1]:
from datascience import *
from prob140 import *
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline
import numpy as np
from scipy import stats

# Lab 12: Chinese Restaurant Process, Part I #

The *Chinese Restuarant Process* is a discrete time stochastic process that can be used as a model for clustering. The model does not assume a fixed number of clusters. Instead, the clusters evolve randomly as individuals enter, according to a specified probabilistic structure.

We have distributed the analysis of this process across two labs so that you can reasonably complete them and have time to think about the ideas without feeling rushed. In the two labs you will study:
- the distribution of the random number of clusters
- the "rich get richer" nature of the clustering, and the long run behavior of the fraction of individuals in the first cluster

In both labs, part of the analysis will be by simulation and part by probability calculations. The calculations of probabilities and expectations in the first lab are involve only discrete distributions. The second lab will contain long run analysis involving beta distributions and some of the beta-binomial calculations seen in class.

Before we describe the process, here is some history. The process has its origins in the work of [Warren Ewens](https://en.wikipedia.org/wiki/Warren_Ewens) in the early 1970's, in particular the [Ewens Sampling Formula](https://en.wikipedia.org/wiki/Ewens%27s_sampling_formula) of population genetics. Since then, the development of the theory of the stochastic process and its use in machine learning has been very much a Berkeley enterprise: Jim Pitman, David Aldous, and Mike Jordan are among the people involved. The restaurant analogy is due to Jim Pitman and our late colleague Lester Dubins, during one of their regular cafe meetings decades ago. Though there is a popular story that they came up with it while eating at a Chinese restaurant, in fact they were at the cafe Strada, which you will recognize as the cafe you see when you come out of the building after Prob140 class. 

### The Process ###
In keeping with the Chinese Restaurant analogy, think of clusters as groups of people sitting at the same table. We will only consider occupied tables, so the number of tables is equal to the number of clusters. 

The process evolves according to the following rules.
- There is a positive parameter $\theta$.
- People enter the restaurant one at a time.
- Person 1 enters and sits at a table which we will call Table 1.
- Each subsequent person 
    - either joins an existing table with probability proportional to the number of people already at that table, or
    - starts a new table with probability proportional to $\theta$.
- People choose tables independently of each other.

Don't worry about running out of room. The restaurant has infinite capacity and each table is infinitely large. You can imagine infinitely many such tables at the start, or imagine new tables appearing magically each time a person's random choice is to start a new table. We prefer the second image because it consists only of the occupied tables.

Note that the tables are not numbered at the start. We number them according to their order of formation. Thus Table 1 is the table at which Person 1 sits. Table 2 is the next new table to be formed. We can't say exactly who starts it, because that's random. Table 3 is the third new table to be formed. And so on.

In Parts 1 and 2 below, you will write code to simulate the Chinese Restaurant process. In Parts 3 through 5 you will study the distribution of the total number of occupied tables after $N$ people have been seated. In other words, you will study the distribution of the number of clusters produced by this process when it has run for $N$ steps.

In this lab you will do a brief qualitative analysis of the distribution of people across tables. In the next lab you will take a closer look at that aspect of the process.

### Part 1. Steps towards Simulating the Process ###
Start by running the cell below to set the parameter $\theta$.

In [2]:
theta = 1

The goal is to create an array `people` that shows the number of people at each table. If there are 10 people, of whom 6 are at Table 1, 3 at Table 2, and 1 at Table 3, then the array created should be [6, 3, 1]. 

All entries in the array must be positive. The length of the array will be the total number of tables formed, which is random.

Note that we are using [] to denote arrays, to reduce writing; but they are intended as arrays, not lists.

The cell below starts off the array `people` when the first person enters at sits at Table 1. Run the cell. 

In [3]:
people = make_array(1)

To help code the selection of the next person who enters, we will also keep track of the labels of the tables formed, in the array `tables`. That will be an array of consecutive integers starting at 1. If `people` is [6, 3, 1] then `tables` is [1, 2, 3]. If `people` is [10] then `tables` is [1]. 

Run the cell below to create `tables` when just one person is in the system.

In [4]:
tables = make_array(1)

To write the code below it might help you to keep the array [6, 3, 1] in mind as an example of what `people` might look like after 10 people are seated.

### a) ###
How can you use `people` to determine how many people are already seated when a new person comes in? Assign this to `n` in the cell below. Your code should work not just for the "starter" array `people` created above, but for any `people` array that shows the number of people at each table.

In [None]:
n = ...

### b) ###
Use `tables` to create an array that contains the choices of tables that the next newly entering person will have, by filling out the cell below. There are no probabilities involved, yet. Just create an array of choices, that is, all the possible tables at which the new person could sit.

In [None]:
# What is the label of the new table that this person might start?
new_table = ...

# tbl_choices is the array of all possible table choices that this new person has:
tbl_choices = ...

### c) ###
The array `tbl_choices` contains all the possible tables the new person can select. Create an array `tbl_probs` containing the corresponding probabilities of selection. That is, the $i$th element of `tbl_probs` should be the probability that the new person selects the $i$th element of tbl_choices.

In [None]:
# Array of probabilities of selecting the different tables
tbl_probs = ...

Do a ballpark check that `tbl_probs` is a probability distribution, by running the cell below.

In [103]:
sum(tbl_probs)

### d) ###
Now make the new person's choice of table, and assign it to the name `choice`. Use `np.random.choice` to do so. The call `np.random.choice(values, p = probabilities)` makes one random draw from the distribution with possible values in the array `values` and the corresponding probabilities in the array `probabilities`.

One technicality: You have to force the choice to be of the `int` type, because you will have to use it as an array index.

In [None]:
# The random choice made by the new person

choice = int(...)

### e) ###
Write code that updates `tables` and `people` appropriately based on the choice made in Part (d). Ask yourself the following questions before writing your code:
- Under what circumstances should `tables` be updated, and how should it be updated?
- Under what circumstances should `people` be updated, and how should it be updated?

Your code can use any of the quantities calculated in earlier cells: `theta`, `people`, `tables`, `new_table`, `tbl_choices`, `tbl_probs`, `choice`.

After you run the cell, both `tables` and `people` should be consistent with `choice`. If you need to trouble-shoot, remember to Run All Above first, so that all the variables get reset.

Run the cell below to check that your code gives consistent answers for Person 2.

In [106]:
# Person 2's choice of table

choice, tables, people

### Part 2. Code the Simulation ###

### a) ###
Collect the code in Part 1 and use it to define a function `cr` that takes `N` and `theta` as its arguments, runs the Chinese Restaurant Process till `N` people have been seated, and returns an array of the number of people at each table in the order of table formation.

- If you call your function with the arguments 1 and any positive $\theta$, it should return the array [1]. That represents the one person seated at Table 1.
- If you call your function with the arguments 2 and any positive $\theta$, it should return either the array [2] if both people sit at Table 1, or [1, 1] if Person 2 starts a new table.
- If you call your function with the arguments 3 and any positive $\theta$, then it should return one of the following arrays: [3], [2, 1], [1, 2], [1, 1, 1].
- And so on.

### b) ###
Perform a basic ballpark check that your function is working right: run the cell below and say whether the expression is evaluating to the right answer.

In [111]:
sum(cr(1000, 1))

### Part 3. Run the Process ###

### a) ###
Run the process with $\theta = 1$ and 100 people. Display the simulated results a table that has two columns:
- `Table`: The labels of the occupied tables, in order of formation, so that the first entry is 1 and the last entry is the number of tables at the end of the run.
- `People at Table`: The number of people seated at the corresponding table.

Run the cell several times to get a sense of the variability of the results.

In [None]:
N = 100
theta = 1

...
...
simulated_process = Table().with_columns(
    'Table', ...,
    'People at Table', ...
)

simulated_process.show()

### b) ### 
Repeat Part (a), this time displaying a bar graph of the distribution of the number of people at the different tables.

In [None]:
...
...

simulated_process.barh(...)
plt.xlim(0, N);

### c) ###
Repeat Part (b) for varying values of $N$ and $\theta$. Be sure to try $N = 100, 1000$, and $10000$ with $\theta = 0.5, 1,$ and $2$. Keep your eye on the number of tables as well as the distribution of people across tables.

### d) ###
Give a brief qualitative description of what you have seen in the simulations for $N = 100$ and $\theta = 0.5, 1$, and $2$. Address questions such as:
- Are there lots of tables, not many, or is it not possible to tell?
- Is the distribution of the number of people pretty uniform across the tables? If not, describe what you see about the number of big and small clusters.
- In what way does $\theta$ make a difference? Or is it not possible to tell?

### Part 4. Distribution of the Number of Tables ###

### a) ###

Define a function `num_tables` that takes `N`, `theta`, and `repetitions` (the number of times to simulate the process) as its arguments. In each repetition, the function runs the Chinese Restaurant Process with `N` people and counts the number of tables occupied by those people once they are all seated. The function should return an array of length `repetitions` consisting of the simulated table counts.

In [15]:
def num_tables(N, theta, repetitions):
    ...
    ...
    return tbls

### b) ###
Draw the empirical histogram of the number of tables, in the case $N = 100$ and $\theta = 1$, based on 2000 repetitions. Be prepared to wait as the simulation chugs.

In [None]:
...
Plot(emp_dist(simulated_table_counts))
plt.xlabel('Number of Tables');

###  c) ###
Draw the empirical histogram of the number of tables in the case $N = 100$ and $\theta = 0.5$, based on 2000 repetitions. 

### d) ###
Draw the empirical histogram of the number of tables in the case $N = 100$ and $\theta = 2$, based on 2000 repetitions. Save the array of simulated values because you will need it in Part 5.

### e) ###
Are the empirical histograms above consistent with the your answers to Part 3(d)? 

### Part 5. Expectation and SD of the Number of Tables ###

### a) ###
Fix a positive integer $N$ and suppose the process is run till $N$ people have been seated. Let $T_N$ be the number of tables. Find $E(T_N)$.

[It is helpful to note that the number of tables is equal to the number of people who started new tables.]

*Provide your answer and reasoning in this Markdown cell.*

### b) ###
Calculate your answer in Part (a) in the case $\theta = 2$ and $N = 100$, and compare it with the mean of the simulated values in Part 4(d).

### c) ###
The sum $\sum_{i=1}^k 1/i$ grows slowly with $k$, and is roughly $\log(k)$ for large $k$. Set $\theta = 2$. On the same graph, plot as a function of $N$ in the range 100 to 5000:
- Your answer to Part (a).
- $\theta \log(N)$.

The graphs should justify the statement, "The number of tables grows like $\theta \log(N)$, to a rough approximation.

### d) ###
For $N$ any positive integer and $\theta > 0$, find $SD(T_N)$. 

*Provide your answer and reasoning in this Markdown cell.*

### e) ###
Calculate your answer in Part (d) in the case $\theta = 2$ and $N = 100$, and compare it with the SD of the simulated values in Part 4(d).

$T_n$ is said to have a [Poisson Binomial](https://en.wikipedia.org/wiki/Poisson_binomial_distribution) distribution. The shape of the histograms should remind you of both the binomial and the Poisson histograms.

In [1]:
_ = autograder.grade('q1')

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [autograder.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
import gsExport
gsExport.generateSubmission()