<a href="https://colab.research.google.com/github/danmackinlay/nextgen-course/blob/main/NG_01_problems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises in probability

![](CSIRO-Data61-logo.svg)

In this notebook we use some of the tools introduced in the lecture to solve problems in probability.
There are more, and more advanced, problems at the end of  chapter 2 of Chris Bishop’s _Pattern Recognition and Machine Learning_.

This workbook is for self-assessment; We invite you to solve it 



## setup

Firstly we need to upgrade google colab's elderly version of numerical libraries

In [13]:
!pip install scipy>=1.10 numpy>=1.24

Did that work? Let's check!

In [37]:
import scipy as sp
import numpy as np

print(f"""
scipy version {sp.version.full_version}
numpy version {np.version.full_version}""")


scipy version 1.7.3
numpy version 1.21.6


Nope, that does _not_ work for me in Google Colab, and probably not for you either! 🤪 Now you have developed two useful skills before even starting!

1. Shouting at Google product for behaving an unexpected way
2. Shouting at python because the wrong version of something is installed and it is not clear how to update it.

Nothing to do for now; but be aware that some of these exercises might be easier if you learn to install python on your own machine rather than relying on Google Colab.

Let us continue despondently and set up the notebook.
First we will configure the random number generators so that the data is consistent.



In [22]:
from random import seed, shuffle

SEED = 15

seed(SEED)
rng = np.random.default_rng(SEED)


## Some utility functions

You might want to use these to solve some of the problems, which is why they are displayed here.

OTOH, they are confusing to look at, so you might want to hide them for now and read the rest of the notebook.



In [38]:
import scipy.stats

def get_sp_dist(version='xi'):
    """
    The salmon pox distribution

    Coding: 
    0 well and negative
    1 sick and negative
    2 well and positive
    3 sick and positive

    This coding is automated in the next function
    """
    xk = np.arange(4)
    if version is None:
        pk = (0.82, 0.01, 0.09, 0.09)
    elif version == 'xi':
        pk = (0.7, 0.02, 0.08, 0.2)  
    return sp.stats.rv_discrete(0, 3, name='sp', values=(xk, pk))


def sp_coding(c):
    """
    convert a salmon pox integer rv into a tuple of strings.
    """
    if c == 0:
        return 'well', 'negative'
    elif c == 1:
        return 'sick', 'negative'
    elif c == 2:
        return 'well', 'positive'
    elif c == 3:
        return 'sick', 'positive'


def get_sp_data(n=100, version='xi', seed=15):
    """
    Generate some data from the salmon pox distribution
    """
    rng = np.random.default_rng(seed)
    
    sp_dist = get_sp_dist(version=version)
    data = sp_dist.rvs(size=n, random_state=rng)
    return [sp_coding(c) for c in data]


def get_wh_dist():
    """
    armspan/height distribution
    """
    mean = np.array([2, 3.5]).reshape(-1,1)
    covar = np.array(
        [1**2, 1*1.5*0.8, 1*1.5*0.8, 1.5**2]
        ).reshape(2,2) * 0.25
    return sp.stats.multivariate_normal(mean=mean.ravel(), cov=covar)


def get_wh_data(n=200, seed=15):
    """
    Generate some data from the armspan/height distribution.
    """
    rng = np.random.default_rng(seed)
    wh_dist = get_wh_dist()
    return wh_dist.rvs(size=n, random_state=rng)

## Exercise 01

In this exercise we revisit the Salmon Pox problem from the lecture.
There is a new variant of the disease, the $\xi$ variant, so we re-test the village and analyse the results.
Our test results are as follows:

In [39]:
sickness_status = get_sp_data()

What is the false negative rate in the village?
What does that tell us about the false negative rate in the wider population?

Given a positive test result for an individual, what is the probability that the individual is infected?

Suppose you visit a new town, on a new continent, and you know nothing about the prevalence of $\xi$-Salmon Pox in the population.
You apply the diagnostic test to a new individual, and discover that they test positive.
What can you say about the probability that they are infected with $\xi$-Salmon Pox?

## Exercise 02

We move to a new village. The village has 200 people, and they want a commemorative kilt to mark their successful weathering of the Salmon Pox outbreak.
The height and arm-span measurements of the people are recorded in the following dataset:

In [40]:
D = get_wh_data()

The cost of the kilt is  $w^2\sqrt{h}$. What is $\mathbb{E}[\text{cost}]$?

## Exercise 03

We mentioned in the lecture the trade-off between binning data and distributing the whole dataset.
How can we quantify this?
Write $\mathcal{D}$ for the original dataset, and $\mathcal{D}(B)$ for that same dataset binned into a set of $B$ bins.
We write $(W,H)\sim \operatorname{Law}(\mathcal{D})$ to mean that $(W,H)$ is simulated by picking random $w$ and $h$ values (with replacement) from the distribution of the data in $\mathcal{D}$ and $X\sim \operatorname{Law}(\mathcal{D}(B))$ to mean that $X$ to mean a similar thing  simulated by picking random _binned_ values from the distribution of the _binned_ data in $\mathcal{D}(B)$. When we are dealing with binned data we do it the same way we did in the lecture, i.e. each datapoint is summarised by the centre of the bin it falls into.
If we want to write about the expected (i.e. average) value of a function of $W$ and $H$ we write $\mathbb{E}_{W,H\sim L}[f(W,H)]$ where $L$ is the law that governs the distribution of $W$ and $H$.

Plot the accuracy of the binned approximation to the data for the value $f(W,H)$ where the horizontal axis is the number of bins and the vertical axis is the root-mean-squared error of the approximation, i.e. $\sqrt{(\mathbb{E}_{(W,H)\sim  \operatorname{Law}(\mathcal{D}(B)}[f(W,H)]-\mathbb{E}_{(W,H)\sim  \operatorname{Law}(\mathcal{D}}[f(W,H)])}$.

Oh wait! we have not said what $f(W,H)$ is. Start with something simple like $f(W,H)\mapsto W\cdot H$.


HINT:
You can save yourself some labour using [scipy.stats.binned\_statistic](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html) and/or the classic combination of [numpy.digitize](https://numpy.org/devdocs/reference/generated/numpy.digitize.html#numpy.digitize) and [numpy.histogram](https://numpy.org/devdocs/reference/generated/numpy.histogram.html#numpy.histogram).

In [41]:
from matplotlib import pyplot as plt
D = get_wh_dist()

OK, now try that exercise again, but this time set $f(W,H)\mapsto 10^W \cos( 15H)$.
How does that plotted curve differ?

Can you find a better way of measuring the “distance” between these approximated expectations to the true data than the Root-Mean-Square error?