# Probability and statistics

In some form or another, machine learning is all about making predictions.  
We might want to predict the probability of a patient suffering a heart attack in the next year, given their clinical history.  
In anomaly detection, we might want to assess how likely a set of readings from an airplane’s jet engine would be, were it operating normally.  
In reinforcement learning, we want an agent to act intelligently in an environment.  
This means we need to think about the probability of getting a high reward under each of the available action.  
And when we build recommender systems we also need to think about probability.  
For example, if we hypothetically worked for a large online bookseller, we might want to estimate the probability that a particular user would buy a particular book, if prompted.  

For this we need to use the language of probability and statistics.  
Entire courses, majors, theses, careers, and even departments, are devoted to probability.  
So our goal here isn’t to teach the whole subject.  
Instead we hope to get you off the ground, to teach you just enough that you know everything necessary to start building your first machine learning models and to have enough of a flavor for the subject that you can begin to explore it on your own if you wish.  
We’ve talked a lot about probabilities so far without articulating what precisely they are or giving a concrete example.  
Let’s get more serious by considering the problem of distinguishing cats and dogs based on photographs.  
This might sound simpler but it’s actually a formidable challenge.  
To start with, the difficulty of the problem may depend on the resolution of the image.

![](img/cats_and_dogs.png)

While it’s easy for humans to recognize cats and dogs at 320 pixel resolution, it becomes challenging at 40 pixels and next to impossible at 20 pixels.  
In other words, our ability to tell cats and dogs apart at a large distance (and thus low resolution) might approach uninformed guessing.  
Probability gives us a formal way of reasoning about our level of certainty.  
If we are completely sure that the image depicts a cat, we say that the probability that the corresponding label l is cat, denoted $P(l=\mathrm{cat})$ equals 1.0.  
If we had no evidence to suggest that $l=\mathrm{cat}$ or that $l=\mathrm{dog}$, then we might say that the two possibilities were equally likely expressing this as $P(l=\mathrm{cat})=0.5$.  
If we were reasonably confident, but not sure that the image depicted a cat, we might assign a probability $.5<P(l=\mathrm{cat})<1.0$.

Now consider a second case: given some weather monitoring data, we want to predict the probability that it will rain in Taipei tomorrow.  
If it’s summertime, the rain might come with probability $.5$.  
In both cases, we have some value of interest.  
And in both cases we are uncertain about the outcome.  
But there’s a key difference between the two cases.  
In this first case, the image is in fact either a dog or a cat, we just don’t know which.  
In the second case, the outcome may actually be a random event, if you believe in such things (and most physicists do).  
So probability is a flexible language for reasoning about our level of certainty, and it can be applied effectively in a broad set of contexts.  

## Basic probability theory

Say that we cast a die and want to know what the chance is of seeing a $1$ rather than another digit.  
If the die is fair, all six outcomes $\mathcal{X} = \{1, \ldots, 6\}$ are equally likely to occur, hence we would see a $1$ in $1$ out of $6$ cases.  
Formally we state that 1 occurs with probability $\frac{1}{6}$.  
For a real die that we receive from a factory, we might not know those proportions and we would need to check whether it is tainted.  
The only way to investigate the die is by casting it many times and recording the outcomes.  
For each cast of the die, we’ll observe a value $\{1, 2, \ldots, 6\}$.  

Given these outcomes, we want to investigate the probability of observing each outcome.  
One natural approach for each value is to take the individual count for that value and to divide it by the total number of tosses.  
This gives us an estimate of the probability of a given event.  
The law of large numbers tell us that as the number of tosses grows this estimate will draw closer and closer to the true underlying probability.  
Before going into the details of what’s going here, let’s try it out.  
We can start by importing the necessary packages:

In [14]:
import mxnet as mx
from mxnet import nd

Next, we’ll want to be able to cast the die.  
In statistics we call this process of drawing examples from probability distributions *sampling*.  
The distribution which assigns probabilities to a number of discrete choices is called the *multinomial distribution*.  
We’ll give a more formal definition of *distribution* later, but at a high level, think of it as just an assignment of probabilities to events.  
In MXNet, we can sample from the multinomial distribution via the aptly named `nd.sample_multinomial` function.  
The function can be called in many ways, but we’ll focus on the simplest.  
To draw a single sample, we simply pass in a vector of probabilities.

In [15]:
probabilities = nd.ones(6) / 6
print(probabilities)
nd.sample_multinomial(probabilities)


[ 0.16666667  0.16666667  0.16666667  0.16666667  0.16666667  0.16666667]
<NDArray 6 @cpu(0)>



[2]
<NDArray 1 @cpu(0)>