# CHEM 1000 - Spring 2025
Prof. Geoffrey Hutchison, University of Pittsburgh

## 9 From Probability to Statistics

These lectures notes on probability and statistics will include substantial material not found in our text.

By the end of this session, you should be able to:
- Understand mean, median, mode, standard deviation, and standard error
    - Know when you might want mean or median
    - Know the difference between standard deviation and standard error of the mean
- Know a little bit about false positives and false negatives thanks to Bayes

### Statistics from Probability Distributions

As a reminder, last time we talked about probability distributions and important properties:

These are related to the shape of the distribution.
- the total (e.g., for a probability it should be 1 = 100%)
- the [mean](https://en.wikipedia.org/wiki/Expected_value) $\mu$ (i.e., the center or "expected value")
- the [variance](https://en.wikipedia.org/wiki/Variance) $\sigma^2$ (i.e., the width)
    - you're probably more familiar with the standard deviation $\sigma$
- the [skewness](https://en.wikipedia.org/wiki/Skewness) (i.e., the asymmetry of the distribution)
- the [kurtosis](https://en.wikipedia.org/wiki/Kurtosis) (i.e., how thin or thick the "tail" of the distribution)

For normal Gaussian distributions, there are useful rules about how much data can be found within certain intervals (e.g., $\pm 2\sigma$) around the mean.

### Real (Finite) Data

There are a few very important points.

1. For real data, we almost never know the **true** distribution, much less the true mean, true standard deviation, etc.
    - Well, to paraphrase my college statistics professor, unless you hear a deep voice "Your data is truly from a Gaussian distribution with mean 0.5 and standard deviation 0.01..."
2. What we know instead, are a finite approximation called the "sampling distribution".
3. We indicate this using regular letters. So if the true mean is $\mu$ and standard deviation is $\sigma$ we'll use $\bar{x}$ and $s$ respectively - these are approximations to the true values.

### Non-Normal Distributions, Medians, and Robust Stats

Consider some bond lengths - these are gathered from the [Crystallographic Open Database](https://www.crystallography.net):

<img src="../images/C2n2.png" width="300" />

Notice that this probability distribution has some asymmetry to the distribution. Do you think the mean comes at the exact peak?

<img src="../images/C2N2-mean.png" width="300" />

A few other "central measures" are useful for cases like this:

- The *mode* is the most frequent value. In this case, you'd average the two central columns and get 1.345 Å - shorter than the average.

- The *median* is the middle value. You sort all the values, and find the one in the middle. If it's an even number you average them.

- The *trimmed mean* - ignore the top and bottom 5% and calculate the mean of the rest.
    - Other trimming values can be used, especially if you have a lot of data (e.g., toss the top 10% and bottom 10%, take the mean of the remaining 80%).

For this data, both the median and the mode are shorter (to the right) of the mean because the data is skewed to longer lengths.

These are examples of so-called "robust statistics" - that is, statistics which are less affected by outlier points or skew.

Another good example is with net worth. What if Bill Gates or Jeff Bezos decided to sit in to audit our class. (Hi Bill!) If we take the mean of net worth, it goes way, way up. Clearly that's not a very representative value.

In [None]:
import numpy as np

# we can load in the c-n bond length data set ~8000 bonds
cn_lengths = np.loadtxt('../data/c2n2-bonds.csv', skiprows=1)

In [None]:
print('mean', np.mean(cn_lengths))
print('std dev.', np.std(cn_lengths))

### Errors and Variance

Why would anyone use the mean if it's distorted by an outlier or skew in the data?

Let's think about three possible data sets:
- A set of 8 bond lengths
- My 8,000 bond length example
- A set of 8,000,000 C-N bond lengths across millions of molecules

Which one do you think will have more accurate statistics?

It's something of a trick -- clearly it's possible to have lots of bad data, but usually having more data improves accuracy of predictions.

This leads us to the concept of the [**standard error**](https://en.wikipedia.org/wiki/Standard_error):

(From Wikipedia)
>Put simply, the standard error of the sample mean is an estimate of how far the sample mean is likely to be from the population mean, whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean

In other words, our intuition reflects the standard error - the more data we gather, the smaller the error bars around the mean of the data.

Generally, we use the estimate:

$$
\sigma_{\bar{x}} \approx \frac{s}{\sqrt{n}}
$$

where "n" is the number of points. If we want to cut the error bars in half, we need 4x as many data points.

<div class="alert alert-block alert-success">

In science, it's important to understand the distinction between the standard error and standard deviation (and when to use one or the other).
    
</div>

But what about the median? Isn't there a standard error of the median?

Yes, but it's not always trivial to calculate - usually you've picked the median because the data isn't from a Gaussian distribution, so the error bars on the median aren't clear.

Instead, you can use a trick called "bootstrapping." Basically, you repeatedly random chunks of your data, calculate the median of that chunk.. and then you can see how much those medians vary.

In [None]:
from sklearn.utils import resample
from statistics import mean, median

# bootstrap confidence intervals
# repeatedly resample the data into random subsets
# .. take the mean and medians of the subset
def boot_sample(list, iterations=1000, fraction=0.75):
    size = int( len(list) * fraction) # how big is each subset
    means = []
    medians = []
    for i in range(iterations):
        subset = resample(list, n_samples=size) # grab a random chunk
        means.append(mean(subset)) # calculate the mean of the subset
        medians.append(median(subset)) # calc. the median
    return means, medians # return sets of means and medians

In [None]:
means, medians = boot_sample(cn_lengths, iterations=500)

med = np.median(cn_lengths)
low_cir = med - np.percentile(medians, 2.5)
high_cir = np.percentile(medians, 97.5) - med
print(low_cir, med, high_cir)

So we might say that the median is 1.3531 ± 0.0007 Å, which seems pretty good.

What about the mean?

In [None]:
stderr = np.std(cn_lengths) / np.sqrt(len(cn_lengths))
print(stderr, np.mean(cn_lengths), stderr)

In other words, we might choose the mean because the standard error of the mean is much smaller than error bars around the median.

### Bayesian Probability and Statistics

[Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes) became very interested in probability. (Ironically, the portrait of him is probably not.)

In particular, Bayes was interested in conditional probability and inverse probability.

<img src="../images/bayesian.png" width=250 />

Let's imagine we put in 3 blue balls and 4 red balls into the basket.

What's the probability we pull out a blue ball?
- Obviously 3 out of 7

If I make 10 observations, what can I infer about the distribution?
- What if you pull out 10 blue balls?
- What if you pull out 10 red balls?

Put another way, if you're gambling, at what point do you decide a coin flip is no longer fair?

Here's [Bayes's Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem)

$$
P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}
$$

* $P(A\mid B)$ is a conditional probability: the likelihood of event A occurring given that B is true.
* $P(B\mid A)$ is also a conditional probability: the likelihood of event B occurring given that A is true.
* $P(A)$ and $P(B)$ are the probabilities of observing A and B.

Why is this useful?

Imagine we want to do some sort of drug test (e.g., similar for COVID, cancer, etc.)

- the test is 90% *sensitive* - it gives a positive result 90% of the time when someone is positive. (For COVID, you'd probably want much higher sensitivity.) That means there's a 10% **false negative** rate.
- the test is 80% *specific* - it gives 80% true negative rate - which means 20% of people get a **false positive**. (This gets into ethics, but presumably it's better to err on a false positive than a false negative, e.g. COVID)

Let's say hypothetically there's 5% *prevalence* of drug use in a population.

What's the chance that someone with a positive drug test is actually a drug user?

$$
P(\text { User } \mid \text { Positive })=\frac{P(\text { Positive } \mid \text { User }) P(\text { User })}{P(\text { Positive })}
$$

$$
P(\text { User } \mid \text { Positive }) = \frac{0.90 \times 0.05}{0.90 \times 0.05+0.20 \times 0.95} = 0.191
$$

In other words, most positive test results are false positives because prevalence is low.

In [None]:
# find better specificity
for false_positive in [0.2, 0.1, 0.05, 0.01]:
    print('rate:', 0.9*0.05 / (0.9*.05 + false_positive*0.95))

In other words, Bayes's theorem tells us that if prevalence is low, we want really good specificity - i.e. very few false positives.

Of course it depends on the consequences:
- for a drug test, presumably you don't want to accuse someone who's innocent.
- for COVID, presumably you really, really don't want a false negative - they can spread it
    - a false positive is presumably not as bad..
    
### Final Words about Bayes

Another way to look at Bayes and so-called "Bayesian Statistics" is that it's a way to look at your hypothesis:

<img src="../images/bayes-hypothesis.png" width="300" />

In other words, Bayes's theorem gives us the likelihood our hypothesis is true, given the evidence.

Isn't that what we want in science?

-------
This notebook is from Prof. Geoffrey Hutchison, University of Pittsburgh
https://github.com/ghutchis/chem1000

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a>