In [1]:
import empiricaldist
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from empiricaldist import FreqTab
from thinkstats import decorate

# PMFs

A `pmf` object is like a `FreqTab` that contains probabilities instead of frequencies.

In [3]:
ftab = FreqTab.from_seq([1, 2, 2, 4, 5])
ftab

Unnamed: 0,freqs
1,1
2,2
4,1
5,1


- Sum of the frequencies = Size of the original sequence.

In [5]:
n = ftab.sum()
n

np.int64(5)

- by dividing the frequencies by `n`, they represent proportions, rather than counts.

In [10]:
pmf = ftab/n
pmf

Unnamed: 0,probs
1,0.2
2,0.4
4,0.2
5,0.2


- Above results indicates:
  - 20% of the values in the sequence are 1
  - 40% are 2 and so on
- These probability also mean that **if we choose a random value from the original sequence then the probability, we choose 1 is 0.2**

- The sum of the probability is 1, which means that this distribution is **normalized**.

In [11]:
pmf.sum()

np.float64(1.0)

- A normalized `FreqTab` object represents a **Probability mass function** (PMF).
- Probabilities associated with discrete values are also called **Probability Masses**.

- Normalizing the frequency table with `Pmf` object.

In [12]:
from empiricaldist import Pmf

pmf = Pmf.from_seq([1, 2, 2, 3, 5])
pmf

Unnamed: 0,probs
1,0.2
2,0.4
3,0.2
5,0.2


In [14]:
# A normalized pmf has total probability is 1
pmf.sum()

np.float64(1.0)

In [15]:
print(pmf[2], pmf(2))

0.4 0.4


- Assigning and operations are same as operations performed on list items.

In [21]:
pmf[2] = 0.2
pmf(2)

np.float64(0.2)

In [22]:
pmf[2] += 0.3
pmf

Unnamed: 0,probs
1,0.2
2,0.5
3,0.2
5,0.2


In [24]:
pmf[2] *= 0.5
pmf

Unnamed: 0,probs
1,0.2
2,0.125
3,0.2
5,0.2


In [25]:
pmf.sum()

np.float64(0.7250000000000001)

In [26]:
# renormalize the pmf again
pmf.normalize()

np.float64(0.7250000000000001)

- Copying the `Pmf`

In [27]:
pmf.copy()

Unnamed: 0,probs
1,0.275862
2,0.172414
3,0.275862
5,0.275862


# Summarizing a PMF

- Suppose, we compute the PMF of the values from a sequence.

In [33]:
seq = [1, 2, 2, 3, 5]
mean = np.sum(seq)/len(seq)
mean

np.float64(2.6)

In [35]:
pmf = Pmf.from_seq(seq)

# Computing the Mean using PMF

In [31]:
mean = np.sum(pmf.ps * pmf.qs)
mean

np.float64(2.6)

In [36]:
# using the mean method
pmf.mean()

np.float64(2.6)

- Given a `Pmf`, we can compute the variance by computing the deviations of each queantity from the mean.

In [37]:
deviations = pmf.qs - mean
deviations

array([-1.6, -0.6,  0.4,  2.4])

In [39]:
var = np.sum(pmf.ps * deviations**2)
var

np.float64(1.84)

- Variance using `var` method:

In [40]:
pmf.var()

np.float64(1.84)

- Computing the standard deviations using the variance.

In [41]:
np.sqrt(var)

np.float64(1.3564659966250536)

- Standard deviation using `std` method

In [42]:
pmf.std()

np.float64(1.3564659966250536)

- Computing `mode`

In [43]:
pmf.mode()

np.int64(2)

# The class size paradox

- Suppose that a college offers 65 classes in a given semester, and we are given the number of classes in each of the following size ranges.

In [44]:
ranges = pd.interval_range(start=5, end=50, freq=5, closed="left")
ranges.name = "class size"

data = pd.DataFrame(index=ranges)
data["count"] = [8, 8, 14, 4, 6, 12, 8, 3, 2]
data

Unnamed: 0_level_0,count
class size,Unnamed: 1_level_1
"[5, 10)",8
"[10, 15)",8
"[15, 20)",14
"[20, 25)",4
"[25, 30)",6
"[30, 35)",12
"[35, 40)",8
"[40, 45)",3
"[45, 50)",2
