# Unit 5: Statistical Analysis
------------------------------

- Calculate and plot the histogram for a dataset
- Calculate random samples from a distribution
- Calculate a confidence interval
- Central limit theorem
- Create an x-bar R run chart

## 5.1. Calculating and plotting a histogram

The histogram is a common visualization taught in statistics classes. It helps us to summarize a large set of values to observe the spread of a data set. To generate a histogram, we first define a set of *bins* that are relevent for our dataset. For example, if we were looking at ages of people, we might consider bins with a width of 10 years (0-9 years, 10-19 years, 20-29 years, ...). Then, as you go through the data, you will count the number of entries that fall within each bin. The results of this process are typically visualized as a bar graph, with no space in between the bars.

### 5.1.1. Plotting the histogram in `matplotlib`

In the example below, we load a set of tensile modulus measurements for polymer films. The `matplotlib` package provides a function `hist(x)` that can be used to easily plot a histogram for a set of data. This function will figure out the bin size and spacing on its own. By viewing the sample below, we can quickly identify that this dataset appears to have two separate distributions, which may indicate that it represents two different polymer films.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

modulus = np.loadtxt('../../data/bopet_modulus-MPa.csv')

fig, ax = plt.subplots() 

ax.hist(modulus)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Count')

We have options in how we can set up the bins. In addition to the data array. The `hist()` function has an additional parameter called `bins` that can be used to either set a number of bins, or to fix the bin edges. In the example below, we set the bin count to 50. While it might seem like more bins will always improve resolution, there is a point at which you make the bins too narrow and you lose sight of the distribution.

In [None]:
fig, ax = plt.subplots() 

ax.hist(modulus, bins=50, density=True)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Density')

Additionally, you can set the actual bin edges. Recall the `np.arange()` function that we learned in Unit 2.

In [None]:
bin_edges = np.arange(4300, 5700, 100)

fig, ax = plt.subplots() 

ax.hist(modulus, bins=bin_edges)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Count')

### Calculating the histogram values in `numpy`

Sometimes, it is helpful to have the actual data, so that you can do something with it, like plotting a 

In [None]:
counts, bins = np.histogram(modulus, bins=bin_edges)

counts, bins

In [None]:
fig, ax = plt.subplots() 

counts, bins, bars = ax.hist(modulus, bins=bin_edges)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Count')

counts, bins

### 5.1.2. Adding the cumulative sum

In [None]:
fig, ax = plt.subplots() 

counts, bins, bars = ax.hist(modulus, bins=bin_edges)
ax.plot(bins[0:-1] + 50, counts.cumsum())

ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Count')

In [None]:
fig, ax = plt.subplots() 

counts, bins, bars = ax.hist(modulus, bins=bin_edges)
ax2 = ax.twinx()
ax2.plot(bins[0:-1] + 50, counts.cumsum())

ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Count')

In [None]:
fig, ax = plt.subplots() 

counts, bins, bars = ax.hist(modulus, bins=bin_edges, alpha=0.5)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Count')
ax.set_ylim((0, 12))

ax2 = ax.twinx()
ax2.plot(bins[0:-1] + 50, counts.cumsum())
ax2.set_ylabel('Cumulative Sum')
ax2.set_ylim((0, 60))
ax2.grid(False)


### 5.1.3. Overlaying the normal distribution



In [None]:
from scipy import stats

fig, ax = plt.subplots() 

counts, bins, bars = ax.hist(modulus, bins=bin_edges, density=True, alpha=0.5)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Frequency')

x = np.linspace(modulus.min()-modulus.std(), modulus.max()+modulus.std(), 100)
pdf = stats.norm.pdf(x, loc=modulus.mean(), scale=modulus.std())

ax.plot(x, pdf)



In [None]:
fig, ax = plt.subplots() 

counts, bins, bars = ax.hist(modulus, bins=bin_edges, density=True, alpha=0.5)
ax.set_xlabel('Tensile Modulus (MPa)')
ax.set_ylabel('Frequency')

x = np.linspace(modulus.min()-modulus.std(), modulus.max()+modulus.std(), 100)
pdf = stats.norm.pdf(x, loc=modulus.mean(), scale=modulus.std())


lower_limit = modulus.mean() - modulus.std()
upper_limit = modulus.mean() + modulus.std()

ax.plot(x, pdf)
ax.fill_between(x, 0, pdf, where=((x>lower_limit) & (x<upper_limit)), alpha=0.8)



In [None]:
np.random.random()

--------------
## Next Steps:

1. Complete the [Unit 5 Problems](./unit05-solutions.ipynb) to test your understanding
2. Advance to [Unit 6](../06-regression-classification/unit06-lesson.ipynb) when you're ready for the next step