# Solutions: Unit 5
-------------------

Complete the problems below in your copy of the Jupyter Notebook.

## Problem 5.1.

For this problem, assume $\mu=15$, $\sigma=3.5$

1. Random sampling 20 values from the normal distribution
2. Plot a histogram of these randomly sampled values as a population density
3. Overlay the normal probability function
4. Using the t distribution, calculate the 95% confidence interval for the mean of the random sample, and identify this range with dashed vertical lines

In [None]:
# problem 5.1. solution

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

plt.style.use('ggplot')

# part 1: generate the random sample
n = 20
mu = 15
sigma = 3.5

random_values = stats.norm.rvs(loc=mu, scale=sigma, size=n)

# part 2: plot the histogram
first_color = colors = plt.rcParams['axes.prop_cycle'].by_key()['color'][0]

fig, ax = plt.subplots()
ax.hist(random_values, density=True, alpha=0.5, color=first_color)

# part 3: overlay the normal probability function
x = np.linspace(mu-3*sigma, mu+3*sigma, 100)
y = stats.norm.pdf(x, loc=mu, scale=sigma)
ax.plot(x, y, c=first_color)

# part 4: compute the interval
x_bar = random_values.mean()
stdev = random_values.std(ddof=-1)

interval = stats.t.interval(0.95, loc=x_bar, scale=stdev/np.sqrt(n), df=n)
ax.axvline(interval[0], ls='--')
ax.axvline(interval[1], ls='--')

The location of the confidence interval is based on the random sampling of values, so it will vary from this example. In fact, it will change every time the cell is run. Because a 95% confidence interval is constructed, it should bound the true mean of the population an average of 95% of the times that the cell is run.

## Problem 5.2.

The `physical_properties` workseet in the `film_testing.xlsx` file contains long-form data on the physical properties of a number of polymer films. Use this data to compare the dart impact energy of BOPP and BOPET films. Plot the histograms, and test the data to determine if there is there is statistical evidence that these two film chemistries have a different impact strength. Assume that the data for each film type is sampled from the normal distribution, but the variance for each film type may not be the same.

In [None]:
# problem 5.2. solution

import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd

# open and review the data set to get the column headings and format
film_df = pd.read_excel('../../data/film_testing.xlsx', sheet_name='physical_properties')
film_df.head()

In [None]:
# select the two datasets to make the code easier to follow
dart_bopet = film_df[(film_df['FilmType']=='BOPET') & (film_df['Property']=='Dart Impact')]
dart_bopp = film_df[(film_df['FilmType']=='BOPP') & (film_df['Property']=='Dart Impact')]

# compute bin edges that span the entire range of data
dart_min = film_df[film_df['Property']=='Dart Impact']['Measurement'].min()
dart_max = film_df[film_df['Property']=='Dart Impact']['Measurement'].max()

dart_bin_width = 0.05
dart_bin_min = dart_bin_width * np.floor(dart_min / dart_bin_width)
dart_bin_max = dart_bin_width * (np.ceil(dart_max / dart_bin_width) + 1)

dart_bins = np.arange(dart_bin_min, dart_bin_max, dart_bin_width)

# plot the histograms of the two data sets
fig, ax = plt.subplots()
ax.hist(dart_bopet['Measurement'], bins=dart_bins, alpha=0.5, label='BOPET')
ax.hist(dart_bopp['Measurement'], bins=dart_bins, alpha=0.5, label='BOPP')

In [None]:
# assume a null hypothesis that the means of these two distributions are equal
stats.ttest_ind(dart_bopet['Measurement'], dart_bopp['Measurement'], equal_var=False)

If we use a 0.05 as the limit for significance in the t-test, we find that the observed p-value << 0.05, so we reject the null hypothesis that these two film types have the same dart impact strength. There is a statistically significant different in the dart impact of BOPP and BOPET films.

## Problem 5.3.

The *basis weight* (mass / unit area) is a common quality measurement in the production of polymer films. The file `basis_weight.txt` contains a series of basis weight values in grams per square meter, with no column heading. Load this data set and construct an $\bar{x}$ run chart of the data, using $n=4$ and setting limits of $\pm 1.5 \sigma$. Use the first 15 $\bar{x}$ points to estimate the mean and standard deviation of the data. Plot any out of control points with an "x" marker and comment on what is observed.

In [None]:
# problem 5.3. solution

import matplotlib.pyplot as plt
import numpy as np

# load the data array
bw = np.loadtxt('../../data/basis_weight.txt')

# compute the x-bar values, using n=4
bw_matrix = bw.reshape(-1, 4)
bw_xbar = bw_matrix.mean(axis=1)

# compute the mean and standard deviation for the first 15 x-bar values
xbar_mean = bw_xbar[:15].mean()
xbar_std = bw_xbar[:15].std(ddof=1)

# compute the upper and lower control limits
ucl = xbar_mean + 1.5 * xbar_std
lcl = xbar_mean - 1.5 * xbar_std

# plot the x-bar run chart, including limits at +/- 
fig, ax = plt.subplots()

# plot the observed x-bar values
ax.plot(np.arange(len(bw_xbar)), bw_xbar, c='blue', lw=1)

# plot the calculated control limits
ax.axhline(xbar_mean + 1.5*xbar_std, label='UCL', c='red')
ax.axhline(xbar_mean, label='CL', c='gray', ls='--')
ax.axhline(xbar_mean - 1.5*xbar_std, label='LCL', c='red')

# get the array index values where the x-bar points are outside the control limits
outofcontrol_index = np.argwhere((bw_xbar > ucl) | (bw_xbar < lcl))

# using the index values as the x-coordinates, plot the out of control points from the array
# use the index values to select the out of control points
ax.scatter(outofcontrol_index, bw_xbar[outofcontrol_index], marker='x', c='blue', 
           label='Out of Control')

ax.legend()

There are several observations that exceed the upper control limit, with the first being at index 17. This indicates that there may have been a disruption in the manufacturing process at this time.

--------------
## Next Steps:

1. Advance to [Unit 6](../06-regression-classification/unit06-lesson.ipynb) when you're ready for the next step