# Exercise set 6

> The first goal of this exercise is to analyze a fractional factorial design and use **confounding**, **deﬁning contrast(s)**, **generators** and the **resolution**.
>
> In connection with experimental design, we have seen two approaches for checking whether
determined effects are important. These two approaches are based on creating a probability
plot and performing ANOVA. The second goal of this exercise is to use these two approaches.
>
> Finally, we are also going to analyze the results of a more complex experimental design.

## Exercise 6.1

You have recently started a new job in a chocolate bars company. A new and supposedly tasty
chocolate is under development, and the main ingredients that you can vary are:


*  The amount of cocoa ($A$).

*  The number of pecan nuts ($B$).

*  The amount of caramel ($C$).

*  The amount of milk powder ($D$).

*  The amount of sugar ($E$).

*  The amount of vanilla ($F$).

You are tasked with carrying out a maximum of $16$
experiments (limited due to cost
and time constraints) in which the best mixture of
the main ingredients ($A$--$F$)
is found. ("Best" is here determined by a tasting panel of $30$ people.)
For this task, you decide on making a two-level
fractional factorial design.


### 6.1(a)
How many experiments would you have to carry out in a full factorial design?

In [None]:
# Full factorial with 6 components:
print(f"Full factorial, 2 levels & 6 components = 2**6 = {2**6}")

#### Your answer to question 6.1(a):
With 6 factors in a two-level design, the number of experiments would be $2^6 = 64$.

### 6.1(b)
As stated, you can only carry out $16$ experiments. 
Explain what confounding is and why the set up with
$16$ experiments will lead to confounding.

#### Your answer to question 6.1(b):

When we carry out a full factorial design, we do enough
experiments so that we can determine the effect of all
factors. When we
do a fractional factorial design, we do too few experiments
to uniquely determine all the effects. When we estimate effects,
we then find that the expressions for some of the effects are
identical and there is no way that we can distinguish between them.

In the present case, we need $64$ experiments (see the [answer to 6.1(a)](#Your-answer-to-question-6.1(a):))
but we can only do $16$. We therefore know that some of our factors
will be confounded with each other.

Luckily, we have some choice and we can, for instance, make it so
that the main effects are confounded with higher-order interactions.
This is useful as we often can neglect higher-order interactions
compared to the main effects.

### 6.1(c)
After talking with the chocolate design team, you
decide on the following generators:

*  $E = ABC$.

*  $F = BCD$.

What is a defining contrast, and what are the
defining contrasts in this case?

#### Your answer to question 6.1(c):

A defining contrast is a
relation that can be used to deduce
what variables are confounded.
In this case, we decided on two generators which can be used
to generate the defining contrasts,


* $E = ABC \implies E^2 = 1 = ABCE$.
* $F = BCD \implies F^2 = 1 = BCDF$.


In addition, we have that,

\begin{equation*}
1 \times 1 = 1 = ABCE \times BCDF = ADEF.
\end{equation*}
  
Thus, we have $3$ defining contrasts here:

* $1 = ABCE$.
* $1 = BCDF$.
* $1 = ADEF$.


### 6.1(d)
Find the resolution for this design and write out the
short-hand representation of the design on the
form $2^{N-p}_R$. Are any of the main effects
confounded with $2$-factor
interactions? Hint: Consult the summary tables
at [https://www.itl.nist.gov/div898/handbook/pri/section3/pri3347.htm](https://www.itl.nist.gov/div898/handbook/pri/section3/pri3347.htm).

#### Your answer to question 6.1(d):

We can find the present setup in the summary table as the setup with 6 factors that uses 16 experiments. This is a
$2_\text{IV}^{6-2}$ design. Since the resolution is 4, we know that the main effects are confounded with effects
of order $4-1=3$ and higher, so no, they are not confounded with 2-factor interactions.
We can also check this by hand (see the PDF-version of the solution).

### 6.1(e)
Construct the design matrix for the current design but show only the columns for the main effects.

In [None]:
from itertools import product
import numpy as np
import pandas as pd

def design_matrix_16():
    # Generate for A, B, C, D first:
    ABCD = np.array(list(product([-1, 1], repeat=4)))
    A = ABCD[:, 0]
    B = ABCD[:, 1]
    C = ABCD[:, 2]
    D = ABCD[:, 3]
    # Use the generators for getting E and F
    # E = ABC:
    E = A * B * C
    # F = BCD:
    F = B * C * D

    names = ["A", "B", "C", "D", "E", "F"]

    design = {}

    for key, val in zip(names, [A, B, C, D, E, F]):
        design[key] = ["+" if i > 0 else "-" for i in val]
    design = pd.DataFrame(design)
    return design

design_matrix_16()

#### Your answer to question 6.1(e):

(See the table above.)

### 6.1(f)
Another member of your team suggests doing just $8$ experiments
as this will cut time and cost.
Do you think this is a good idea? Why/why not? What would
the design matrix look like in this case?

In [None]:
from itertools import product
import numpy as np
import pandas as pd

def design_matrix_8():
    # Generate for A, B, C, D first:
    ABC = np.array(list(product([-1, 1], repeat=3)))
    A = ABC[:, 0]
    B = ABC[:, 1]
    C = ABC[:, 2]
    # Use generators for getting E, E and F
    # D = AB:
    D = A * B
    # E = AC:
    E = A * C
    # F = BC:
    F = B * C

    names = ["A", "B", "C", "D", "E", "F"]

    design = {}

    for key, val in zip(names, [A, B, C, D, E, F]):
        design[key] = ["+" if i > 0 else "-" for i in val]
    design = pd.DataFrame(design)
    return design

design_matrix_8()

#### Your answer to question 6.1(f):

Consulting the [summary tables](https://www.itl.nist.gov/div898/handbook/pri/section3/eqns/2to6m3.txt) we
find that the design with 8 experiments is a $2_\text{III}^{6-3}$ setup. In this case, the main effects
are confounded with second order effects. In general, we want our main effects to not be confounded with
interactions lower than $3$-factor interactions. We would then
advise against doing only $8$ experiments, and keep to the
original plan with $16$ experiments to be able to better
assess the main effects.

The design matrix is shown above, note that you might get a different table, depending on the
generators you have used.

## Exercise 6.2

After running a set of experiments, you determine the effects
given in Table 1 for $4$ factors: A, B, C, and D.
Use a normal probability plot to identify the important effects among
the ones listed in this table.


**Table 1:** *Effects determined in a set of experiments. Data for [Exercise 6.2](#Exercise-6.2)*


| A    | B    | C     | D     | AB  | AC   | AD   | BC    | BD   | CD    | ABC   | ACD   | BCD   | ABCD  | ABD   |
|:-----|:-----|:------|:------|:----|:-----|:-----|:------|:-----|:------|:------|:------|:------|:------|:------|
| -8.0 | 24.0 | -2.25 | -5.50 | 1.0 | 0.75 | 0.00 | -1.25 | 4.50 | -0.25 | -0.75 | -0.25 | -0.75 | -0.25 | 0.50  |

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns  # Styling of plots

%matplotlib notebook
sns.set_theme(style="ticks", context="notebook", palette="muted")

effects = [
    ("A", -8.00),
    ("B", 24.00),
    ("C", -2.25),
    ("D", -5.50),
    ("AB", 1.00),
    ("AC", 0.75),
    ("AD", 0.00),
    ("BC", -1.25),
    ("BD", 4.50),
    ("CD", -0.25),
    ("ABC", -0.75),
    ("ACD", -0.25),
    ("BCD", -0.75),
    ("ABCD", -0.25),
    ("ABD", 0.50),
]

table1 = pd.DataFrame(effects, columns=["factor", "effect"])

# Create a normal probability plot with statsmodels:
plot = sm.ProbPlot(table1["effect"].to_numpy(), fit=False)
fig, ax = plt.subplots(constrained_layout=True)
plot.qqplot(line="q", ax=ax)

x = plot.theoretical_quantiles  # The expected locations
y = plot.sample_quantiles  # The sorted effects

# Also get the name of factors from sorting the table:
sorted_factors = table1.sort_values("effect")
for xi, yi, factor in zip(x, y, sorted_factors["factor"]):
    ax.text(
        xi,
        yi,
        factor,
        va="center",
        ha="center",
        bbox={
            "facecolor": "w",
            "edgecolor": "#4C72B0",
            "boxstyle": "round",
            "lw": 1.5,
        },
        fontsize="xx-small"
    )
sns.despine(fig=fig)

#### Your answer to question 6.2:

In the normal probability plot generated above, four factors deviate significantly from the other effects,
(and the straight line): A, D, BD, and B. We thus conclude
that these four factors are the important factors in this case.

## Exercise 6.3

**Only do this exercise if you have time, or did not do [Exercise 6.2](#Exercise-6.2).**

From a $2^2$ factorial experiment replicated three times you have obtained
the data given in Table 2. We use here a short-hand notation
for the $4$ possible combinations of the variables: $(1)$, $a$, $b$, and $ab$.
In this notation, $(1)$ is the experiment where all factors were at their low levels. For the other cases, the absence of a letter means that the corresponding factor was at a low level, and
the presence of a letter indicates that the corresponding factor was at a high level (e.g., "$a$" is the
same as saying that factor A was at the high level and B at the low level).

**Table 2:** *Results from a $2^2$ factorial experiment, repeated $3$ times. Data for [Exercise 6.3](#Exercise-6.3)*

|**Experiment** | **Replicate 1** | **Replicate 2** | **Replicate 2** |
|:---:|:---:|:---:|:---:|
|$(1)$ | $9$  | $10$ | $11$ |
|$a$   | $30$ | $31$ | $29$ |
|$b$   | $19$ | $20$ | $21$ |
|$ab$  | $5$  | $6$  | $4$  |


### 6.3(a)
Calculate the effects (A, B, and AB).

In [None]:
data = {
    "(1)": np.array([9, 10, 11]),
    "a": np.array([30, 31, 29]),
    "b": np.array([19, 20, 21]),
    "ab": np.array([5, 6, 4]),
}

data_values = np.array(
    [
        [9, 10, 11],
        [30, 31, 29],
        [19, 20, 21],
        [5, 6, 4],
    ]
)

In [None]:
# In this notation:
# (1): A=-, B=-, AB=+
# a: A=+, B=-, AB=-
# b: A=-, B=+, AB=-
# ab: A=+, B=+, AB=+

# The mean response is:
mean_data = {key: np.mean(val) for key, val in data.items()}
# And the effects are:
effect = {}
effect["A"] = 0.5 * (
    (mean_data["a"] + mean_data["ab"]) - (mean_data["(1)"] + mean_data["b"])
)
effect["B"] = 0.5 * (
    (mean_data["b"] + mean_data["ab"]) - (mean_data["(1)"] + mean_data["a"])
)
effect["AB"] = 0.5 * (
    (mean_data["(1)"] + mean_data["ab"]) - (mean_data["b"] + mean_data["a"])
)
print(effect)

#### Your answer to question 6.3(a):
The average effects are (see the calculation above):

* A: 2.5
* B: -7.5
* AB: -17.5 

### 6.3(b)
Use ANOVA to investigate which effects are important in this case.
Use a significance level of $\alpha = 0.01$. For
a significance level of $\alpha = 0.01$, the relevant critical
$f$-value is $f_{\alpha=0.01}(1, 8) = 11.259$ with $1$ and $8$ degrees
of freedom. (Note: The numbers in Table 2
are the same as for the example on
page 96 in the textbook.)

In [None]:
# Calculate the contrasts first, since we already have the effects, we can get it as:
k = 3  # 3 repeated experiments
N = 2  # 2 factors.
denom = k * 2**(N-1)
contrasts = {key: val*denom for key, val in effect.items()}
contrasts

In [None]:
# Next, find the sum of squares:
sum_of_squares = {key: val * val / (k*2**N) for key, val in contrasts.items()}
sum_of_squares

In [None]:
# And total sum of squares:
SST = np.sum((data_values - data_values.mean()) ** 2)
print(f"SST = {SST}")
# And the SSE:
SSE = SST - sum([val for _, val in sum_of_squares.items()])
print(f"SSE = {SSE}")
df_sse = 2**2 * (k - 1)
print("Degrees of freedom SSE:", df_sse)
sigma = SSE / df_sse  # Estimate for variance from SSE

In [None]:
import scipy.stats
# Calculate f-values and p-values:
for factor, ssi in sum_of_squares.items():
    fval = ssi / sigma
    pval = 1 - scipy.stats.f.cdf(fval, 1, df_sse)
    print(f"ANOVA for: {factor}")
    print(f"\tf = {fval}")
    print(f"\tp = {pval}")

In [None]:
# Print out some critical f-values at different significance levels:
print("\nCritical f-values:")
for alpha in (0.001, 0.01, 0.05, 0.1):
    f_critical = scipy.stats.f.ppf(1 - alpha, 1, df_sse)
    print(f"\tAt alpha {alpha:5.3f}: f-critical = {f_critical:6.3f}")

#### Your answer to question 6.3(b):

At the given significance level (0.01), all effects are significant. This is also reflected
in the calculated $p$-values. All of these are small, which means that if the effects
are *not significant*, then the probability of observing what we have observed is small.

## Exercise 6.4

A chemical company is producing the
compound B from compound A in the reaction $\text{A} \to \text{B}$
and the company is running a project to 
maximize the yield of compound B.

The yield for the conversion from A to B is low
at low temperatures. The yield increases at
higher temperatures, but
a competing secondary reaction transforms B into C, an unwanted product.
The secondary reaction, $\text{A} \to \text{B} \to \text{C}$,
is more prominent at higher temperatures, especially when the reaction is allowed to run for a long time.
To determine what combinations of temperature and
reaction time result in the best yield, the company has
carried out a set of experiments. The experimental space and the experimental points
are shown in the figure below (in the Python cell).

Here, the company first
defined a feasibility region (based on chemical knowledge)
and then created a *D*-optimal
design for a cubic model. The experimental results are given
in the Excel file [Data/yield.xls](Data/yield.xls). Use the experimental data to create a cubic model and locate the settings (temperature & time)
that gives the highest yield.

In [None]:
# Example code to get you started:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns  # Styling of plots

%matplotlib notebook
sns.set_theme(style="ticks", context="notebook", palette="muted")

data = pd.read_excel("Data/yield.xls")  # you might have to first do: pip install openpyxl
data.head()

In [None]:
# Plot experimental region
from matplotlib.patches import Polygon
fig, ax = plt.subplots(constrained_layout=True)
ax.set(xlabel="Time (s)", ylabel="Temperature (K)")
points = [
    (660, 520),
    (360, 529),
    (360, 550), 
    (420, 550),
    (720, 523),
    (720, 520),
]
region = Polygon(points, edgecolor="k", facecolor="0.95", linestyle="--", label="Feasible settings")
ax.add_artist(region)
ax.scatter(data["Time (s)"], data["Temperature (K)"], label="Experimental points")
ax.legend()
ax.set_title("Experimental space", loc="left")
sns.despine(fig=fig)
fig.savefig("space1.pdf")

To generate all the possible terms for a cubic polynomial can be a hassle (and it is easy to
forget some terms!). Here we would need to have: time, temperature, time², temperature², time × temperature, time³, temperature³, time × temperature² and time² × temperature.

Below is some Python code that will do this for you:

In [None]:
# Here are some hints for creating the model:
# 1) We can transform the variables to -1, 1 with a MinMaxScaler from sklearn.
# 2) We can generate all cubic terms for a polynomial with PolynomialFeatures from sklearn
# 3) We can join these two operations with a so-called pipeline.
# Here is an example:

from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.pipeline import Pipeline

# We do not need to use the scaler here, and we do not need the pipeline.
# The pipeline is convenient if we want to redo this preprocessing on new data
# (for instance, if we want to use the model in the future for predictions).
# The MinMaxScaler is not really needed, but it can make it easier to compare
# regression coefficients.

X_raw = data[["Time (s)", "Temperature (K)"]]
y = data["Yield"]

# Make a pipeline for preprocessing (first scale Time & Temperature, the add polynomial)
preprocess = Pipeline(
    steps=[
        ("minmax-scale", MinMaxScaler(feature_range=(-1, 1))),  # Scale to -1 to 1
        ("cubic", PolynomialFeatures(degree=3, include_bias=True)),  # Add x1, x1^2, x2, x1*x2, ...
        #include_bias = True = add also a constant term
    ]
)
preprocess.fit(X_raw)
# The line above is fitting the pipeline to the data (this sets up the scaler and polynomial transformation).
# This does not apply it yet!

X = pd.DataFrame(
    preprocess.transform(X_raw),  # Transform the data = scale and make the polynomial
    columns=preprocess.get_feature_names_out(),  # Store the variables
)
X

In [None]:
# Your code here: fit the polynomial and locate the maximum (you can do this graphically)
import statsmodels.api as sm

model = sm.OLS(y, X).fit()
print(model.summary())

The fit above seems pretty good, but we note that two of the terms seems to not be needed (they have a high $p$-value). Let us remake the model without these two:

In [None]:
X2 = X.drop(columns=["Time (s)^3", "Time (s)^2 Temperature (K)"], inplace=False)
model2 = sm.OLS(y, X2).fit()
print(model2.summary())

The model found above seems to fit the data very well. Let us use to to draw a contour diagram to show
the yield. To do this, we have to evaluate the yield on a grid of time and temperature points:

In [None]:
time = np.linspace(350, 725, 100)  # 50 points 
temperature = np.linspace(518, 552, 100)  # 50 points

# Create the grid (this is for plotting) as all combinations of time and temperature
time_grid, temperature_grid = np.meshgrid(time, temperature)  # 100 x 100 points
# To evaluate on the grid, we first make a new table with all the points, here
# we convert the 2D grid to a 1D list of points with flatten()
X_evaluate = pd.DataFrame(
    np.column_stack((time_grid.flatten(), temperature_grid.flatten())),
    columns=["Time (s)", "Temperature (K)"],
)
# Transform to the cubic model:
X_evaluate = pd.DataFrame(
    preprocess.transform(X_evaluate),  # Transform the data = scale and make the polynomial
    columns=preprocess.get_feature_names_out(),  # Store the variables
)
X_evaluate.drop(columns=["Time (s)^3", "Time (s)^2 Temperature (K)"], inplace=True)
# Evaluate the model on the grid:
y_hat = model2.predict(X_evaluate)
y_hat_grid = y_hat.to_numpy().reshape(time_grid.shape)

y_hat_grid[y_hat_grid < 0] = float("nan")

In [None]:
# Now we can plot it:
fig, ax = plt.subplots(constrained_layout=True)
cont = ax.contourf(time_grid, temperature_grid, y_hat_grid, levels=30)
ax.set(xlabel="Time (s)", ylabel="Temperature (K)")
points = [
    (660, 520),
    (360, 529),
    (360, 550),
    (420, 550),
    (720, 523),
    (720, 520),
]
region = Polygon(points, edgecolor="k", facecolor="None", linestyle="-", label="Feasible settings")
ax.add_artist(region)
ax.legend()
cbar = fig.colorbar(cont, ax=ax)
cbar.ax.set_title("Yield")
sns.despine(fig=fig)

In [None]:
# To find the optimum, it is here maybe easiest to just pick out the largest value from the grid above.
# We can also try to do it numerically. It is then easiest to define a new function:

def evaluate_model(x):
    time, temperature = x
    X = pd.DataFrame({"Time (s)": [time], "Temperature (K)": [temperature]})
    X = pd.DataFrame(
        preprocess.transform(X),
        columns=preprocess.get_feature_names_out(),
    )
    X.drop(columns=["Time (s)^3", "Time (s)^2 Temperature (K)"], inplace=True)
    return -model2.predict(X).to_numpy()  # Scipy will minimize something for us...

evaluate_model((360, 542.5))

In [None]:
from scipy.optimize import Bounds, minimize
bounds = Bounds([360, 520], [720, 550])
# These bounds are not correct, we could efine the region with equations, but that is a lot
# of work, and judging from the drawing above, we are not going to end up outside the feasible
# region.
x0 = (360, 542.5)  # Guess from the figure
res = minimize(evaluate_model, x0, method='trust-constr', options={'verbose': 1}, bounds=bounds)
yield_max = -evaluate_model(res.x)
print(f"Optimum at {res.x}, yield = {yield_max}")

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
cont = ax.contourf(time_grid, temperature_grid, y_hat_grid, levels=30)
ax.set(xlabel="Time (s)", ylabel="Temperature (K)")
points = [
    (660, 520),
    (360, 529),
    (360, 550),
    (420, 550),
    (720, 523),
    (720, 520),
]
region = Polygon(points, edgecolor="k", facecolor="None", linestyle="-", label="Feasible settings")
ax.add_artist(region)
cbar = fig.colorbar(cont, ax=ax)
cbar.ax.set_title("Yield")
ax.scatter(res.x[0], res.x[1], marker="*", label="Optimum", color="k", s=100)
ax.legend()
sns.despine(fig=fig)

#### Your answer to question 6.4: What settings gives you the highest yield?
Within the constraints: A time of 360 s and a temperature of 543 K will yield the highest.

If we think outside the box, the model indicates that an even shorter reaction time (maybe 300 s) and a slightly higher temperature (545 K) can yield even better.