# Exercise set 3: 


**Learning Objectives:**

After completing this exercise set, you will be able to:

- Calculate effects from full and fractional factorial experimental designs.
- Create and interpret normal probability plots to assess the importance of effects.
- Create least squares models from experimental design results and use them to estimate effects.


**To get the exercise approved, complete the following problems:**

- [3.2(a)](#3.2(a)), [3.2(b)](#3.2(b)), and [3.2(c)](#3.2(c)): To show that you can analyse a full factorial design, including creating the normal probability plot.

- [3.3(b)](#3.3(b)) and [3.3(c)](#3.3(c)): To show that you can calculate effects from a fractional factorial design and create a least squares model.

## Exercise 3.1

> **Note:** This problem is more of an example. The code given here will show you how to create a normal probability plot. Run the code, create the plot and interpret it.


After running a set of experiments, you determine the effects
given in the table below for 4 factors: A, B, C, and D:



| A    | B    | C     | D     | AB  | AC   | AD   | BC    | BD   | CD    | ABC   | ACD   | BCD   | ABCD  | ABD   |
|:-----|:-----|:------|:------|:----|:-----|:-----|:------|:-----|:------|:------|:------|:------|:------|:------|
| -8.0 | 24.0 | -2.25 | -5.50 | 1.0 | 0.75 | 0.00 | -1.25 | 4.50 | -0.25 | -0.75 | -0.25 | -0.75 | -0.25 | 0.50  |



**Task:** Use the example code below to create a normal probability plot. Interpret this plot and decide what the important effects are in this case.

In [None]:
# Example code to create a probability plot.

# We collect the effects from the table above:
import pandas as pd

effects = [
    ("A", -8.00),
    ("B", 24.00),
    ("C", -2.25),
    ("D", -5.50),
    ("AB", 1.00),
    ("AC", 0.75),
    ("AD", 0.00),
    ("BC", -1.25),
    ("BD", 4.50),
    ("CD", -0.25),
    ("ABC", -0.75),
    ("ACD", -0.25),
    ("BCD", -0.75),
    ("ABCD", -0.25),
    ("ABD", 0.50),
]
table1 = pd.DataFrame(effects, columns=["factor", "effect"])

In [None]:
# We create the probability plot using statsmodels:
from matplotlib import pyplot as plt  # Needed for plotting
import seaborn as sns  # Used to style plots for a Jupyter Notebook
import statsmodels.api as sm  # Needed to create the probability plot

sns.set_context("notebook")

# Create the probability plot object using the "effect" column from table1.
# By default, ProbPlot compares the data to a standard normal distribution.
# This can also be selected by setting fit=False.
plot = sm.ProbPlot(table1["effect"], fit=False)

# Create a figure and an axes object. This allows for more control over the plot,
# such as setting axis labels, titles, and so on. fig represents the entire figure,
# while ax represents the specific axes where the plot will be drawn.
fig, ax = plt.subplots()

# Generate the plot and draw it on the specified axes (ax).
_ = plot.qqplot(ax=ax)

The plot above is somewhat hard to read since we do not see the names of the effects.
Let us try to make it easier to read by adding text showing the factor names:

In [None]:
# Create a new figure:
fig, ax = plt.subplots()
plot = sm.ProbPlot(table1["effect"], fit=False)
plot.qqplot(ax=ax)  # , line="q")
# Extract the theoretical and sample quantiles from the ProbPlot object.
# These will be used as the x and y coordinates for our scatter plot.
x = (
    plot.theoretical_quantiles
)  # Quantiles of the standard normal distribution.
y = plot.sample_quantiles  # Ordered (sorted) values of the 'effect' column.

# Get the factor names, sorted according to the 'effect' values.  This ensures
# the labels are placed correctly corresponding to the sorted effects on the plot.
sorted_factors = table1.sort_values("effect")

# Add text labels to the plot, one for each data point.
# We iterate through the x (theoretical quantiles), y (sample quantiles), and
# factor names simultaneously using zip().
for i, (xi, yi, factor) in enumerate(zip(x, y, sorted_factors["factor"])):
    # Put the text slightly away from the points. Offset determines
    # how far away the points will be. You can experiment with values
    # different from 3.
    offset = 3 if i % 2 == 0 else -3
    # Use annotate to show the text with a line connecting to the effect:
    ax.annotate(
        factor,  # Use the factor text
        (xi, yi),  # Point we are adding text to
        xytext=(
            xi,
            yi + offset,
        ),  # Shift the text along the y-axis to make it more visible
        ha="center",  # Center the text horizontally on the point
        va="center",  # Center the text vertically on the point
        arrowprops={
            "arrowstyle": "-"
        },  # Add a line connecting the text to the point
        fontsize="small",  # Make the font slightly smaller
    )
sns.despine(fig=fig)

In plots like the one above, data from a normal distribution with a mean of zero and a variance $\sigma^2$ will tend to fall along a straight line passing through the origin, with a slope equal to $\sigma$. It is therefore helpful to add a reference line to the plot to aid in visual assessment.

When checking if numbers are from a standard normal distribution (with $\sigma = 1$), adding the line $x=y$ to the plot is common. This line represents the expected location of points from a standard normal distribution. In the code above this can be achieved by using the `line="45"` option:

```python
plot.qqplot(ax=ax, line="45")
```
In this case, fitting a line through the observed data points can be helpful. A challenge is that potential outliers (which might represent important effects) can influence the fitted line. `statsmodels` offers a robust option based on fitting through quartiles. This can be achieved using the `line="q"` option

```python
plot.qqplot(ax=ax, line="q")
```

In some cases, even `line="q"` might be affected by outliers. A second option is to **use your own judgment to identify the linear region** and potentially fit a line manually, excluding any obvious outliers. This visual inspection might be easier to perform on a plot without added annotations (effect names), as those can sometimes distract from the overall pattern.

#### Your answer to question 3.1: What are the important effects?

*Double click here*

## Exercise 3.2

To optimize the filtration rate of a chemical product manufactured in a pressure
vessel, a 2⁴ factorial experiment was conducted in a pilot plant.
The experiment investigated the effects of temperature (A), pressure (B),
formaldehyde concentration (C), and stirring rate (D), each at two levels.

The goal was to determine how these factors influence filtration rate (volume filtered per unit time), aiming to maximize it for increased throughput and potentially lower costs.

Results from this experimental design can be found below:

In [None]:
import pandas as pd

data32 = {
    "A": [-1, 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1],
    "B": [-1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1],
    "C": [-1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1],
    "D": [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1],
    "Filtration Rate (L/hour)": [
        45,
        71,
        48,
        65,
        68,
        60,
        80,
        65,
        43,
        100,
        45,
        104,
        75,
        86,
        70,
        96,
    ],
}
table32 = pd.DataFrame(data32)
table32

### 3.2(a)

Calculate all main effects and interaction effects (second-, third-, and fourth-order). 

In [None]:
# Your code here

#### Your answer to question 3.2(a): What are the calculated effects?

*Double click here*

### 3.2(b)
Construct a normal probability plot and use it to identify the important effects.

In [None]:
# Your code here

#### Your answer to question 3.2(b): What are important effects?
*Double click here*

### 3.2(c)
Based on your results, at what level (high or low) should you put each of three factors A, C, and D to maximize the filtration rate?

#### Your answer to question 3.2(c): How should you set the factors A, C, and D?

*Double click here*

## Exercise 3.3

Reconsider the experiment described in [Exercise 3.2](#Exercise-3.2). Due to limited resources, we can only do 8 runs. We will therefore use a $2^{4-1}$ fractional factorial design. We select the following generator: $D=ABC$.

### 3.3(a)

Find the defining contrast and resolution of this design.  Are any main effects aliased with two-factor interactions?

#### Your answer to question 3.3(a): What is the resolution and are any main effects aliased with two-factor interactions?

*Double click here*

### 3.3(b)
The results of a fractional factorial experiment are shown below.
Verify that $D = ABC$ and add columns for the interaction effects $AB$, $AC$, and $BC$ to the table.

In [None]:
# Your code here

#### Your answer to question 3.3(b): Is $D=ABC$?

*Double click here*

### 3.3(c)
Calculate all effects ($A$, $B$, $C$, $AB$, $AC$, $BC$, and $D=ABC$). Do this by creating a least squares model, for instance, using [statsmodels](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html) or [scikit-learn](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py).

In [None]:
# Your code here

#### Your answer to question 3.3(c): What are the effects you calculated?

*Double click here*

### 3.3(d)

Identify the unimportant effect(s) based on their calculated values.  Confirm their insignificance by fitting a new least squares model without the unimportant effects. (Removing the unimportant effects should not change $R^2$  significantly.)

In [None]:
# Your code here

#### Your answer to question 3.3(d): What are the unimportant effects?

*Double click here*

## Exercise 3.4

[Teixeira et al. (2018)](https://doi.org/10.1016/j.scitotenv.2018.07.204) investigated the removal of two antibiotics from water using walnut shell-based activated carbon and a Box-Behnken experimental design. Their study examined the effects of pH, temperature (T), and initial antibiotic concentration (C) on antibiotic removal (measured as mg of antibiotic adsorbed per gram of activated carbon).

We will investigate if we can reproduce their results for one of the antibiotics, [Sulfamethoxazole](https://en.wikipedia.org/wiki/Sulfamethoxazole), namely that the optimal conditions for removal of Sulfamethoxazole are obtained at a temperature of 30 °C, initial concentration of 40 mg/L and a pH value of 5.5.

**Task:** Create a linear regression model using the provided experimental data (see [antibiotic.csv](./antibiotic.csv)) to predict the conditions (pH, T, and C) that maximize the removal of Sulfamethoxazole from water. The data includes measurements of Sulfamethoxazole adsorption (mg/g) at various pH (ranging from 2 to 8), temperature (ranging from 10 to 30 °C), and initial concentration (ranging from 20 to 40 mg/L) levels.

The linear model should include all main effects, all second-order main effects, and all interaction effects. That is, your model for the adsorption of the antibiotic ($y$) should be of the form: 

$y = b_0 + \sum_i b_i x_i + \sum_i b_{ii} x_i^2 + \sum_{i <j} b_{ij} x_i x_j$

where $x_i$ represents one of the factors (pH, T, or C).


**Suggested steps:**
1. Load the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
2. Generate terms needed for the linear model using scikit-learn's [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).
3. Fit the linear regression model using [statsmodels](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) or [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). Evaluate the model's performance by calculating $R²$.
4. Identify the conditions (pH, T, and C) that maximize Sulfamethoxazole removal within the experimental region.  This can be done by using methods from [scipy.optimize](https://docs.scipy.org/doc/scipy/tutorial/optimize.html), for instance [minimize](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html) (you can maximize $y$ by minimizing $-y$).

In [2]:
# To load the raw data:
import pandas as pd

data = pd.read_csv("antibiotic.csv")

The raw data is provided in the following format:

|    |   pH |   T (°C) |   C (mg/L) |   x1 |   x2 |   x3 |   Sulfamethoxazole (mg/g) |
|---:|-----:|---------:|-----------:|-----:|-----:|-----:|--------------------------:|
|  0 |    2 |       10 |         30 |   -1 |   -1 |    0 |                      57.2 |
|  1 |    2 |       10 |         30 |   -1 |   -1 |    0 |                      54.9 |
|  2 |    2 |       10 |         30 |   -1 |   -1 |    0 |                      54.3 |

Where the columns are:

*   **pH:**  The pH of the solution.
*   **T (°C):** The temperature of the solution in degrees Celsius.
*   **C (mg/L):** The initial concentration of Sulfamethoxazole in mg/L.
*   **x1:** The pH value scaled to the range [-1, 1] using the formula:  `x1 = (pH - 5) / 3`
*   **x2:** The temperature scaled to the range [-1, 1] using the formula: `x2 = (T - 20) / 10`
*   **x3:** The initial concentration scaled to the range [-1, 1] using the formula: `x3 = (C - 30) / 10`
*   **Sulfamethoxazole (mg/g):** The amount of Sulfamethoxazole adsorbed per gram of activated carbon, measured in mg/g.


In [None]:
# Your code here

#### Your answer to question 3.4: What settings gives you optimum absorbance?

*Double click here*