<a href="https://colab.research.google.com/github/francji1/01NAEX/blob/main/code/01NAEX_Exercise_07_solution_MartinK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# O1NAEX Exercise 07

## Problem 6.31 + 6.32

from the chapter 6, D. C. Montgomery DAoE - 8. edition.

An experiment was conducted on a chemical process that produces a
polymer. The four factors studied were temperature (A), catalyst
concentration (B), time (C), and pressure (D). Two responses, molecular
weight and viscosity, were observed. The design matrix and response data
are following:

In [None]:
 !pip install pyDOE3

In [None]:
from pyDOE3 import ff2n
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
from matplotlib import cm

df631 = pd.read_csv("https://raw.githubusercontent.com/francji1/01NAEX/main/data/Problem_6_31.txt", sep=";")
df631.head()

In [None]:
m = ff2n(4)
m = m[:, ::-1]

m_df = pd.DataFrame(m, columns=["A", "B", "C", "D"])
Weight = df631.loc[0:15, "Weight"].values
Viscosity = df631.loc[0:15, "Viscosity"].values
m_df['Weight'] = Weight
m_df['Viscosity'] = Viscosity
response_names = m_df.columns[-2:].tolist()

print("Response names:", response_names)
print(m_df)

In [None]:
df626 = pd.read_csv("https://raw.githubusercontent.com/francji1/01NAEX/main/data/Ex06_26.csv", sep=";")
df626.head()

##	Problems 6.26 and 6.27 and 7.7
from the chapters 6 and 7, D. C. Montgomery DAoE - 8. edition.

An experiment was run in a semiconductor fabrication plant in an effort to increase yield. Five factors, each at two levels, were studied. The factors (and levels) were
+ A = aperture setting (small, large),
+ B = exposure time (20% below nominal, 20% above nominal),
+ C = development time (30 and 45 s),
+ D = mask dimension (small, large), and
+ E = etch time (14.5 and 15.5min).

The unreplicated $2^5$ design shown below was run.

In [None]:
import pandas as pd

# Define new column names
new_column_names = ["A", "B", "C", "D", "E"]

# Rename columns
df626.rename(columns=dict(zip(df626.columns, new_column_names)), inplace=True)


In [None]:
df626_df.head()

In [None]:
pd.set_option('future.no_silent_downcasting', True)

# Rename variable values
df626["B"] = df626["B"].apply(lambda x: -1.0 if x < 15 else 1.0)
df626["E"] = df626["E"].apply(lambda x: -1.0 if x < 15 else 1.0)
df626["C"] = df626["C"].apply(lambda x: -1.0 if x < 40 else 1.0)

df626["A"] = df626["A"].replace({"small": -1.0, "large": 1.0}).infer_objects(copy=False)
df626["D"] = df626["D"].replace({"Small": -1.0, "Large": 1.0}).infer_objects(copy=False)

# Change data types to categorical
df626[new_column_names] = df626[new_column_names].astype("category")

# View the renamed dataframe
df626.info()

In [None]:
df626.head()

In [None]:
df626_cp = df626.iloc[-4:] # Center points
df626_df = df626.iloc[:-4] # Working points

---
**a) Construct a normal probability plot of the effect estimates.
Which effects appear to be large?**

---

In [None]:
# Fit the linear model
model_full = smf.ols("Yield ~ (A + B + C + D + E)**2", data=df626_df).fit()
xnames = ['Intercept', 'A', 'B', 'C', 'D', 'E',
          'A:B', 'A:C', 'A:D', 'A:E', 'B:C', 'B:D', 'B:E', 'C:D', 'C:E', 'D:E']
# Summarize the model
model_full.summary(xname=xnames)

In [None]:
# I would not like to be influenced (biased) by the p-value of the full model ANOVA hence not showing here
full_anova = sm.stats.anova_lm(model_full, typ=2);

In [None]:
coeffs = model_full.params * 2
coeffs.index = xnames
full_total_ssq = full_anova["sum_sq"].sum()
effects_df = pd.DataFrame(coeffs,columns=["Effects"])
effects_df["Sum Sq."] = full_anova["sum_sq"]
effects_df["Percent contribution"] = full_anova["sum_sq"].map(lambda x: 100*x / full_total_ssq)

In [None]:
effects_df

In [None]:
effects_no_intercept = effects_df.drop('Intercept')
effects_abs = effects_no_intercept['Effects'].abs()
effects_sorted = effects_abs.sort_values()
effect_names_sorted = effects_sorted.index.tolist()

n = len(effects_sorted)
i = np.arange(1, n + 1)
probabilities = (i - 0.5) / n
expected_values = stats.halfnorm.ppf(probabilities)

plt.figure(figsize=(10, 6))
plt.plot(effects_sorted, expected_values,'o')


for x, y, label in zip(effects_sorted,expected_values, effect_names_sorted):
    plt.text(x+0.7, y-0.015, label, fontsize=9, ha='left')

plt.xlabel('Effect')
plt.ylabel('Normal % probability')
plt.title('Normal Probability Plot of Effects')
plt.grid(True)
plt.show()

Most important are the factors
+ `C` and `B`

and the interaction
+ `A:B`.

---
**b) Conduct an analysis of variance to confirm your findings
for part (a).**

---

In [None]:
model_reduced = smf.ols('Yield ~ B + C + A:B', data=df626_df).fit()
sm.stats.anova_lm(model_reduced, typ=2)

In [None]:
sm.stats.anova_lm(model_reduced, model_full, typ="I")

ANOVA (both of them)  have confirmed the findings.

In [None]:
formula_final = 'Yield ~ B + C + A:B'

In [None]:
model_final = model_reduced

---
**c) Write down the regression model relating yield to the significant process variables.**

---

In [None]:
coeffs_final = model_final.params
coeffs_final.index = ["Intercept","B", "C", "A+:B-", "A+:B+"]
coeffs_final

In [None]:
print("A:B =",-1 * coeffs_final["A+:B-"] + coeffs_final["A+:B+"])

$$
\text{Yield} = 6.78125 + 26 B + 9.68750 C + 3.875 A_{+}\cdot B_{-} + 19.75 A_{+} \cdot B_{+}
$$ hence $$
\text{Yield} = 6.78 + 26 B + 9.69 C + 15.88 A\cdot B
$$

---
**d) Plot the residuals on normal probability paper. Is the plot satisfactory?**

---

In [None]:
residuals = model_final.resid
fitted = model_final.fittedvalues

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.scatter(fitted, residuals, edgecolors='k', facecolors='none')
ax1.axhline(y=0, color='red', linestyle='--')
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residuals vs. Fitted Values')
ax1.grid(True)

sm.qqplot(residuals, line='45', fit=True, ax=ax2)
ax2.set_title('Normal Q-Q Plot')
ax2.grid(True)

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.2)

plt.show()

Sem spokojenej rozhodně

---
**e) Plot the residuals versus the predicted yields and versus each of the five factors. Comment on the plots.**

---


In [None]:
residuals = pd.Series(model_final.resid, name="Residuals")
df_with_resid = pd.concat([df626_df, residuals], axis=1)

In [None]:
df_with_resid.head()

In [None]:
# Create the figure with two subplots
fig =  plt.figure(figsize=(13, 2.5))

# Create a grid of subplots for the residuals vs. factors
factor_cols = ['A', 'B', 'C', 'D', 'E']
for i, factor in enumerate(factor_cols):
    ax = fig.add_subplot(1, 5, i+1)
    sns.boxplot(x=factor, y="Residuals", data=df_with_resid, ax=ax)
    ax.set_xlabel(f"{factor} Levels")
    if i == 0:
        ax.set_ylabel("Residuals")
    else:
        ax.set_ylabel(None)
    ax.set_title(f"Residuals vs. Factor {factor}")
    ax.grid(True)

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)

plt.show()

Both the residuals against the predicted values and the residuals for the factors seem to be unbiased and also they seem to have a similar variance. Expected value around 0 which is a good sign.
+ Ther is maybe a slight underrepresentation of the factor `D` (**mask dimension**) which did not make it to the reduced model.

---
**f) Interpret any significant interactions.**

---

A recap of the factors:
> +    A = aperture setting (small, large),
> +    B = exposure time (20% below nominal, 20% above nominal),
> +    C = development time (30 and 45 s),
> +    D = mask dimension (small, large), and
> +    E = etch time (14.5 and 15.5min).

and the model
> $$
\text{Yield} = 6.78 + 26 B + 9.69 C + 15.88 A\cdot B
$$
>
The interaction that is left in the model is only
+ `A:B`

which are in the experiment setting
+ **aperture setting** $\times$ **exposure time** (15.88)

This gives them the obvious physical interpretation.
+ Both **aperture setting** and the **exposure time** are related to the illumination that the lasers take upon the crystals. _Hence more power from the lasers, more yield!_

---
**g) What are your recommendations regarding process
operating conditions?**

---


+ The **development time** should be set to _(45 s)_.
+ A _large_ **aperture setting** and **exposure time** (from the interaction and the independent factor).

---
**h) Project the $2^5$ design in this problem into a $2^k$ design
in the important factors. Sketch the design and show
the average and range of yields at each run. Does
this sketch aid in interpreting the results of this
experiment?**

---


We are left with the significant factors `A`, `B` and `C`.

In [None]:
df626_2k = df626_df.groupby(['A', 'B', 'C'], observed = False)

In [None]:
df626_2k.groups;

In [None]:
df626_2k_df = pd.DataFrame(df626_2k['Yield'].mean())
df626_2k_df.rename(columns = {'Yield':'Yield mean'}, inplace = True)
df626_2k_df['Yield range'] = df626_2k['Yield'].apply(lambda y: np.max(y) - np.min(y))
df626_2k_df.reset_index([0,1,2])

We can see that the range of the `Yield` against the mean value is small therefore we can believe that the choice of the relevant factors was done correctly.

---
**i) Suppose that the experimenter had run four center points in addition to the 32 trials in the original experiment. The yields obtained at the
center point runs were 68, 74, 76, and 70. Reanalyze the experiment, including a test for pure quadratic curvature. Discuss what your next step would be.**

---


In [None]:
factor_names = ['A','B','C','D','E']
df626_cp.loc[:,factor_names] = df626_cp[factor_names].astype(float)
df626_cp.loc[:, factor_names] = np.zeros((4, 5))
df626_cp

In [None]:
cp_analysis = {
    'Y_cp':df626_cp['Yield'].mean(),
    'Y_f':df626_df['Yield'].mean(),
    'Y_f - Y_cp':df626_df['Yield'].mean() - df626_cp['Yield'].mean()
}

In [None]:
n_f, n_c = len(df626_df), len(df626_cp)
ss_pq = n_f * n_c
cp_analysis['SS_pq'] = n_f*n_c*cp_analysis['Y_f - Y_cp']**2/(n_f + n_c)
cp_analysis['MS_F'] = df626_cp['Yield'].map(lambda y: (y - cp_analysis['Y_cp'])**2).sum()/(n_c - 1)
cp_analysis['F_stat'] = cp_analysis['SS_pq']/cp_analysis['MS_F']
cp_analysis['p-val'] = stats.f.sf(cp_analysis['F_stat'],3,1)

In [None]:
pd.DataFrame(cp_analysis, index=[0])

We conclude that there is a significant reason to believe that there is a quadratic curvature in the model.

In [None]:
df626_w_cp = pd.concat([df626_df,df626_cp], ignore_index=True)
df626_w_cp.loc[:,factor_names] = df626_w_cp.loc[:,factor_names].astype(float)
df626_w_cp.insert(6, 'F', np.concat([np.zeros(32), np.ones(4)]))
df626_w_cp.tail(6)

In [None]:
model_w_cp_full = smf.ols("Yield ~ (A + B + C + D + E)**2 + F", data=df626_w_cp).fit()
sm.stats.anova_lm(model_w_cp_full, typ=2)

In [None]:
sm.stats.anova_lm(model_final,typ=2)

In [None]:
model_w_cp = smf.ols(formula_final, data=df626_w_cp).fit()
sm.stats.anova_lm(model_w_cp,typ=2)

In [None]:
df626_df_numeric = df626_df.copy()
df626_df_numeric[factor_names] = df626_df_numeric[factor_names].astype(float)
sm.stats.anova_lm(smf.ols(formula_final + " + I(B**2)", data=df626_df_numeric).fit(),typ=2)

I would keep the original model and be satistied with it.

---
**j) Construct and analyze a design in two blocks with ABCDE
confounded with blocks.**

---


In [None]:
df626_blck = df626_df.copy()
df626_blck.insert(5,"BLC", df626_blck[factor_names].astype(float).prod(axis=1))
df626_blck.head()

We have two blocks for this design. The result of the analysis would be unchanged.

---
**k) Assuming now that four blocks are
necessary. Suggest a reasonable confounding scheme.**

---


For four blocks we may chose any 2 permutations of 3 factors that overlap in 1 factor to generate the blocks. We will have these cubic interactions confounded and the 4 factor intraction that has includes the union of these factors from the generation scheme.

For my experiment, I would choose `ACD`, `ABE` which would confound `ACD`, `ABE` and `ACDE`. I suggest this in order to keep the possibility of the intraction `ABC` which can be important based on the initial modeling. Further more, it confounds `ABE` where `A:B` is already in the model and `E` is insignificant hence this is OK and `ACD` where `D` is insignificant and we threw out `A:C` due to its insignificance from ANOVA. This blocking scheme assures that there is no reason to believe that I would miss a significant factor or interaction.

---
**l) Suppose that it was necessary to run this design in four blocks
with ACDE and BCD (and consequently ABE) confounded.
Analyze the data from this design.**

---

The results would probably be the same... It would take a long time to do this and lead to no additional results. This is my anaysis.