<a href="https://colab.research.google.com/github/francji1/01NAEX/blob/main/code/01NAEX_Exercise_04_python_sol_GG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# O1NAEX Exercise 04

## Setup

In [None]:
!pip install rpy2

In [None]:
%load_ext rpy2.ipython

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm, t
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import scipy.stats as stats


In [None]:
# Recap of the Lecture in Python

# Read the data from the URL
rocket_path = "https://raw.githubusercontent.com/francji1/01NAEX/refs/heads/main/data/rocket2.txt"
rocket = pd.read_csv(rocket_path, sep=";")


# Renaming columns for consistency
rocket.rename(columns={'op': 'operator', 'y': 'Propellant'}, inplace=True)

# Converting columns to factors (categorical variables)
rocket['operator'] = rocket['operator'].astype('category')
rocket['batch'] = rocket['batch'].astype('category')

# Latin Square Design Plotting
sns.boxplot(x='operator', y='Propellant', data=rocket)
plt.show()

sns.boxplot(x='batch', y='Propellant', data=rocket)
plt.show()

sns.boxplot(x='treat', y='Propellant', data=rocket)
plt.show()

# Latin Square Design - Linear Model
rocket_lm = smf.ols('Propellant ~ operator + batch + treat', data=rocket).fit()
print(sm.stats.anova_lm(rocket_lm))


# Without considering batch as a factor
rocket_lm2 = smf.ols('Propellant ~ operator + treat', data=rocket).fit()
print(sm.stats.anova_lm(rocket_lm2))


##	Problem 4.23
from the chapter 4, D. C. Montgomery DAoE - 8. edition.

An industrial engineer is investigating the effect of
four assembly methods (A, B, C, D) on the assembly time for
a color television component. Four operators are selected for
the study. Furthermore, the engineer knows that each assembly
method produces such fatigue that the time required for
the last assembly may be greater than the time required for the
first, regardless of the method. That is, a trend develops in the
required assembly time. To account for this source of variability,
the engineer uses the Latin square design shown below.
Analyze the data from this experiment (use	$\alpha = 0.05$) and draw
appropriate conclusions.



In [None]:
# Read the data from the URL
url_4_23 = "https://raw.githubusercontent.com/francji1/01NAEX/main/data/Problem_4_23.txt"
df_4_23 = pd.read_csv(url_4_23, sep=";")

# Display the first few rows of the dataframe
df_4_23.head(10)

Inspect the unique values of the columns to check which are categorical.

In [None]:
for col in df_4_23.columns:
    print(f"{col}: {df_4_23[col].unique()}")

_Time_ is, obviously, a quantitative variable. Cast the _Operator_ and _Method_ variables to categorical.

In [None]:
df_4_23['Method'] = df_4_23['Method'].astype('category')
df_4_23['Operator'] = df_4_23['Operator'].astype('category')

Perform ANOVA with _Time_ as the response.

In [None]:
model_4_23 = smf.ols('Time ~ C(Method) + C(Operator) + Order', data=df_4_23).fit()
anova_tab_4_23 = sm.stats.anova_lm(model_4_23, typ=2)
anova_tab_4_23

From the ANOVA table, it's apparent that the assembly method (_Method_) and operator (_Operator_) have a significant effect on the assembly time (_Time_), while the order (_Order_) in which the assembly was performed does not.

Let's check a more detailed model summary.

In [None]:
model_4_23.summary()

The levels that seem to have no significant effect on the assembly time are C and D.

When it comes to the operators, the only one that seems to have any significant effect on the assembly time is the second one.



Let's check the model assumptions.

- Visual inspection of the residuals

In [None]:
resid_4_23 = model_4_23.resid

fig, ax = plt.subplots(1, 3, figsize=(15, 7), dpi=100)
sns.histplot(resid_4_23, kde=False, ax=ax[0])
stats.probplot(resid_4_23, plot=ax[1])
sns.residplot(x=model_4_23.fittedvalues, y=resid_4_23, lowess=True, ax=ax[2], line_kws={'color': 'red'})
plt.show()

- Normality test

In [None]:
# Shapiro-Wilk
shap_results = stats.shapiro(resid_4_23)
print(shap_results)


We're failing to reject the null hypothesis that the residuals are normally distributed.

- Homoscedasticity test

In [None]:
import statsmodels.stats.api as sms

bp = sms.het_breuschpagan(model_4_23.resid, model_4_23.model.exog)
for statval, pval in zip(bp[0:2], bp[2:]):
    print(statval, pval)

Again, we're failing to reject the null hypothesis. The residuals are most likely homoscedastic.

Since different _Method_ and _Operator_ variable levels seem to influence the response to a varying extent, we should run a post-hoc test to determine which level pairs are significantly different from the other.

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Tukey HSD
tukey_res_4_23 = pairwise_tukeyhsd(df_4_23['Time'], df_4_23['Method'], alpha=0.05)

# Display results
print(tukey_res_4_23)

# Plotting Tukey HSD results
tukey_res_4_23.plot_simultaneous()
plt.show()

The Tukey HSD test shows that the only significant difference in assembly time is between methods A and C.

Running the same test for the _Operator_ variable.

In [None]:
tukey_res_4_23_oper = pairwise_tukeyhsd(df_4_23['Time'], df_4_23['Operator'], alpha=0.05)

print(tukey_res_4_23_oper)

tukey_res_4_23_oper.plot_simultaneous()
plt.show()

No significant differences whatsoever...

## Problem  4.40
from the chapter 4, D. C. Montgomery DAoE - 8. edition.



An engineer is studying the mileage performance
characteristics of five types of gasoline additives. In the road
test he wishes to use cars as blocks; however, because of a time constraint, he must use an incomplete block design. He
runs the balanced design with the five blocks that follow.
Analyze the data from this experiment (use $\alpha	 = 0.05$) and
draw conclusions.


In [None]:
# Read the data from the URL
url_4_40 = "https://raw.githubusercontent.com/francji1/01NAEX/main/data/Problem_4_40.txt"
df_4_40 = pd.read_csv(url_4_40, sep=";")

# Display the first few rows of the dataframe
df_4_40.head()

Inspect the unique values of the columns to check which are categorical.

In [None]:
for col in df_4_40.columns:
    print(f"{col}: {df_4_40[col].unique()}")


- Casting _Additive_ to categorical.

In [None]:
df_4_40['Additive'] = df_4_40['Additive'].astype('category')

Perform ANOVA with _Mileage_ as the response.

In [None]:
model_4_40 = smf.ols('Mileage ~ C(Additive) + Car', data=df_4_40).fit()
anova_tab_4_40 = sm.stats.anova_lm(model_4_40, typ=2)
anova_tab_4_40

From the ANOVA table, we see that both independent variables have a significant effect on the _Mileage_.

In [None]:
model_4_40.summary()

Model assumptions.

- Visual inspection of the residuals

In [None]:
resid_4_40 = model_4_40.resid
fig, ax = plt.subplots(1, 3, figsize=(15, 7), dpi=100)
sns.histplot(resid_4_40, kde=False, ax=ax[0])
stats.probplot(resid_4_40, plot=ax[1])
sns.residplot(x=model_4_40.fittedvalues, y=resid_4_40, lowess=True, ax=ax[2], line_kws={'color': 'red'})
plt.show()

- Normality test

In [None]:
shapiro_test = stats.shapiro(model_4_40.resid)
print(shapiro_test)

We do not reject the normality null hypothesis.

- Homoscedasticity test

In [None]:
bp = sms.het_breuschpagan(model_4_40.resid, model_4_40.model.exog)
for statval, pval in zip(bp[0:2], bp[2:]):
    print(statval, pval)

Neither do we reject the homoscedasticity hypothesis.

Check if there are any significant differences between the _Additive_ level pairs.

In [None]:
tukey_4_40 = pairwise_tukeyhsd(df_4_40['Mileage'], df_4_40['Additive'], alpha=0.05)

# Display results
print(tukey_4_40)

# Plotting Tukey HSD results
tukey_4_40.plot_simultaneous()
plt.show()

No significant differences between the pairs found.

    
# Problem  4.42
from the chapter 4, D. C. Montgomery DAoE - 8. edition.\\[3mm]


Seven different hardwood concentrations are being studied to determine their effect on the strength of the paper produced. However, the pilot plant can only produce three	runs each day. As days may differ, the analyst uses the balanced incomplete block design that follows. Analyze the data from this experiment (use $\alpha = 0.05$) and draw conclusions.


Try to run, in addition to ANOVA with BIBD, the linear model with concentration as a quantitative response too (on condition there is no day effect).


In [None]:
# Read the data from the URL
url_4_42 = "https://raw.githubusercontent.com/francji1/01NAEX/main/data/Problem_4_42.txt"
df_4_42 = pd.read_csv(url_4_42, sep=";")


# Display the first few rows of the dataframe
df_4_42.head(10)

In [None]:
for col in df_4_42.columns:
    print(f"{col}: {df_4_42[col].unique()}")

In [None]:
df_4_42['Concentration'] = df_4_42['Concentration'].astype('category')

Run ANOVA with _Strength_ as the response.

In [None]:
model_4_42 = smf.ols('Strength ~ C(Concentration) + Days', data=df_4_42).fit()

anova_tab_4_42 = sm.stats.anova_lm(model_4_42, typ=2)
anova_tab_4_42


ANOVA table shows that the _Days_ variable does't have a significant effect on the _Strength_, while the _Concentration_ does.

In [None]:
model_4_42.summary()

Model assumptions: visual inspection.

In [None]:
resid_4_42 = model_4_42.resid
fig, ax = plt.subplots(1, 3, figsize=(15, 7), dpi=100)
sns.histplot(resid_4_42, kde=False, ax=ax[0])
stats.probplot(resid_4_42, plot=ax[1])
sns.residplot(x=model_4_42.fittedvalues, y=resid_4_42, lowess=True, ax=ax[2], line_kws={'color': 'red'})
plt.show()


- Normality test

In [None]:
shapiro_test = stats.shapiro(model_4_42.resid)
print(shapiro_test)

- Homoscedasticity test

In [None]:
bp = sms.het_breuschpagan(model_4_42.resid, model_4_42.model.exog)
for statval, pval in zip(bp[0:2], bp[2:]):
    print(statval, pval)

The model assumptions are met.

Let's try to fit a linear model with _Concentration_ as quantitative response.

In [None]:
# Cast the Concentration to quantitative.
df_4_42['Concentration'] = df_4_42['Concentration'].astype(float)

In [None]:
model_linear = smf.ols('Concentration~Strength', data=df_4_42).fit()
anova_tab = sm.stats.anova_lm(model_linear, typ=2)
anova_tab

In [None]:
model_linear.summary()

The _Strength_ has no significant effect on the _Concentration_.

Model assumptions: visual inspection of the residuals.

In [None]:
resid_linear = model_linear.resid
fig, ax = plt.subplots(1, 3, figsize=(15, 7), dpi=100)
sns.histplot(resid_linear, kde=False, ax=ax[0])
stats.probplot(resid_linear, plot=ax[1])
sns.residplot(x=model_linear.fittedvalues, y=resid_linear, lowess=True, ax=ax[2], line_kws={'color': 'red'})
plt.show()


- Normality test

In [None]:
print(stats.shapiro(resid_linear))

The residuals are not normal...

- Homoscedasticity test

In [None]:
print(sms.het_breuschpagan(resid_linear, model_linear.model.exog))

... neither they are homoscedastic.