# Treatments and timepoints

$$
y = \mu + \beta_1 \text{timepoint} + \beta_2 \text{treatment} + e
$$


## Read the data

Data from [The effects of lateral line ablation and regeneration in schooling giant danios](https://journals.biologists.com/jeb/article/221/8/jeb175166/300/The-effects-of-lateral-line-ablation-and)

(data repo [here](https://zenodo.org/record/4999506))

It's data on fish (*Devario aequipinnatus*) lateral line system, with the effect of chemical treatments (gentamycin or no treatment -- sham) at different timepoints.

In [None]:
import numpy as np ## arrays
import pandas as pd ## dataframes
import seaborn as sns ## plots
import statsmodels.api as sm ## statistical models
import matplotlib.pyplot as plt ## plots


In [None]:
## tab-separated text data
url= "https://zenodo.org/records/4999506/files/JEXBIO-2017-175166-Processed-Data-Master.txt"
danios = pd.read_csv(url, sep = "\t")

Dataset on *giant danios*: how the lateral line system responds to chemical treatments.

![giant danio](https://drive.google.com/uc?export=view&id=1kBpKQEg5Q6edFaSUsUCKZTi-XKA6Gva0)

In [None]:
danios

In [None]:
## converting Week (timepoint) to string
danios['Week'] = danios['Week'].astype(str)

The target variables can be:

1.  nearest neighbor distance or NND (unit: body length),
2.  time in school (percentage),
3.  angular bearing (unit: degrees),
4.  angular elevation (unit: degrees),
5.  speed (body length per second).

Explanatory variables include:

-   `Treatment`: gentamycin / sham (control)
-   `Week`: time point in subsequent weeks (from week -1 to week 8)

## EDA

In [None]:
danios.describe()

In [None]:
danios['Treatment'].value_counts()

In [None]:
danios['Week'].value_counts()

In [None]:
freq_table = pd.crosstab(danios['Treatment'], danios['Week'])
freq_table

In [None]:
import warnings
warnings.filterwarnings('ignore')
#warnings.filterwarnings(action='once')

In [None]:
# Assuming 'danios' is your DataFrame
mD = pd.melt(danios, id_vars=["Treatment", "Week"], var_name="target", value_name="value")
mD["Week"] = pd.Categorical(mD["Week"])

# Group and calculate mean
df_mean = mD.groupby(["Week", "Treatment", "target"], as_index=False).agg(avg=("value", "mean"))
df_mean.head()

In [None]:
# Set plot style and color palette
palette = ["#00AFBB", "#E7B800", "#FC4E07"]
sns.set(style="whitegrid")

# Create a FacetGrid for each 'target'
g = sns.FacetGrid(mD, col="target", col_wrap=3, sharey=False, height=4, palette=palette)

# Boxplot
g.map_dataframe(sns.boxplot, x="Week", y="value", hue="Treatment", palette=palette, fliersize=0, boxprops=dict(alpha=0.3))

# Overlay points and lines for the mean
for ax, target in zip(g.axes.flat, mD["target"].unique()):
    subset = df_mean[df_mean["target"] == target]
    sns.pointplot(data=subset, x="Week", y="avg", hue="Treatment", ax=ax,
                  color="black", markers="o", linestyles="", dodge=True, legend=False)
    sns.lineplot(data=subset, x="Week", y="avg", hue="Treatment", ax=ax,
                 linewidth=1.5, palette=palette, legend=False)

# Adjust legend and layout
g.add_legend()
plt.tight_layout()
plt.show()

From reading the article, we expect the treatment to have an effect a little time after the application of gentamycin (vs control/sham) and then to see restoration when the cells of the lateral line system are regenerated.

------------------------------------------------------------------------

**Q: which target variables better show this expected pattern?**

------------------------------------------------------------------------

### Pick target variable

We select `Time in School` (based on the EDA above).

In [None]:
dd = danios.groupby(["Week", "Treatment"], as_index=False).agg(
    avg=("Time in School", "mean"),
    std=("Time in School", "std")
)

In [None]:
# Step 1: Pivot to wide format, calculate absolute difference, and drop columns
temp = dd.pivot(index="Week", columns="Treatment", values="avg").reset_index()
temp["diff"] = (temp["Gentamycin"] - temp["Sham"]).abs()
temp = temp[["Week", "diff"]]  # Keep only Week and diff

# Step 2: Merge back with original dd
dd = dd.merge(temp, on="Week", how="inner")

In [None]:
dd

In [None]:
def highlight_greaterthan(s, threshold, column):
    is_max = pd.Series(data=False, index=s.index)
    is_max[column] = s.loc[column] >= threshold
    return ['background-color: yellow' if is_max.any() else '' for v in is_max]


dd.style.apply(highlight_greaterthan, threshold=20.0, column=['diff'], axis=1)

## Models of analysis

1.  treatment within timepoint
2.  treatment + timepoint
3.  treatment + timepoint + (treatment x timepoint)

### Within timepoint

This is the simplest approach: we split the data by timepoint and make a comparison between treatments.

Now we get a much simpler dataset, with `Gentamycin`-treated and control fish records from week 2 only.

In [None]:
school_time = mD.loc[(mD['target'] == "Time in School")] ## !! REMEMBER THAT WE CHOSE ONE TARGET VARIABLE, Time in School !!
temp = school_time.loc[(school_time['Week'] == '2')]
temp

Likewise, we can apply a very simple model:

$$
\text{Time in School} = \mu + \beta \cdot \text{Treatment} + e
$$

In [None]:
# Define the independent variables and add a constant for the intercept
## BEWARE: our independent variable is categorical

## here we are forcing Sham to be the reference category (0) and Gentamycin the alternative (1)
temp["Treatment"] = pd.Categorical(temp["Treatment"], categories=["Sham", "Gentamycin"], ordered=True)

treat_d = pd.get_dummies(temp['Treatment'], prefix='Treatment', drop_first=True, dtype=int)
treat_d.head(3)

X = sm.add_constant(treat_d)  # Adds the intercept term

# Define the dependent variable
y = temp['value']

X.head()

In [None]:
# Fit the linear model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

---

#### From matrix algebra

$$
\mathbf{y} = \mathbf{Xb} + \mathbf{e}
$$

In [None]:
y = np.array(y)
X = np.array(X)


$$
\mathbf{X'y} = \mathbf{X'Xb}
$$

-   **X**: (n,m) = (50, 2) [50 records, 2 parameters: intercept and slope]
-   **y**: (n,1)
-   **X'y**: (m,1) = (2,1)
-   **X'X**: (m,m) = (2,2)
-   **b**: (m,1) = (2,1)

In [None]:
Xy = np.dot(X.transpose(), y)
XX = np.dot(X.transpose(), X)

$$
\mathbf{b} = \mathbf{X'X}^{-1} \cdot \mathbf{X'y}
$$

In [None]:
invXX = np.linalg.inv(XX)
b = np.dot(invXX, Xy)

b

We see that this involves matrix inversion. Since this is a 2x2 matrix, we could do it by hand (for fun! But don't worry: Python will take care of matrix inversion for this and -much- larger matrices).

$$
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix} ^ {-1} = \frac{1}{ad-bc} \cdot
\begin{bmatrix}
d & -b \\
-c & a
\end{bmatrix}
$$

In [None]:
XX

In [None]:
## we define the determinant and the matrix of cofactors
multiplicative_factor = 1/(50*25 - 25*25)
M =np.array([[25, -25], [-25, 50]])
print(M)

In [None]:
## we get the inverse by scalar * array multiplication
invMatrix = multiplicative_factor * M
print(invMatrix)

In [None]:
## this is equal to what we obtained before using the numpy linalg.inv() method
invXX

#### And the p-value?

First, we need to estimate the variance of our target variable:

$$
\hat{\sigma}^2 = \frac{1}{(n-2)}\sum(y_i-\hat{y}_i)^2
$$

The $(n-2)$ comes from $(n - (k+1)$, where $k$ is the length of the vector of parameters $\mathbf{b}$ - 1 (to remove the intercept) (we typically look at parameters one by one)

In [None]:
n = temp.shape[0] ## sample size
y_hat = np.dot(X, b) ## predictions/fitted values
residuals = y-y_hat
variance = np.square(residuals).sum()/(n-2)
print("Variance is", variance)

$$
\text{Var}(\hat{\beta}) = \frac{\hat{\sigma}^2}{\sum(x_i-\overline{x})^2}
$$

In [None]:
X[:,1]

In [None]:
x_avg = X[:,1].mean()
var_beta = variance/np.square(X[:,1]-x_avg).sum()

Now, the standard error of the estimate is the square root of its variance.
You can compare this value with the results from the `OLS()` function above.

In [None]:
std_err_beta = np.sqrt(var_beta)
print(std_err_beta)

In [None]:
tstat = b[1]/std_err_beta
df = n - len(b) ## degrees of freedom
tstat

Finally, with the value of our Student's t statistic, we can get the p-value for the Treatment coefficient:

In [None]:
from scipy.stats import t

# For a **two-tailed** p-value:
pval = 2 * t.cdf(tstat, df=df) ## from the Student's t CDF
print(pval)

#### ANOVA

Yet another way to get solve our within-timepoint model is to use **analysis of variance**:

In [None]:
from statsmodels.formula.api import ols

res = ols('value ~ Treatment', data=temp).fit()
sm.stats.anova_lm(res, typ=2)

**IMPORTANT: ANOVA and linear regression are equivalent!**

---

#### Apply the within-timepoint analysis to all timepoints

We now take the model used for week 2 and apply it to all weeks (all timepoints): `ols(value ~ Treatment, data = temp)`

In [None]:
import statsmodels.formula.api as smf

school_time["Treatment"] = pd.Categorical(school_time["Treatment"], categories=["Sham", "Gentamycin"], ordered=True)

In [None]:
results = []

# Group by 'Week'
for week, group in school_time.groupby('Week'):
    # Fit linear model
    model = smf.ols('value ~ Treatment', data=group).fit()
    # Get summary as DataFrame
    summary_df = model.summary2().tables[1].reset_index()
    summary_df = summary_df.rename(columns={'index': 'term'})
    summary_df['Week'] = week
    # Filter out intercept
    summary_df = summary_df[summary_df['term'] != 'Intercept']
    results.append(summary_df)

# Combine all results
final_results = pd.concat(results, ignore_index=True)

In [None]:
## now we want to highlight significant results (lower than threshold)
def highlight_lowerthan(s, threshold, column):
    is_max = pd.Series(data=False, index=s.index)
    is_max[column] = s.loc[column] < threshold
    return ['background-color: yellow' if is_max.any() else '' for v in is_max]

final_results.style.apply(highlight_lowerthan, threshold=0.05, column=['P>|t|'], axis=1)

------------------------------------------------------------------------

### Exercise [optional]

Pick another target variable and apply the within-timepoint linear regression analysis:

In [None]:
## TASK 1: get the data for a different target variable

## your code here!

In [None]:
## TASK 2: run the model for each timepoint (within-timepoint analysis)

## your code here!

------------------------------------------------------------------------

### Across-timepoint analysis

We now use a more complex model of analysis, which uses all the data at once and includes both the effect of timepoint and the effect of treatment:

$$
\text{Time in School} = \mu + \beta_1 \text{Timepoint} + \beta_2 \text{Treatment} + e
$$

Again, we use `Time in School` as target:

In [None]:
## starting point: we (re)get the data
school_time = mD.loc[(mD['target'] == "Time in School")]
school_time["Treatment"] = pd.Categorical(school_time["Treatment"], categories=["Sham", "Gentamycin"], ordered=True)

In [None]:
week_d = pd.get_dummies(school_time['Week'], prefix='Week', drop_first=True, dtype=int) ## week -1 is the reference category value
treat_d = pd.get_dummies(school_time['Treatment'], prefix='Treatment', drop_first=True, dtype=int)

X = sm.add_constant(week_d)  # Adds the intercept term

# Define the dependent variable
y = school_time['value']

X = pd.concat([X, treat_d['Treatment_Gentamycin']], axis=1)
X.head()

In [None]:
# Fit the linear model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

The output above shows:

-   overall $R^2$ of the model
-   overall p-value of the model
-   estimates of single coefficients (with respect to the reference class)
-   p-values for the single coefficients (under the null hypothesis that they're equal to zero)

------------------------------------------------------------------------

**Q: how do we interpret the model coefficients?**

------------------------------------------------------------------------