# Longitudinal data

In this *Python* notebook we will get introduced to examples of longitudinal data, i.e. data with a **time component**:

## Read data

Data from:
- [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488)
- [(data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))]

It's data on fish catch (multiple fish species) over time in different regions of Alaska.

In [None]:
import numpy as np
import pandas as pd

In [None]:
url= "https://zenodo.org/records/4987796/files/stema_data.csv"
fish = pd.read_csv(url)

In [None]:
## data size (tabular)
fish.shape

In [None]:
fish

-   **CPUE**: target variable, "catch per unit effort"
-   **SST**: sea surface temperature
-   **CV**: actually, the coefficient of variation for SST is used $\rightarrow$ the coefficient of variation is an improved measure of seasonal SST over the mean, because it standardizes scale and allows us to consider the changes in variation of SST with the changes in mean over time (Hannah Correia, 2018 - Ecology and Evolution)
-   **SSTcvW1-5**: CPUE is influenced by survival in the first year of life. Water temperature affects survival, and juvenile fish are more susceptible to environmental changes than adults. Therefore, CPUE for a given year is likely linked to the winter SST at the juvenile state. Since this survey targets waters during the summer and the four species covered reach maturity at 5--8 years, SST was lagged for years one through five to allow us to capture the effect of SST on the juvenile stages. All five lagged SST measures were included for modeling.

### Data preprocessing

In [None]:
fish.columns

In [None]:
fish = fish.drop(['Unnamed: 0', 'Latitude', 'Longitude'], axis=1)

In [None]:
fish

Note: in the subset below, **CPUE values are identical**

We see that, in order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):


In [None]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 1990)]

In [None]:
## mutate variable
# Assuming fish is a pandas DataFrame
fish['avg'] = fish.groupby(['Species', 'Area', 'Year'])['CPUE'].transform('mean')
fish['std'] = 0.1 * fish['avg']

In [None]:
fish['noise'] = np.random.normal(loc=0, scale=fish['std'])
fish['CPUE'] = fish['CPUE'] + fish['noise']

In [None]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 1990)]

### EDA (Exploratory Data Analysis)

Let's start by looking at the raw data. As we already saw, for each combination of species, area and year we have multiple observations; for instance, let's look at `Pacific cod` from `West Yakutat` in year `2000`. Therefore, a boxplot is a good way to plot these data:

In [None]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 2000)]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set up the FacetGrid
g = sns.catplot(
    data=fish,
    x='Year',
    y='CPUE',
    hue='Area',
    col='Area',
    row='Species',
    kind='box',
    height=4,
    aspect=1.2
)

# Rotate x-axis labels
for ax in g.axes.flatten():
    for label in ax.get_xticklabels():
        label.set_rotation(90)

plt.tight_layout()
plt.show()

First, we note large variation in scale between fish species. Let's try to allow the scale to change by `Species`:

In [None]:
g = sns.catplot(
    data=fish,
    x='Year',
    y='CPUE',
    hue='Area',
    col='Area',
    col_order = ['East Yakutat/Southeast', 'West Yakutat', 'Central Gulf of Alaska', 'Western Gulf of Alaska'],
    row='Species',
    kind='box',
    height=4,
    aspect=1.2,
    sharey=False  # allow individual y-axis, we'll manually sync per row
)

# Get the species (row) levels
species_levels = fish['Species'].unique()

# Sync y-axis within each row
for i, species in enumerate(species_levels):
    # Get all axes in the current row
    axes_row = g.axes[i]
    # Find the min and max y across this row
    y_mins, y_maxs = zip(*(ax.get_ylim() for ax in axes_row))
    common_ylim = (min(y_mins), max(y_maxs))
    # Set the same ylim for all axes in this row
    for ax in axes_row:
        ax.set_ylim(common_ylim)

# Rotate x-axis labels
for ax in g.axes.flatten():
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

plt.tight_layout()
plt.show()

We now see CPUE oscillations overtime and between geogpraphical areas, but again this varies by fish species. What if we rescale CPUE?

In [None]:
# Define the rescale function
def rescale01(x):
    rng = (np.nanmin(x), np.nanmax(x))
    return 100 * (x - rng[0]) / (rng[1] - rng[0]) if rng[1] != rng[0] else np.zeros_like(x)

# Assuming 'fish' is a pandas DataFrame with columns: 'Species', 'CPUE'
# Group by 'Species' and apply the rescaling to each group

fish['rescaled_cpue'] = (
    fish.groupby('Species')['CPUE']
    .transform(rescale01)
)

In [None]:
fish.groupby('Species').agg({'rescaled_cpue':['min','max']})

In [None]:
g = sns.catplot(
    data=fish,
    x='Year',
    y='rescaled_cpue',
    hue='Area',
    col='Area',
    col_order = ['East Yakutat/Southeast', 'West Yakutat', 'Central Gulf of Alaska', 'Western Gulf of Alaska'],
    row='Species',
    kind='box',
    height=4,
    aspect=1.2,
    sharey=True  # allow individual y-axis, we'll manually sync per row
)

# Rotate x-axis labels
for ax in g.axes.flatten():
    for label in ax.get_xticklabels():
        label.set_rotation(90)

plt.tight_layout()
plt.show()

### Trends

A trend is usually an average over time:

In [None]:
dd = (
    fish.groupby(['Species', 'Area', 'Year'])['rescaled_cpue']
    .mean()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='rescaled_cpue')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='CPUE')

-   `group`: we have only one observation per group (average by Species, Area, Year), so we must specify the grouping variable, in this case `Area`
-   year is not a number now, and this is reflected in the x axis: no intervals, all values are plotted (so we can for example place them vertically and make them smaller, to avoid overlap)

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='CPUE', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

What if `Year` (x axis variable) was a number (an integer)?

In [None]:
temp['Year'] = temp['Year'].astype(int)

In [None]:
# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='CPUE', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

------------------------------------------------------------------------

**Q: Do we have a trend?**

------------------------------------------------------------------------

What about the standard deviation? Let's see if we have a trend there (!! remember, we introduced artificial random variation, no trend is actually expected, safe by chance !!):

In [None]:
dd = (
    fish.groupby(['Species', 'Area', 'Year'])['rescaled_cpue']
    .std()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='rescaled_cpue')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='sd(CPUE)')

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
# temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='sd(CPUE)', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

------------------------------------------------------------------------

**Q: What do you notice?**

------------------------------------------------------------------------

### Model-based adjustments

We can use a model to adjust phenotypes by known sources of variation:

In [None]:
import statsmodels.api as sm

# Assuming 'fish' is a pandas DataFrame already loaded with your data

# Define the independent variables and add a constant for the intercept
X = fish[['SST_cvW', 'SST_cvW1', 'SST_cvW2', 'SST_cvW3', 'SST_cvW4', 'SST_cvW5']]
X = sm.add_constant(X)  # Adds the intercept term

# Define the dependent variable
y = fish['rescaled_cpue']

# Fit the linear model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

We then **focus on model residuals**:

In [None]:
## residuals are stored in the attribute <object>.resid
model.resid.describe()

In [None]:
# Plotting a basic histogram
data = model.resid
plt.hist(data, bins=30, color='skyblue', edgecolor='black')

# Adding labels and title
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Basic Histogram')

# Display the plot
plt.show()

In [None]:
fish['residuals'] = model.resid
fish['fitted_values'] = model.fittedvalues

In [None]:
fish

In [None]:
dd = (
    fish.groupby(['Species', 'Area', 'Year'])['residuals']
    .mean()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='residuals')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='adjusted CPUE')

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
# temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='adjusted CPUE', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()