# Longitudinal data

In this *Python* notebook we will get introduced to examples of longitudinal data, i.e. data with a **time component**:

## Read data

Data from:
- [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488)
- [(data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))]

It's data on fish catch (multiple fish species) over time in different regions of Alaska.

In [None]:
import numpy as np
import pandas as pd

In [None]:
url= "https://zenodo.org/records/4987796/files/stema_data.csv"
fish = pd.read_csv(url)

In [None]:
## data size (tabular)
fish.shape

In [None]:
fish

-   **CPUE**: target variable, "catch per unit effort"
-   **SST**: sea surface temperature
-   **CV**: actually, the coefficient of variation for SST is used $\rightarrow$ the coefficient of variation is an improved measure of seasonal SST over the mean, because it standardizes scale and allows us to consider the changes in variation of SST with the changes in mean over time (Hannah Correia, 2018 - Ecology and Evolution)
-   **SSTcvW1-5**: CPUE is influenced by survival in the first year of life. Water temperature affects survival, and juvenile fish are more susceptible to environmental changes than adults. Therefore, CPUE for a given year is likely linked to the winter SST at the juvenile state. Since this survey targets waters during the summer and the four species covered reach maturity at 5--8 years, SST was lagged for years one through five to allow us to capture the effect of SST on the juvenile stages. All five lagged SST measures were included for modeling.

### Data preprocessing

In [None]:
fish.columns

In [None]:
fish = fish.drop(['Unnamed: 0', 'Latitude', 'Longitude'], axis=1)

In [None]:
fish

Note: in the subset below, **CPUE values are identical**

We see that, in order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):


In [None]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 1990)]

In [None]:
## mutate variable
fish['avg'] = fish.groupby(['Species', 'Area', 'Year'])['CPUE'].transform('mean')
fish['std'] = 0.1 * fish['avg']

In [None]:
fish['noise'] = np.random.normal(loc=0, scale=fish['std'])
fish['CPUE'] = fish['CPUE'] + fish['noise']

In [None]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 1990)]

### EDA (Exploratory Data Analysis)

Let's start by looking at the raw data. As we already saw, for each combination of species, area and year we have multiple observations; for instance, let's look at `Pacific cod` from `West Yakutat` in year `2000`. Therefore, a boxplot is a good way to plot these data:

In [None]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 2000)]

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set up the FacetGrid
g = sns.catplot(
    data=fish,
    x='Year',
    y='CPUE',
    hue='Area',
    col='Area',
    row='Species',
    kind='box',
    height=4,
    aspect=1.2
)

# Rotate x-axis labels
for ax in g.axes.flatten():
    for label in ax.get_xticklabels():
        label.set_rotation(90)

plt.tight_layout()
plt.show()

First, we note large variation in scale between fish species. Let's try to allow the scale to change by `Species`:

In [None]:
g = sns.catplot(
    data=fish,
    x='Year',
    y='CPUE',
    hue='Area',
    col='Area',
    col_order = ['East Yakutat/Southeast', 'West Yakutat', 'Central Gulf of Alaska', 'Western Gulf of Alaska'],
    row='Species',
    kind='box',
    height=4,
    aspect=1.2,
    sharey=False  # allow individual y-axis, we'll manually sync per row
)

# Get the species (row) levels
species_levels = fish['Species'].unique()

# Sync y-axis within each row
for i, species in enumerate(species_levels):
    # Get all axes in the current row
    axes_row = g.axes[i]
    # Find the min and max y across this row
    y_mins, y_maxs = zip(*(ax.get_ylim() for ax in axes_row))
    common_ylim = (min(y_mins), max(y_maxs))
    # Set the same ylim for all axes in this row
    for ax in axes_row:
        ax.set_ylim(common_ylim)

# Rotate x-axis labels
for ax in g.axes.flatten():
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

plt.tight_layout()
plt.show()

We now see CPUE oscillations overtime and between geogpraphical areas, but again this varies by fish species. What if we rescale CPUE?

In [None]:
# Define the rescale function
def rescale01(x):
    rng = (np.nanmin(x), np.nanmax(x))
    return 100 * (x - rng[0]) / (rng[1] - rng[0]) if rng[1] != rng[0] else np.zeros_like(x)

# Group by 'Species' and apply the rescaling to each group
fish['rescaled_cpue'] = (
    fish.groupby('Species')['CPUE']
    .transform(rescale01)
)

In [None]:
fish.groupby('Species').agg({'rescaled_cpue':['min','max']})

In [None]:
g = sns.catplot(
    data=fish,
    x='Year',
    y='rescaled_cpue',
    hue='Area',
    col='Area',
    col_order = ['East Yakutat/Southeast', 'West Yakutat', 'Central Gulf of Alaska', 'Western Gulf of Alaska'],
    row='Species',
    kind='box',
    height=4,
    aspect=1.2,
    sharey=True  # allow individual y-axis, we'll manually sync per row
)

# Rotate x-axis labels
for ax in g.axes.flatten():
    for label in ax.get_xticklabels():
        label.set_rotation(90)

plt.tight_layout()
plt.show()

### Trends

A trend is usually an average over time:

In [None]:
dd = (
    fish.groupby(['Species', 'Area', 'Year'])['rescaled_cpue']
    .mean()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='rescaled_cpue')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='CPUE')

-   `group`: we have only one observation per group (average by Species, Area, Year), so we must specify the grouping variable, in this case `Area`
-   year is not a number now, and this is reflected in the x axis: no intervals, all values are plotted (so we can for example place them vertically and make them smaller, to avoid overlap)

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='CPUE', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

What if `Year` (x axis variable) was a number (an integer)?

In [None]:
temp['Year'] = temp['Year'].astype(int)

In [None]:
# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='CPUE', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

------------------------------------------------------------------------

**Q: Do we have a trend?**

------------------------------------------------------------------------

What about the standard deviation? Let's see if we have a trend there (!! remember, we introduced artificial random variation, no trend is actually expected, safe by chance !!):

In [None]:
dd = (
    fish.groupby(['Species', 'Area', 'Year'])['rescaled_cpue']
    .std()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='rescaled_cpue')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='sd(CPUE)')

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
# temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='sd(CPUE)', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

------------------------------------------------------------------------

**Q: What do you notice?**

------------------------------------------------------------------------

### Model-based adjustments

We can use a model to adjust phenotypes by known sources of variation:

In [None]:
import statsmodels.api as sm

# Define the independent variables and add a constant for the intercept
X = fish[['SST_cvW', 'SST_cvW1', 'SST_cvW2', 'SST_cvW3', 'SST_cvW4', 'SST_cvW5']]
X = sm.add_constant(X)  # Adds the intercept term

# Define the dependent variable
y = fish['rescaled_cpue']

# Fit the linear model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

We then **focus on model residuals**:

In [None]:
## residuals are stored in the attribute <object>.resid
model.resid.describe()

In [None]:
# Plotting a basic histogram
data = model.resid
plt.hist(data, bins=30, color='skyblue', edgecolor='black')

# Adding labels and title
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Basic Histogram')

# Display the plot
plt.show()

In [None]:
fish['residuals'] = model.resid
fish['fitted_values'] = model.fittedvalues

In [None]:
fish

In [None]:
dd = (
    fish.groupby(['Species', 'Area', 'Year'])['residuals']
    .mean()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='residuals')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='adjusted CPUE')

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
# temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='adjusted CPUE', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

------------------------------------------------------------------------

**Q: We see that the trends of residuals are almost the same as the trends of CPUE: why do you think it is so?**

------------------------------------------------------------------------

### Baseline adjustment

An approach which is often used with temporal data is to express group trends as differences from a reference time point. This is often used in econometrics, e.g. to show variation in GDP in different countries from a starting year. Also in medicine this approach is often used, and is usually referred to as **baseline adjustment** (for data-viz and EDA; model-based adjustments have been mentioned earlier).

We start with identifying the reference timepoint: 1990 is the initial year for the entire dataset. We need to mutate `Year` back from factor to number:

In [None]:
oldest_year = min(fish['Year'])
print("The oldest year in our records is", oldest_year)

Since we have multiple values per group at baseline (multiple records per Species, Area and Year), we need to take **averages**:

In [None]:
fish_bsl = fish[fish['Year'] == oldest_year].copy()

In [None]:
# Group by Species and Area, then calculate group-wise mean of rescaled_cpue
fish_bsl['bsl'] = fish_bsl.groupby(['Species', 'Area'])['rescaled_cpue'].transform('mean')

In [None]:
# Select required columns and drop duplicates
fish_bsl = fish_bsl[['Year', 'Area', 'Species', 'bsl']].drop_duplicates()

In [None]:
fish_bsl = fish_bsl.drop(columns=['Year'])

In [None]:
fish_bsl

The average values at baseline are then subtracted from the target variable by "group" (Species and Area in this example):

In [None]:
# Perform left join with fish_bsl on Year, Species, and Area
fish_chg = fish.merge(fish_bsl, on=["Species", "Area"], how="left")

In [None]:
fish_chg

In [None]:
fish_chg['chg'] = fish_chg['rescaled_cpue'] - fish_chg['bsl']

In [None]:
fish_chg

Now we have the data ready to visualize trends expressed as differences from baseline:

In [None]:
dd = (
    fish_chg.groupby(['Species', 'Area', 'Year'])['chg']
    .mean()
    .round(2)
    .reset_index()
    .pivot(index=['Species', 'Area'], columns='Year', values='chg')
    .reset_index()  # Optional: flatten the multi-index
)

dd

In [None]:
temp = dd.melt(id_vars=['Species', 'Area'], var_name='Year', value_name='chg')

In [None]:
temp

In [None]:
# Ensure 'Year' is treated as a string or categorical for proper x-axis handling
# temp['Year'] = temp['Year'].astype(str)

# Set up FacetGrid: one subplot per Species
g = sns.FacetGrid(temp, col='Species', col_wrap=2, height=4, sharey=True,
                  legend_out=True)

# Add lineplot to each facet
g.map_dataframe(sns.lineplot, x='Year', y='chg', hue='Area', estimator=None)

# Rotate x-axis labels and adjust text size
for ax in g.axes.flatten():
    ax.tick_params(axis='x', rotation=90, labelsize=6)

# Add legend (optional customization)
g.add_legend(title='Area')
plt.tight_layout()
plt.show()

## Exercise

Data from the article "[Age-dependent trait variation: the relative contribution of within-individual change, selective appearance and disappearance in a long-lived seabird](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/1365-2656.12321)" (data repo [here](https://zenodo.org/record/5010983))

In [None]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
#url= "https://www.jackdellequerce.com/data/seabird/data_traits.xlsx"
url = "https://zenodo.org/records/5010983/files/data_traits.xlsx"
seabirds = pd.read_excel(url)

In [None]:
seabirds

1. Within populations, the expression of phenotypic traits typically varies with age.
Such age-dependent trait variation can be caused by within-individual change (improvement, senescence, terminal effects) and/or selective (dis)appearance of certain phenotypes among older age classes.

2. In this study we applied two methods (decomposition and mixed-modelling) to attribute age-dependent variation in seven phenological and reproductive traits to within-individual change and
selective (dis)appearance, in a long-lived seabird, the common tern (*Sterna hirundo*).

3. At the population level, all traits, except the probability to breed, improved with age (i.e., phenology advanced and reproductive output increased).
Both methods identified within-individual change as the main responsible process, and within individuals, performance improved until age 6-13, before levelling off.
In contrast, within individuals, breeding probability decreased to age 10, then levelled off.

4. Effects of selective appearance and disappearance were small, but showed that longer-lived individuals had a higher breeding probability and bred earlier,
and that younger recruits performed better throughout life than older recruits in terms of both phenology and reproductive performance.
In the year prior to death, individuals advanced reproduction, suggesting terminal investment.

5. The decomposition method attributed more age-dependent trait variation to selective disappearance than the mixed-modelling method: 14-36% versus 0-8%, respectively,
which we identify to be due to covariance between rates of within-individual change and selective (dis)appearance leading to biased results from the decomposition method.

6. We conclude that the decomposition method is ideal for visualising processes underlying population change in performance from one age class to the next,
but that a mixed-modelling method is required to investigate the significance and relative contribution of age-effects.

7. Considerable variation in the contribution of the different age-processes between the seven phenotypic traits studied,
as well as notable differences between species in patterns of age-dependent trait expression, calls for better predictions regarding optimal allocation strategies with age.

<u>OPTIONS</u>: - look at males vs females across time - look at males vs females in two/three groups: young, intermediate, old birds

<u>Target variables can be</u>: e.g. laying date or egg volume (approximately continuous variables)

- `afb`: age at first breeding
- `NoFledglings`: n. of young birds (offspring)
- `ClutchSize`: n. of eggs laid in the same nest
- `BroodSize`: n. of young birds in the same nest

## EDA

In [None]:
## TASK 1: find number of records and number of variables in the dataset

## your code here

In [None]:
## TASK 2: find type of data for each column (hint: use the dtypes attribute)

## your code here


In [None]:
## TASK 3: get descriptive statistics of the data (hint: use the describe method)

## your code here

------------------------------------------------------------------------

**Q: What do you notice in this summary description? Are all variables included?**

------------------------------------------------------------------------

In [None]:
## TASK 4: get descriptive statistics for non-numerical variables

## your code here

In [None]:
## TASK 5: count n. of missing values per columns (hint: use the isnull and sum methods)

## your code here

## Data preprocessing

In [None]:
## TASK 6: remove the column(s) with too many missing data (hint: use the drop method)

## your code here


In [None]:
## TASK 7: remove rows qith missing data (hint: use the dropna method)

## your code here

## Trends

### Crude trends

Data as they are:
- by `year`
- by `age`

#### 1) by year

In [None]:
## TASK 8: calculate mean of target variable by year
## 1) pick one target variable (hint: subset dataframe with the df['column'] syntax); 2) calculate mean (hint: use the mean method)

## your code here

In [None]:
## TASK 9: make simple line plot
## (hint1: use matplotlib plot function for two numerical variables)
## (hint2: assign x and y to two variables: x = df['column1']; y = df['column2'])

## your code here

#### 2) by age

In [None]:
## TASK 10: calculate mean of target variable by age
## 1) pick one target variable (hint: subset dataframe with the df['column'] syntax); 2) calculate mean (hint: use the mean method)

## your code here

In [None]:
## TASK 11: make simple line plot
## (hint1: use matplotlib plot function for two numerical variables)
## (hint2: assign x and y to two variables: x = df['column1']; y = df['column2'])

## your code here

### Model-adjusted trends

**IMPORTANT!**: select only some numerical variables into X, at first. the function OLS() does not accept categorical variables as strings, e.g. "f"/"m"

In [None]:
import statsmodels.api as sm
## TASK 12: define dataframes of X variables (only numeric) and y target
## remember to add the intercept (function add_constant())

## your code here

If you want to add the sex variable to the model, you can use the code below (uncomment):

In [None]:
#sex_d = pd.get_dummies(seabirds['sex'], prefix='sex', drop_first=True, dtype=float)
#sex_d.head(3)

In [None]:
#X = pd.concat([X, sex_d['sex_m']], axis=1)
#X

Now, we have the X and y data arrays/series to fir a linear model:

In [None]:
## TASK 13: fit a linear model to the data and print the summary of results (hint: use the OLS and fit methods)

## your code here

Now, we are ready to plot the residuals (adjusted target values) over time to check visually for trends:

In [None]:
## TASK 14: get the model residuals (hint: attribute resid) and add them to our dataset (seabirds)

In [None]:
## TASK 15: calculate mean of target variable by age
## 1) the target variable is now the residuals; 2) calculate the mean (hint: use the mean method)

In [None]:
## TASK 16: make a simple line plot
## (hint1: use matplotlib plot function for two numerical variables)
## (hint2: assign x and y to two variables: x = df['column1']; y = df['column2'])

------------------------------------------------------------------------

**Q: What do you notice in the results? How can you interpret them given your choice of target and predictor variables?**

------------------------------------------------------------------------