# Longitudinal data

In this *Python* notebook we will get introduced to examples of longitudinal data, i.e. data with a **time component**:

## Read data

Data from:
- [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488)
- [(data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))]

It's data on fish catch (multiple fish species) over time in different regions of Alaska.

In [1]:
import numpy as np
import pandas as pd

In [3]:
url= "https://zenodo.org/records/4987796/files/stema_data.csv"
fish = pd.read_csv(url)

In [7]:
## data size (tabular)
fish.shape

(6716, 14)

In [6]:
fish

Unnamed: 0.1,Unnamed: 0,Station,Year,Area,Species,Latitude,Longitude,CPUE,SST_cvW,SST_cvW5,SST_cvW4,SST_cvW3,SST_cvW2,SST_cvW1
0,2092,62,1990,Western Gulf of Alaska,Pacific cod,52.663,-168.988,1.212,0.222324,0.252917,0.209706,0.187889,0.195080,0.296625
1,2093,62,1991,Western Gulf of Alaska,Pacific cod,52.663,-168.988,0.645,0.236036,0.209706,0.187889,0.195080,0.296625,0.222324
2,2094,62,1992,Western Gulf of Alaska,Pacific cod,52.663,-168.988,2.661,0.252917,0.187889,0.195080,0.296625,0.222324,0.236036
3,2095,62,1993,Western Gulf of Alaska,Pacific cod,52.663,-168.988,1.947,0.209706,0.195080,0.296625,0.222324,0.236036,0.252917
4,2096,62,1994,Western Gulf of Alaska,Pacific cod,52.663,-168.988,1.767,0.187889,0.296625,0.222324,0.236036,0.252917,0.209706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6711,39103,149,2008,East Yakutat/Southeast,Sablefish,54.598,-133.023,6.827,0.164408,0.143692,0.144263,0.147234,0.133766,0.120992
6712,39113,149,2009,East Yakutat/Southeast,Sablefish,54.598,-133.023,7.116,0.209535,0.144263,0.147234,0.133766,0.120992,0.164408
6713,39123,149,2010,East Yakutat/Southeast,Sablefish,54.598,-133.023,10.756,0.110823,0.147234,0.133766,0.120992,0.164408,0.209535
6714,39133,149,2011,East Yakutat/Southeast,Sablefish,54.598,-133.023,9.177,0.173239,0.133766,0.120992,0.164408,0.209535,0.110823


-   **CPUE**: target variable, "catch per unit effort"
-   **SST**: sea surface temperature
-   **CV**: actually, the coefficient of variation for SST is used $\rightarrow$ the coefficient of variation is an improved measure of seasonal SST over the mean, because it standardizes scale and allows us to consider the changes in variation of SST with the changes in mean over (Hannah Correia, 2018 - Ecology and Evolution)
-   **SSTcvW1-5**: CPUE is influenced by survival in the first year of life. Water temperature affects survival, and juvenile fish are more susceptible to environmental changes than adults. Therefore, CPUE for a given year is likely linked to the winter SST at the juvenile state. Since this survey targets waters during the summer and the four species covered reach maturity at 5--8 years, SST was lagged for years one through five to allow us to capture the effect of SST on the juvenile stages. All five lagged SST measures were included for modeling.

### Data preprocessing

In [8]:
fish.columns

Index(['Unnamed: 0', 'Station', 'Year', 'Area', 'Species', 'Latitude',
       'Longitude', 'CPUE', 'SST_cvW', 'SST_cvW5', 'SST_cvW4', 'SST_cvW3',
       'SST_cvW2', 'SST_cvW1'],
      dtype='object')

In [11]:
fish = fish.drop(['Unnamed: 0', 'Latitude', 'Longitude'], axis=1)

In [12]:
fish

Unnamed: 0,Station,Year,Area,Species,CPUE,SST_cvW,SST_cvW5,SST_cvW4,SST_cvW3,SST_cvW2,SST_cvW1
0,62,1990,Western Gulf of Alaska,Pacific cod,1.212,0.222324,0.252917,0.209706,0.187889,0.195080,0.296625
1,62,1991,Western Gulf of Alaska,Pacific cod,0.645,0.236036,0.209706,0.187889,0.195080,0.296625,0.222324
2,62,1992,Western Gulf of Alaska,Pacific cod,2.661,0.252917,0.187889,0.195080,0.296625,0.222324,0.236036
3,62,1993,Western Gulf of Alaska,Pacific cod,1.947,0.209706,0.195080,0.296625,0.222324,0.236036,0.252917
4,62,1994,Western Gulf of Alaska,Pacific cod,1.767,0.187889,0.296625,0.222324,0.236036,0.252917,0.209706
...,...,...,...,...,...,...,...,...,...,...,...
6711,149,2008,East Yakutat/Southeast,Sablefish,6.827,0.164408,0.143692,0.144263,0.147234,0.133766,0.120992
6712,149,2009,East Yakutat/Southeast,Sablefish,7.116,0.209535,0.144263,0.147234,0.133766,0.120992,0.164408
6713,149,2010,East Yakutat/Southeast,Sablefish,10.756,0.110823,0.147234,0.133766,0.120992,0.164408,0.209535
6714,149,2011,East Yakutat/Southeast,Sablefish,9.177,0.173239,0.133766,0.120992,0.164408,0.209535,0.110823


Note: in the subset below, **CPUE values are identical**

We see that, in order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):


In [16]:
fish.loc[(fish['Species'] == "Pacific cod") & (fish['Area'] == "West Yakutat") & (fish['Year'] == 1990)]

Unnamed: 0,Station,Year,Area,Species,CPUE,SST_cvW,SST_cvW5,SST_cvW4,SST_cvW3,SST_cvW2,SST_cvW1
621,89,1990,West Yakutat,Pacific cod,0.257,0.184981,0.170131,0.187907,0.204168,0.181633,0.232143
644,90,1990,West Yakutat,Pacific cod,0.257,0.182921,0.150192,0.179461,0.186885,0.185428,0.228893
667,91,1990,West Yakutat,Pacific cod,0.257,0.180274,0.160873,0.171338,0.199175,0.201185,0.234303
690,92,1990,West Yakutat,Pacific cod,0.257,0.146539,0.122142,0.158705,0.193011,0.196952,0.216233
713,93,1990,West Yakutat,Pacific cod,0.257,0.159055,0.119703,0.150706,0.194416,0.181282,0.210157
736,94,1990,West Yakutat,Pacific cod,0.257,0.177652,0.13132,0.159346,0.193603,0.164746,0.199447
759,95,1990,West Yakutat,Pacific cod,0.257,0.177032,0.148819,0.169562,0.196527,0.174598,0.198033
782,96,1990,West Yakutat,Pacific cod,0.257,0.194973,0.134543,0.158762,0.204315,0.178914,0.21014
1449,136,1990,West Yakutat,Pacific cod,0.257,0.146539,0.122142,0.158705,0.193011,0.196952,0.216233
1472,137,1990,West Yakutat,Pacific cod,0.257,0.14573,0.117977,0.15501,0.192141,0.197596,0.21124


In [17]:
## mutate variable
# Assuming fish is a pandas DataFrame
fish['avg'] = fish.groupby(['Species', 'Area', 'Year'])['CPUE'].transform('mean')
fish['std'] = 0.1 * fish['avg']

In [19]:
fish['noise'] = np.random.normal(loc=0, scale=fish['std'])
fish['CPUE'] = fish['CPUE'] + fish['noise']

In [20]:
fish

Unnamed: 0,Station,Year,Area,Species,CPUE,SST_cvW,SST_cvW5,SST_cvW4,SST_cvW3,SST_cvW2,SST_cvW1,avg,std,noise
0,62,1990,Western Gulf of Alaska,Pacific cod,1.113481,0.222324,0.252917,0.209706,0.187889,0.195080,0.296625,1.212,0.1212,-0.098519
1,62,1991,Western Gulf of Alaska,Pacific cod,0.755270,0.236036,0.209706,0.187889,0.195080,0.296625,0.222324,0.645,0.0645,0.110270
2,62,1992,Western Gulf of Alaska,Pacific cod,2.642664,0.252917,0.187889,0.195080,0.296625,0.222324,0.236036,2.661,0.2661,-0.018336
3,62,1993,Western Gulf of Alaska,Pacific cod,2.323607,0.209706,0.195080,0.296625,0.222324,0.236036,0.252917,1.947,0.1947,0.376607
4,62,1994,Western Gulf of Alaska,Pacific cod,1.930758,0.187889,0.296625,0.222324,0.236036,0.252917,0.209706,1.767,0.1767,0.163758
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6711,149,2008,East Yakutat/Southeast,Sablefish,7.012555,0.164408,0.143692,0.144263,0.147234,0.133766,0.120992,6.827,0.6827,0.185555
6712,149,2009,East Yakutat/Southeast,Sablefish,5.776981,0.209535,0.144263,0.147234,0.133766,0.120992,0.164408,7.116,0.7116,-1.339019
6713,149,2010,East Yakutat/Southeast,Sablefish,7.821571,0.110823,0.147234,0.133766,0.120992,0.164408,0.209535,10.756,1.0756,-2.934429
6714,149,2011,East Yakutat/Southeast,Sablefish,9.449958,0.173239,0.133766,0.120992,0.164408,0.209535,0.110823,9.177,0.9177,0.272958


### EDA (Exploratory Data Analysis)

Let's start by looking at the raw data. As we already saw, for each combination of species, area and year we have multiple observations; for instance, let's look at `Pacific cod` from `West Yakutat` in year `2000`. Therefore, a boxplot is a good way to plot these data: