# Survival Analysis of Salmon in Saline Sea
> Jenny Lee, Arturo Rey, Rafe Chang, Riya Eliza

## Contents
1. [Executive Summary](#executive_summary)
2. [Introduction](#introduction)
<br> 2.1. [Motivation](#motivation)
3. [Data Products](#data_product)
<br> 3.1. [Survival Analysis](#survival_analysis)
<br> 3.2. [Outmigration Model](#further-analysis)
<br> 3.3. [Species Classification Model](#further-analysis)
<br> 3.4. [PL/Python Pipeline](#further-analysis)
4. [Conclusion and Recommendation](#further-analysis)
5. References

<a id='executive_summary'></a>
## 1.0 Executive Summary
Our collaboration with the Pacific Salmon Foundation on the Bottleneck project will yield comprehensive visual analysis tools and advanced statistical and machine learning models, directly enhancing biologists' ability to understand salmon survival trends. By leveraging techniques from the Master of Data Science program, we aim to gain deepeer understanding of salmon survival probability, considering factors like predation, body size, origin site, and others.

<a id='introduction'></a>
## 2.0 Introduction
Salmons are critical to the ecosystem as they are food to 137 species (Rahr, 2023), such as grizzly bears. In British Columbia, there are over 9,000 distinct salmon populations ("State of Salmon", 2022). However, due to climate change and industrial development in the past 150 years, the population of Pacific salmon in BC has declined and their habitats have been facing unprecedented pressures.

Pacific Salmon Foundation is a non-profit organization committed to guiding the sustainable future of Pacific salmon and its habitat. The organization has a wide range of work such as community investments and salmon health. As a part of the organization’s effort towards marine sciences, the Bottlenecks to Survival Projects investigate the survival bottlenecks, which refers to when a population size is reduced for at least one ("Understanding Evolution"), for salmon and steelhead throughout the Salish Sea and southern BC regions.

<a id='motivation'></a>
### 2.1 Motivation
This project deeply interests us after learning about salmons’ crucial survival rate. We are doing a survival analysis to better understand their life cycle and two main models to assist the data collection process. 

## 3.0 Data Products
<a id='survival_analysis'></a>
### 3.1 Survival Analysis
#### 3.1.1 Objective
In ecological terms, a bottleneck refers to a specific event that results in a sharp decline in a population over a period of time (Pacific Salmon Foundation, 2021). Identifying these bottleneck points throughout the various stages of a salmon’s life cycle is crucial, as it provides valuable insights into potential interventions to improve survival rates. Our study aimed to observe the survival and detection probabilities across five stages of the salmon’s outmigration-return path, as well as the cumulative survival probability from the first stage to the last. 

#### 3.1.2 Preprocessing
The data was extracted using SQL queries, stage-wise from the Strait of Georgia data center. The 5 stages identified were facility (hatchery), downstream, estuary, field (or microtroll) and return. For each of these stages, we wanted data of all wild and hatchery origin fishes. SQL queries were used to extract the required data from the data center. The columns extracted were the unique tag_id of a fish, the date on which it was detected, the “stage” at which this data is derived from, the origin of the fish (hatchery/ wild), the fork length of the fish, action (tag/ detect), species of the detected fish.
![sample_table](img/survival_analysis_fig1.png)
After data from all 5 stages are extracted in this manner, the datasets then go into a preprocessing.R file to be combined together. At this stage, we make any nomenclature changes needed to mitigate the variations in data that may have happened during the collection process. Finally, we output a CSV file (survival_analysis.csv) containing the data of all detections from all 5 stages.

#### 3.1.3 Data Science Method
Our study implemented the Cormack-Jolly-Seber (CJS) model (Cormack, 1964; Jolly, 1965; Seber, 1965), which was originally designed for studying bird migration. However, due to the model's heavy reliance on recapture rates, we incorporated Bayesian modeling to address concerns arising from the lack of recapture events. By utilizing prior knowledge through Bayesian modeling, we aim to enhance the analytical power of our sparse recapture data. This approach allows us to obtain more precise and accurate parameter estimates, thereby improving the reliability of our findings.

Our data is modeled using hierarchical modeling to estimate survival and detection probabilities across multiple stages of the fish's outmigration-return path. The model combines elements of the CJS model with Bayesian techniques to handle sparse recapture data effectively.

Our prior distributions are as follows, where $j$ indicates probabilities between stages. $\phi_j$ depicts the survival probability of salmon across different stages, and $p_j$ depicts the detection probability of salmon across different stages. 

$$\phi_j \sim \text{Beta}(1,1)$$
$$p_j \sim \text{Beta}(1,1)$$

Our likelihood distributions are as follows, where $i$ indicates individual salmon. $z_{i,j}$ returns a binary value of either $0$ or $1$ to depict survival status of the fish, and $y_{i,j}$ depicts the binary value of either $0$ or $1$ to depict tagging status of the fish.

$$z_{i,j} \sim \text{Bernoulli}(\phi_j, z_{i,j-1})$$
$$y_{i,j} \sim \text{Bernoulli}(p_j, z_{i,j-1})$$

Lastly, to capture the cumulative survival rates of salmon across all stages, we recursively compute the cumulative probability through each stage based on the survival probability of the past stage. Note that $k$ represents a stage before the current stage $j$.
$$\text{Survship}_j = \prod_{k=1}^{i=1} \phi_k$$

#### 3.1.4 Conclusion and Future Recommendations
We are actively working towards incorporating origin as a covariate and observing its effect on the survival probability of salmon. 

### 3.2 Outmigration Model
#### 3.2.1 Objective
The goal of this model is to accurately predict the outmigration timing of salmon. This model will take in river information including flow, level, and temperature, then use these predictors to predict the outmigration timing of salmon. With this model, we hope to better inform the biologists at the Pacific Salmon Foundation on which day in the year that they should go out to the field to tag salmon. 

#### 3.2.2 Preprocessing
- Merge four data together (salmon tagging tate, river temperature, river flow, river level)
- Convert the date column to the right format
- Create the rolling mean of river flow, river temperature, and river level each on a 5, 10, and 15 day basis.

#### 3.2.3 Data Science Method
- We used a xgboost model to predict the salmons that will outmigrate on a given date; then do a Lasso regression to
  The model will return a range of dates on a given year based on the lower and upper quartile user decide.

#### 3.2.4 Conclusion and Future Recommendations
- Before choosing to use the XGBoost model, we experienced SARIMAX, logistic regression, and linear regression- these models were not able to predict results as ideal as XGBoost during the time we worked on them.
- Currently, we experimented with training with 2014 to 2021 data and test on 2022 and 2023 data. The model is able to predict the starting date for 2022 but not 2023. This model can benefit from training with more data when available. 

### 3.3 Species Prediction Model
#### 3.3.1 Imputation Model (Historical Data)
##### 3.3.1.a Objective
The goal of this model was to make the data stored in the data center as complete as possible by imputing the species of the fish wherever possible and/or necessary. This was to be dne on data that had ben collected in the past by experts doing field work.
1. To impute data in places where the species of a fish was not recorded
2. To detect and correct mislabelled species if any

##### 3.3.1.b Data Collection
The data for training had to be confirmed data. Out of the 57k data points (fishes) in the field table, 5000 of them had their species confirmed by the genetics lab. This became our training set. The extracted dataset (from the data center) had the following columns,

![species_prediction1](img/species_prediction_fig1.png)

Where annotated_species was the field detection and confirmed_species was the result of the lab. The train set had 5014 rows. 

##### 3.3.1.c Preprocessing
1. Removed null values (if any)
2. Added 2 new features extracted from date: day of the year (a  whole number between 1 and 365) and the year (2021-2023)
3. Removed tag_id (since it is unique), applied standard scaling to fork length and day of the year and one-hot encoded the rest of the features to make it in a model ingestible format.

##### 3.3.1.d Data Science Method
The model that we finally landed on was a Deep learning neural network on a tensorflow framework. The reason for finalizing this model was due to its validation accuracy (95%) and due to its ability to detect subtle nuances in the dataset.

The model had 4 layers; 1 input, 2 hidden and 1 output layer. The layers and parameters of the model is as shown below:

![species_prediction2](img/species_prediction_fig2.png)

The model architecture consists of the input layer with a ReLU activation function that takes in all the features, followed by two hidden layers and finally a softmax layer to output prediction probabilities for the species of each fish.

The diagram below shows the model output given that the following features of one fish are identified and fed into the model.

![species_prediction3](img/species_prediction_fig3.png)

The model tested with a 95% accuracy, where the accuracy is defined as the number of correct predictions in comparison to the results from the genetics lab. The model training was steady along 20 epochs. The accuracy and loss for both train and validation sets can be seen here.

##### 3.3.1.e Result
The results of this model are finally delivered as a CSV file with the tag ID of a fish, the field identified label, the predicted label and the predicted probability.

![species_prediction4](img/species_prediction_fig4.png)

#### 3.3.2 Current Model (New Data)
##### 3.3.2.a Objective
The goal of this model is to be able to predict the species of a fish given its physical features, location and site that it was detected at.

##### 3.3.2.b Preprocessing
The data needed to train this model is a combination of deterministic features, i.e; the physical features of the fish and non-deterministic features like the location, site method and locality of the fish.

The deterministic features for each species of fish are as follows:
![species_prediction5](img/species_prediction_fig5.png)

The non-deterministic features are:
![species_prediction6](img/species_prediction_fig6.png)

These two sets of features are then merged to create a complete dataset that is processed and fed into the probabilistic models. For the deterministic decision trees, only the deterministic features of the fish were used.

Preprocessing steps are as following:
1. **Probabilistic Model**:
- The numerical features were kept as they were for the decision trees but were standard-scaled for the deep learning model.
- All the categorical features were one-hot encoded
- Finally, we had 61 features for the probabilistic models
2. **Deterministic Model**:
- Since all the deterministic features are categorical non-ordinal, the only required processing is to transform the features into binary features using the OneHot encoding technique.

##### 3.3.2.c Data Science Method
The probabilistic models that were used were a deep learning neural network on a tensorflow framework with4 layers and a decision tree classifier. The data used to train these two models was the combined data frame consisting of the physical features of the fish and its location features.

The final product that will be delivered to the partners is the pipeline code, alongside the cron file. The pipeline code will be delivered in two scripts, per project.

#### 3.4 General Pipelining

Pipelines are important for automating the process of extracting (ingesting), transforming (processing), and loading data into a database. They are mostly used when migrating data from database A to database B. For this project, the pipelines will be used to extract, transform, predict using the trained models, and load the data into the same database.

As per the partners request, the pipelines are done using PL/Python (Procedural Language Python). The orchestration tasks will be done using cron.

The pipeline diagram is as follows:
![pipeline](img/pipeline.png)

Where,
- Raw table is the tables the pipelines will be reading from. These tables come from the Marine Science database.
- Process is the processing script where, depending on the pipeline, different transformations of the data will be performed.
- Staging table is the table where all the transformed data will be stored, in order to be read again by the Predict script. Having a staging table is very standard in the industry because it helps separate the extraction and transformation from the loading part of the pipeline, ensuring an easier debugging process.
- Predict is the script that will be performed on the transformed data. This will run the models and get a prediction which will then be stored in the Results table.
- Results table is where all the results from the model prediction will be stored. 

## 4. Conclusion and Future Recommendation
The number of adult salmon returning based on observed and recorded data is critically low, posing a significant threat not only to the species that rely on the nutritional value of salmon but also to British Columbia's seafood industry, which exports $1.38 billion annually. Through survival analysis, we examine the survivability across different stages of the salmon's lifetime and understand at which stage the bottleneck may occur. With the outmigration model, we aim to reduce time, cost and efforts for the team at PSF when deciding when to come to the field and observe fishes and finally with the species prediction model, we enable the completion and accuracy of the database to ensure clear future analysis. We aim to empower scientists with robust data to better confirm their hypotheses and make efficient decisions.

---
## References
- Rahr, Guido. “Why Protect Salmon.” Wild Salmon Center, 7 Nov. 2023, wildsalmoncenter.org/why-protect-salmon/.
- “State of Salmon.” Pacific Salmon Foundation, 13 Apr. 2022, psf.ca/salmon/.
- “Understanding Evolution", https://evolution.berkeley.edu/bottlenecks-and-founder-effects/. Accessed 7 May 2024.
- “Breaking the Bottlenecks: A PSF Initiative Seeks to Identify the Danger Zones in the Salmon Life Cycle”. Pacific Salmon Foundation, 2021, https://psf.ca/salmon-steward/breaking-the-bottlenecks/#:~:text=A%20salmon%20survival%20bottleneck%20is,time%2C%20ultimately%20limiting%20future%20production.