# World Data League 2022

## 🎯 Challenge
Avencas Marine Protected Area: Predict the future of the local ecosystem and its species


<img src="wdl_2023.png" alt="MoMoneyMoModels_Badge_whitebackground.png" style="width:35%">

## Team: Mo Money, Mo Models
## 👥 Authors
* David Raposo
* Duarte Pereira
* Martim Chaves
* Paulo Sousa


## 💻 Development
Start coding here! 🐱‍🏍

Create the necessary subsections (e.g. EDA, different experiments, etc..) and markdown cells to include descriptions of your work where you see fit. Comment your code. 

All new subsections must start with three hash characters. More specifically, don't forget to explore the following:
1. Assess the data quality
2. Make sure you have a good EDA where you enlist all the insights
3. Explain the process for feature engineering and cleaning
4. Discuss the model / technique(s) selection
5. Don't forget to explore model interpretability and fairness or justify why it is not needed

Pro-tip 1: Don't forget to make the jury's life easier. Remove any unnecessary prints before submitting the work. Hide any long output cells (from training a model for example). For each subsection, have a quick introduction (justifying what you are about to do) and conclusion (results you got from what you did). 

Pro-tip 2: Have many similiar graphs which all tell the same story? Add them to the appendix and show only a couple of examples, with the mention that all the others are in the appendix.

Pro-tip 3: Don't forget to have a motivate all of your choices, these can be: Data-driven, constraints-driven, literature-driven or a combination of any. For example, why did you choose to test certain algorithms or why only one.

### Introduction

#### The place

Avencas, near Lisbon, Portugal, was classified as a Biophysical Interest Zone (ZIBA) in 1998 due to its **high intertidal biodiversity**.

This classification **sparked controversy and conflict with locals**, leading to **non-compliance** with regulations. In 2016, after concerted efforts from local authorities, Avencas was reclassified as a Marine Protected Area (MPA). Marine Protected Areas (MPAs) constitute coastal management tools that aim to mitigate threats to the functioning of the areas and can be planned according to diﬀerent speciﬁc objectives.

Along with the reclassification, **public participation sessions and environmental awareness** activities were carried out, **improving regulation compliance**, particularly within the **fishing community**.

Certain activities, including aquaculture, water motor sports, fishing, and collection of animals, are prohibited unless authorized for scientific studies. (From Ferreira et al. 2017)

#### A Success Story

"[...] user management actions have been created including visitors’ pathways through the rocky platforms and information spots displaying signs with area specific rules at the entrance to the beach. Positive results point to the success of this approach, as visitors either agreed or respected the various management actions implemented. A survey showed that 84% of visitors look favorable upon the information spots and 76% agree with the location of the access pathways. " (Challenge Brief)

#### A Problem Arises...

But... "The local usages are now under control, [...] allowing for a decrease in the anthropogenic stress of this small MPA, so in theory the intertidal ecosystem should be recovering at a faster rate than what is being recorded by our teams of biologists." In other words: locals are now not adding stress to the MPA. But, it's still not recovering as fast as we expected. So, some other factors are probably at play here, likely related to the global changes to the climate that we are witnessing.

#### The Solution

Knowing this, a great tool to further improve the biodiversity, and assure the health and sustainability of the Avencas MPA, would be an **interactive app** that could visually represent the sessile species coverage in the MPA, and informs the locals and the general populace of how important sessile species are for the Avencas MPA.

This could be in the form of an **interactive, sea-side screen**, and an **online web app**.

The **goal** of this app would be to, **beyond the specific Avencas MPA**, captivate people, locals and non-locals, to how important marine life can be, and to incentivize them to become **champions of environmentally friendly causes**, so that they may pressure policy makers, and contribute towards meaningful, large scale, change.

Sessile species highly contribute towards MPAs, in several ways. Here's some that could be emphasized:

* Biodiversity: Our ocean is amazing - it's a home for all sorts of sessile organims like algae, coral, and even sea slugs! All of these different being create complex neighboorhoods that host a wild variety of life. Some call corals "rainforests of the sea", as they're teeming with fascinating biodiversity!

* Food Source: Each being plays a part in these complex neighboorhouds down under the sea. Some are treats for other larger beings, like algae, mussels, and barnacles, while other act as an amazing clean-up crew, like the amazing fungi. Each being is important, as each is a key element to the assurance of the marina areas, keeping the ocean clean and healthy!

* Carbon Sequestration: Also, another major role that algae and coral play is when it come to sucking up carbon dioxide and giving oxygen in return, aiding us in fighting climate change. Besides being pretty, they're also working hard to keep our planet cooool.

("Marine Biology: An Ecological Approach" by Nybakken and Bertness, "Dynamics of Marine Ecosystems: Biological-Physical Interactions in the Oceans" by Mann and Lazier)

But first... let's look at the data that we have at hand!

### EDA

#### 1.1. Data Profiling

Examining the structure, contents, and basic statistics of the dataset. 

In [2]:
# Relevant Imports
import pandas as pd
import os

In [5]:
data_path = os.path.join("data", "raw", "files_WDL", "cascais_data", "AMPA_Data_Sample.xlsx")

with pd.ExcelFile(data_path) as xls:
    sheet_names = xls.sheet_names          

print(sheet_names) 

['Sessil (% Coverage)', ' Mobil (nº individuals)', 'Invasive_conservat species list']


Our main dataset has 3 sheets.

Opening each sheet, visually, we can understand that:
1. 'Sessil (% Coverage)' focuses on sessile species, i.e. species that fixate on a substracte. It contains measures of percentage of coverage for each specie being analysed for a random "quadrat" in one of five sectors of the MPA. These measures seem to be taken sporadically, around a couple of times per month, over the span of many years, from 2011 to 2020.
2. ' Mobil (nº individuals)' is similar to the previous sheet, but contains only mobile species.
3. 'Invasive_conservat species list' dictates which species are invasive, and which are under risk of extinction.

In [26]:
sessil_df = pd.read_excel(data_path, sheet_name=sheet_names[0])
mobile_df = pd.read_excel(data_path, sheet_name=sheet_names[1])
invsve_df = pd.read_excel(data_path, sheet_name=sheet_names[2])

Visually, it has been understood that there are some columns that are not relevant to us, and that can be dropped.
They are information about who took the sample, and two extra columns that contain no information. So, let's get rid of them! :)

In [27]:
columns_to_drop_sessil = ['Sampler', 'Coluna1', 'Coluna2']
columns_to_drop_mobile = ['Sampler', 'Column1', 'Column2', 'Column3']

sessil_df = sessil_df.drop(columns_to_drop_sessil, axis=1)
mobile_df = mobile_df.drop(columns_to_drop_mobile, axis=1)

Let's get a general idea of what we're dealing with...

In [6]:
def initial_data_profiling(data: pd.DataFrame):
    # Get the data types of each column
    data_types = data.dtypes

    # Get the summary statistics of the DataFrame
    summary_stats = data.describe()

    # Get the number of missing values in each column
    missing_values = data.isnull().sum()

    # Get the unique values in each column
    unique_values = data.nunique()

    # Display the data profiling information
    print("Data Types:")
    print(data_types)
    print("\nSummary Statistics:")
    print(summary_stats)
    print("\nMissing Values:")
    print(missing_values)
    print("\nUnique Values:")
    print(unique_values)

In [19]:
print("Sessile")
initial_data_profiling(sessil_df)

Sessile
Data Types:
Date                                        datetime64[ns]
Hour                                                object
Tide                                               float64
Weather Condition                                   object
Water temperature (ºC)                              object
                                                 ...      
Sphacelaria rigidula (pompons castanhos)           float64
Cystoseira sp.                                      object
Laminaria sp.                                      float64
TOTAL2                                              object
observações                                         object
Length: 102, dtype: object

Summary Statistics:
              Tide  Siphonaria algesirae  Gibbula sp.  Monodonta lineata  \
count  2010.000000           2010.000000  2010.000000        2010.000000   
mean      0.729035              0.237736     0.290647           0.128532   
std       0.184047              0.914012     0.882766  

In [23]:
print(f"Number of columns (sessile): {len(sessil_df.columns)}")
print(f"Number of columns (mobil): {len(mobile_df.columns)}")
print(f"Number of columns (invasice): {len(invsve_df.columns)}")

Number of columns (sessile): 102
Number of columns (mobil): 67
Number of columns (invasice): 10


This isn't particularly useful... What's happening is that there are a lot of different species. So it's a bit difficult to accurately get an idea of the data using simple describe methods. We're dealing with 102 columns for the sessile df, and 67 columns for the mobil one.

To deal with this, let's separate the dataframes into information about the samples, and the information about the species in the samples themselves.

This way, we'll be able to get an idea regarding the meta-information of the samples (how often were they taken, what were the hours of the day when they were taken, and so on), and an idea regarding the information of the species themselves (although a general idea, not a specie by specie level idea).

We'll leave the ivasive lists analysis for last.

In [24]:
# List of columns that represent sample information (excluding species columns)
sessil_meta_info_columns = ['Date', 'Hour', 'Tide', 'Weather Condition',
                            'Water temperature (ºC)', 'Zone',
                            'Supratidal/Middle Intertidal', 'Substrate',
                            'TOTAL2', 'observações']

mobile_meta_info_columns = ['Date', 'Hour', 'Tide', 'Weather Condition',
                            'Water temperature (ºC)', 'Zone',
                            'Supratidal/Middle Intertidal', 'Substrate',
                            'TOTAL', 'Abundance (ind/m2)']

In [28]:
# Create the meta_info_df DataFrame with sample information
sessil_meta_info_df = sessil_df[sessil_meta_info_columns]
mobile_meta_info_df = mobile_df[mobile_meta_info_columns]

# Create the species_info_df DataFrame with species information
sessil_species_info_df = sessil_df.drop(columns=sessil_meta_info_columns)
mobile_species_info_df = mobile_df.drop(columns=mobile_meta_info_columns)

# Delete original df to free up memory
del sessil_df
del mobile_df

In [29]:
print("Analysis of Sessil samples' meta info")
initial_data_profiling(sessil_meta_info_df)

Analysis of Sessil samples' meta info
Data Types:
Date                            datetime64[ns]
Hour                                    object
Tide                                   float64
Weather Condition                       object
Water temperature (ºC)                  object
Zone                                    object
Supratidal/Middle Intertidal            object
Substrate                               object
TOTAL2                                  object
observações                             object
dtype: object

Summary Statistics:
              Tide
count  2010.000000
mean      0.729035
std       0.184047
min       0.300000
25%       0.600000
50%       0.700000
75%       0.900000
max       2.400000

Missing Values:
Date                               1
Hour                               1
Tide                               1
Weather Condition                  1
Water temperature (ºC)             9
Zone                               1
Supratidal/Middle Intertidal       

## 🖼️ Visualisations
Copy here the most important visualizations (graphs, charts, maps, images, etc). You can refer to them in the Executive Summary.

Technical note: If not all the visualisations are visible, you can still include them as an image or link - in this case please upload them to your own repository.

## 👓 References
List all of the external links (even if they are already linked above), such as external datasets, papers, blog posts, code repositories and any other materials.

## ⏭️ Appendix
Add here any code, images or text that you still find relevant, but that was too long to include in the main report. This section is optional.
