# Download time-series data from BICEP project

**Last updated: 29/04/2024**

This script downloads global ocean, time-series data from the European Space Agency's [**BICEP project**](https://www.bicep-project.org/Deliverables). This script uses the conda environment `bashenv` since the Bash command `wget` is essential for downloading data from the **CEDA Archive**, where data for the BICEP project are stored. Unfortunately, CEDA services do not support subsetting the arrays; therefore, you must download the entire **global ocean** dataset.

The global ocean BICEP datasets downloaded in this script are:
- Particulate organic carbon (**[POC](https://catalogue.ceda.ac.uk/uuid/5006f2c553cd4f26a6af0af2ee6d7c94)**)
- Phytoplankton carbon (**[Cphyto](https://catalogue.ceda.ac.uk/uuid/6a6ccbb8ef2645308a60dc47e9b8b5fb)**)
- Net primary production (**[NPP](https://catalogue.ceda.ac.uk/uuid/69b2c9c6c4714517ba10dab3515e4ee6)**)

To create this script, I followed the instructions from the CEDA Archive, which involved (1) visiting https://dap.ceda.ac.uk/neodc/bicep/data and selecting the datasets of interest (POC, Cphyto and NPP in this script's case); (2) copying the provided **URL links** for download and pasting them in the section immediately below. Instructions to implement `wget` were obtained [here](https://help.ceda.ac.uk/article/5061-bulk-download-wget).

In [None]:
%%bash

URL_LIST=(
    "https://dap.ceda.ac.uk/neodc/bicep/data/particulate_organic_carbon/v5.0/monthly/GEO/"
    "https://dap.ceda.ac.uk/neodc/bicep/data/phytoplankton_carbon/v5.0/monthly/"
    "https://dap.ceda.ac.uk/neodc/bicep/data/marine_primary_production/v4.2/monthly/"
)

In [None]:
%%bash

# Parameters to define the data download directories

ROOT_DIR="../.."  # navigate two directories up
DATA_DIR="data/raw/BICEP_data"
OUTPUT_DIRECTORY_LIST=(
    "BICEP_POC_nc"
    "BICEP_Cphyto_nc"
    "BICEP_NPP_nc"
)

In [2]:
%%bash

# Construct paths

output_directory_list_path=()
for output_directory in "${OUTPUT_DIRECTORY_LIST[@]}"; do 
    # Use printf to create a portable file path
    data_subdir_path=$(printf '%s/%s/%s/' "$ROOT_DIR" "$DATA_DIR" "$output_directory")
    # Combine data_subdir_path with the OUTPUT_DIRECTORY_LIST and add to output_directory_list_path
    output_directory_list_path+=("${data_subdir_path}")
    # Create the directory at the specified path if it doesn't already exist
    if [ ! -d "${data_subdir_path}" ]; then
        mkdir -p "${data_subdir_path}"
        echo "Directory created: ${data_subdir_path}"
    fi
done
    
# Combine URL_LIST and output_directory_list_path using paste with a space delimiter. We will pass
# two arguments at a time (-n 2) to wget, and execute at most two parallel wget processes at a time 
# (-P 2). The -q argument executes wget quietly (no output to the terminal) and xarg returns only
# after the last spawned process has finished. The -nH option avoids creating a new directory for 
# the hostname. The --cut-dirs=desired_level option specifies the number of directory levels to skip 
# when creating the local directory structure. Adjust desired_level based on the depth of the 
# remote folder structure you want to preserve.
paste -d ' ' <(printf "%s\n" "${URL_LIST[@]}") <(printf "%s\n" "${output_directory_list_path[@]}") | \
xargs -n 2 -P 2 bash -c 'echo "URL: $1"; echo "Output directory path: $2"; wget -q -e robots=off --no-parent -P "$2" -nH --cut-dirs=6 -r "$1"' --

echo "Download completed!"

Download completed!


In [8]:
%%bash

# Notice that the POC dataset comes with an intermediate folder GEO. Let's remove it.
cd "../../data/raw/BICEP_data/BICEP_POC_nc/" 
mv GEO/* . # move the contents of the GEO folder to the parent directory
rmdir GEO # remove the empty GEO folder