In [9]:
from hodor_python.dataset import HODOR_Dataset, Species

# need to allow nested event loops in notebooks to enable async download
import nest_asyncio
nest_asyncio.apply()

## Downloading Data from HODOR Dataset

The HODOR dataset class provides flexible download functions to retrieve data based on your specific needs:

In the example below, we create a HODOR dataset instance with a local folder called "HODOR" where all downloaded files will be stored. The dataset will automatically skip downloading files that already exist locally, making it safe to re-run download commands without duplicating data.

```python
# Example: Create dataset instance and download sequence 1
hodor = HODOR_Dataset(dataset_folder="HODOR")
hodor.download_sequence(1)
```

All downloaded files (videos, sonar data, and metadata) will be organized within the specified dataset folder. The downloader checks for existing files and only downloads what's missing.

### Download Options

- **Complete Sequences**: Download both video and sonar data for entire sequences using sequence IDs
    - Single sequence: Pass an individual sequence ID
    - Multiple sequences: Pass a list of sequence IDs

- **Video Data Only**: Download just the video files for specified sequences

- **Sonar Data Only**: Download just the sonar files for specified sequences

### Downloading Strategy 

For single sequences, no Pangaea account is required. However, downloading the entire dataset at once requires registration at Pangaea and is not recommended due to the large dataset size. A better approach is to:

1. First examine the included activity counts data to identify sequences with interesting species activity
2. Use the sequence IDs to selectively download only the sequences relevant to your research
3. This targeted approach is more efficient and avoids unnecessary data transfer

### Important Note about Pangaea Storage

The HODOR dataset is hosted on [Pangaea](https://doi.pangaea.de/10.1594/PANGAEA.980000), which uses a hierarchical storage system. If files haven't been accessed recently, they may be stored on tape storage rather than immediately available disk storage. When you request such files, Pangaea will automatically retrieve them from tape storage, which can take some time (depending on the file size and system load).

The downloader is designed to handle this automatically - it will wait for the file retrieval process to complete and then proceed with the download once the files become available on disk storage. You may notice longer download times for files that need to be retrieved from tape storage compared to files that are already available on disk.

In [2]:
# Example: Create dataset instance and download sequence 1
hodor = HODOR_Dataset(dataset_folder="HODOR")
hodor.download_sequence(1)

cam2_0001.mp4 is being retrieved from tape. Retrying in 30 seconds...
Downloaded cam1_0001.txt successfully!
Downloaded cam2_0001.txt successfully!
Downloaded cam1_0001.mp4 successfully!
cam2_0001.mp4 is being retrieved from tape. Retrying in 60 seconds...
cam2_0001.mp4 is being retrieved from tape. Retrying in 90 seconds...
Downloaded cam2_0001.mp4 successfully!
Downloaded sonar_0001.txt successfully!
Downloaded sonar_0001.mp4 successfully!


In the above example it can be seen, that some files had to be retrieved from tape first and have afterwards been downloaded.

If the same sequence is downloaded again, the files will be skipped:

In [3]:
hodor.download_sequence(1)

File cam2_0001.txt already exists, skipping.
File cam1_0001.txt already exists, skipping.
File cam1_0001.mp4 already exists, skipping.
File cam2_0001.mp4 already exists, skipping.
File sonar_0001.txt already exists, skipping.
File sonar_0001.mp4 already exists, skipping.


## Finding Interesting Sequences

The HODOR dataset includes activity count data that allows you to identify sequences with specific species of interest. This is particularly useful for targeted downloading rather than retrieving the entire dataset.

### Example: Finding Sequences with High Cod Activity

You can filter sequences based on species occurrence counts. For example, to find sequences with significant cod activity (more than 10 detections):

```python
# Find all sequences with more than 10 cod detected
interesting_sequences = list(hodor.counts[hodor.counts[Species.FISH_COD] > 10].SeqID)
print(f"Found {len(interesting_sequences)} sequences with >10 cod detections")
print(f"Sequence IDs: {interesting_sequences}")
```

This approach allows you to:
- Identify sequences with high biological activity
- Focus your analysis on relevant data
- Minimize download time and storage requirements
- Target specific species of research interest

You can then use these sequence IDs to download only the data you need for your research.

In [4]:
# Find all sequences with more than 10 cod detected
interesting_sequences = list(hodor.counts[hodor.counts[Species.FISH_COD] > 10].SeqID)
print(f"Found {len(interesting_sequences)} sequences with >10 cod detections")
print(f"Sequence IDs: {interesting_sequences}")

Found 4 sequences with >10 cod detections
Sequence IDs: [1464, 1788, 2892, 2895]


In [8]:
# Check how many cod were detected in sequence 1464
hodor.counts.loc[interesting_sequences[0]]

SeqID                                          1464
sequenceStartUnix                        1631898552
sequenceEndUnix                          1631898779
DateTimeStart            2021-09-17 17:09:12.139000
DateTimeEnd              2021-09-17 17:12:58.529000
sequence_length              0 days 00:03:46.390000
anguilla_anguilla                                 0
bird_cormorant                                    0
bird_unspecified                                  0
crab_crustacea                                    0
fish_clupeidae                                    0
fish_cod                                         14
fish_mackerel                                     0
fish_mugilidae                                    0
fish_oncorhynchus                                 0
fish_pipefish                                     0
fish_plaice                                       0
fish_salmonidae                                   0
fish_scad                                         0
fish_unspeci

In [6]:
# download video data for one of these sequences
hodor.download_video(interesting_sequences[0])

cam2_1464.txt is being retrieved from tape. Retrying in 30 seconds...
cam2_1464.mp4 is being retrieved from tape. Retrying in 30 seconds...
cam1_1464.mp4 is being retrieved from tape. Retrying in 30 seconds...
cam1_1464.txt is being retrieved from tape. Retrying in 30 seconds...
Downloaded cam2_1464.txt successfully!
Downloaded cam1_1464.txt successfully!
Downloaded cam1_1464.mp4 successfully!
Downloaded cam2_1464.mp4 successfully!
