# Getting Started

This notebook will help install all the required dependencies as well as prepare the dataset for use with fast.ai

In [1]:
# Check python version
import sys
sys.version

'3.7.13 (default, Mar 29 2022, 02:18:16) \n[GCC 7.5.0]'

This is needed for systems that don't already have it installed.

In [2]:
!apt-get install -y libsndfile1

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libflac8 libogg0 libvorbis0a libvorbisenc2
The following NEW packages will be installed:
  libflac8 libogg0 libsndfile1 libvorbis0a libvorbisenc2
0 upgraded, 5 newly installed, 0 to remove and 17 not upgraded.
Need to get 557 kB of archives.
After this operation, 2051 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libogg0 amd64 1.3.2-1 [17.2 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libflac8 amd64 1.3.2-1 [213 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 libvorbis0a amd64 1.3.5-4.2 [86.4 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/main amd64 libvorbisenc2 amd64 1.3.5-4.2 [70.7 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libsndfile1 amd64 1.0.28-4ubuntu0.18.04.2 [170 kB]
Fetched 557 kB in 1s (571 kB/s)     
debconf: delayi

Setting up system dependencies that work well with each other is incredibly difficult. Different libraries specify different version constraints and that makes finding the right versions difficult to resolve. In the cloud, typically libraries like `torch` will already be installed on the system with CUDA enabled. When a dependency specifies a different version of `torch` and your system downloads a new one, it may not be CUDA enabled. The same occurs with other libraries.

`conda` ships with the Anaconda distribution of Python and should be used instead of `pip` as certain channels curate which libraries work best together. `fastchan` is a channel created by the fast.ai team that curates the most common versions data science packages that work well together and also have GPU acceleration. When a package isn't on `fastchan` we use `conda-forge` which is a community maintained channel that is a good alternative. 

In [4]:
# Install fastai
!conda install -c fastchan fastai torchaudio --yes
!conda install -c conda-forge kaggle librosa --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - fastai
    - torchaudio


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bottleneck-1.3.4           |   py37hda87dfa_1         125 KB  fastchan
    brotli-1.0.9               |       h166bdaf_7          18 KB  fastchan
    brotli-bin-1.0.9           |       h166bdaf_7          19 KB  fastchan
    catalogue-2.0.8            |   py37h89c1867_0          32 KB  fastchan
    certifi-2022.9.24          |     pyhd8ed1ab_0         155 KB  fastchan
    click-8.1.3                |   py37h89c1867_0         145 KB  fastchan
    conda-22.9.0               |   py37h89c1867_0         959 KB  fastchan
    cymem-2.0.6                |   py37hd23a5d3_3          42 KB  fastchan
    cython-blis-0.7.7          |   py37hda87dfa_1         

expat-2.4.8          | 187 KB    | ##################################### | 100% 
libbrotlidec-1.0.9   | 33 KB     | ##################################### | 100% 
gst-plugins-base-1.1 | 4.8 MB    | ##################################### | 100% 
murmurhash-1.0.7     | 26 KB     | ##################################### | 100% 
glib-2.68.4          | 447 KB    | ##################################### | 100% 
gstreamer-1.14.0     | 3.2 MB    | ##################################### | 100% 
unicodedata2-14.0.0  | 496 KB    | ##################################### | 100% 
libxcb-1.13          | 391 KB    | ##################################### | 100% 
pydantic-1.8.2       | 2.2 MB    | ##################################### | 100% 
pathy-0.6.2          | 38 KB     | ##################################### | 100% 
sip-4.19.8           | 274 KB    | ##################################### | 100% 
cython-blis-0.7.7    | 4.0 MB    | ##################################### | 100% 
libglib-2.68.4       | 3.0 M

# Download Dataset

In order to use the Kaggle’s public API, you must first authenticate using an API token. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings at https://www.kaggle.com/account. Scroll down to the section of the page labelled API:

To create a new token, click on the “Create New API Token” button. This will download a fresh authentication token onto your machine.

### Accept the rules

https://www.kaggle.com/competitions/whale-detection-challenge/rules


### Upload your `kaggle.json` to the same folder as this notebook then run the cell below

In [5]:
!mkdir -p ~/.kaggle; mv kaggle.json ~/.kaggle/kaggle.json

### Download Dataset from Kaggle

In [6]:
!kaggle competitions download -c whale-detection-challenge

Downloading whale-detection-challenge.zip to /workspace/whale_detection_fast-ai
100%|███████████████████████████████████████▉| 508M/509M [00:15<00:00, 39.8MB/s]
100%|████████████████████████████████████████| 509M/509M [00:15<00:00, 34.1MB/s]


### Prepare Dataset for Use

In [7]:
!apt-get install unzip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  zip
The following NEW packages will be installed:
  unzip
0 upgraded, 1 newly installed, 0 to remove and 17 not upgraded.
Need to get 168 kB of archives.
After this operation, 567 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 unzip amd64 6.0-21ubuntu1.1 [168 kB]
Fetched 168 kB in 0s (575 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package unzip.
(Reading database ... 12708 files and directories currently installed.)
Preparing to unpack .../unzip_6.0-21ubuntu1.1_amd64.deb ...
Unpacking unzip (6.0-21ubuntu1.1) ...
Setting up unzip (6.0-21ubuntu1.1) ...
Processing triggers for mime-support (3.60ubuntu1) ...


Lets clean up our workspace so we can start over if necessary but also setup the directories we will need.

In [8]:
!rm -rf full_data; rm -rf sample_data; rm -rf tmp_data; rm -rf full_image_data; #remove any existing extracted data
!unzip -q whale-detection-challenge.zip -d data/ #unzip main file
!unzip -q data/small_data_sample_revised.zip -d sample_data/ #unzip sample data
!unzip -q data/whale_data.zip -d tmp_data/ #unzip full data
!rm -rf data/; rm -rf tmp_data/data/test; #remove unneeded files. official test data isn't used because we don't have labels
!mkdir full_data; mv tmp_data/data/train full_data/audio; #move stuff around
!mv tmp_data/data/train.csv full_data/labels.csv #rename labels
!rm -rf tmp_data #remove tmp directory
!mkdir -p full_data/whale; mkdir -p full_data/not_whale; #create necessary folders
!mkdir -p full_image_data/whale; mkdir -p full_image_data/not_whale; #create necessary folders

In [10]:
# This file contains all the main external libs we'll use
import pandas as pd
import os;
import shutil;

DATA_ROOT_DIR=os.path.normpath(os.path.join(os.getcwd(), 'full_data'))
DATA_META_FILE=os.path.join(DATA_ROOT_DIR, 'labels.csv')
DATA_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'audio')
DATA_WHALE_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'whale')
DATA_NOT_WHALE_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'not_whale')

df = pd.read_csv(DATA_META_FILE)
df.head()

for index, row in df.iterrows():
    clip_name = row['clip_name']
    label = row['label']
    
    # path to file described in labels.csv
    source_path = os.path.join(DATA_AUDIO_DIR, clip_name)

    # this is the directory we will move it to
    aiff_dst_path = None
    
    # path will be dependent on whether it is a whale sound or not
    if(label==0): #not whale
        aiff_dst_path = os.path.join(DATA_NOT_WHALE_AUDIO_DIR, clip_name)
    else: #whale
        aiff_dst_path = os.path.join(DATA_WHALE_AUDIO_DIR, clip_name)

    # perform the move, this is pretty fast
    shutil.move(source_path, aiff_dst_path)

Test that it moved correctly

In [11]:
df = pd.read_csv(DATA_META_FILE)
df.head()

for index, row in df.iterrows():
    clip_name = row['clip_name']
    label = row['label']
    source_path = os.path.join(DATA_AUDIO_DIR, clip_name)
    not_whale_destination_path = os.path.join(DATA_NOT_WHALE_AUDIO_DIR, clip_name)
    whale_destination_path = os.path.join(DATA_WHALE_AUDIO_DIR, clip_name)

    if(label==0): #not whale
        assert(os.path.exists(not_whale_destination_path)), f"{clip_name} should be in {not_whale_destination_path}"
        assert not(os.path.exists(whale_destination_path)), f"{clip_name} should not be in {whale_destination_path}"
    else: #whale
        assert not(os.path.exists(not_whale_destination_path)), f"{clip_name} should not be in {not_whale_destination_path}"
        assert(os.path.exists(whale_destination_path)), f"{clip_name} should be in {whale_destination_path}"

Remove `labels.csv` since we no longer need it since files are organized in folders according to their labell.
If you don't do this you will run into a dataloader issue later

In [12]:
!rm -rf full_data/audio; rm -rf full_data/labels.csv