# Getting Started

This notebook will help install all the required dependencies as well as prepare the dataset for use with fast.ai

In [None]:
# Check python version
import sys
sys.version

This is needed for systems that don't already have it installed.

In [None]:
!apt-get install -y libsndfile1

Setting up system dependencies that work well with each other is incredibly difficult. Different libraries specify different version constraints and that makes finding the right versions difficult to resolve. In the cloud, typically libraries like `torch` will already be installed on the system with CUDA enabled. When a dependency specifies a different version of `torch` and your system downloads a new one, it may not be CUDA enabled. The same occurs with other libraries.

`conda` ships with the Anaconda distribution of Python and should be used instead of `pip` as certain channels curate which libraries work best together. `fastchan` is a channel created by the fast.ai team that curates the most common versions data science packages that work well together and also have GPU acceleration. When a package isn't on `fastchan` we use `conda-forge` which is a community maintained channel that is a good alternative. 

In [9]:
# Install fastai
!conda install -c fastchan fastai --yes
!conda install -c conda-forge kaggle librosa --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - fastai


The following NEW packages will be INSTALLED:

  python_abi         fastchan/linux-64::python_abi-3.7-2_cp37m None

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            pkgs/main/linux-64::certifi-2022.9.24~ --> fastchan/noarch::certifi-2022.9.24-pyhd8ed1ab_0 None
  conda              pkgs/main::conda-22.9.0-py37h06a4308_0 --> fastchan::conda-22.9.0-py37h89c1867_0 None


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - kaggle
    - librosa


The following packages will be downloaded:

    package                    |          

# Download Dataset

In order to use the Kaggle’s public API, you must first authenticate using an API token. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings at https://www.kaggle.com/account. Scroll down to the section of the page labelled API:

To create a new token, click on the “Create New API Token” button. This will download a fresh authentication token onto your machine.

### Accept the rules

https://www.kaggle.com/competitions/whale-detection-challenge/rules


### Upload your `kaggle.json` to the same folder as this notebook then run the cell below

In [None]:
!mkdir -p ~/.kaggle; mv kaggle.json ~/.kaggle/kaggle.json

### Download Dataset from Kaggle

In [None]:
!kaggle competitions download -c whale-detection-challenge

### Prepare Dataset for Use

In [None]:
!apt-get install unzip

Lets clean up our workspace so we can start over if necessary but also setup the directories we will need.

In [None]:
!rm -rf full_data; rm -rf sample_data; rm -rf tmp_data; rm -rf full_image_data; #remove any existing extracted data
!unzip -q whale-detection-challenge.zip -d data/ #unzip main file
!unzip -q data/small_data_sample_revised.zip -d sample_data/ #unzip sample data
!unzip -q data/whale_data.zip -d tmp_data/ #unzip full data
!rm -rf data/; rm -rf tmp_data/data/test; #remove unneeded files. official test data isn't used because we don't have labels
!mkdir full_data; mv tmp_data/data/train full_data/audio; #move stuff around
!mv tmp_data/data/train.csv full_data/labels.csv #rename labels
!rm -rf tmp_data #remove tmp directory
!mkdir -p full_data/whale; mkdir -p full_data/not_whale; #create necessary folders
!mkdir -p full_image_data/whale; mkdir -p full_image_data/not_whale; #create necessary folders

In [None]:
# This file contains all the main external libs we'll use
import pandas as pd
import os;

DATA_ROOT_DIR=os.path.normpath(os.path.join(os.getcwd(), 'full_data'))
DATA_META_FILE=os.path.join(DATA_ROOT_DIR, 'labels.csv')
DATA_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'audio')
DATA_WHALE_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'whale')
DATA_NOT_WHALE_AUDIO_DIR=os.path.join(DATA_ROOT_DIR, 'not_whale')

df = pd.read_csv(DATA_META_FILE)
df.head()

for index, row in df.iterrows():
    clip_name = row['clip_name']
    label = row['label']
    
    # path to file described in labels.csv
    source_path = os.path.join(DATA_AUDIO_DIR, clip_name)

    # this is the directory we will move it to
    aiff_dst_path = None
    
    # path will be dependent on whether it is a whale sound or not
    if(label==0): #not whale
        aiff_dst_path = os.path.join(DATA_NOT_WHALE_AUDIO_DIR, clip_name)
    else: #whale
        aiff_dst_path = os.path.join(DATA_WHALE_AUDIO_DIR, clip_name)

    # perform the move, this is pretty fast
    shutil.move(source_path, aiff_dst_path)

Test that it moved correctly

In [None]:
df = pd.read_csv(DATA_META_FILE)
df.head()

for index, row in df.iterrows():
    clip_name = row['clip_name']
    label = row['label']
    source_path = os.path.join(DATA_AUDIO_DIR, clip_name)
    not_whale_destination_path = os.path.join(DATA_NOT_WHALE_AUDIO_DIR, clip_name)
    whale_destination_path = os.path.join(DATA_WHALE_AUDIO_DIR, clip_name)

    if(label==0): #not whale
        assert(os.path.exists(not_whale_destination_path)), f"{clip_name} should be in {not_whale_destination_path}"
        assert not(os.path.exists(whale_destination_path)), "f{clip_name} should not be in {whale_destination_path}"
    else: #whale
        assert not(os.path.exists(not_whale_destination_path)), f"{clip_name} should not be in {not_whale_destination_path}"
        assert(os.path.exists(whale_destination_path)), f"{clip_name} should be in {whale_destination_path}"

Remove `labels.csv` since we no longer need it since files are organized in folders according to their labell.
If you don't do this you will run into a dataloader issue later

In [None]:
!rm -rf full_data/audio; rm -rf full_data/labels.csv