<a href="https://colab.research.google.com/github/andraspalasti/deeplearning-hw/blob/main/notebooks/data_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setting up the project in google colab

In [None]:
# Cloning repository into current folder
!git clone https://github.com/andraspalasti/deeplearning-hw.git
!mv deeplearning-hw/* .
!rm -rf deeplearning-hw/

# Install the packages used
%pip install -r requirements.txt

## Preparing raw dataset

To download the dataset from kaggle you need to be signed in, to sign in fill in the credentials listed in the cell below.

What do these cells do?
1. Download raw dataset from kaggle
1. Examin the ratio of images
1. Unzip the selected part of the raw dataset
1. Divide dataset into train, val, test datasets
1. Optionally save the dataset into google drive


In [None]:
import os
from pathlib import Path

# Set the enviroment variables for authentication
if 'KAGGLE_USERNAME' not in os.environ:
    os.environ['KAGGLE_USERNAME'] = "xxxx"
    os.environ['KAGGLE_KEY'] = "xxxx"

from tqdm import tqdm
import pandas as pd
from zipfile import ZipFile
from src.data import download_dataset, filter_missing

In [None]:
# Set up directories to work in
data_dir = Path('data')
if not data_dir.exists():
    data_dir.mkdir()

raw_dir = data_dir / 'raw'
if not raw_dir.exists():
    raw_dir.mkdir()

proc_dir = data_dir / 'processed'

In [None]:
# Download dataset returns the downloaded zip file's location
dataset_path = download_dataset(raw_dir)
print(dataset_path)

In [None]:
zip = ZipFile(dataset_path)

# Export csv file containing segmentations
csv_path = Path(zip.extract('train_ship_segmentations_v2.csv', raw_dir))
segmentations = pd.read_csv(csv_path)
segmentations['EncodedPixels'] = segmentations['EncodedPixels'].fillna('')
segmentations = segmentations.groupby('ImageId').agg({'EncodedPixels': ' '.join})

print(f'There are {len(segmentations)} number of images in the dataset')
segmentations

In [None]:
imgs_with_ships = segmentations[segmentations['EncodedPixels'] != '']
print(f'There are {len(imgs_with_ships)} images that contain ships')

ratio = len(imgs_with_ships) / len(segmentations)
print(f'Ratio of images containing ships and all images: {ratio*100:.2f}%')

We have a lot of images that do not contain ships so for now they have
less value for us. We are going to use 60000 images to create our own
dataset (using all the images would be too much for us anyway). In
the 60000 images we will put all of the images that contain ships and
for the rest we will use images that do not contain ships.

In [None]:
dataset_size = 60_000

image_ids = list(imgs_with_ships.index)
print(f'{len(image_ids)} number of images contain ships')

# Fill the rest of the dataset with images that do not contain ships
print(f'{dataset_size-len(image_ids)} number of images do not contain ships')
imgs_without_ships = segmentations[segmentations['EncodedPixels'] == '']
image_ids.extend(imgs_without_ships[:dataset_size-len(image_ids)].index)

In [None]:
# Our next task will be to only extract the images that are in our dataset
for image_id in (t := tqdm(image_ids)):
    zip.extract(f'train_v2/{image_id}', path=raw_dir)
    t.set_description(f'Extracting: {image_id}')

In [None]:
!echo "Number of images in raw dataset: $(ls -1 data/raw/train_v2/ | wc -l)"

In [None]:
from math import floor

train_size = floor(dataset_size * 0.9)
val_size = floor(dataset_size * 0.05)
test_size = floor(dataset_size * 0.05)
print(f'Full size of dataset: {dataset_size}')
print(f'\tTrain set size: {train_size}')
print(f'\tValidation set size: {val_size}')
print(f'\tTest set size: {test_size}')

In [None]:
# Split training images into train, val, test sets using images from the raw dataset
!mkdir -p data/processed/

# Create training dataset
!mkdir -p data/processed/train/
!find data/raw/train_v2/ -name "*.jpg" \
    | head -n {train_size} \
    | tr '\n' '\0' \
    | xargs -0 cp -t data/processed/train/
!cp data/raw/train_ship_segmentations_v2.csv data/processed/train_ship_segmentations.csv

# Create validation dataset
!mkdir -p data/processed/val/
!find data/raw/train_v2/ -name "*.jpg" \
    | head -n {val_size} \
    | tr '\n' '\0' \
    | xargs -0 cp -t data/processed/val/
!cp data/raw/train_ship_segmentations_v2.csv data/processed/val_ship_segmentations.csv

# Create test dataset
!mkdir -p data/processed/test/
!find data/raw/train_v2/ -name "*.jpg" \
    | head -n {test_size} \
    | tr '\n' '\0' \
    | xargs -0 cp -t data/processed/test/
!cp data/raw/train_ship_segmentations_v2.csv data/processed/test_ship_segmentations.csv

In [None]:

# Filter missing annotations
for dataset_path in ['train', 'val', 'test']:
    filter_missing(proc_dir / f'{dataset_path}_ship_segmentations.csv',
                   proc_dir / f'{dataset_path}')

In [None]:
!echo "Number of images in dataset: $(find data/processed/*/ -name "*.jpg" | wc -l)"
!echo "Size of dataset on disk: $(du -sh data/processed)"

In [None]:
# Optional save dataset into google drive
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    !zip -r gdrive/MyDrive/airbus-dataset.zip data/processed/*
    drive.flush_and_unmount()
except:
    pass