<a href="https://colab.research.google.com/github/andraspalasti/deeplearning-hw/blob/main/notebooks/data_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setting up the project in google colab

In [None]:
# Cloning repository into current folder
!git clone https://github.com/andraspalasti/deeplearning-hw.git
!mv deeplearning-hw/* .
!rm -rf deeplearning-hw/

# Install the packages used
%pip install -q -r requirements.txt

## Preparing raw dataset

To download the dataset from kaggle you need to be signed in, to sign in fill in the credentials listed in the cell below.

What do these cells do?
1. Download raw dataset from kaggle
1. Examin the ratio of images
1. Unzip the selected part of the raw dataset
1. Divide dataset into train, val, test datasets
1. Optionally save the dataset into google drive


In [1]:
import os
from pathlib import Path

# Set the enviroment variables for authentication
if 'KAGGLE_USERNAME' not in os.environ:
    os.environ['KAGGLE_USERNAME'] = "xxxx"
    os.environ['KAGGLE_KEY'] = "xxxx"

from tqdm import tqdm
import pandas as pd
from zipfile import ZipFile
from src.data import download_dataset, filter_missing

In [2]:
# Set up directories to work in
data_dir = Path('data')
if not data_dir.exists():
    data_dir.mkdir()

raw_dir = data_dir / 'raw'
if not raw_dir.exists():
    raw_dir.mkdir()

proc_dir = data_dir / 'processed'

In [4]:
# Download dataset returns the downloaded zip file's location
dataset_path = download_dataset(raw_dir)
print(dataset_path)

Downloading airbus-ship-detection.zip to data/raw
... resuming from 277872640 bytes (30412638106 bytes left) ...


100%|██████████| 28.6G/28.6G [14:53<00:00, 34.1MB/s] 


data/raw/airbus-ship-detection.zip





In [5]:
zip = ZipFile(dataset_path)

# Export csv file containing segmentations
csv_path = Path(zip.extract('train_ship_segmentations_v2.csv', raw_dir))
segmentations = pd.read_csv(csv_path)
segmentations['EncodedPixels'] = segmentations['EncodedPixels'].fillna('')
segmentations = segmentations.groupby('ImageId').agg({'EncodedPixels': ' '.join})

print(f'There are {len(segmentations)} number of images in the dataset')
segmentations

There are 192556 number of images in the dataset


Unnamed: 0_level_0,EncodedPixels
ImageId,Unnamed: 1_level_1
00003e153.jpg,
0001124c7.jpg,
000155de5.jpg,264661 17 265429 33 266197 33 266965 33 267733...
000194a2d.jpg,360486 1 361252 4 362019 5 362785 8 363552 10 ...
0001b1832.jpg,
...,...
fffedbb6b.jpg,
ffff2aa57.jpg,
ffff6e525.jpg,
ffffc50b4.jpg,


In [6]:
imgs_with_ships = segmentations[segmentations['EncodedPixels'] != '']
print(f'There are {len(imgs_with_ships)} images that contain ships')

ratio = len(imgs_with_ships) / len(segmentations)
print(f'Ratio of images containing ships and all images: {ratio*100:.2f}%')

There are 42556 images that contain ships
Ratio of images containing ships and all images: 22.10%


We have a lot of images that do not contain ships so for now they have
less value for us. We are going to use 60000 images to create our own
dataset (using all the images would be too much for us anyway). In
the 60000 images we will put all of the images that contain ships and
for the rest we will use images that do not contain ships.

In [7]:
dataset_size = 60_000

image_ids = list(imgs_with_ships.index)
print(f'{len(image_ids)} number of images contain ships')

# Fill the rest of the dataset with images that do not contain ships
print(f'{dataset_size-len(image_ids)} number of images do not contain ships')
imgs_without_ships = segmentations[segmentations['EncodedPixels'] == '']
image_ids.extend(imgs_without_ships[:dataset_size-len(image_ids)].index)

42556 number of images contain ships
17444 number of images do not contain ships


In [8]:
# Our next task will be to only extract the images that are in our dataset
for image_id in (t := tqdm(image_ids)):
    zip.extract(f'train_v2/{image_id}', path=raw_dir)
    t.set_description(f'Extracting: {image_id}')

Extracting: 1dbeec1ea.jpg: 100%|██████████| 60000/60000 [01:40<00:00, 598.23it/s]


In [9]:
!echo "Number of images in raw dataset: $(ls -1 data/raw/train_v2/ | wc -l)"

Number of images in raw dataset: 60000


In [10]:
from math import floor

train_size = floor(dataset_size * 0.9)
val_size = floor(dataset_size * 0.05)
test_size = floor(dataset_size * 0.05)
print(f'Full size of dataset: {dataset_size}')
print(f'\tTrain set size: {train_size}')
print(f'\tValidation set size: {val_size}')
print(f'\tTest set size: {test_size}')

Full size of dataset: 60000
	Train set size: 54000
	Validation set size: 3000
	Test set size: 3000


In [11]:
# Split training images into train, val, test sets using images from the raw dataset
!mkdir -p data/processed/

# Create training dataset
!mkdir -p data/processed/train/
!find data/raw/train_v2/ -name "*.jpg" \
    | sort -R \
    | head -n {train_size} \
    | tr '\n' '\0' \
    | xargs -0 mv -t data/processed/train/
!cp data/raw/train_ship_segmentations_v2.csv data/processed/train_ship_segmentations.csv

# Create validation dataset
!mkdir -p data/processed/val/
!find data/raw/train_v2/ -name "*.jpg" \
    | sort -R \
    | head -n {val_size} \
    | tr '\n' '\0' \
    | xargs -0 mv -t data/processed/val/
!cp data/raw/train_ship_segmentations_v2.csv data/processed/val_ship_segmentations.csv

# Create test dataset
!mkdir -p data/processed/test/
!find data/raw/train_v2/ -name "*.jpg" \
    | sort -R \
    | head -n {test_size} \
    | tr '\n' '\0' \
    | xargs -0 mv -t data/processed/test/
!cp data/raw/train_ship_segmentations_v2.csv data/processed/test_ship_segmentations.csv

In [12]:
# Filter missing annotations
for dataset_path in ['train', 'val', 'test']:
    filter_missing(proc_dir / f'{dataset_path}_ship_segmentations.csv',
                   proc_dir / f'{dataset_path}')

In [13]:
!echo "Number of images in dataset: $(find data/processed/*/ -name "*.jpg" | wc -l)"
!echo "Size of dataset on disk: $(du -sh data/processed)"

Number of images in dataset: 60000
Size of dataset on disk: 8.7G	data/processed


In [14]:
# Optional save dataset into google drive
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    !zip -r gdrive/MyDrive/airbus-dataset.zip data/processed/*
    drive.flush_and_unmount()
except:
    pass