# Ingest Xview2 Dataset

This notebook ingests [xView2](xview2.org) data. Since it is quite large, it is stored in AWS S3 on `s3://alivio`

* Runtime: ~7 minutes
* Compute: 16-32 GB memory

In [0]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import requests
import os
import json
import boto3
from loguru import logger
from typing import Literal
from smart_open import open
from concurrent.futures import ThreadPoolExecutor

from src.utils.config import LOCAL_DATA_DIR, S3_DATA_DIR, S3_BUCKET
from xview2_dataset_links import challenge_links as xview2_links

In [0]:
DESTINATION: Literal["LOCAL", "S3"] = "S3"

DATASET_PREFIX: str = "/xview2"

BASE_DIR: str = (
    LOCAL_DATA_DIR + DATASET_PREFIX
    if DESTINATION == "LOCAL"
    else S3_DATA_DIR + DATASET_PREFIX
    if DESTINATION == "S3"
    else None
)

print(f"{BASE_DIR=}")

assert all([BASE_DIR]), "Download Destination not Set"

BASE_DIR='s3://alivio/datasets/xview2'


In [0]:
s3 = boto3.client("s3")


### Ephemeral Links

xView2 provides ephemeral links with an expiry from their website. As such, they need to be refreshed when downloading from the website again. Substitute the links in `src.01_data_ingestion.xview2_dataset_links.py` file when re-downloading. Since each dataset link is about 10 GB large, we use streaming to download it programatically

In [0]:
prefix_path: str = "/".join(BASE_DIR.split("/")[3:])

def stream_to_s3(url: str, object_name: str):
    print(f"Downloading {object_name}")
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        s3.upload_fileobj(r.raw, S3_BUCKET, object_name)


## Concurrent Downloads

On receiving the ephemeral download links from xview2 from the website, each link only lives for about 5-10 minutes. Downloading the links serially, where each link takes about 3-5 minutes on a datacenter network to download means that some dataset links will expire before they're downloaded. This will result in a `403` error. 

As such, multithreading is used to fetch data from all links simultaneously. This ensures that a local copy of the data is available within the time window of the ephemeral link timeout.

In [0]:
with ThreadPoolExecutor(max_workers=6) as executor:
    executor.map(
        lambda x: stream_to_s3(*x),
        [(v, prefix_path + "/" + k) for k, v in xview2_links.items()],
    )

Downloading datasets/xview2/xview2_trainDownloading datasets/xview2/xview2_test

Downloading datasets/xview2/xview2_holdout
