# Unofficial TACO downloader

As of Oct 01 2022, original author's python downloader throws errors
when downloading the unofficial TACO images. That is, the following call runs into error

`python3 download.py --dataset_path ./data/annotations_unofficial.json`

This notebook contains a modified image downloader based on original author's script.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%%bash
wget https://raw.githubusercontent.com/pedropro/TACO/master/data/annotations_unofficial.json
wget https://raw.githubusercontent.com/pedropro/TACO/master/download.py

--2022-10-01 06:20:40--  https://raw.githubusercontent.com/pedropro/TACO/master/data/annotations_unofficial.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3846384 (3.7M) [text/plain]
Saving to: ‘annotations_unofficial.json’

     0K .......... .......... .......... .......... ..........  1% 27.8M 0s
    50K .......... .......... .......... .......... ..........  2% 25.1M 0s
   100K .......... .......... .......... .......... ..........  3% 57.9M 0s
   150K .......... .......... .......... .......... ..........  5% 28.0M 0s
   200K .......... .......... .......... .......... ..........  6%  144M 0s
   250K .......... .......... .......... .......... ..........  7%  343M 0s
   300K .......... .......... .......... .......... ..........  9%  393M 0s
   35

In [None]:
import os.path
import argparse
import json
from PIL import Image
import requests
from io import BytesIO
import sys
import re
from tqdm import tqdm

In [None]:
%rm -rf UNOF_TACO
%mkdir UNOF_TACO

In [None]:
anno_path = './annotations_unofficial.json'
dataset_dir = './UNOF_TACO'

In [None]:
with open(anno_path, 'r') as f:
  annotations = json.loads(f.read())
  nr_images = len(annotations['images'])
  images = annotations['images']

In [None]:
# original author stipulates all images to be saved in .jpg. check it:
for i in [re.search("[^.]+$", j).group() for j in [i['file_name'] for i in images]]:
  if i != 'jpg':
    print('NOT JPG')

In [None]:
for i in tqdm(range(nr_images)):
  image = images[i]

  # re-id, continuing from 1500, as 1499 is the end of the "official" TACO
  img_id = str(int(image['id'])+1500)+'.jpg'
  url_original = image['flickr_url']

  file_path = os.path.join(dataset_dir, img_id)

  if not os.path.isfile(file_path):
    # Load and Save Image
    response = requests.get(url_original)
    img = Image.open(BytesIO(response.content))
    rgb_im = img.convert('RGB')
    
    if img._getexif():
      rgb_im.save(file_path, exif=img.info["exif"])
    else:
      rgb_im.save(file_path)

100%|██████████| 3736/3736 [1:49:38<00:00,  1.76s/it]


In [None]:
%%capture
# %rm UNOF_TACO.zip
!zip -r UNOF_TACO.zip ./UNOF_TACO/*
%mv UNOF_TACO.zip ./drive/MyDrive
