`2018-Jan-15 10:40`
`Wayne Nixalo`

Trying to find out why a loop in `cropper.py` keeps skipping filenames `000001` and `000003` when just running through, but prints them later.

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
tpath = 'data/train/'
tempath = 'data/tmp/'
# rejectpath = tempath + 'reject/'
folders = os.listdir(tpath)
folders.sort()  # subfolders are numerically ordered
if '.DS_Store' in folders:
    folders.remove('.DS_Store')

# create destination folder if needed
if not os.path.exists(tempath):
    os.mkdir(tempath)
    clean_start = True
    last_fname = -1
else:
    # find starting point if quit before
    #NOTE: requires deletion of CSVs if tmp/ data deleted! otherwise will skip
    clean_start = False
    interstage_csvs = [csv_fname for csv_fname in os.listdir('data/') if 'interstage_labels-' in csv_fname]
    interstage_csvs.sort()
    last_csv = pd.read_csv('data/' + max(interstage_csvs))
    # find last recorded filename
    last_fpath = last_csv['id'].iloc[-1]
    last_folder, last_fname = last_fpath.split('/')
    # remove all folders before last
    for idx,folder in enumerate(folders):
        if folder < last_folder:
            folders.pop(idx)

In [10]:
print(f'clean_start: {clean_start}, \n'
      f'interstage_csvs: {interstage_csvs}, \n'
      f'last_csv: {last_csv}, \n'
      f'last_fpath: {last_fpath}, \n'
      f'last_folder: {last_folder}, \n'
      f'last_fname: {last_fname}, \n'
      f'folders: {folders}')

clean_start: False, 
interstage_csvs: ['interstage_labels-000000-000002.csv', 'interstage_labels-000003-000003.csv'], 
last_csv:                          id   x1   y1   x2   y2
0  000000-000412/000001.jpg  200  272  244  377
1  000000-000412/000003.jpg   12  122  147  388, 
last_fpath: 000000-000412/000003.jpg, 
last_folder: 000000-000412, 
last_fname: 000003.jpg, 
folders: ['000000-000412', '000413-000569', '000570-001189', '001190-001434', '001435-001882', '001883-002438', '002439-003316', '003317-003603', '003604-003904', '003905-004151', '004152-004401', '004402-004684', '004685-005105', '005106-005451', '005452-005591', '005592-006111', '006112-006241', '006242-006439', '006440-006548', '006549-006672', '006673-006860', '006861-007364', '007365-007636']


So, looking at this, the last filename is `000003.jpg`, so all files 0-3 should be removed from the filenames list.

In [11]:
# folder of interest is the first
folder = folders[0]

In [12]:
# build filenames list by looking at the folder's directory
fnames = os.listdir(tpath + folder)
fnames.sort()

In [18]:
fnames[:5] # we can see the first 5 filenames, including 0-3

['000000.jpg', '000001.jpg', '000002.jpg', '000003.jpg', '000004.jpg']

In [20]:
# remove all filenames before last in the 1st folder if not a fresh start
if not clean_start and folder == last_folder:
    for idx, fname in enumerate(fnames):
#         print(fname, last_fname)
        if fname <= last_fname:
            print(fname, last_fname, idx)
#             fnames.pop(idx)

000000.jpg 000003.jpg 0
000001.jpg 000003.jpg 1
000002.jpg 000003.jpg 2
000003.jpg 000003.jpg 3


Okay, that's how it's supposed to work. But that's not what I've seen.

In [21]:
if not clean_start and folder == last_folder:
    for idx, fname in enumerate(fnames):
        if fname <= last_fname:
            print(fname, last_fname, idx)
            fnames.pop(idx)
print(fnames[:5])

000000.jpg 000003.jpg 0
000002.jpg 000003.jpg 1
['000001.jpg', '000003.jpg', '000004.jpg', '000005.jpg', '000006.jpg']


That's the issue. So are `000001.jpg` and `000003.jpg` not seen as less than `000003.jpg`?

In [22]:
'000001.jpg' <= '000003.jpg', '000003.jpg' <= '000003.jpg'

(True, True)

They are. So then the condition should trigger and they should be removed from the list... Why not?

In [24]:
for idx, fname in enumerate(fnames[:5]):
    if fname <= last_fname:
        print(f'{fname} ≤ {last_fname} | {idx}')
    else:
        print(f'{fname} > {last_fname} | {idx}')

000001.jpg ≤ 000003.jpg | 0
000003.jpg ≤ 000003.jpg | 1
000004.jpg > 000003.jpg | 2
000005.jpg > 000003.jpg | 3
000006.jpg > 000003.jpg | 4


So it works now. But it didn't work just before. And I did the same thing. I am missing something.

I'm going to restart and step through this carefully.

In [28]:
fnames = os.listdir(tpath+folder)
fnames.sort()
fnames[:5]

['000000.jpg', '000001.jpg', '000002.jpg', '000003.jpg', '000004.jpg']

In [34]:
# copy of fnames
fnames_copy = fnames.copy()

if not clean_start and folder == last_folder:
    for idx, fname in enumerate(fnames[:20]):
        if fname <= last_fname:
            popped = fnames_copy.pop(idx)
            print(f'{fname} ≤ {last_fname} | {idx} -- removing {popped}')
        else:
            print(f'{fname} > {last_fname} | {idx}')

000000.jpg ≤ 000003.jpg | 0 -- removing 000000.jpg
000001.jpg ≤ 000003.jpg | 1 -- removing 000002.jpg
000002.jpg ≤ 000003.jpg | 2 -- removing 000004.jpg
000003.jpg ≤ 000003.jpg | 3 -- removing 000006.jpg
000004.jpg > 000003.jpg | 4
000005.jpg > 000003.jpg | 5
000006.jpg > 000003.jpg | 6
000007.jpg > 000003.jpg | 7
000008.jpg > 000003.jpg | 8
000009.jpg > 000003.jpg | 9
000010.jpg > 000003.jpg | 10
000011.jpg > 000003.jpg | 11
000012.jpg > 000003.jpg | 12
000013.jpg > 000003.jpg | 13
000014.jpg > 000003.jpg | 14
000015.jpg > 000003.jpg | 15
000016.jpg > 000003.jpg | 16
000017.jpg > 000003.jpg | 17
000018.jpg > 000003.jpg | 18
000019.jpg > 000003.jpg | 19


In [32]:
fnames_copy = fnames.copy()
fnames_copy[:5]

['000000.jpg', '000001.jpg', '000002.jpg', '000003.jpg', '000004.jpg']

In [33]:
fnames_copy.pop(3)

'000003.jpg'

I am a colossal fool. How exactly did I expect `.pop(index)` to work when the list shrinks each time it's called? The removal-by-index method won't work when your indices are changing. I have to use `.remove(element)` instead.

**NOTE**. ohh. The `.remove` method also won't work, because it too is changing the size of the list. So (guessing) since `enumerate(.)` is an iterator, it gets the next item when it 'punches' the iterable (`fnames`). Neither `enumerate(.)` nor some magical Python service is holding on to the data of `fnames`, *that*  is `fnames`'s job as a variable... So when `fnames` is modified and `enumerate` punches it again for the next index,item pair, well you get the picture.

This cost me about 30-40 minutes this morning, and at least 3, maybe 4+ hours last night.. Could *all* be fixed by just building a list of indices. Alrighty. Actually, that'll work for NumPy ndarrays, and maybe for reassignment by list comprehension.. but I think that too would break if run by a simple for-loop. Here I can just use `remove` instead in an `O(N*r)` operation that removes `r` items from an `N`-long list.