# Compare results from IRC image search and SBI 

## Background
The Image Registry Catalog (IRC) API is used to identify misuse of licensed images and should identify images that are cropped, rotated, and flipped horizontally. Are results from Search by Image (SBI) accurate enough to be considered as a replacement for IRC? 

### Save test Getty images: original, cropped, rotated, flipped horizontally

In [245]:
import requests
from io import BytesIO
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import csv

image_filenames = []

def save_test_images():
    with open('data/assetids.csv', 'r') as csv_file:
        assetids = csv.reader(csv_file, delimiter=',')
        counter = 0
        for assetid in assetids:
            assetid = assetid[0]
            url = 'http://media.gettyimages.com/photos/-id{}?s=170667a'.format(assetid)
            response = requests.get(url)
            if (response.status_code != 200):
                print('Error downloading for assetid {}'.format(assetid))
                continue 
            original_filename = assetid + '-original.jpg'
            with open(original_filename, 'wb') as handler:
                handler.write(response.content)
                image_filenames.append(original_filename)
            # crop   
            image = Image.open('./' + assetid + '-original.jpg')
            width, height = image.size   # Get dimensions
            left = width/4
            top = height/4
            right = 3 * width/4
            bottom = 3 * height/4
            cropped = image.crop((left, top, right, bottom))
            cropped_filename = assetid + '-cropped.jpg'
            cropped.save('./' + cropped_filename)
            image_filenames.append(cropped_filename)

            # rotate
            rotated  = image.transpose(Image.ROTATE_90)
            rotated_filename = assetid + '-rotated.jpg'
            rotated.save('./' + rotated_filename)
            image_filenames.append(rotated_filename)

            # flip
            flipped = image.transpose(Image.FLIP_LEFT_RIGHT)
            flipped_filename = assetid + '-flipped.jpg'
            flipped.save('./' + flipped_filename)
            image_filenames.append(flipped_filename)
            
            counter = counter + 1
            if counter % 100 == 0:
                print('Completed processing {} images.'.format(counter))

    print('Finished saving images.')

save_test_images()

Completed processing 100 images.
Completed processing 200 images.
Completed processing 300 images.
Completed processing 400 images.
Completed processing 500 images.
Completed processing 600 images.
Completed processing 700 images.
Completed processing 800 images.
Completed processing 900 images.
Completed processing 1000 images.
Finished saving images.


In [246]:
def display_test_images():  
    i = 0
    columns = 5

    for key in getty_images:
        
        if (i%columns == 0):
            i = 0
            fig, axs = plt.subplots(nrows=4, ncols=columns)
            fig.set_size_inches(20, 20)

        axs[0, i].set_title("{}".format(key), fontsize=20)
        filename = key + '-original.jpg'
        image = plt.imread(filename)
        axs[0, i].imshow(image)
        axs[0, i].axis('off')

        axs[1, i].set_title("Cropped", fontsize=20)
        filename = key + '-cropped.jpg'
        image = plt.imread(filename)
        axs[1, i].imshow(image)
        axs[1, i].axis('off')

        axs[2, i].set_title("Rotated", fontsize=20)
        filename = key + '-rotated.jpg'
        image = plt.imread(filename)
        axs[2, i].imshow(image)
        axs[2, i].axis('off')

        axs[3, i].set_title("Flipped", fontsize=20)
        filename = key + '-flipped.jpg'
        image = plt.imread(filename)
        axs[3, i].imshow(image)
        axs[3, i].axis('off')

        i = i + 1

    plt.tight_layout()
    plt.show()
    
# display_test_images()

### Call IRC with original, cropped, rotated, and flipped Getty images

In [272]:
import time

irc_original_matches = 0
irc_cropped_matches = 0
irc_rotated_matches = 0
irc_flipped_matches = 0
irc_total_matches = 0

def find_matches(filenames, irc_total_matches, irc_original_matches, irc_cropped_matches, irc_rotated_matches, irc_flipped_matches):
    print('Finding IRC matches for {} images...'.format(len(filenames)))
    
    if (len(filenames) == 0):
        print('Filename list is empty.')
        return
    
    total_time = 0
    image_count = 0 
    
    for filename in filenames:
        files = {'file': open(filename, 'rb')}
        
        start = time.time()
        r = requests.post('https://api.picscout.com/v1/Search?key=<>', files=files)
        end = time.time()
        time_taken = int((end - start) * 1000)
        total_time = total_time + time_taken
        
        resp = r.json()
        
        if len(resp['ids']) > 0:
            if 'original' in filename:
                irc_original_matches = irc_original_matches + 1
            elif 'cropped' in filename:
                irc_cropped_matches = irc_cropped_matches + 1
            elif 'rotated' in filename:
                irc_rotated_matches = irc_rotated_matches + 1
            elif 'flipped' in filename :
                irc_flipped_matches = irc_flipped_matches + 1
            else:
                print('File {} cannot be categorized.'.format(filename))
    
#             print('{0:<30} {1:<37} {2:>10}ms'.format(filename, resp['ids'][0], time_taken))
        image_count = image_count + 1
        if image_count % 500 == 0:
            print('Finished processing {} images in {} seconds.'.format(image_count, round(total_time/1000, 1)))
            
    print()
    irc_total_matches = irc_original_matches + irc_cropped_matches + irc_rotated_matches + irc_flipped_matches
    print('IRC found matches {}% of the time. Average time {}ms'.format(round(irc_total_matches/len(filenames)*100, 2), int(total_time/len(filenames))))
    print('Original matches: {}%'.format(round((irc_original_matches/(len(filenames)/4))*100, 2)))
    print('Cropped matches: {}%'.format(round((irc_cropped_matches/(len(filenames)/4))*100, 2)))
    print('Rotated matches: {}%'.format(round((irc_rotated_matches/(len(filenames)/4))*100, 2)))
    print('Flipped matches: {}%'.format(round((irc_flipped_matches/(len(filenames)/4))*100, 2)))
    
    
image_filenames.sort()
find_matches(image_filenames, irc_total_matches, irc_original_matches, irc_cropped_matches, irc_rotated_matches, irc_flipped_matches) 

Finding IRC matches for 4000 images...
Finished processing 500 images in 351.8 seconds.
Finished processing 1000 images in 714.0 seconds.
Finished processing 1500 images in 1080.0 seconds.
Finished processing 2000 images in 1440.0 seconds.
Finished processing 2500 images in 1776.7 seconds.
Finished processing 3000 images in 2115.0 seconds.
Finished processing 3500 images in 2458.0 seconds.
Finished processing 4000 images in 2819.4 seconds.

IRC found matches 69.58% of the time. Average time 704ms
Original matches: 97.6%
Cropped matches: 83.4%
Rotated matches: 0.0%
Flipped matches: 97.3%


### Upload Getty images to S3 in order to extract fingerprints

In [253]:
import boto3
import urllib.parse

def upload_to_s3(bucket, files):
    
    # create bucket in sandbox env and authenticate locally using oktad
    session = boto3.Session(profile_name='oktad')
    client = session.client('s3')
    
    count = 0
    for file in files:
        client.upload_file(file, bucket, file, ExtraArgs={'ACL': "public-read"})
        count = count + 1
        if count % 500 == 0:
            print('Completed uploading {} images.'.format(count))
        
    print('Done uploading.')
    
    
upload_to_s3('visint-image-match-test', image_filenames)  

Completed uploading 500 images.
Completed uploading 1000 images.
Completed uploading 1500 images.
Completed uploading 2000 images.
Completed uploading 2500 images.
Completed uploading 3000 images.
Completed uploading 3500 images.
Completed uploading 4000 images.
Done uploading.


In [261]:
filename_to_fingerprint = {}

def extract_fingerprints(files):
    
    print('Finding fingerprints for {} images...'.format(len(files)))
    
    fingerprint_count = 0
    total_time = 0
    
    for file in files:
        url = 'https://s3-us-west-2.amazonaws.com/visint-image-match-test/{}'.format(file)
        encoded_url = urllib.parse.quote(url, safe='')
        
        start = time.time()
        r = requests.get('http://usw2-prod-search-fingerprint-extractor.prod-getty.cloud/fingerprint/{}'.format(encoded_url))  
        end = time.time()
        total_time = total_time + (end - start)
        
        if (r.status_code != 200):
            print('Error retrieving fingerprint for {}'.format(file))
            continue
        resp = r.json()
        
        if len(resp['results']) == 0:
            print('No fingerprint extracted for {}'.format(file))
        else:
            filename_to_fingerprint[file] = resp['results'][0]['fp']['data']  
        
        fingerprint_count = fingerprint_count + 1
        if fingerprint_count % 500 == 0:
            print('Found {} fingerprints in {} seconds.'.format(fingerprint_count, round(total_time, 1)))
            
    print('Finished finding {} fingerprints'.format(len(filename_to_fingerprint)))    
    
    
extract_fingerprints(image_filenames)

Finding fingerprints for 4000 images...
Found 500 fingerprints in 370.5 seconds.
Found 1000 fingerprints in 708.1 seconds.
Found 1500 fingerprints in 1047.2 seconds.
Found 2000 fingerprints in 1393.3 seconds.
Found 2500 fingerprints in 1749.2 seconds.
Found 3000 fingerprints in 2097.9 seconds.
Found 3500 fingerprints in 2469.7 seconds.
Found 4000 fingerprints in 2803.4 seconds.
Finished finding 4000 fingerprints


### Find SBI matches for Getty images

In [271]:
sbi_original_matches = 0
sbi_cropped_matches = 0
sbi_rotated_matches = 0
sbi_flipped_matches = 0
sbi_total_matches = 0

def get_sbi_matches(files, sbi_total_matches, sbi_original_matches, sbi_cropped_matches, sbi_rotated_matches, sbi_flipped_matches):   
    print('Finding SBI matches for {} images...'.format(len(files)))
    
    if (len(files) == 0):
        print('Filename list is empty.')
        return

    total_time = 0
    
    image_count = 0
    
    for filename, fingerprint in filename_to_fingerprint.items():
        fingerprint_as_strings = ','.join([str(num) for num in fingerprint])
        
        start = time.time()
        sbi_url = 'http://usw2-prod-search-fingerprint-search.prod-getty.cloud/search?k=50&c=1&v={}'.format(fingerprint_as_strings)
        r = requests.get(sbi_url)
        end = time.time()
        time_taken = int((end - start) * 1000)
        total_time = total_time + time_taken
        resp = r.json()  
        
        if len(resp.items()) > 0:
            assetid = next(iter(resp))
#             match = ''
            if (assetid in filename):
#                 match = 'MATCH'
                if 'original' in filename:
                    sbi_original_matches = sbi_original_matches + 1
                elif 'cropped' in filename:
                    sbi_cropped_matches = sbi_cropped_matches + 1
                elif 'rotated' in filename:
                    sbi_rotated_matches = sbi_rotated_matches + 1
                elif 'flipped' in filename :
                    sbi_flipped_matches = sbi_flipped_matches + 1
                else:
                    print('File {} cannot be categorized.'.format(filename))
                
#             print('{0:<30} {1:<15} {2:<20} {3:<6} {4:>8}ms'.format(filename, assetid, resp[assetid], match, time_taken))
        image_count = image_count + 1
        if image_count % 500 == 0:
            print('Finished processing {} images in {} seconds.'.format(image_count, round(total_time/1000, 1)))
    
    print()
    sbi_total_matches = sbi_original_matches + sbi_cropped_matches + sbi_rotated_matches + sbi_flipped_matches
    print('SBI found matches {}% of the time. Average time {}ms'.format(round(sbi_total_matches/len(files) * 100, 2), int(total_time/len(files))))
    print('Original matches: {}%'.format(round((sbi_original_matches/(len(files)/4))*100, 2)))
    print('Cropped matches: {}%'.format(round((sbi_cropped_matches/(len(files)/4))*100, 2)))
    print('Rotated matches: {}%'.format(round((sbi_rotated_matches/(len(files)/4))*100, 2)))
    print('Flipped matches: {}%'.format(round((sbi_flipped_matches/(len(files)/4))*100, 2)))
            
                  
get_sbi_matches(image_filenames, sbi_total_matches, sbi_original_matches, sbi_cropped_matches, sbi_rotated_matches, sbi_flipped_matches)

Finding SBI matches for 4000 images...
Finished processing 500 images in 29.3 seconds.
Finished processing 1000 images in 61.7 seconds.
Finished processing 1500 images in 93.1 seconds.
Finished processing 2000 images in 123.1 seconds.
Finished processing 2500 images in 151.5 seconds.
Finished processing 3000 images in 178.4 seconds.
Finished processing 3500 images in 222.1 seconds.
Finished processing 4000 images in 260.7 seconds.

SBI found matches 40.58% of the time. Average time 65ms
Original matches: 99.4%
Cropped matches: 1.1%
Rotated matches: 2.1%
Flipped matches: 59.7%


### Testing non-Getty images for false positives

In [326]:
def call_irc(file):
    print('Calling IRC...')
    with open(file, 'r') as csv_file:
        image_urls = csv.reader(csv_file, delimiter=',')
        image_count = 0
        matches = 0
        for url in image_urls:
            image_count = image_count + 1
            image_url = url[0]
            r = requests.get('https://api.picscout.com/v1/Search?key=<test-key>&url='.format(image_url))       

            if (r.status_code != 200):
                print('Error calling IRC for {}'.format(url))
                continue

            resp = r.json()
            if len(resp['ids']) > 0:
                print('Match found for {}: {}'.format(url, resp['ids'][0]))
                matches = matches + 1
            
    print('IRC false positives rate: {} for {} images'.format(round(matches/image_count, 1), image_count))

    
def upload_to_s3(bucket, file):
    print('Uploading images to S3...')
    # create bucket in sandbox env and authenticate locally using oktad
    session = boto3.Session(profile_name='oktad')
    client = session.client('s3')
    with open(file, 'r') as csv_file:
        image_urls = csv.reader(csv_file, delimiter=',')
        count = 0
        for url in image_urls:
            response = requests.get(url[0])
            if (response.status_code != 200):
                print('Error downloading for url {}'.format(url[0]))
                continue 
                
            image_filename = 'nongetty{}.jpg'.format(count + 1)
            count = count + 1
            with open(image_filename, 'wb') as handler:
                handler.write(response.content)
            client.upload_file(image_filename, bucket, image_filename, ExtraArgs={'ACL': "public-read"})

    print('Done uploading {} non-Getty images.'.format(count))
    
    
def call_sbi(file):
    with open(file, 'r') as csv_file:
        image_urls = csv.reader(csv_file, delimiter=',')
        image_count = sum(1 for image in image_urls)
    print('Calling SBI for {} images...'.format(image_count))
    
    matches = 0 
    min_distance = 100
    for i in range(image_count):
        url = 'https://s3-us-west-2.amazonaws.com/visint-image-match-test/nongetty{}.jpg'.format(i + 1)
        encoded_url = urllib.parse.quote(url, safe='')
        r = requests.get('http://usw2-prod-search-fingerprint-extractor.prod-getty.cloud/fingerprint/{}'.format(encoded_url))  
        
        if (r.status_code != 200):
            print('Error retrieving fingerprint for {}'.format(encoded_url))
            continue
            
        resp = r.json()
        fingerprint = resp['results'][0]['fp']['data']
        
        fingerprint_as_strings = ','.join([str(num) for num in fingerprint])
        sbi_url = 'http://usw2-prod-search-fingerprint-search.prod-getty.cloud/search?k=50&c=1&v={}'.format(fingerprint_as_strings)
        r = requests.get(sbi_url)
        resp = r.json()  
        
        if len(resp.items()) > 0:
            distance = next(iter(resp.values()))
            min_distance = min([distance, min_distance])
            if next(iter(resp.values())) <= .35:
                print('Match found for nongetty{}.jpg: {}'.format(i + 1, next(iter(resp))))
                matches = matches + 1
            
    print('SBI false positives rate: {}'.format(round(matches/image_count, 1)))
    print('The closest match was {} away.'.format(min_distance))
    
    
# call_irc('data/nongettyimages.csv')
# upload_to_s3('visint-image-match-test', 'data/nongettyimages.csv')  
call_sbi('data/nongettyimages.csv')

Calling SBI for 100 images...
SBI false positives rate: 0.0
The closest match was 0.4218774735927582 away.


### Results

Using 1000 randomly chosen images from the Getty library and testing the two image-matching systems with the original images as well as cropped, rotated, and horizontally flipped versions, we see that IRC overall returns matches 70% of the time and SBI returns matches 41% of the time. 

### IRC

IRC performs equally well for both original and horizontally flipped images, returning matches 93% of the time. It does not work at all (0%) for images rotated by 90 degrees and works relatively well for cropped images at 83%:

| Image type     | Rate of Matches |
| -------------- | -----------     |
| Original       | 97.6%           |
| Cropped        | 83.4%           |
| Rotated        | 0.0%            |
| Flipped        | 97.3%           |
| **Overall**      | **69.6%**           |

There was at least one false positive for the IRC test, for image 53289274-cropped.jpg which returned as asset attributed to a licensor called Media Bakery. The corresponding image was not found on its site.


### SBI

SBI performs very well at identifying matches for original images, returning correct results 99% of the time. However it performs poorly on all other manipulations, returning matches 1% of the time for cropped images, 2% for rotated images, and 60% for flipped images:


| Image type     | Rate of Matches |
| -------------- | -----------     |
| Original       | 99.4%           |
| Cropped        | 1.1%          |
| Rotated        | 2.1%           |
| Flipped        | 59.7%           |
| **Overall**      | **40.6%**           |  
  
    
       
       
         
## Analysis
    

If we assume that detecting rotations is not necessary (as publishers are unlikely to be able to use an image after a 90 degree rotation), then IRC's overall accuracy is closer to 93% and SBI's overall accuracy 55%. 

In terms of speed, IRC is slightly faster than SBI. Getting a match from SBI takes around 765ms per call (700ms to extract the fingerprint and 65ms to find a match) versus 607ms for IRC.

SBI performs slightly better than IRC in detecting original images, but this may be because we used watermarked images for testing. The SBI index uses watermarked images and the IRC index does not. Nonetheless, in order for SBI to be considered a candidate to replace IRC, it needs added functionality to support identification of horizontally flipped and cropped images.

When tested with a set of 100 non-Getty stock images, neither system returned false positive results. For SBI an exact match is an image returned that has a distance of <= .35. The closest distance of a result returned by SBI for a non-Getty image was .42.