## Feature Extraction from Images

This notebook will:

    1. Create a robust function to accept image urls in input and return a 1000 element float vector of features
    2. Read in the renthop csv as a dataframe, and create a new series containing the url of the first image for each of the 50K plus rows
    3. Iterate thru the series and build a new numpy array with 1000 columns, each containing the 1000 feature values for each picture.
    4. Convert the numpy array to a dataframe, and then append it on the right to the renthop dataframe.
    5. Output this new dataframe into a csv.

Let's import all the necessary libraries.

In [152]:
import requests
from requests.exceptions import ConnectionError
from keras.applications.resnet50 import ResNet50
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
from keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
import time
import pandas as pd

Now we create two neural network models using VGG16 and ResNet50 weights (coefficient values).

In [18]:
vgg16_model = VGG16(weights='imagenet', include_top=True)
resnet50_model = ResNet50(weights='imagenet')

Now we create an array of length 1000 with all zeros.  This will be used for the features if there is an error in processing, such as no pictures listed or bad urls provided.

In [74]:
error_array = np.zeros(1000)

Now we create the main function 'ProcessImage'.  This function reads a url and makes predictions using a specified model.  See the comments in the function for a description of each section.

In [309]:
def ProcessImage(input_url, model):
    ### first we get the image from the url and then write it to a temporary file 'buffer.jpg'
    url = input_url
    if url ==']':
        return(error_array) # this returns the error array if the listing has no pictures.  
    try: # try to get the url
        r = requests.get(url, allow_redirects=True)
    except: #ConnectionError:
        return(error_array) # this returns the error array if the url is not formed correctly or points to a dead link
    
    data_type = r.headers.get('Content-Type')
    if data_type != 'image/jpeg':
        return(error_array) # this returns the error array if the image is not a picture
    try:
        file = open('buffer.jpg', 'wb')
    except: 
        return(error_array) # this returns the error array if 'buffer.jpg' cannot be accessed
    
    file.write(r.content) # write the image to a buffer file
    file.close()
    
    ### next, we read 'buffer.jpg' and resize and preprocess the image
    img_path = 'buffer.jpg'
    img = image.load_img(img_path, target_size=(224, 224)) # read the image from the buffer file and crop it
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    
    ## next we extract the 1000 features
    features = model.predict(x)
    
    ## next, let's change all feature weights < .001 to 0
    features[features < .001] = 0
    
    ## next we return the features
    return(features[0])
    
    
    

Now let's try both the VGG16 model and the ResNet50 model and see which performs quickest.

In [150]:
start = time.time()
ProcessImage('https://photos.renthop.com/2/6878465_d3ff71ae16aca93c88865a8151c05340.jpg', resnet50_model)
end = time.time()
print(end - start)

Predicted: [('n03761084', 'microwave', 0.09829877), ('n03782006', 'monitor', 0.095307335), ('n02906734', 'broom', 0.08151369)]
1.027466058731079


In [151]:
start = time.time()
ProcessImage('https://photos.renthop.com/2/6878465_d3ff71ae16aca93c88865a8151c05340.jpg', vgg16_model)
end = time.time()
print(end - start)

Predicted: [('n02977058', 'cash_machine', 0.13933799), ('n04125021', 'safe', 0.06972619), ('n04239074', 'sliding_door', 0.038677044)]
0.7340161800384521


The VGG16 model seems to return quicker predictions (3/4 second) vs the ResNet50 model (1 second) so we'll use the VGG16 model.

Let's read in the master csv and put it into a dataframe.

In [153]:
master_df = pd.read_csv('master_data.csv')

In [170]:
len(master_df)

49352

In [165]:
photo_series = master_df['photos']

In [171]:
len(photo_series)

49352

Now that we have a series of photo urls, let's create a series with just the urls for the first image.

In [172]:
first_photo_series = photo_series.apply(lambda x: x.split(',')[0].replace('[',''))

In [177]:
len(first_photo_series)

49352

Now we build a function to loop through the series of first images, crate the feature vector, and put that feature vector into a dataframe.  The most important part of this function is that it needs to be fault tolerant so that if it fails, we don't lose work.  This is accomplished by declaring a global variable in the function, and by enabling the function to start and end at specific points in the data.

In [376]:
## Function to extract features for specific sets of pictures
## start will contain the starting index (beginning at 0)
## end will contain the last index.  Interval is the reporting interval

def Iterate_images(start, end, interval):
    global feature_df
    stop = end + 1 # always set the last paramater of range to last index + 1
    ## extract features for each image, and append to the bottom of feature_frame
    for i in range(start, stop):
        url = first_photo_series[i]
        features = ProcessImage(url, vgg16_model) # extract the 1000 features
        total = sum(features)
        slice_df = pd.DataFrame(features).transpose() # convert into a 1 row, 1000 column datafram
        slice_df.insert(0, 'url', url) # insert a column into this dataframe that contains the image url
        slice_df.insert(0, 'number', i) # insert a column into this dataframe that contains the record number
        slice_df.insert(0, 'total', total) # insert a column into this dataframe that contains the total weights
        feature_df = feature_df.append(slice_df, ignore_index=True, verify_integrity=False) 
        if (i % interval == 0):
            print ('Completed interation: ' + str(i))
 

In [387]:
## create an empty dataframe with 1001 columns
feature_df = pd.DataFrame(index=range(0,1001)).transpose()

Now, let's process all 49,352 records and report every 200 iterations.

In [388]:
start = time.time()
Iterate_images(0, 49351, 200)
end = time.time()
print(end-start)

  return self._int64index.union(other)


Completed interation: 0
Completed interation: 200
Completed interation: 400
Completed interation: 600
Completed interation: 800
Completed interation: 1000
Completed interation: 1200
Completed interation: 1400
Completed interation: 1600
Completed interation: 1800
Completed interation: 2000
Completed interation: 2200
Completed interation: 2400
Completed interation: 2600
Completed interation: 2800
Completed interation: 3000
Completed interation: 3200
Completed interation: 3400
Completed interation: 3600
Completed interation: 3800
Completed interation: 4000
Completed interation: 4200
Completed interation: 4400
Completed interation: 4600
Completed interation: 4800
Completed interation: 5000
Completed interation: 5200
Completed interation: 5400
Completed interation: 5600
Completed interation: 5800
Completed interation: 6000
Completed interation: 6200
Completed interation: 6400
Completed interation: 6600
Completed interation: 6800
Completed interation: 7000
Completed interation: 7200
Complete

I wrote this out to a csv.  I'll now test this by reading it in, and verify that it's in the correct order by comparing the 'number' column with the index on the far left.  They match so we're good to go.

In [391]:
test_df = pd.read_csv('image_features.csv')

In [393]:
test_df.tail()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,...,994,995,996,997,998,999,1000,total,number,url
49347,49347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.001527,,0.949521,49347.0,https://photos.renthop.com/2/7098690_18396d32e...
49348,49348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,0.9534,49348.0,https://photos.renthop.com/2/6822449_b429587b7...
49349,49349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,0.913889,49349.0,https://photos.renthop.com/2/6881461_20a865305...
49350,49350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,0.990323,49350.0,https://photos.renthop.com/2/6841891_124c9c446...
49351,49351,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,0.929024,49351.0,https://photos.renthop.com/2/6858245_c4380bde9...


In [394]:
58894/3600


16.359444444444446

Total process time was approximately 16.4 hours.