# Predicting Pawpularity

#### Objective
PetFinder.my is Malaysia’s leading animal welfare platform, featuring over 180,000 animals with 54,000 happily adopted. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Currently, PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved.

In this competition, you’ll analyze raw images and metadata to predict the “Pawpularity” of pet photos. You'll train and test your model on PetFinder.my's thousands of pet profiles.

#### Description of data
- CSV file with 9912 rows and 14 columns of metadata (no nulls)
- Folder with 9912 jpeg files linked to metadata via id in file name

#### Issues:
- Selection method of data is unclear
- Unclear whether photos are profile photos
- Pawpularity score is based on webtraffic on pet profile, not based on metadata

Noise
- Metadata does not include information whether featured pets are dog or cat
- Metadata does not include information about pet location 

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('/Users/arnet/Desktop/Ironhack/Final_Project/petfinder-pawpularity-score/train.csv')

In [3]:
data.head()

Unnamed: 0,Id,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur,Pawpularity
0,0007de18844b0dbbb5e1f607da0606e0,0,1,1,1,0,0,1,0,0,0,0,0,63
1,0009c66b9439883ba2750fb825e1d7db,0,1,1,0,0,0,0,0,0,0,0,0,42
2,0013fd999caf9a3efe1352ca1b0d937e,0,1,1,1,0,0,0,0,1,1,0,0,28
3,0018df346ac9c1d8413cfcc888ca8246,0,1,1,1,0,0,0,0,0,0,0,0,15
4,001dc955e10590d3ca4673f034feeef2,0,0,0,1,0,0,1,0,0,0,0,0,72


In [4]:
data = data.set_index('Id')
data = data.rename_axis(None)

In [5]:
data.head()

Unnamed: 0,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur,Pawpularity
0007de18844b0dbbb5e1f607da0606e0,0,1,1,1,0,0,1,0,0,0,0,0,63
0009c66b9439883ba2750fb825e1d7db,0,1,1,0,0,0,0,0,0,0,0,0,42
0013fd999caf9a3efe1352ca1b0d937e,0,1,1,1,0,0,0,0,1,1,0,0,28
0018df346ac9c1d8413cfcc888ca8246,0,1,1,1,0,0,0,0,0,0,0,0,15
001dc955e10590d3ca4673f034feeef2,0,0,0,1,0,0,1,0,0,0,0,0,72


## Process metadata

### Scale pawpularity score (ALSO TRY WITHOUT SCALING)

In [6]:
# get target score
y = data["Pawpularity"]

In [7]:
# Scale and ensure that score is between 0 and 1
maxScore = y.max()
y_scaled= y / maxScore

In [8]:
y_scaled

0007de18844b0dbbb5e1f607da0606e0    0.63
0009c66b9439883ba2750fb825e1d7db    0.42
0013fd999caf9a3efe1352ca1b0d937e    0.28
0018df346ac9c1d8413cfcc888ca8246    0.15
001dc955e10590d3ca4673f034feeef2    0.72
                                    ... 
ffbfa0383c34dc513c95560d6e1fdb57    0.15
ffcc8532d76436fc79e50eb2e5238e45    0.70
ffdf2e8673a1da6fb80342fa3b119a20    0.20
fff19e2ce11718548fa1c5d039a5192a    0.20
fff8e47c766799c9e12f3cb3d66ad228    0.30
Name: Pawpularity, Length: 9912, dtype: float64

In [9]:
# Drop 'y'
data = data.drop('Pawpularity', axis=1)
data.head()

Unnamed: 0,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur
0007de18844b0dbbb5e1f607da0606e0,0,1,1,1,0,0,1,0,0,0,0,0
0009c66b9439883ba2750fb825e1d7db,0,1,1,0,0,0,0,0,0,0,0,0
0013fd999caf9a3efe1352ca1b0d937e,0,1,1,1,0,0,0,0,1,1,0,0
0018df346ac9c1d8413cfcc888ca8246,0,1,1,1,0,0,0,0,0,0,0,0
001dc955e10590d3ca4673f034feeef2,0,0,0,1,0,0,1,0,0,0,0,0


In [10]:
data

Unnamed: 0,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur
0007de18844b0dbbb5e1f607da0606e0,0,1,1,1,0,0,1,0,0,0,0,0
0009c66b9439883ba2750fb825e1d7db,0,1,1,0,0,0,0,0,0,0,0,0
0013fd999caf9a3efe1352ca1b0d937e,0,1,1,1,0,0,0,0,1,1,0,0
0018df346ac9c1d8413cfcc888ca8246,0,1,1,1,0,0,0,0,0,0,0,0
001dc955e10590d3ca4673f034feeef2,0,0,0,1,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
ffbfa0383c34dc513c95560d6e1fdb57,0,0,0,1,0,0,0,0,0,0,0,1
ffcc8532d76436fc79e50eb2e5238e45,0,1,1,1,0,0,0,0,0,0,0,0
ffdf2e8673a1da6fb80342fa3b119a20,0,1,1,1,0,0,0,0,1,1,0,0
fff19e2ce11718548fa1c5d039a5192a,0,1,1,1,0,0,0,0,1,0,0,0


## Load image data

In [18]:
import cv2
import os
import numpy as np

In [19]:
def load_house_images(data, inputPath):
    # initialize our images array (i.e., the pet images themselves)
    images = []
   
    # get image files based on index column in dataframe
    for i in data.index.values:
        base = os.path.join(inputPath, i).replace('\\','/')
        basePath = f'{base}.jpg'
        #basePaths.append(f'{basePath}.jpg')
        image = cv2.imread(basePath)
        image = cv2.resize(image, (64, 64)) # best image size?
        images.append(image)
    
    # return our set of images
    return np.array(images)


In [20]:
inputPath = '/Users/arnet/Desktop/Ironhack/Final_Project/petfinder-pawpularity-score/train'

# make input path the current directory
os.chdir(inputPath)

images = load_house_images(data, inputPath)

In [21]:
# scale image size
images = images / 255.0

In [22]:
len(images)

9912

In [23]:
type(images)

numpy.ndarray

In [24]:
images[0]

array([[[0.76078431, 0.70588235, 0.72156863],
        [0.69803922, 0.64313725, 0.65882353],
        [0.78431373, 0.73333333, 0.7372549 ],
        ...,
        [0.69411765, 0.69019608, 0.69803922],
        [0.72941176, 0.73333333, 0.7254902 ],
        [0.74901961, 0.74117647, 0.7372549 ]],

       [[0.75294118, 0.70588235, 0.69803922],
        [0.8       , 0.75294118, 0.74509804],
        [0.77647059, 0.72941176, 0.72156863],
        ...,
        [0.56078431, 0.65882353, 0.6       ],
        [0.56470588, 0.62352941, 0.58039216],
        [0.59607843, 0.61960784, 0.59607843]],

       [[0.7254902 , 0.67843137, 0.67058824],
        [0.74509804, 0.69803922, 0.69019608],
        [0.7372549 , 0.69019608, 0.68235294],
        ...,
        [0.57647059, 0.54509804, 0.54509804],
        [0.59215686, 0.57647059, 0.58431373],
        [0.52941176, 0.54117647, 0.54509804]],

       ...,

       [[0.63137255, 0.63529412, 0.61568627],
        [0.55686275, 0.61176471, 0.62352941],
        [0.41960784, 0

## Split into train and test data

In [25]:
from sklearn.model_selection import train_test_split

In [46]:
# Split data for METADATA, IMAGES AND PREDICTION VALUE 'y'
split = train_test_split(data, images, y, test_size=0.25, random_state=42)
(X_train, X_test, XImages_train, XImages_test, y_train, y_test) = split

In [27]:
X_train

Unnamed: 0,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur
5e1a107f1f714592cd96d3e1329cab27,0,1,1,1,0,0,0,0,0,0,0,0
e8d7c28145cfbdd2349acd8f2a28949f,0,1,1,0,0,0,0,1,0,1,1,0
20ebe45894e740a627da088335f9ddea,0,0,0,0,0,0,1,0,0,0,0,0
7c041fd9ed4c83ac88a3e9e0487a02ec,0,1,1,1,0,0,0,0,1,0,0,0
803bb1db4f00079a0be91f3319de78f9,0,1,1,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
94773ee6f04891c99bc4d37e8f20ff6f,0,1,1,1,0,0,0,0,1,1,0,0
85e7146eeb13644b1bac74b684ccf51f,0,1,1,1,0,0,0,0,0,0,0,0
8afb263d779be24c94f01046e5ec3e81,0,1,1,1,0,0,0,0,0,0,0,0
15bda3335526d2ab18834c65f93add56,0,1,1,1,0,0,0,0,1,0,0,0


## Create model

In [28]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
import tensorflow as tf

### Mixed model

In [47]:
# MLP model FOR METADATA
def create_mlp(dim, regress=False):
    # define our MLP network
    model = Sequential()
    model.add(Dense(8, input_dim=dim, activation="relu"))
    model.add(Dense(4, activation="relu"))
    # check to see if the regression node should be added
    if regress:
        model.add(Dense(1, activation="linear"))
    # return our model
    return model

In [48]:
# CNN for regression prediction FOR IMAGE DATA
def create_cnn(width, height, depth, filters=(16, 32, 64), regress=False):
    # initialize the input shape and channel dimension, assuming
    # TensorFlow/channels-last ordering
    inputShape = (height, width, depth)
    chanDim = -1

    # define the model input
    inputs = Input(shape=inputShape)
    # loop over the number of filters
    for (i, f) in enumerate(filters):
        # if this is the first CONV layer then set the input
        # appropriately
        if i == 0:
            x = inputs
        # CONV => RELU => BN => POOL
        x = Conv2D(f, (3, 3), padding="same")(x)
        x = Activation("relu")(x)
        x = BatchNormalization(axis=chanDim)(x)
        x = MaxPooling2D(pool_size=(2, 2))(x)
            # flatten the volume, then FC => RELU => BN => DROPOUT
    x = Flatten()(x)
    x = Dense(16)(x)
    x = Activation("relu")(x)
    x = BatchNormalization(axis=chanDim)(x)
    x = Dropout(0.5)(x)
    # apply another FC layer, this one to match the number of nodes
    # coming out of the MLP
    x = Dense(4)(x)
    x = Activation("relu")(x)
    # check to see if the regression node should be added
    if regress:
        x = Dense(1, activation="linear")(x)
    # construct the CNN
    model = Model(inputs, x)
    # return the CNN
    return model

In [31]:
# import the necessary packages
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import concatenate
import numpy as np
import os

In [49]:
# CREATE the MLP and CNN models
mlp = create_mlp(X_train.shape[1], regress=False) # X_train.shape[1] = number of METADATA columns i.e. inputs
cnn = create_cnn(64, 64, 3, regress=False)

# create the input to our final set of layers as the *output* of both
# the MLP and CNN
combinedInput = concatenate([mlp.output, cnn.output])

# our final FC layer head will have two dense layers, the final one
# being our regression head
x = Dense(4, activation="relu")(combinedInput)
x = Dense(1, activation="linear")(x)

# our final model will accept categorical/numerical data on the MLP
# input and images on the CNN input, outputting a single value (the
# predicted price of the house)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

In [50]:
##### compile the model using mean absolute percentage error as our loss,
# implying that we seek to minimize the absolute percentage difference
# between our price *predictions* and the *actual prices*
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="mse", 
              metrics = ['mse', 
                         'mean_absolute_error',
                         'mean_absolute_percentage_error',
                        tf.keras.metrics.RootMeanSquaredError()],
              optimizer='adam')
# train the model
print("[INFO] training model...")
model.fit(
    x=[X_train, XImages_train], y=y_train,
    validation_data=([X_test, XImages_test], y_test),
    epochs=5, batch_size=8)


[INFO] training model...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1e593444df0>

In [None]:
# make predictions on the testing data
print("[INFO] predicting house prices...")
preds = model.predict([testAttrX, testImagesX])

[INFO] training model...
Epoch 1/5
930/930 [==============================] - 75s 80ms/step - loss: 65.1758 - val_loss: 56.7548
Epoch 2/5
930/930 [==============================] - 95s 102ms/step - loss: 57.6539 - val_loss: 56.2901
Epoch 3/5
930/930 [==============================] - 77s 82ms/step - loss: 57.4500 - val_loss: 56.4440
Epoch 4/5
930/930 [==============================] - 75s 80ms/step - loss: 57.4485 - val_loss: 56.5238
Epoch 5/5
930/930 [==============================] - 77s 83ms/step - loss: 57.4599 - val_loss: 56.1432

In [33]:
# create our Convolutional Neural Network FOR IMAGES and then compile the model
# using mean absolute percentage error as our loss, implying that we
# seek to minimize the absolute percentage difference between our
# price *predictions* and the *actual prices*
model = create_cnn(64, 64, 3, regress=True)
#opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_absolute_percentage_error", 
              optimizer='adam', 
              metrics=['mse', tf.metrics.RootMeanSquaredError()])

In [34]:
# train the model
print("[INFO] training model...")
model.fit(x=X_train, y=y_train, 
    validation_data=(X_test, y_test),
    epochs=5, batch_size=8)

[INFO] training model...
Epoch 1/5


ValueError: in user code:

    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:806 train_function  *
        return step_function(self, iterator)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:796 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1211 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2585 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2945 _call_for_each_replica
        return fn(*args, **kwargs)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:789 run_step  **
        outputs = model.train_step(data)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:747 train_step
        y_pred = self(x, training=True)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\base_layer.py:985 __call__
        outputs = call_fn(inputs, *args, **kwargs)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\functional.py:385 call
        return self._run_internal_graph(
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\functional.py:508 _run_internal_graph
        outputs = node.layer(*args, **kwargs)
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\base_layer.py:975 __call__
        input_spec.assert_input_compatibility(self.input_spec, inputs,
    C:\Users\arnet\anaconda3\lib\site-packages\tensorflow\python\keras\engine\input_spec.py:191 assert_input_compatibility
        raise ValueError('Input ' + str(input_index) + ' of layer ' +

    ValueError: Input 0 of layer conv2d_3 is incompatible with the layer: : expected min_ndim=4, found ndim=2. Full shape received: [None, 12]


Note:
* Metrics show no further improvements after 10 epochs
* loss (Mean Absolute Percetage Error): 56.9354 - mse: 0.0626 - root_mean_squared_error: 0.2502 - val_loss: 57.6136 - val_mse: 0.0662 - val_root_mean_squared_error: 0.2573

In [31]:
# import the necessary packages
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
''' additional modules not used'''
import numpy as np
import os

## Train model

In [14]:
# import the necessary packages
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model

In [15]:
# import the necessary packages
from tensorflow.keras.optimizers import Adam
import numpy as np
import locale
import os

In [16]:
# function for creating SEQUENTIAL model
def create_mlp(dim, regress=False):
    # define our MLP network
    model = Sequential()
    model.add(Dense(8, input_dim=dim, activation="relu"))
    model.add(Dense(4, activation="relu"))
    # check to see if the regression node should be added
    if regress:
        model.add(Dense(1, activation="linear"))
    # return our model
    return model

In [17]:
X_train.shape[1]

12

In [18]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7434 entries, 8462 to 9771
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Subject Focus  7434 non-null   int64
 1   Eyes           7434 non-null   int64
 2   Face           7434 non-null   int64
 3   Near           7434 non-null   int64
 4   Action         7434 non-null   int64
 5   Accessory      7434 non-null   int64
 6   Group          7434 non-null   int64
 7   Collage        7434 non-null   int64
 8   Human          7434 non-null   int64
 9   Occlusion      7434 non-null   int64
 10  Info           7434 non-null   int64
 11  Blur           7434 non-null   int64
dtypes: int64(12)
memory usage: 755.0 KB


In [20]:
import math
epox = 100
b_size = math.floor(7434 / epox)
b_size

74

In [24]:
from tensorflow.keras import metrics
# create our MLP and then compile the model using mean absolute
# percentage error as our loss, implying that we seek to minimize
# the absolute percentage difference between our price *predictions*
# and the *actual prices*
model = create_mlp(X_train.shape[1], regress=True)
opt = Adam(lr=1e-3, decay=1e-3 / 200) # decay is unclear
model.compile(loss="mean_absolute_percentage_error", 
              optimizer=opt,
              metrics = ['mse', tf.metrics.RootMeanSquaredError()] # added metrics
             ) 
# train the model
print("[INFO] training model...")
model.fit(x=X_train, y=y_train, 
    validation_data=(X_test, y_test),
    epochs=epox)
#batch_size=b_size)

[INFO] training model...
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100


Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100


Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x1883a5d9eb0>

In [28]:
# make predictions on the testing data
print("[INFO] predicting house prices...")
preds = model.predict(X_test)
# compute the difference between the *predicted* house prices and the
# *actual* house prices, then compute the percentage difference and
# the absolute percentage difference
diff = preds.flatten() - y_test
percentDiff = (diff / y_test) * 100
absPercentDiff = np.abs(percentDiff)
# compute the mean and standard deviation of the absolute percentage
# difference
mean = np.mean(absPercentDiff)
std = np.std(absPercentDiff)
# finally, show some statistics on our model
print("[INFO] avg. house price: {}, std house price: {}".format(data["Pawpularity"].mean(), 
      data["Pawpularity"].std()))
print("[INFO] mean: {:.2f}%, std: {:.2f}%".format(mean, std))

[INFO] predicting house prices...
[INFO] avg. house price: 38.03904358353511, std house price: 20.591990105774546
[INFO] mean: 58.15%, std: 131.29%
