# Predicting Pawpularity

#### Objective
PetFinder.my is Malaysia’s leading animal welfare platform, featuring over 180,000 animals with 54,000 happily adopted. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Currently, PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved.

In this competition, you’ll analyze raw images and metadata to predict the “Pawpularity” of pet photos. You'll train and test your model on PetFinder.my's thousands of pet profiles.

#### Description of data
- CSV file with 9912 rows and 14 columns of metadata (no nulls)
- Folder with 9912 jpeg files linked to metadata via id in file name

#### Issues:
- Selection method of data is unclear
- Unclear whether photos are profile photos
- Pawpularity score is based on webtraffic on pet profile, not based on metadata

Noise
- Metadata does not include information whether featured pets are dog or cat
- Metadata does not include information about pet location 

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('/Users/arnet/Desktop/Ironhack/Final_Project/petfinder-pawpularity-score/train.csv')

In [3]:
data.head()

Unnamed: 0,Id,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur,Pawpularity
0,0007de18844b0dbbb5e1f607da0606e0,0,1,1,1,0,0,1,0,0,0,0,0,63
1,0009c66b9439883ba2750fb825e1d7db,0,1,1,0,0,0,0,0,0,0,0,0,42
2,0013fd999caf9a3efe1352ca1b0d937e,0,1,1,1,0,0,0,0,1,1,0,0,28
3,0018df346ac9c1d8413cfcc888ca8246,0,1,1,1,0,0,0,0,0,0,0,0,15
4,001dc955e10590d3ca4673f034feeef2,0,0,0,1,0,0,1,0,0,0,0,0,72


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9912 entries, 0 to 9911
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Id             9912 non-null   object
 1   Subject Focus  9912 non-null   int64 
 2   Eyes           9912 non-null   int64 
 3   Face           9912 non-null   int64 
 4   Near           9912 non-null   int64 
 5   Action         9912 non-null   int64 
 6   Accessory      9912 non-null   int64 
 7   Group          9912 non-null   int64 
 8   Collage        9912 non-null   int64 
 9   Human          9912 non-null   int64 
 10  Occlusion      9912 non-null   int64 
 11  Info           9912 non-null   int64 
 12  Blur           9912 non-null   int64 
 13  Pawpularity    9912 non-null   int64 
dtypes: int64(13), object(1)
memory usage: 1.1+ MB


In [5]:
for column in data:
    print(column.upper())
    print(data[column].value_counts(), '\n')

ID
0007de18844b0dbbb5e1f607da0606e0    1
aa90aad818852ca47177213e5be26709    1
aa5d26353816ded3a8a6625aeeabe88b    1
aa635b767ed19b26ac40a8882f3331e8    1
aa6fae51f270df4093831d11d7dd61d2    1
                                   ..
554b19adc114ac107175c2115347136a    1
554bda115618f06c0cccc09c9ec549c3    1
55521992ad6215df8b423a767cc7a3c6    1
555b694f51ae06552493692a94cf9167    1
fff8e47c766799c9e12f3cb3d66ad228    1
Name: Id, Length: 9912, dtype: int64 

SUBJECT FOCUS
0    9638
1     274
Name: Subject Focus, dtype: int64 

EYES
1    7658
0    2254
Name: Eyes, dtype: int64 

FACE
1    8960
0     952
Name: Face, dtype: int64 

NEAR
1    8540
0    1372
Name: Near, dtype: int64 

ACTION
0    9813
1      99
Name: Action, dtype: int64 

ACCESSORY
0    9240
1     672
Name: Accessory, dtype: int64 

GROUP
0    8630
1    1282
Name: Group, dtype: int64 

COLLAGE
0    9420
1     492
Name: Collage, dtype: int64 

HUMAN
0    8264
1    1648
Name: Human, dtype: int64 

OCCLUSION
0    8207
1    1705


In [6]:
corr_matrix = data.corr()
corr_matrix['Pawpularity'].sort_values(ascending=False)

Pawpularity      1.000000
Group            0.016469
Accessory        0.013287
Face             0.008018
Human            0.003983
Occlusion        0.001979
Collage          0.001732
Near             0.001001
Action          -0.001373
Info            -0.004735
Eyes            -0.006686
Subject Focus   -0.009853
Blur            -0.023540
Name: Pawpularity, dtype: float64

## Process metadata

### Scale pawpularity score (ALSO TRY WITHOUT SCALING)

In [7]:
# get target score
y = data["Pawpularity"]

In [8]:
# Scale and ensure that score is between 0 and 1
maxScore = y.max()
y_scaled= y / maxScore

In [9]:
maxScore

100

In [10]:
y_scaled

0       0.63
1       0.42
2       0.28
3       0.15
4       0.72
        ... 
9907    0.15
9908    0.70
9909    0.20
9910    0.20
9911    0.30
Name: Pawpularity, Length: 9912, dtype: float64

### Prepare 'X'

In [11]:
# drop unnecessary columns
df_trimmed = data.drop(['Pawpularity', 'Id'], axis=1)
df_trimmed.head()

Unnamed: 0,Subject Focus,Eyes,Face,Near,Action,Accessory,Group,Collage,Human,Occlusion,Info,Blur
0,0,1,1,1,0,0,1,0,0,0,0,0
1,0,1,1,0,0,0,0,0,0,0,0,0
2,0,1,1,1,0,0,0,0,1,1,0,0
3,0,1,1,1,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,0,0,0


## Split into train and test data

In [12]:
from sklearn.model_selection import train_test_split

In [20]:
# try with both 'y' as well as 'y_scaled'
X_train, X_test, y_train, y_test= train_test_split(df_trimmed, y, test_size= 0.25, random_state = 32)

## Train model

In [21]:
# import the necessary packages
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model

In [22]:
# import the necessary packages
from tensorflow.keras.optimizers import Adam
import numpy as np
import locale
import os

In [35]:
# function for creating SEQUENTIAL model
def create_mlp(dim, regress=False):
    # define our MLP network
    model = Sequential()
    model.add(Dense(8, input_dim=dim, activation="relu"))
    model.add(Dense(4, activation="relu"))
    # check to see if the regression node should be added
    if regress:
        model.add(Dense(1, activation="linear"))
    # return our model
    return model

In [36]:
X_train.shape

(7434, 12)

In [37]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7434 entries, 8462 to 9771
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Subject Focus  7434 non-null   int64
 1   Eyes           7434 non-null   int64
 2   Face           7434 non-null   int64
 3   Near           7434 non-null   int64
 4   Action         7434 non-null   int64
 5   Accessory      7434 non-null   int64
 6   Group          7434 non-null   int64
 7   Collage        7434 non-null   int64
 8   Human          7434 non-null   int64
 9   Occlusion      7434 non-null   int64
 10  Info           7434 non-null   int64
 11  Blur           7434 non-null   int64
dtypes: int64(12)
memory usage: 755.0 KB


In [38]:
import math
epox = 100
b_size = math.floor(7434 / epox)
b_size

74

In [40]:
from tensorflow.keras import metrics
#metrics=[metrics.mean_squared_error, metrics.mean_absolute_error, metrics.mean_absolute_percentage_error, metrics.cosine_proximity])
# create our MLP and then compile the model using mean absolute
# percentage error as our loss, implying that we seek to minimize
# the absolute percentage difference between our price *predictions*
# and the *actual prices*
model = create_mlp(X_train.shape[1], regress=True)
opt = Adam(lr=1e-3, decay=1e-3 / 200) # decay is unclear
model.compile(loss="mean_absolute_percentage_error", 
              optimizer='adam',
              metrics = ['mse', "mean_absolute_error", tf.metrics.RootMeanSquaredError()] # added metrics
             ) 
# train the model
print("[INFO] training model...")
model.fit(x=X_train, y=y_train, 
    validation_data=(X_test, y_test),
    epochs=10)
#batch_size=b_size)

[INFO] training model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x17f11db7130>

In [28]:
# make predictions on the testing data
print("[INFO] predicting house prices...")
preds = model.predict(X_test)
# compute the difference between the *predicted* house prices and the
# *actual* house prices, then compute the percentage difference and
# the absolute percentage difference
diff = preds.flatten() - y_test
percentDiff = (diff / y_test) * 100
absPercentDiff = np.abs(percentDiff)
# compute the mean and standard deviation of the absolute percentage
# difference
mean = np.mean(absPercentDiff)
std = np.std(absPercentDiff)
# finally, show some statistics on our model
print("[INFO] avg. house price: {}, std house price: {}".format(data["Pawpularity"].mean(), 
      data["Pawpularity"].std()))
print("[INFO] mean: {:.2f}%, std: {:.2f}%".format(mean, std))

[INFO] predicting house prices...
[INFO] avg. house price: 38.03904358353511, std house price: 20.591990105774546
[INFO] mean: 58.15%, std: 131.29%


### Best scores
#### Mean absolute percentage error
- val_loss: 58.1714
- loss: 55.9041

Working without batches produced slightly better scores
- loss: 55.7859 - val_loss: 57.6528

Eliminating scaling of y did not impact results (approximately equal), perhaps slightly worsened speed
- loss: 56.5528 - val_loss: 57.6922

200 Epochs seems to be uneccessary, will reduce to 100

#### Mean absolute error
100 epochs
loss: 418.9434 - mean_squared_error: 418.9434 - mean_absolute_error: 15.2811 - mean_absolute_percentage_error: 78.0882 - root_mean_squared_error: 20.4681 - val_loss: 432.4336 - val_mean_squared_error: 432.4336 - val_mean_absolute_error: 15.5180 - val_mean_absolute_percentage_error: 78.1331 - val_root_mean_squared_error: 20.7950
- scores significantly worsened after adding metrics
    - mean absolute percentage error worsened from 58 to 78 percent, although there appears to be no overfitting
    - will try to work with scaled data again - scaling made no difference
    
Restarting kernel led to improvement
- try to determine what impact of including RMSE is

In [None]:
model= 