
# Petfinder.my Adoption Prediction
How cute is that doggy in the shelter?

## Competition Description:

Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. If homes can be found for them, many precious lives can be saved — and more happy families created.

PetFinder.my has been Malaysia’s leading animal welfare platform since 2008, with a database of more than 150,000 animals. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Animal adoption rates are strongly correlated to the metadata associated with their online profiles, such as descriptive text and photo characteristics. As one example, PetFinder is currently experimenting with a simple AI tool called the Cuteness Meter, which ranks how cute a pet is based on qualities present in their photos.

In this competition you will be developing algorithms to predict the adoptability of pets - specifically, how quickly is a pet adopted? If successful, they will be adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization.

Top participants may be invited to collaborate on implementing their solutions into AI tools for assessing and improving pet adoption performance, which will benefit global animal welfare.

## Objective:
you will predict the speed at which a pet is adopted, based on the pet’s listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. See below for details. 

#### AdoptionSpeed

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
<br>0 - Pet was adopted on the same day as it was listed. 
<br>1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
<br>2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
<br>3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
<br>4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).



__File descriptions:__
<br>train.csv - Tabular/text data for the training set
<br>test.csv - Tabular/text data for the test set
<br>sample_submission.csv - A sample submission file in the correct format
<br>breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
<br>color_labels.csv - Contains ColorName for each ColorID
<br>state_labels.csv - Contains StateName for each StateID


Data Fields

- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

### Evaluation:

scored based on the quadratic weighted kappa, which measures the agreement between two ratings. This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters). In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores.

Results have 5 possible ratings, 0,1,2,3,4.  The quadratic weighted kappa is calculated as follows. First, an N x N histogram matrix O is constructed, such that Oi,j corresponds to the number of adoption records that have a rating of i (actual) and received a predicted rating j. An N-by-N matrix of weights, w, is calculated based on the difference between actual and predicted rating scores:

$$\begin{equation*}
w_{ij} = \frac{(i-j)^2}{(N-1)^2}
\end{equation*} $$


An N-by-N histogram matrix of expected ratings, E, is calculated, assuming that there is no correlation between rating scores.  This is calculated as the outer product between the actual rating's histogram vector of ratings and the predicted rating's histogram vector of ratings, normalized such that E and O have the same sum.

From these three matrices, the quadratic weighted kappa is calculated as: 

$$\begin{equation*}
\kappa = 1 - \frac{\sum_(w_{ij} O_{ij})}{\sum_(w_{ij} E_{ij})}
\end{equation*} $$



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os, json

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import FeatureUnion

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier as ET
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import cross_val_predict

  from numpy.core.umath_tests import inner1d


In [2]:
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

In [3]:
from sklearn.metrics import make_scorer, cohen_kappa_score

# Metric used for this competition (Quadratic Weigthed Kappa aka Quadratic Cohen Kappa Score)
def metric(y1,y2):
    return cohen_kappa_score(y1,y2, weights='quadratic')

# Make scorer for scikit-learn
scorer = make_scorer(metric)

## Features to add from train dataset
- number of colors
- pure bred or mixed breed
- Description length

In [4]:
color_cols = ['Color1', 'Color2', 'Color3']

In [9]:
def set_breed(row):
    """ this function sets the Mixed Breed Indicator for the pet
        a dog is set as Mixed Breed if either Breed1 or Breed2 = 370
        or both Breed1 and Breed2 is set to some value greater than 0.
        A cat is set as Mixed Breed if either Breed1 or Breed2 = 266
        or both Breed1 and Breed2 is set to some value greater than 0.
    """
    if (row['Type'] == 'Dog'): 
        if (((row['Breed1'] == 370) | (row['Breed2'] == 370)) |
            ((row['Breed1'] > 0) & (row['Breed2'] > 0))):
            return 1
        else:
            return 0
    else: # if 'Cat'
        if (((row['Breed1'] == 266) | (row['Breed2'] == 266)) |
            ((row['Breed1'] > 0) & (row['Breed2'] > 0))):
            return 1 #mixed
        else:
            return 0 #Pure

In [10]:
def get_desc_len(row):
    """ return length of Pet's profile Description"""
    return len(row['Description'])

In [11]:
def get_name_len(row):
    """ return length of Pet's Name"""
    return len(row['Name'])

In [12]:
def get_color_count(row):
    """ return the number of colors of the pet"""
    color_count = 0
    
    for color in color_cols:
        if row[color] > 0:
            color_count +=1
            
    return color_count

In [13]:
def get_age_cat(row):
    """ return the age category for dogs and cats """
    if row['Type'] == 'Dog':
        if (row['Age'] >= 0) & (row['Age'] < 24):
            return 'dog_0_23'
        elif (row['Age'] >= 24) & (row['Age'] < 72):
            return 'dog_24_71'
        elif (row['Age'] >= 72) & (row['Age'] < 120):
            return 'dog_72_119'
        elif row['Age'] >= 120:
            return 'dog_120_above'
    else:
        if (row['Age'] >= 0) & (row['Age'] < 2):
            return 'cat_0_1'
        elif (row['Age'] >= 2) & (row['Age'] < 6):
            return 'cat_2_5'
        elif (row['Age'] >= 6) & (row['Age'] < 12):
            return 'cat_6_11'
        elif row['Age'] >= 12:
            return 'cat_12_above'        

In [14]:
def process_sentiment(filetype):
    """ This function processes the description sentiment file and creates a sentiment score dataframe.
        the filetype (train or test) will be a required parameter"""
    
    path_to_json = '../data/{}_sentiment/'.format(filetype)
    json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
    json_dict = {'pet_id':[],'magnitude':[], 'score':[]}

    for index, js in enumerate(json_files):
     
        with open(os.path.join(path_to_json, js)) as json_file:
            json_text = json.load(json_file)
            json_dict['pet_id'].append(js.split('.')[0])
            json_dict['magnitude'].append(json_text['documentSentiment']['magnitude'])
            json_dict['score'].append(json_text['documentSentiment']['score'])
            
    sentiDF = pd.DataFrame(json_dict)
    return sentiDF

In [22]:
def add_features(df, filetype):
    """ This function adds new features to our dataFrame as part of pre-processing
        of train and test data.  The parameter filetype will either be (train or test)"""
    
    dfAdd = df.copy()

    #Handle Missing Values
    dfAdd['Name'] = dfAdd.Name.fillna('').values
    dfAdd['Description'] = dfAdd.Name.fillna('').values
    
    
    #Add Features
    dfAdd['Name_Ind'] = dfAdd['Name'].apply(lambda x: 1 if len(x) > 0 else 0)
    dfAdd['Name_Len'] = dfAdd.apply(lambda x: get_name_len(x), 1)
    dfAdd['Breed_Ind'] = dfAdd.apply(lambda x: set_breed(x), 1)
    dfAdd['Desc_Len'] = dfAdd.apply(lambda x: get_desc_len(x), 1)
    dfAdd['Color_count'] = dfAdd.apply(lambda x: get_color_count(x), 1)
    dfAdd['age_cat'] = dfAdd.apply(lambda x:get_age_cat(x), 1)  
  
    #Get descriptions for the different numeric categorical features
    colors_dict = {k: v for k, v in zip(color['ColorID'], color['ColorName'])}
    dfAdd['Color1_name'] = dfAdd['Color1'].apply(lambda x: '_'.join(colors_dict[x].split()) if x in colors_dict else 'Unknown')
    dfAdd['Color2_name'] = dfAdd['Color2'].apply(lambda x: '_'.join(colors_dict[x]) if x in colors_dict else '-')
    dfAdd['Color3_name'] = dfAdd['Color3'].apply(lambda x: '_'.join(colors_dict[x]) if x in colors_dict else '-')
    
    breeds_dict = {k: v for k, v in zip(breed['BreedID'], breed['BreedName'])}
    dfAdd['Breed1_name'] = dfAdd['Breed1'].apply(lambda x: '_'.join(breeds_dict[x].split()) if x in breeds_dict else 'Unknown')
    dfAdd['Breed2_name'] = dfAdd['Breed2'].apply(lambda x: '_'.join(breeds_dict[x]) if x in breeds_dict else '-')
    
    
    #After all the new columns have been added to the dataframe.  Join dfAll with desc_df to populate
    #the dataframe with the description sentiment score
    desc_df = process_sentiment(filetype)
    desc_df.rename(index=str, columns={'pet_id': 'PetID'}, inplace=True)
    dfAdd = dfAdd.merge(desc_df, on=['PetID'], how='left')
    dfAdd.fillna(0, inplace=True)
    
    dfAdd['sentiment'] = dfAdd.apply(lambda x: x['magnitude'] * x['score'], 1)
    
    return dfAdd

In [16]:
#load files

breed = pd.read_csv('../input/breed_labels.csv')
color = pd.read_csv('../input/color_labels.csv')
state = pd.read_csv('../input/state_labels.csv')

train = pd.read_csv('../input/train/train.csv')

test = pd.read_csv('../input/test/test.csv')

### Pre-processing

In [23]:
corr_num_cols = ['Type','Gender','sentiment', 'score', 'Color_count', 'Desc_Len', 'Name_Len', 
                 'PhotoAmt', 'Quantity', 'Age', 'MaturitySize', 'FurLength',
                 'Sterilized','Dewormed', 'Vaccinated','Health','Fee','State', 'Name_Ind',
                 'Breed_Ind','Color1', 'Color2', 'Color3','Breed1', 'Breed2',]
corr_cat_cols = ['age_cat']

In [24]:

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [25]:
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(corr_num_cols)),
    ('std_scaler', StandardScaler())
])

## Pre-process Train file

In [26]:
dfAll = add_features(train, 'train')

In [27]:
Xnum = num_pipeline.fit_transform(dfAll)

In [28]:
petFinder_cat = dfAll[corr_cat_cols]
PF_cat_encoded, PF_categories = petFinder_cat['age_cat'].factorize()

sc = StandardScaler()
sc.fit(PF_cat_encoded.reshape(-1,1))
cat_scaled = sc.transform(PF_cat_encoded.reshape(-1,1))



In [32]:
dfnew = pd.DataFrame(np.c_[Xnum, cat_scaled], columns=corr_num_cols + corr_cat_cols)

In [33]:
X  = dfnew.as_matrix()
y = dfAll['AdoptionSpeed'].values

print(X.shape, y.shape)

(14993, 26) (14993,)


## Pre-Process Test File

In [34]:
dfTest = add_features(test, 'test')

In [36]:
Test_X.shape

(3948, 25)

In [35]:
Test_X = num_pipeline.fit_transform(dfTest)

In [37]:
PF_cat = dfTest[corr_cat_cols]
PF_cat_encoded, PF_categories = PF_cat['age_cat'].factorize()

sc = StandardScaler()
sc.fit(PF_cat_encoded.reshape(-1,1))
cat_scaled = sc.transform(PF_cat_encoded.reshape(-1,1))



In [2]:
Test_df = pd.DataFrame(np.c_[Test_X, cat_scaled], columns=corr_num_cols + corr_cat_cols)
Test_df.head()

NameError: name 'pd' is not defined

In [1]:
Test_X = Test_df.values

NameError: name 'Test_df' is not defined

In [41]:
Test_X.shape

(3948, 26)

In [29]:
X.shape

(14993, 425)

## Train-Predict Models

In [42]:
SEED = 42

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)


### SVM Classifier (kernel='poly')

In [44]:
poly_kernel_svm_clf = SVC(kernel="poly", degree=10, coef0=1, C=5)
poly_kernel_svm_clf.fit(X_train, y_train)


SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape=None, degree=10, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [45]:
y_pred = poly_kernel_svm_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(metric(y_test, y_pred))

0.32147621164962203
0.17943174936913753


### Bagging Classifier with Random Forest

In [46]:
# Train Random Forest Classifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=3, criterion='entropy')
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(metric(y_test, y_pred))

0.41329479768786126
0.3665934991490325


In [47]:
from sklearn.ensemble import BaggingClassifier

bag_clf = BaggingClassifier(rf, random_state=42, n_estimators=100)

In [48]:
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=3, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=42, verbose=0, warm_start=False)

In [49]:
y_bag_pred = bag_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_bag_pred)
score = metric(y_test, y_bag_pred)
print(accuracy, score)

0.4266340595820365 0.37426304220949824


### Gradient Boosting 
[hyperparameter tuning](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)



In [50]:
from sklearn.ensemble import GradientBoostingClassifier

In [57]:
gbt_clf = GradientBoostingClassifier(n_estimators=300, max_depth=1, random_state=SEED)

In [58]:
gbt_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=1,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=300, presort='auto', random_state=42,
              subsample=1.0, verbose=0, warm_start=False)

In [59]:
y_pred = gbt_clf.predict(X_test)
score = metric(y_test, y_pred)
print(score)  # original score: 0.29297979606827806

0.3325686769219627


In [None]:
sgbt_clf = GradientBoostingClassifier(n_estimators=300, max_depth=1, 
                                      subsample=0.8,
                                      max_features=0.7,
                                      random_state=SEED)

In [None]:
sgbt_clf.fit(X_train, y_train)
y_pred = sgbt_clf.predict(X_test)
score = metric(y_test, y_pred)
print(score) #original score: 0.30550830837124776

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_list = {
              'n_estimators': [300,400,500],
              'max_depth': [4, 6, 8],
              'min_samples_split': [50, 100, 120],
              'min_samples_leaf': [5, 10, 15],
              'learning_rate': [0.1, 0.3, 0.5]
            
    
}

In [None]:
grid_gbt = GridSearchCV(estimator=gbt_clf, param_grid=param_list, cv=3,
                       scoring='accuracy',verbose=1,n_jobs=-1)

In [None]:
grid_gbt.fit(X_train, y_train)

In [None]:
grid_gbt.best_estimator_

In [None]:
y_gs_pred = grid_gbt.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_gs_pred)

In [None]:
accuracy

In [None]:
score = metric(y_test, y_gs_pred)
print(score) 

In [51]:
sgbt_clf2 = GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=2,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=3,
              min_samples_split=100, min_weight_fraction_leaf=0.0,
              n_estimators=200, presort='auto', random_state=42,
              subsample=1.0, verbose=0, warm_start=False)

In [52]:
sgbt_clf2.fit(X_train, y_train)
y_sgbt2_pred = sgbt_clf2.predict(X_test)
accuracy = accuracy_score(y_test, y_sgbt2_pred)
score = metric(y_test, y_sgbt2_pred)
print(accuracy, score) 

0.41751889728768343 0.3915879000654472


### Prepare Submission

In [60]:
petid = list(dfTest.PetID.values)

In [62]:
len(petid)

3948

In [53]:
Test_Pred = sgbt_clf2.predict(Test_X)

In [64]:
sub_df = pd.DataFrame({'PetID': petid, 'AdoptionSpeed': list(Test_Pred)})

In [66]:
sub_df = sub_df[['PetID', 'AdoptionSpeed']]
sub_df.head()

Unnamed: 0,PetID,AdoptionSpeed
0,378fcc4fc,2
1,73c10e136,4
2,72000c4c5,4
3,e147a4b9f,4
4,43fbba852,4


In [67]:
sub_df.to_csv('submission.csv', index=False)