When a stray animal is brought to a shelter, it is not always apparent what their breed is. Either an expert must assess the animal to determine its breed, or the staff on hand simply guess, sometimes labelling them as "mixed breed" when no primary category is apparent. We believe that having an accurately labelled breed is important for boosting an animal's chance of adoption, and automating the process would save experts' time and allow shelters without easy access to one the ability to confidently label their breeds. We decided to make a classifier for dog breeds to narrow the scope of the problem to one that could be easily solved by a classifier.

The dataset we'll be using to train our classifier comes from the Petfinder API, which takes data from petfinder.com, which is a website that aggregates pet information from animal shelters. Their API allows the client to search for and retrieve pet listings based on characteristics of an animal. To get our data, we just asked it for all animals and filtered out the ones that weren't dogs. The responses have many properties, but we reduced it to the ones we felt would be most useful to the classifier. Some of the properties we removed were things like media links, or references to other animals or organizations from the API.

In [97]:
import pandas as pd
import numpy as np
import sqlite3 as lite
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

*Setup*

We stored the data from the API in a SQL databse, so here we retrieve that info into a DataFrame. 

In [154]:
conn = lite.connect('pets.db')
data = pd.read_sql_query('SELECT * FROM pet', conn)
data

Unnamed: 0,id,mixed_breed,primary_color,secondary_color,tertiary_color,age,size,gender,coat,good_with_children,good_with_other_dogs,good_with_cats,unknown_breed,primary_breed,secondary_breed
0,1,1,Yellow / Tan / Blond / Fawn,,,Young,Small,Female,Medium,1.0,1.0,1.0,0,Terrier,
1,2,0,,,,Baby,Small,Male,Short,1.0,1.0,1.0,0,Chihuahua,
2,3,1,Yellow / Tan / Blond / Fawn,,,Adult,Medium,Female,Short,,,,0,Golden Retriever,Shepherd
3,4,0,,,,Baby,Small,Female,Short,1.0,1.0,1.0,0,Chihuahua,
4,5,1,Yellow / Tan / Blond / Fawn,,,Adult,Medium,Female,,1.0,1.0,,0,Retriever,Hound
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1239,1240,1,,,,Senior,Small,Female,,1.0,1.0,1.0,0,Bichon Frise,Poodle
1240,1241,1,,,,Baby,Large,Female,,1.0,1.0,1.0,0,Labrador Retriever,
1241,1242,0,,,,Baby,Small,Female,,,,,0,Mixed Breed,
1242,1243,0,,,,Young,Large,Male,,,,,0,German Shepherd Dog,


*Data exploration*

The first thing we looked at was the distribution of our data. The data seems to have a long tail of 1s and be heavily concentrated at the head, which would suggest a Zipf distribution.

We also looked at the distribution of the columns.

We noticed that some of the non-binary categorical data such as the three colors columns had significant percentages of NA values (




In [161]:
print_counts = lambda counts: print('***', counts.name, '***\n', \
                                    counts, '\n' \
                                   '% filled:', counts.sum() / data.shape[0] * 100, '\n')

data.apply(lambda col: print_counts(col.value_counts()))

*** id ***
 1244    1
417     1
410     1
411     1
412     1
       ..
831     1
832     1
833     1
834     1
1       1
Name: id, Length: 1244, dtype: int64 
% filled: 100.0 

*** mixed_breed ***
 1    979
0    265
Name: mixed_breed, dtype: int64 
% filled: 100.0 

*** primary_color ***
 Black                               141
Tricolor (Brown, Black, & White)     53
White / Cream                        50
Brown / Chocolate                    44
Bicolor                              42
Brindle                              25
Apricot / Beige                      24
Red / Chestnut / Orange              21
Yellow / Tan / Blond / Fawn          18
Gray / Blue / Silver                 12
Golden                                7
Merle (Blue)                          3
Harlequin                             2
Sable                                 1
Name: primary_color, dtype: int64 
% filled: 35.61093247588424 

*** secondary_color ***
 White / Cream                  119
Yellow / Tan / Blond / F

id                      None
mixed_breed             None
primary_color           None
secondary_color         None
tertiary_color          None
age                     None
size                    None
gender                  None
coat                    None
good_with_children      None
good_with_other_dogs    None
good_with_cats          None
unknown_breed           None
primary_breed           None
secondary_breed         None
dtype: object

Data cleaning:

We clean the data by setting all NA values to the mode for that column, except for secondary and tertiary color, which we explain below. We also drop the 'id' column, as it's not useful for classifying (it lets the classifier cheat a little!)

In [153]:
fill_mode = lambda col: col.fillna(col.mode()[0])
data = data.drop(['id', 'secondary_color', 'tertiary_color'], axis=1) \
.apply(fill_mode, axis=0) \
.join(data[['secondary_color', 'tertiary_color']])

Feature Engineering:

We also tried adding more columns such as `breed_counts` and `breed_group`, which we found massively increased our accuracy. We then realized that by relating these columns to the labels, we were indirectly giving the answers to the classifier. Also, in a real world usage of the classifier we wouldn't be able to determine these properties for animals with an unknown breed, so they were useless anyway.

In [130]:
split_colors = lambda index: data[index].astype(str).apply(lambda s: s.split(' / ')).apply(lambda o: [i for i in o if not str(i) == 'None']) 

data['colors'] = (split_colors('primary_color') \
                  + split_colors('secondary_color') \
                  + split_colors('tertiary_color') \
                 ).apply(lambda o: np.unique(o))

data['color_counts'] = data['colors'].apply(lambda o: len(o))
# data['color_counts'] = data['primary_color'].apply(lambda x: len(x) if x else 0) + data['secondary_color'].apply(lambda x: len(x) if x else 0) + data['tertiary_color'].apply(lambda x: len(x) if x else 0)

# data['breed_counts'] = data['primary_breed'].apply(lambda x: 1 if x else 0) + data['secondary_breed'].apply(lambda x: 1 if x else 0)
# data['Mixed'] = data['primary_breed'].apply(lambda x: 1 if x=='Labrador Retriever' or x=='Husky' else 0)
# data['Terrier'] = data['primary_breed'].apply(lambda x: 1 if x=='Pit Bull Terrier' or x=='Terrier' or x=='American Staffordshire Terrier' or x=='Staffordshire Bull Terrier' or x=='Jack Russell Terrier' or x=='Cairn Terrier' or x=='Border Terrier' else 0)
# data['Toy'] = data['primary_breed'].apply(lambda x: 1 if x=='Chihuahua' or x=='Shih Tzu' or x =='Miniature Pinscher' or x=='Parson Russell Terrier' or x=='Rat Terrier' or x=='Maltese' or x=='Pug' or x=='Yorkshire Terrier' else 0)
# data['Herding'] = data['primary_breed'].apply(lambda x: 1 if x=='German Shepherd Dog' or x=='Shepherd' or x=='Border Collie' or x=='Australian Cattle Dog / Blue Heeler' or x=='Catahoula Leopard Dog' or x=='Australian Shepherd' or x=='Cattle Dog' or x=='Corgi' or x=='Collie' or x=='Belgian Shepherd / Malinois' else 0)
# data['Working'] = data['primary_breed'].apply(lambda x: 1 if x=='Boxer' or x=='Siberian Husky' or x=='American Bulldog' or x=='Great Pyrenees' or x=='Doberman Pinscher' or x=='Schnauzer' or x=='Rottweiler' or x=='Mastiff' or x=='Akita' or x=='Alaskan Malamute' or x=='Newfoundland Dog' else 0)
# data['Hound'] = data['primary_breed'].apply(lambda x: 1 if x=='Beagle' or x=='Hound' or x=='Dachshund' or x=='Basset Hound' or x=='Plott Hound' or x=='Treeing Walker Coonhound' or x=='Coonhound' or x=='Basenji' else 0)
# data['Non-sporting'] = data['primary_breed'].apply(lambda x: 1 if x=='Poodle' or x=='American Eskimo Dog' or x=='Boston Terrier' else 0)
# data['Sporting'] = data['primary_breed'].apply(lambda x: 1 if x=='Golden Retriever' or x=='Pointer' or x=='Retriever' or x=='Black Labrador Retriever' or x=='Cocker Spaniel' else 0)
# data['Hound'] = data['primary_breed'].apply(lambda x: 1 if x=='Greyhound' else 0)

data
# data['primary_breed'].value_counts().head(60)

Unnamed: 0,id,mixed_breed,primary_color,age,size,gender,coat,good_with_children,good_with_other_dogs,good_with_cats,unknown_breed,primary_breed,secondary_breed,secondary_color,tertiary_color,colors,color_counts
0,1,1,Yellow / Tan / Blond / Fawn,Young,Small,Female,Medium,1.0,1.0,1.0,0,Terrier,Mixed Breed,,,"[Blond, Fawn, Tan, Yellow]",4
1,2,0,Black,Baby,Small,Male,Short,1.0,1.0,1.0,0,Chihuahua,Mixed Breed,,,[Black],1
2,3,1,Yellow / Tan / Blond / Fawn,Adult,Medium,Female,Short,1.0,1.0,1.0,0,Golden Retriever,Shepherd,,,"[Blond, Fawn, Tan, Yellow]",4
3,4,0,Black,Baby,Small,Female,Short,1.0,1.0,1.0,0,Chihuahua,Mixed Breed,,,[Black],1
4,5,1,Yellow / Tan / Blond / Fawn,Adult,Medium,Female,Short,1.0,1.0,1.0,0,Retriever,Hound,,,"[Blond, Fawn, Tan, Yellow]",4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1239,1240,1,Black,Senior,Small,Female,Short,1.0,1.0,1.0,0,Bichon Frise,Poodle,,,[Black],1
1240,1241,1,Black,Baby,Large,Female,Short,1.0,1.0,1.0,0,Labrador Retriever,Mixed Breed,,,[Black],1
1241,1242,0,Black,Baby,Small,Female,Short,1.0,1.0,1.0,0,Mixed Breed,Mixed Breed,,,[Black],1
1242,1243,0,Black,Young,Large,Male,Short,1.0,1.0,1.0,0,German Shepherd Dog,Mixed Breed,,,[Black],1


Here is where we clean the data. We start by filling in NaN values with the mode for that column. From there we realized we had a very unbalanced sample, which followed something of a Zipf distribution. To fix this, we used random oversampling, 

In [77]:

Y = data['primary_breed']#[:200]
X = data.drop(['primary_breed', 'id'], axis=1)#[:200]

counts = Y.value_counts()
sample_threshold = counts[0] // 1
minorities = counts[counts.values < sample_threshold]

minorities
ros = RandomOverSampler(sampling_strategy={ key: sample_threshold  for key in minorities.index })
# ros 
X_ros, Y_ros = ros.fit_sample(X, Y)

data = data.join(data.primary_color.str.join('|').str.get_dummies())
print(data)
# pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], 
#           axis=1).iloc[4:10]
# enc = OneHotEncoder(handle_unknown='ignore')
# enc.fit(X_ros)
# X_hot = enc.transform(X_ros).toarray()

# le = LabelEncoder()
# le.fit(Y_ros)
# Y_hot = le.transform(Y_ros)

# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ValueError: columns overlap but no suffix specified: Index(['Beige', 'Bicolor', 'Black', 'Brindle', 'Chocolate', 'Cream', 'Fawn',
       'Golden', 'Harlequin', 'Merle (Blue)', 'Orange', 'Sable', 'Silver',
       'Tricolor (Brown, Black, & White)'],
      dtype='object')

In [41]:
from sklearn.multiclass import OneVsRestClassifier

params = {'max_depth': [25, 30, 35, 40],
          'max_features': range(1, 28, 8),
          'min_samples_leaf': [1, 2, 3, 4]}

tree = DecisionTreeClassifier()
grid_search = GridSearchCV(tree, params, cv=5, scoring='accuracy')
# grid_search.fit(X_train, Y_train)
# print(grid_search.best_params_)

accuracy = cross_val_score(grid_search, X_hot, Y_hot, cv=10)
print("Tree accuracy:", accuracy.mean() * 100.0)

nb = GaussianNB()
accuracy = cross_val_score(nb, X, Y, cv=10)
print("NB accuracy:", accuracy.mean() * 100.0)


Tree accuracy: 75.28581367254756
NB accuracy: nan


ValueError: could not convert string to float: 'Apricot / Beige'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'

ValueError: could not convert string to float: 'Yellow / Tan / Blond / Fawn'



In [10]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

Y_pred = cross_val_predict(grid_search, X, Y)
cm = confusion_matrix(Y, Y_pred)

cm = classification_report(Y, Y_pred)
print(cm)

precision    recall  f1-score   support

           0       0.99      1.00      0.99       151
           1       1.00      0.72      0.83       151
           2       0.72      0.42      0.53       151
           3       0.99      1.00      0.99       151
           4       1.00      0.29      0.45       151
           5       1.00      1.00      1.00       151
           6       0.91      0.60      0.73       151
           7       0.76      0.66      0.70       151
           8       0.99      1.00      1.00       151
           9       0.98      0.53      0.69       151
          10       0.99      1.00      1.00       151
          11       1.00      0.36      0.53       151
          12       0.87      0.74      0.80       151
          13       0.86      0.67      0.75       151
          14       0.97      0.75      0.85       151
          15       0.99      1.00      1.00       151
          16       0.87      1.00      0.93       151
          17       0.82      0.89      0.

In [7]:
from sklearn.metrics import roc_curve, roc_auc_score

prob_pos = grid_search.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(Y_test, prob_pos)

# Do not change this code! This plots the ROC curve.
# Just replace the fpr and tpr above with the values from your roc_curve
plt.plot([0,1],[0,1],'k--') #plot the diagonal line
plt.plot(fpr, tpr, label='NB') #plot the ROC curve
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Decision Tree')
plt.show()

auc = roc_auc_score(Y_test, prob_pos)
print("AUC Score:", auc)

ValueError: multiclass format is not supported