When a stray animal is brought to a shelter, it is not always apparent what their breed is. Either an expert must assess the animal to determine its breed, or the staff on hand simply guess, sometimes labelling them as "mixed breed" when no primary category is apparent. We believe that having an accurately labelled breed is important for boosting an animal's chance of adoption, and automating the process would save experts' time and allow shelters without easy access to one the ability to confidently label their breeds. We decided to make a classifier for dog breeds to narrow the scope of the problem to one that could be easily solved by a classifier.

The dataset we'll be using to train our classifier comes from the Petfinder API, which takes data from petfinder.com, which is a website that aggregates pet information from animal shelters. Their API allows the client to search for and retrieve pet listings based on characteristics of an animal. To get our data, we just asked it for all animals and filtered out the ones that weren't dogs. The responses have many properties, but we reduced it to the ones we felt would be most useful to the classifier. Some of the properties we removed were things like media links, or references to other animals or organizations from the API.

In [2]:
import pandas as pd
import numpy as np
import sqlite3 as lite
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB, CategoricalNB

*Setup*

We stored the data from the API in a SQL databse, so here we retrieve that info into a DataFrame. 

In [3]:
conn = lite.connect('pets.db')
data = pd.read_sql_query('SELECT * FROM pet', conn)
data

Unnamed: 0,id,mixed_breed,primary_color,secondary_color,tertiary_color,age,size,gender,coat,good_with_children,good_with_other_dogs,good_with_cats,unknown_breed,primary_breed,secondary_breed
0,1,1,Yellow / Tan / Blond / Fawn,,,Young,Small,Female,Medium,1.0,1.0,1.0,0,Terrier,
1,2,0,,,,Baby,Small,Male,Short,1.0,1.0,1.0,0,Chihuahua,
2,3,1,Yellow / Tan / Blond / Fawn,,,Adult,Medium,Female,Short,,,,0,Golden Retriever,Shepherd
3,4,0,,,,Baby,Small,Female,Short,1.0,1.0,1.0,0,Chihuahua,
4,5,1,Yellow / Tan / Blond / Fawn,,,Adult,Medium,Female,,1.0,1.0,,0,Retriever,Hound
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13177,13178,1,,,,Young,Large,Female,,,,,0,Boxer,Mixed Breed
13178,13179,1,,,,Adult,Large,Male,,,,,0,German Shepherd Dog,
13179,13180,1,,,,Adult,Large,Male,,,,,0,Boxer,Mixed Breed
13180,13181,0,,,,Adult,Medium,Male,,,,,0,Bullmastiff,


*Data exploration*

The first thing we looked at was the distribution of our data. The data seems to have a long tail of 1s and be heavily concentrated at the head, which would suggest a Zipf distribution.

We also looked at the distribution of the columns.

We noticed that some of the non-binary categorical data such as the three colors columns had significant percentages of NA values, which we tried to fix in feature engineering.




In [4]:
print_counts = lambda counts: print('***', counts.name, '***\n', \
                                    counts, '\n' \
                                   '% filled:', counts.sum() / data.shape[0] * 100, '\n')

data.apply(lambda col: print_counts(col.value_counts()))

*** id ***
 13182    1
4390     1
4400     1
4399     1
4398     1
        ..
8786     1
8785     1
8784     1
8783     1
1        1
Name: id, Length: 13182, dtype: int64 
% filled: 100.0 

*** mixed_breed ***
 1    9178
0    4004
Name: mixed_breed, dtype: int64 
% filled: 100.0 

*** primary_color ***
 Black                               1219
Tricolor (Brown, Black, & White)     513
Brown / Chocolate                    502
White / Cream                        496
Brindle                              316
Bicolor                              314
Apricot / Beige                      267
Yellow / Tan / Blond / Fawn          261
Red / Chestnut / Orange              255
Gray / Blue / Silver                 204
Golden                               122
Merle (Blue)                          30
Merle (Red)                           25
Sable                                 22
Harlequin                             18
Name: primary_color, dtype: int64 
% filled: 34.62297071764527 

*** secondary_c

id                      None
mixed_breed             None
primary_color           None
secondary_color         None
tertiary_color          None
age                     None
size                    None
gender                  None
coat                    None
good_with_children      None
good_with_other_dogs    None
good_with_cats          None
unknown_breed           None
primary_breed           None
secondary_breed         None
dtype: object

Data cleaning:

We clean the data by setting all NA values to the mode for that column, except for secondary and tertiary color, which we explain below. We also drop the 'id' column and any of the columns which related to the breed other than `primary_breed` to prevent the classifier from cheating. We chose mode because all of our data is categorical, and there's too many NA values to make dropping rows viable, and we only have a few columns so we wanted to preserve as many as we could. 

In [5]:
fill_mode = lambda col: col.fillna(col.mode()[0])

data_cleaned = data.drop(['id', 'secondary_color', 'tertiary_color', 'secondary_breed', 'unknown_breed', 'mixed_breed'], axis=1) \
.apply(fill_mode, axis=0) \
.join(data[['secondary_color', 'tertiary_color']])

data_cleaned

Unnamed: 0,primary_color,age,size,gender,coat,good_with_children,good_with_other_dogs,good_with_cats,primary_breed,secondary_color,tertiary_color
0,Yellow / Tan / Blond / Fawn,Young,Small,Female,Medium,1.0,1.0,1.0,Terrier,,
1,Black,Baby,Small,Male,Short,1.0,1.0,1.0,Chihuahua,,
2,Yellow / Tan / Blond / Fawn,Adult,Medium,Female,Short,1.0,1.0,1.0,Golden Retriever,,
3,Black,Baby,Small,Female,Short,1.0,1.0,1.0,Chihuahua,,
4,Yellow / Tan / Blond / Fawn,Adult,Medium,Female,Short,1.0,1.0,1.0,Retriever,,
...,...,...,...,...,...,...,...,...,...,...,...
13177,Black,Young,Large,Female,Short,1.0,1.0,1.0,Boxer,,
13178,Black,Adult,Large,Male,Short,1.0,1.0,1.0,German Shepherd Dog,,
13179,Black,Adult,Large,Male,Short,1.0,1.0,1.0,Boxer,,
13180,Black,Adult,Medium,Male,Short,1.0,1.0,1.0,Bullmastiff,,


Feature Engineering:

We did our feature engineering on the colors columns. Of these, only `primary_color` had enough data to be useful, but we still wanted to make use of the extra columns. If we filled the secondary and tertiary colors with mode, we still had garbage data because around 90% of the rows would have the same value. Instead, we decided to try and count the number of colors and place that in a new column, `color_count`. While doing this, we noticed that the data in the color fields lists multiple values, but since the data is a string, it's all treated as one value. We fixed this by splitting the string into lists of the different colors listed, and removing `None` values. This let us get a more accurate `color_count` and will help us with encoding the data later. Finally, we combined the three color columns into `colors`, which lists the union of the lists in `primary_color`, `secondary_color`, and `tertiary_color`. We did this because of how sparse `secondary_color` and `tertiary_color` were, such that they weren't useful on their own. Combining the three allowed us get rid of the sparse features without throwing away their data.

We also tried adding more columns such as `breed_counts` and `breed_group`, which we found massively increased our accuracy. We then realized that by relating these columns to the labels, we were indirectly giving the answers to the classifier. Also, in a real world usage of the classifier we wouldn't be able to determine these properties for animals with an unknown breed, so they were useless anyway.

In [6]:
data = data_cleaned

split_colors = lambda index: data[index].astype(str).apply(lambda s: s.split(' / ')).apply(lambda o: [i for i in o if not str(i) == 'None']) 

data_engineered = data

data_engineered['colors'] = (split_colors('primary_color') \
                  + split_colors('secondary_color') \
                  + split_colors('tertiary_color') \
                 ).apply(lambda o: np.unique(o))

data_engineered['color_count'] = data_engineered['colors'].apply(lambda o: len(o))

data = data_engineered.drop(['primary_color', 'secondary_color', 'tertiary_color'], axis=1)
"""
"Cheating" code for breed groups.
data['breed_counts'] = data['primary_breed'].apply(lambda x: 1 if x else 0) + data['secondary_breed'].apply(lambda x: 1 if x else 0)
data['Mixed'] = data['primary_breed'].apply(lambda x: 1 if x=='Labrador Retriever' or x=='Husky' else 0)
data['Terrier'] = data['primary_breed'].apply(lambda x: 1 if x=='Pit Bull Terrier' or x=='Terrier' or x=='American Staffordshire Terrier' or x=='Staffordshire Bull Terrier' or x=='Jack Russell Terrier' or x=='Cairn Terrier' or x=='Border Terrier' else 0)
data['Toy'] = data['primary_breed'].apply(lambda x: 1 if x=='Chihuahua' or x=='Shih Tzu' or x =='Miniature Pinscher' or x=='Parson Russell Terrier' or x=='Rat Terrier' or x=='Maltese' or x=='Pug' or x=='Yorkshire Terrier' else 0)
data['Herding'] = data['primary_breed'].apply(lambda x: 1 if x=='German Shepherd Dog' or x=='Shepherd' or x=='Border Collie' or x=='Australian Cattle Dog / Blue Heeler' or x=='Catahoula Leopard Dog' or x=='Australian Shepherd' or x=='Cattle Dog' or x=='Corgi' or x=='Collie' or x=='Belgian Shepherd / Malinois' else 0)
data['Working'] = data['primary_breed'].apply(lambda x: 1 if x=='Boxer' or x=='Siberian Husky' or x=='American Bulldog' or x=='Great Pyrenees' or x=='Doberman Pinscher' or x=='Schnauzer' or x=='Rottweiler' or x=='Mastiff' or x=='Akita' or x=='Alaskan Malamute' or x=='Newfoundland Dog' else 0)
data['Hound'] = data['primary_breed'].apply(lambda x: 1 if x=='Beagle' or x=='Hound' or x=='Dachshund' or x=='Basset Hound' or x=='Plott Hound' or x=='Treeing Walker Coonhound' or x=='Coonhound' or x=='Basenji' else 0)
data['Non-sporting'] = data['primary_breed'].apply(lambda x: 1 if x=='Poodle' or x=='American Eskimo Dog' or x=='Boston Terrier' else 0)
data['Sporting'] = data['primary_breed'].apply(lambda x: 1 if x=='Golden Retriever' or x=='Pointer' or x=='Retriever' or x=='Black Labrador Retriever' or x=='Cocker Spaniel' else 0)
data['Hound'] = data['primary_breed'].apply(lambda x: 1 if x=='Greyhound' else 0)
"""

data_engineered[['primary_color', 'secondary_color', 'tertiary_color', 'colors', 'color_count']]

Unnamed: 0,primary_color,secondary_color,tertiary_color,colors,color_count
0,Yellow / Tan / Blond / Fawn,,,"[Blond, Fawn, Tan, Yellow]",4
1,Black,,,[Black],1
2,Yellow / Tan / Blond / Fawn,,,"[Blond, Fawn, Tan, Yellow]",4
3,Black,,,[Black],1
4,Yellow / Tan / Blond / Fawn,,,"[Blond, Fawn, Tan, Yellow]",4
...,...,...,...,...,...
13177,Black,,,[Black],1
13178,Black,,,[Black],1
13179,Black,,,[Black],1
13180,Black,,,[Black],1


Now we do some more cleaning to fix our unbalanced sample (Zipfian distribution of classes). We used random oversampling to fix this because fancier methods are more geared towards continuous data. 

In [21]:
from sklearn.utils import shuffle

data_shuf = shuffle(data).head(1000)
Y = data_shuf['primary_breed']
X = data_shuf.drop(['primary_breed'], axis=1)

# ros 
counts = Y.value_counts()
sample_threshold = counts[0] // 5
minorities = counts[counts.values < sample_threshold]

ros = RandomOverSampler(sampling_strategy={ key: sample_threshold for key in minorities.index })
X_ros, Y_ros = ros.fit_sample(X, Y)

# encoding
gle = LabelEncoder()
encode_field = lambda df, field: df.drop(field, axis=1).join(pd.DataFrame(gle.fit_transform(df[field]), columns=[field]))

X_encoded = X_ros.drop('colors', axis=1).join(X_ros['colors'].str.join('|').str.get_dummies())
X_encoded = encode_field(X_encoded, 'age')
X_encoded = encode_field(X_encoded, 'size')
X_encoded = encode_field(X_encoded, 'gender')
X_encoded = encode_field(X_encoded, 'coat')

X_encoded
# enc = OneHotEncoder(handle_unknown='ignore')
# enc.fit(X_ros)
# X_hot = enc.transform(X_ros).toarray()

# Y_enc = gle.fit_transform(Y_ros)

# Y_enc

# X_train, X_test, Y_train, Y_test = train_test_split(X_encoded, Y_ros, test_size=0.2)

Unnamed: 0,good_with_children,good_with_other_dogs,good_with_cats,color_count,Apricot,Beige,Bicolor,Black,Blond,Blue,...,Sable,Silver,Tan,"Tricolor (Brown, Black, & White)",White,Yellow,age,size,gender,coat
0,1.0,1.0,1.0,6,0,0,0,0,1,0,...,0,0,1,0,1,1,3,2,1,3
1,1.0,1.0,1.0,2,1,1,0,0,0,0,...,0,0,0,0,0,0,0,3,1,3
2,1.0,1.0,1.0,2,1,1,0,0,0,0,...,0,0,0,0,0,0,0,3,0,3
3,1.0,1.0,1.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,3
4,1.0,1.0,1.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,2,2,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3085,1.0,1.0,1.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,2,3,0,3
3086,1.0,1.0,1.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,3,3,0,3
3087,1.0,1.0,1.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,3,3,0,3
3088,1.0,1.0,1.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,2,3,0,3


In [26]:
from sklearn.multiclass import OneVsRestClassifier

params = {'clf__max_depth': [25, 30, 35, 40],
          'clf__max_features': range(1, 28, 8),
          'clf__min_samples_leaf': [1, 2, 3, 4]}

tree = DecisionTreeClassifier()
grid_search = GridSearchCV(tree, params, cv=5, scoring='accuracy')
# pipe = Pipeline([('smpl', RandomOverSampler(sampling_strategy=0.2)), ])
ovr = OneVsRestClassifier(grid_search)

accuracy = cross_val_score(ovr, X_encoded, Y_ros, cv=10)
print("Tree accuracy:", accuracy.mean() * 100.0)

nb = NaGaussianNB
ovr = OneVsRestClassifier(nb, n_jobs=4)
# accuracy = cross_val_score(ovr, X_encoded, Y_ros, cv=10)
# print("NB accuracy:", accuracy.mean() * 100.0)
'''

SyntaxError: EOL while scanning string literal (<ipython-input-26-b88680bd7140>, line 19)

In [10]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

Y_pred = cross_val_predict(grid_search, X, Y)
cm = confusion_matrix(Y, Y_pred)

cm = classification_report(Y, Y_pred)
print(cm)

precision    recall  f1-score   support

           0       0.99      1.00      0.99       151
           1       1.00      0.72      0.83       151
           2       0.72      0.42      0.53       151
           3       0.99      1.00      0.99       151
           4       1.00      0.29      0.45       151
           5       1.00      1.00      1.00       151
           6       0.91      0.60      0.73       151
           7       0.76      0.66      0.70       151
           8       0.99      1.00      1.00       151
           9       0.98      0.53      0.69       151
          10       0.99      1.00      1.00       151
          11       1.00      0.36      0.53       151
          12       0.87      0.74      0.80       151
          13       0.86      0.67      0.75       151
          14       0.97      0.75      0.85       151
          15       0.99      1.00      1.00       151
          16       0.87      1.00      0.93       151
          17       0.82      0.89      0.

In [7]:
from sklearn.metrics import roc_curve, roc_auc_score

prob_pos = grid_search.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(Y_test, prob_pos)

# Do not change this code! This plots the ROC curve.
# Just replace the fpr and tpr above with the values from your roc_curve
plt.plot([0,1],[0,1],'k--') #plot the diagonal line
plt.plot(fpr, tpr, label='NB') #plot the ROC curve
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Decision Tree')
plt.show()

auc = roc_auc_score(Y_test, prob_pos)
print("AUC Score:", auc)

ValueError: multiclass format is not supported