For this assignment you will need to predict whether or not a pet will get adopted. Note
that the “Outcome” column contains multiple categories. For this assignment you will
need to convert the data into only two categories: 1 == adopted, 0 == not adopted.
● Check for missing values. If there are any missing values, deal with them appropriately.
● Provide written justification explaining why you selected particular methods for dealing
with missing values
● Check for outliers. Do we keep them or do we drop them? Why?
● Provide written justification explaining why you decided to keep or drop outliers.
● Center and scale data as needed
○ Generate a density plot for every field that contains continuous data
○ Review distributions
○ Chose centering and scaling approach
○ Provide written justification explaining why you needed (or did not need) to center
and/or scale the data.
● Transform data as needed
○ Choose transformation approach
○ Provide written justification explaining why you needed (or did not need) to
transform the data
● Think about which features make sense as predictors. DO NOT use all features as
predictors in your model.
● Provide a written justification explaining why you selected certain features and excluded
others
● Create and train four classification models (using Naive Bayes, SVM, KNN, Random
Forest) that predict whether or not a pet will get adopted
● Use 10-fold cross validation to validate your models.
● Report each model’s accuracy score
● Report each model’s AUC score
● Compare accuracy for the four models and make a recommendation as to which model
performed best
● Write a paragraph (either as comments or as markdown) explaining whether or not your
best model is “good” and why

In [1]:
#load tools
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import cross_val_score, cross_val_predict

In [2]:
#load file
shelter_data = pd.read_csv("aac_shelter_outcomes.csv")
shelter_data

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...
78251,1 month,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04T00:00:00,2018-02-01T18:26:00,2018-02-01T18:26:00,,Foster,Adoption,Spayed Female
78252,3 years,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30T00:00:00,2018-02-01T18:06:00,2018-02-01T18:06:00,Max,,Adoption,Neutered Male
78253,,A766098,Other,Bat Mix,Brown,2017-02-01T00:00:00,2018-02-01T18:08:00,2018-02-01T18:08:00,,Rabies Risk,Euthanasia,Unknown
78254,2 months,A765858,Dog,Standard Schnauzer,Red,2017-11-13T00:00:00,2018-02-01T18:32:00,2018-02-01T18:32:00,,,Adoption,Spayed Female


In [3]:
#check for null values in the columns
shelter_data.isnull().sum()

age_upon_outcome        8
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        2
dtype: int64

In [4]:
#note that age_upon_outcome is an object, and mean/median cannot be performed on it
#there are 8 null values in the age column, 
#we can estimate and fill it in with the most common occurring age, so the
#shelter animals can be assumed to be at least 1 year of age
shelter_data.age_upon_outcome.describe()

count      78248
unique        46
top       1 year
freq       14355
Name: age_upon_outcome, dtype: object

In [5]:
#fill missing values for age
shelter_data["age_upon_outcome"].fillna("1 year", inplace=True)

In [6]:
#we can't just carelessly drop the rows with no names, as those are all valid data
#but we also can't just fill with the most common name
#instead, let's fill it with "unknown" as a filler
shelter_data["name"].fillna("Unknown", inplace=True)

We only care about if the animal was adopted or not, so the last three columns have to be
filtered through; everything not marked "adoption" must be converted to the numeric 0

In [7]:
#let's drop the unnecessary outcome columns
shelter_data.drop('outcome_subtype', axis = 'columns', inplace = True)
shelter_data.drop('sex_upon_outcome', axis = 'columns', inplace = True)

In [8]:
shelter_data.outcome_type.describe()

count        78244
unique           9
top       Adoption
freq         33112
Name: outcome_type, dtype: object

In [9]:
#check the different outcome types
shelter_data.outcome_type.unique()

array(['Transfer', 'Adoption', 'Euthanasia', 'Return to Owner', 'Died',
       'Disposal', 'Relocate', 'Missing', nan, 'Rto-Adopt'], dtype=object)

In [10]:
#there are null values, we can fill them with "not adopted" or simply drop them
#since there are only 12 null values, dropping would not have an adverse effect on the data
#i will fill the null values, since i don't feel too comfortable with dropping data

shelter_data["outcome_type"].fillna("Missing", inplace=True)

In [11]:
#convert the data into only two categories: 1 == adopted, 0 == not adopted
#we could use get_dummies, but since this method is not specified, I'll do it by using numpy 
#to replace values for convenience
shelter_data['outcome'] = np.where(shelter_data ['outcome_type'].str.contains('Adoption'), 1, 0)
shelter_data

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_type,outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,Unknown,Transfer,0
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Transfer,0
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,Adoption,1
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Transfer,0
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,Unknown,Euthanasia,0
...,...,...,...,...,...,...,...,...,...,...,...
78251,1 month,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04T00:00:00,2018-02-01T18:26:00,2018-02-01T18:26:00,Unknown,Adoption,1
78252,3 years,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30T00:00:00,2018-02-01T18:06:00,2018-02-01T18:06:00,Max,Adoption,1
78253,1 year,A766098,Other,Bat Mix,Brown,2017-02-01T00:00:00,2018-02-01T18:08:00,2018-02-01T18:08:00,Unknown,Euthanasia,0
78254,2 months,A765858,Dog,Standard Schnauzer,Red,2017-11-13T00:00:00,2018-02-01T18:32:00,2018-02-01T18:32:00,Unknown,Adoption,1


We have no need to deal with outliers, since these are all significant data and there will be significant differences between each individual. In addition, most of these values are not numerical.

Centering and scaling the data is also unnecessary here, due to the lack of numerical values. Density plots, in turn, also cannot be generated. In addition, the data cannot be transformed.

So our task is to predict whether or not a pet will get adopted. For this, it is reasonable to take into account columns age, type, date_time, and outcome. Age and type may affect the animal's chance of being adopted due to preference, and date_time may also be a factor due to adoption seasons. The outcome will clearly be relevant in analysis of whether or not a pet will be adopted. While breed and color may affect the chances of an animal being adopted, there are too many variances in such detail where it would be more efficient just to look at the animal type when predicting adoptions. Date of birth is irrelevant, since we already have the age upon outcome. Name and animal id will most likely not affect adoption at all, so we can move past those.

Naive Bayes

In [12]:
#apply label encoder to columns
def label_encode(shelter_data, columns):
    for col in columns:
        le = LabelEncoder()
        col_values_unique = list(shelter_data[col].unique())
        le_fitted = le.fit(col_values_unique)
 
        col_values = list(shelter_data[col].values)
        #le.classes_
        col_values_transformed = le.transform(col_values)
        shelter_data[col] = col_values_transformed

In [13]:
#naive bayes
x = shelter_data[['age_upon_outcome','animal_type','datetime']]
label_encode(x, x.columns.values)

y = shelter_data[['outcome']]
label_encode(y, y.columns.values)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [14]:
#apply model
nb = GaussianNB()
nb.fit(x_train, y_train.values.ravel())
y_pred = nb.predict(x_test)
y_pred

array([1, 0, 0, ..., 0, 0, 0])

In [15]:
#accuracy (yikes)
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

0.5803291384317522


In [16]:
#10 fold cross validation
scores = cross_val_score(nb, x, y.values.ravel(), cv=10)
print(scores)
print('Cross-validated score:', scores.mean())

[0.57692308 0.57692308 0.57692308 0.64656274 0.55954511 0.56286737
 0.56894569 0.56217252 0.57162939 0.56741214]
Cross-validated score: 0.5769904186013852


In [17]:
# make class predictions for the testing set
y_pred_class = nb.predict(x_test)

print(metrics.accuracy_score(y_test, y_pred_class))
print(metrics.roc_auc_score(y_test, y_pred_class))

0.5803291384317522
0.5447305108345873


SVM

In [18]:
from sklearn import svm

# instantiate model
model = svm.SVC() 

# fit model
model.fit(x_train, y_train.values.ravel())

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [19]:
# make class predictions for the testing set
y_pred_class = model.predict(x_test)

In [20]:
print(metrics.accuracy_score(y_test, y_pred_class))
print(metrics.roc_auc_score(y_test, y_pred_class))

0.5756437560503388
0.5


In [21]:
#cross validation was making it run slow, so i commented it out
#scores = cross_val_score(model, x, y.values.ravel(), cv=10)
#print(scores.mean())

KNN

In [33]:
from sklearn.neighbors import KNeighborsRegressor

# instantiate model
model = KNeighborsRegressor(n_neighbors=5)

# fit model
model.fit(x_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [34]:
#cross validation was making it run slow, so i commented it out
#scores = cross_val_score(model, x, y.values.ravel(), cv=10)
#print(scores.mean())

In [35]:
# make class predictions for the testing set
y_pred_class = model.predict(x_test)

In [36]:
# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class.round()))
print(metrics.roc_auc_score(y_test, y_pred_class))

0.6476282671829623
0.6842645993842711


Random Forest

In [26]:
#import the model 
from sklearn.ensemble import RandomForestRegressor

#instantiate model 
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)

#train the model on training data
rf.fit(x_train, y_train.values.ravel());

# Use the forest's predict method on the test data
y_pred_class = rf.predict(x_test)

print(metrics.accuracy_score(y_test, y_pred_class.round()))
print(metrics.roc_auc_score(y_test, y_pred_class))

0.6973088092933204
0.7704702723347336


In [28]:
#cross validation was making it run slow, so i commented it out
#scores = cross_val_score(model, x, y.values.ravel(), cv=10)
#print(scores.mean())

Of the four models, the RFM performed the best, with a 70% accuracy score compared to 58%, 57%, and 64% of the other models. When we look at the AUC values, we see that the NBM and SVM are both on the low end, near 0.5, making them bad classifiers. When comparing the KNN and RFM, 0.77 of the RM is more reliable and indicates a better classifier. This further supports that the Random Forest Model is the one that performed the best. At a 0.77 AUC score and 0.69 accuracy score, the RFM works as a good classifier, but it definitely could be more accurate. Predicting adoption at a 70% rate may not be as precise and reliable as most shelters would want, but this imprecision can also be chalked up to unpredictable preferences as each adopter may have different goals when adopting.