In [51]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt

## Group Members ##
Cecilia Vu,
Julianne Vu,
An Vi Nguyen,
Emaan Haseem

## Introduction ##
For our project, we are trying to predict whether or not an animal will be adopted based their characteristics. This problem is important as there are thousands of animals placed in shelters each year and our model could help these animals get adopted and avoid euthanasia. The dataset that we are using from for model is from the Austin Animal Center. This dataset provides the characteristics of the animals when they arrived at the center as well as their characteristics when they leave the center. These features includes their condition, breed, age, etc. We will use this dataset to build a model to predict whether or not an animal will be adopted. 

## Data Collection ##
We are merging the data based on their Animal ID

In [52]:
data1 = pd.read_csv('Austin_Animal_Center_Intakes.csv')
data2 = pd.read_csv('Austin_Animal_Center_Outcomes.csv')

data = pd.merge(data1, data2, on='Animal ID', how='inner')
data_copy = data.copy()
#print(data.columns)

## Feature Selection ##
For feature selection, we dropped the following labels/columns from our merged dataset:
* **Name_x, Name_y**
    * We used the row indices to identify the animals instead of the names, this way it would be numeric and unique.
* **MonthYear_x, MonthYear_y**
    * We already have DateTime_x and DateTime_y, so having MonthYear was not necessary. 
* **Found Location**
    * Majority of the found location was in Austin, TX and we didn't think Found Location was relevant enough to affect our predictions.
* **Animal Type_y, Color_y, Breed_y**
    * Since we already got Animal Type_x, Color_x, and Breed_x, these three y columns were duplicates, so we didn't need them.
* **Outcome Subtype**
    * We already have Outcome Type, and we only cared whether an animal was adopted or not. Dropping the subtype would simplify our prediction.
* **Date of Birth, Age upon Outcome**
    * We already have Age upon Intake and how long they had been at the center, so date of birth and age upon outcome was not necessary. 
* **Sex upon Intake**
    * They could've been neutered or spayed during their time at the shelter, so their sex upon outcome was more relevant for our dataset and predictions.

In [53]:
data = data.drop(columns=['Name_x','MonthYear_x','MonthYear_y','Found Location','Name_y','Animal Type_y','Color_y','Breed_y','Outcome Subtype','Date of Birth', 'Age upon Outcome', 'Sex upon Intake'])
#print(data.head())
data.shape

(178310, 11)

## Data Cleaning/Prep ##

For data cleaning, we dropped all rows that contained N/A, NaN or Unknown. We also decided to drop all rows with duplicate animal IDs because they could have been adopted out and returned back to the center multiple times, and it would be difficult to track how long they were actually at the center. 

In [54]:
data = data.dropna()
data = data.loc[data['Sex upon Outcome'] != 'Unknown']
data.drop_duplicates(subset='Animal ID', keep = False, inplace = True)
print(data.shape)

(100607, 11)


Since the ages were formatted as "X years", "X months" or "X weeks," we decided to standardized it to weeks so we can better can compare the data. 

In [55]:
# converting age into number of weeks
for i in range(len(data)):
    age = str(data.iloc[i,5])
    age = age.split()
    if len(age) < 2:
        data.drop(i)
        continue
    if (age[1] == 'years' or age[1] == 'year'):
        data.iloc[i,5] = int(age[0]) * 52
    elif (age[1] == 'months' or age[1] == 'month'):
        data.iloc[i,5] = int(age[0]) * 4
    else:
        data.iloc[i,5] = int(age[0])

We split our data into two data sets, features and labels. We used that 'Outcome Type' as the label dataframe. However, there more than 2 categories in the label dataset and we only cared about predicting whether or not the animals were adopted. We decided to change the label set to a binary and set all of the 'Adoption' labels to 1 and the others to 0. We thought that this would help simplify our problem. 

In [57]:
label_df = data['Outcome Type']
feature_df = data.drop(columns=['Outcome Type', 'Animal ID'])
label_df = label_df.values.ravel()
print("shape of feature data frame: ", feature_df.shape)
print("length of label data frame: " , len(label_df))

for i in range(len(label_df)):
    if label_df[i] == 'Adoption' or label_df[i] == 1:
        label_df[i] = 1
    else:
        label_df[i] = 0
print(sum(label_df))
print(label_df)
#feature_df.head()

shape of feature data frame:  (100607, 9)
length of label data frame:  100607
48288
[0 0 0 ... 0 0 0]


After that, we decided to factorize all the categorical data into numeric data. We attempted to do one-hot-encoding; however, our dimensionality increase by a lot since there were so many categories for breed and colors which would make it harder to efficiently generalize. 

In [58]:
feature_df['Animal Type_x'] = pd.factorize(feature_df['Animal Type_x'])[0]
feature_df['Intake Type'] = pd.factorize(feature_df['Intake Type'])[0]
feature_df['Intake Condition'] = pd.factorize(feature_df['Intake Condition'])[0]
feature_df['Breed_x'] = pd.factorize(feature_df['Breed_x'])[0]
feature_df['Color_x'] = pd.factorize(feature_df['Color_x'])[0]

We decided to compare the columns DateTime_x and DateTime_y, so we could calculate how long the animal was at the shelter. We did this by converting the DateTimes from strings to DateTime objects and subtracting DateTime_x from DateTime_y. After we got the duration times for each animal, we removed the DateTime columns and added a Duration column because it was more relevant for our predictions.

In [59]:
from datetime import datetime
from datetime import timedelta

durations = []

for i in range(len(feature_df)):
    into_shelter = datetime.strptime(feature_df.iloc[i, 0], "%m/%d/%Y %H:%M:%S %p")
    out_shelter = datetime.strptime(feature_df.iloc[i, 7], "%m/%d/%Y %H:%M:%S %p")
    duration = out_shelter - into_shelter
    duration = duration.days
    duration = duration if duration >= 0 else 0
    durations.append(duration)
    
feature_df = feature_df.drop(columns=['DateTime_x', 'DateTime_y'])


In [60]:
feature_df['Duration_days'] = durations
feature_df.head()

Unnamed: 0,Intake Type,Intake Condition,Animal Type_x,Age upon Intake,Breed_x,Color_x,Sex upon Outcome,Duration_days
0,0,0,0,104,0,0,Neutered Male,4
1,0,0,0,416,1,1,Spayed Female,0
2,0,0,0,44,2,2,Neutered Male,6
3,0,1,1,4,3,3,Intact Female,0
4,0,0,0,208,4,4,Neutered Male,2


Sex upon Outcome gave us the neuter/spay status of the animal as well as the sex, but we thought it would more useful to split up this column into two separate columns to analyze trends better. There could be trends from solely the sex of the animal or from solely the neuter/spay status of the animal.

In [61]:
# Splitting "Sex Upon Outcome" into 2 columns
fertility = []
gender = []

feature_df = feature_df.loc[feature_df['Sex upon Outcome'] != 'Unknown']
for i in range(len(feature_df)):
    sex = feature_df.iloc[i, 6].split()
    sex[0] = 0 if sex[0] == 'Intact' else 1
    sex[1] = 0 if sex[1] == 'Male' else 1
    fertility.append(sex[0])
    gender.append(sex[1])
        
feature_df = feature_df.drop(columns=['Sex upon Outcome'])

In [62]:
feature_df['Spayed/Neutered'] = fertility
feature_df['Gender'] = gender
print(feature_df.shape)
feature_df.head()

(100607, 9)


Unnamed: 0,Intake Type,Intake Condition,Animal Type_x,Age upon Intake,Breed_x,Color_x,Duration_days,Spayed/Neutered,Gender
0,0,0,0,104,0,0,4,1,0
1,0,0,0,416,1,1,0,1,1
2,0,0,0,44,2,2,6,1,0
3,0,1,1,4,3,3,0,0,1
4,0,0,0,208,4,4,2,1,0


## Data Analysis ##
We needed to split our data into two sets, the training set and the testing set, to build our models. We decided to build our first model using decision trees and we received an accuracy of 81%. However, we realized that we used factorization and instead of one-hot-encoding for our categorical variables, therefore, decision trees would not really work for our dataset. 

In [63]:
# Decision Tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
train_feat, test_feat, train_label, test_label = train_test_split(feature_df, label_df, train_size = .8, test_size=.2)
print(len(train_feat))
print(len(test_feat))

clf = DecisionTreeClassifier(criterion="entropy")
clf.fit(train_feat, train_label.astype(int))
pred = clf.predict(test_feat)
print("Accuracy = ", accuracy_score(test_label.astype(int), pred))

80485
20122
Accuracy =  0.8051883510585429


In [64]:
# Cross validation for decision tree
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier(criterion="entropy")
accuracy = cross_val_score(clf,feature_df, label_df.astype(int), cv=10)
print(accuracy.mean())

0.8056198590190297


We needed to see how our model would look if we used other classification techniques. We also realized that we should be checking the precision and recall of the model as well. For our next model, we decided to used K-nearest neighbors. 

In [67]:
# KNN
from sklearn import neighbors
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

ss = StandardScaler()
pca = PCA()
knn = KNeighborsClassifier(n_neighbors=7)
pl = Pipeline(steps=[("ss", ss), ("pca", pca), ("knn", knn)])

scores = cross_val_score(pl, feature_df.astype(int), label_df.astype(int), cv=5)
predictions = cross_val_predict(knn, feature_df.astype(int), label_df.astype(int), cv=5)
cf = confusion_matrix(label_df.astype(int), predictions)
print(confusion_matrix(label_df.astype(int), predictions))
print(classification_report(label_df.astype(int), predictions))
print(scores.mean())

[[38619 13700]
 [12089 36199]]
              precision    recall  f1-score   support

           0       0.76      0.74      0.75     52319
           1       0.73      0.75      0.74     48288

    accuracy                           0.74    100607
   macro avg       0.74      0.74      0.74    100607
weighted avg       0.74      0.74      0.74    100607

0.8285805055934802


For our final model, we decided to use Naive Bayes with a 10 fold cross validation. This model produced our lowest accuracy but it had the highest recall. Here were our results: 

In [44]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
nb_scores = cross_val_score(classifier, feature_df.astype(int), label_df.astype(int), cv=10)
print("Average accuracy for 10 fold cv with Naive Bayes: ", nb_scores.mean())

Average accuracy for 10 fold cv with Naive Bayes:  0.7957498029416188


In [47]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

predictions = cross_val_predict(classifier, feature_df.astype(int), label_df.astype(int), cv=10)
print(confusion_matrix(label_df.astype(int), predictions))
print(classification_report(label_df.astype(int), predictions))

[[38182 14137]
 [ 6412 41876]]
              precision    recall  f1-score   support

           0       0.86      0.73      0.79     52319
           1       0.75      0.87      0.80     48288

    accuracy                           0.80    100607
   macro avg       0.80      0.80      0.80    100607
weighted avg       0.80      0.80      0.80    100607



## Conclusion ##
We decided recall would be more important for our dataset compared to precision and accuracy because recall in this scenario is out of the animals that were adopted, how many were actually classified as adopted with our models. This is important because if we accidentally labelled an animal that could of had a high chance at adoption as 'Not Adopted', this could lead to bad consequences for that animal (i.e. euthanasia). Because of this reason, we decided that our naive bayes model was the best as it gave us the highest recall of .87.

After finishing this project, we realized that we spent the most amount of time on feature selection and data preperation because it was really hard for us to figure out the justification of getting rid our of some of our features. We spent a lot of time debating which ones we should keep and which ones we should get rid of. The most challenging part of this project was figuring out how to deal with the categorical data because the majority of our features were categorical. We attempted to one hot encoding first; however, there were so many categories for some features (ex. breed, color). We realized that using one hot encoding just increased our dimensionality which is something we wanted to avoid. After long debate, we decided that factorization would be best for our dataset. Something interesting that we found looking at our dataset was that there was a bat at the shelter! 