# Shelter Animal Outcomes

## Introduction


Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year.

Using a dataset of information including breed, color, sex, and age , we need to predict the outcome for each animal.Submissions are evaluated using the multi-class logarithmic loss.

In [103]:
#importing the required libraries
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

We are loading the training and test dataset in the form of dataframe using read_csv function. Since DateTime column contains date and time , we have passed the column name to parse_dates so that it is stored as Datetime datatype instead of string

In [104]:
df_raw = pd.read_csv('train.csv',parse_dates=['DateTime'])
df_test = pd.read_csv('test.csv',parse_dates=['DateTime'])

For each dataframe ,  we are printing the datframe shape, null values and information about datetime column. Since the time duration for both the training and test datasets overlap we can divide the df_raw into training and validation set randomly. 

In [105]:
for df in [df_raw,df_test]:
    print(df.shape)
    print(df.DateTime.describe())
    print('\n')
    print(df.isnull().sum()[df.isnull().sum() != 0])
    print('\n')

(26729, 10)
count                   26729
unique                  22918
top       2015-08-11 00:00:00
freq                       19
first     2013-10-01 09:31:00
last      2016-02-21 19:17:00
Name: DateTime, dtype: object


Name               7691
OutcomeSubtype    13612
SexuponOutcome        1
AgeuponOutcome       18
dtype: int64


(11456, 8)
count                   11456
unique                  10575
top       2014-10-20 09:00:00
freq                        8
first     2013-10-01 10:44:00
last      2016-02-21 18:37:00
Name: DateTime, dtype: object


Name              3225
AgeuponOutcome       6
dtype: int64




We are combining the training and test dataset , to perform the preprocessing. Also since the name column is not useful it is dropped

In [106]:
df_comb = pd.concat([df_raw,df_test],sort = True)

df_comb.drop('Name',axis = 1,inplace=True)

In [107]:
df_comb.SexuponOutcome.value_counts(dropna = False)

Neutered Male    14014
Spayed Female    12633
Intact Female     5004
Intact Male       4985
Unknown           1548
NaN                  1
Name: SexuponOutcome, dtype: int64

Value_counts() method gives the counts of unique values for a given column. Since that row with one unknown value is in training set we are dropping it. The Unknown value is replaced with the most frequent value of the SexuponOutcome  column.\
Similarly for AgeuponOutcome column the mode and median value was found to be the same i.e 1 year and hence the NaN values are filled with this mode.

Now, we are dividing the each string in SexuponOutcome into two parts. 2nd part is assigned to the Sex and the 1st  part is assigned to the fertility. Ex: Neutered Male -> Neutered + Male.

Age Column is split into 2 parts and using age_dict years, months etc are converted into the corresponding weeks and then multiplied with the integer part to get the total weeks. Ex .2 years -> 2 + years -> 2*52 -> 104 weeks.

Then Age_weeks column is split into 10 equal groups using qcut function


In [108]:
df_comb.dropna(subset = ['SexuponOutcome'],inplace = True)
df_comb['SexuponOutcome'] = df_comb['SexuponOutcome'].replace('Unknown','Neutered Male')

df_comb['AgeuponOutcome'].value_counts(dropna = False)
df_comb['AgeuponOutcome'] = df_comb['AgeuponOutcome'].fillna(df_comb['AgeuponOutcome'].mode()[0])

df_comb['Sex'] = df_comb.SexuponOutcome.str.split().str[1]

df_comb['Fertility'] = df_comb.SexuponOutcome.str.split().str[0]

df_comb['Fertility'] = df_comb.SexuponOutcome.str.split(expand=True)[0]

age_dict = {'year':52,'years':52,'month':4.5,'months':4.5,'day':1/7,'days':1/7,'weeks':1,'week': 1}

df_comb['Age_weeks'] =  df_comb.AgeuponOutcome.str.split().str[0].astype(float)*\
df_comb.AgeuponOutcome.str.split().str[1].map(age_dict)

df_comb['Age_groups'],y = pd.qcut(df_comb['Age_weeks'],10,duplicates='drop',labels =False,retbins = True)   

df_comb['Age_norm'] = (df_comb.Age_weeks - df_comb.Age_weeks.min())/(df_comb.Age_weeks.max() - df_comb.Age_weeks.min())

df_comb['Age_std']  = (df_comb.Age_weeks - df_comb.Age_weeks.mean())/df_comb.Age_weeks.std()

Columns with the strings are mapped into integers using map function on respective columns. Also using datetime column several attributes like weekday , year, month have been created

In [109]:
df_comb['Animal_map'] = df_comb['AnimalType'].map({'Dog':1,'Cat':0})

df_comb['Sex_map'] = df_comb['Sex'].map({'Male':1,'Female':0})

df_comb['Fertility_map'] = df_comb['Fertility'].map({'Neutered':0,'Spayed':0,'Intact':1})

df_comb['Weekday'] = df_comb.DateTime.dt.dayofweek

df_comb['Month'] = df_comb.DateTime.dt.month

df_comb['Year'] = df_comb.DateTime.dt.year

df_comb['Day_of_Year'] = df_comb.DateTime.dt.dayofyear

df_comb['Month_end'] = df_comb.DateTime.dt.is_month_end*1

df_comb['Month_start'] = df_comb.DateTime.dt.is_month_start*1

df_comb['Quarter_start'] = df_comb.DateTime.dt.is_quarter_start*1

df_comb['Quarter_end'] = df_comb.DateTime.dt.is_quarter_end*1

For color and breed columns, the '/' is replaced with space and then each string is split into separate words using split function which gives output as a list of words. Now Output of all the strings is added to create a superset of all words contained in each column. Using the counter function on the list , we have extracted the top 20 frequently appearing words.
.We have created the new column for each breed and color in these top 20 words.

In [110]:
df_dog = df_comb[df_comb.AnimalType=='Dog']
print(df_dog.shape[0])
Counter(df_dog.Color.str.replace('/',' ').str.split().sum()).most_common(20)

22250


[('White', 11730),
 ('Black', 7439),
 ('Brown', 5654),
 ('Tan', 4370),
 ('Brindle', 1487),
 ('Red', 1396),
 ('Tricolor', 1279),
 ('Blue', 1216),
 ('Chocolate', 687),
 ('Sable', 459),
 ('Merle', 457),
 ('Buff', 431),
 ('Gray', 419),
 ('Yellow', 414),
 ('Cream', 397),
 ('Fawn', 294),
 ('Tick', 156),
 ('Silver', 113),
 ('Gold', 113),
 ('Apricot', 45)]

In [111]:
df_cat = df_comb[df_comb.AnimalType=='Cat']
print(df_cat.shape[0])
Counter(df_cat.Color.str.replace('/',' ').str.split().sum()).most_common(20)

15934


[('Tabby', 7404),
 ('White', 5632),
 ('Black', 4128),
 ('Brown', 3895),
 ('Blue', 2175),
 ('Orange', 1986),
 ('Tortie', 878),
 ('Point', 852),
 ('Calico', 802),
 ('Torbie', 567),
 ('Cream', 520),
 ('Lynx', 284),
 ('Seal', 236),
 ('Gray', 135),
 ('Flame', 122),
 ('Smoke', 110),
 ('Silver', 63),
 ('Lilac', 56),
 ('Chocolate', 54),
 ('Buff', 9)]

In [112]:
colors = ['White','Black','Brown','Tan','Brindle','Red','Tricolor','Blue','Tabby','Orange','Tortie','Point','Calico','Torbie',\
         'Cream','Lynx','Seal','Gray','Flame','Smoke','Chocolate','Sable','Merle','Buff','Yellow','Tick','Silver','Gold',]

In [113]:
for x in colors:
    df_comb[x] = df_comb.Color.str.contains(x)*1

In [114]:
print(df_dog.shape[0])
Counter(df_dog.Breed.str.replace('/',' ').str.split().sum()).most_common(30)

22250


[('Mix', 16330),
 ('Chihuahua', 3690),
 ('Retriever', 3539),
 ('Bull', 3529),
 ('Shorthair', 3460),
 ('Pit', 3458),
 ('Labrador', 3280),
 ('Terrier', 2435),
 ('Shepherd', 2001),
 ('Australian', 1516),
 ('German', 1427),
 ('Miniature', 1126),
 ('Dachshund', 1115),
 ('Dog', 996),
 ('Cattle', 905),
 ('Poodle', 716),
 ('Border', 713),
 ('Collie', 708),
 ('Boxer', 605),
 ('American', 541),
 ('Hound', 462),
 ('Beagle', 434),
 ('Russell', 430),
 ('Schnauzer', 404),
 ('Chow', 398),
 ('Jack', 389),
 ('Yorkshire', 383),
 ('Rat', 380),
 ('Catahoula', 371),
 ('Great', 352)]

In [115]:
dog_breed =['Mix','Chihuahua','Terrier','Retriever','Bull','Shepherd','Australian','German','Miniature',\
            'Cattle','Poodle','Border','Collie','Boxer','American','Hound','Beagle','Russell','Schnauzer','Chow','Jack',\
            'Yorkshire','Rat','Catahoula','Great','Labrador','Dachshund',]
cat_breed = ['Domestic','Shorthair','Medium','Longhair','Siamese','Snowshoe']
breeds = dog_breed + cat_breed

In [116]:
for x in breeds:
    df_comb[x] = df_comb.Breed.str.contains(x)*1

In [117]:
df_comb.head()

Unnamed: 0,AgeuponOutcome,AnimalID,AnimalType,Breed,Color,DateTime,ID,OutcomeSubtype,OutcomeType,SexuponOutcome,...,Catahoula,Great,Labrador,Dachshund,Domestic,Shorthair,Medium,Longhair,Siamese,Snowshoe
0,1 year,A671945,Dog,Shetland Sheepdog Mix,Brown/White,2014-02-12 18:22:00,,,Return_to_owner,Neutered Male,...,0,0,0,0,0,0,0,0,0,0
1,1 year,A656520,Cat,Domestic Shorthair Mix,Cream Tabby,2013-10-13 12:44:00,,Suffering,Euthanasia,Spayed Female,...,0,0,0,0,1,1,0,0,0,0
2,2 years,A686464,Dog,Pit Bull Mix,Blue/White,2015-01-31 12:28:00,,Foster,Adoption,Neutered Male,...,0,0,0,0,0,0,0,0,0,0
3,3 weeks,A683430,Cat,Domestic Shorthair Mix,Blue Cream,2014-07-11 19:09:00,,Partner,Transfer,Intact Male,...,0,0,0,0,1,1,0,0,0,0
4,2 years,A667013,Dog,Lhasa Apso/Miniature Poodle,Tan,2013-11-15 12:52:00,,Partner,Transfer,Neutered Male,...,0,0,0,0,0,0,0,0,0,0


In [118]:
df_comb.isnull().sum()[df_comb.isnull().sum()!=0]

AnimalID          11456
ID                26728
OutcomeSubtype    25067
OutcomeType       11456
dtype: int64

Since the value of Outcome Type is known for values in training set, we have separated the combined dataset into training and test sets. Outcome type in training set is converted into integers using the map function

In [119]:
df_test = df_comb[df_comb['OutcomeType'].isnull()]
df_raw = df_comb[~(df_comb['OutcomeType'].isnull())]

In [120]:
df_raw['out_map'] = df_raw.OutcomeType.map({'Adoption':1,'Died':2,'Euthanasia':3, 'Transfer':4,'Return_to_owner':5 })

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [121]:
date_ftr = ['Weekday','Month','Year','Day_of_Year','Month_end','Month_start','Quarter_start','Quarter_end',]
features = ['Animal_map','Sex_map','Fertility_map','Age_weeks']+breeds+colors+date_ftr

Now a features_all list has been created which consists the names of all the features created. Using a subset of these features a logistic Regression Classifier is fitted and quality of fit is measured using log_loss 


In [125]:

features_all = (['Animal_map','Sex_map','Fertility_map','Age_groups']+breeds+colors+date_ftr)
for i in range(len(features_all)):
    features = features_all[:i+1]
    df_trn,df_val,label_trn,label_val = train_test_split(df_raw[features],df_raw[['AnimalID','out_map']],
                                                stratify = df_raw['out_map'].values,test_size = 0.3,random_state = 123)

    #clf = RandomForestClassifier(n_estimators=100,max_features=0.5)
    clf = LogisticRegression(penalty = 'l2',C=1)
    clf.fit(df_trn.values, label_trn['out_map'].values)
    
    log_loss_prob_trn = log_loss(label_trn['out_map'].values,clf.predict_proba(df_trn.values),labels = [1,2,3,4,5])
    log_loss_prob_val = log_loss(label_val['out_map'].values,clf.predict_proba(df_val.values),labels = [1,2,3,4,5])
    
 
    print('Training_score = '+ str(round(log_loss_prob_trn,5)) + '     Validation_score = ' + str(round(log_loss_prob_val,5)))
    #print('Validation_score = ' + str(log_loss_prob_val))

   

Training_score = 1.18059     Validation_score = 1.17359
Training_score = 1.17809     Validation_score = 1.17232
Training_score = 1.03907     Validation_score = 1.03805
Training_score = 0.97087     Validation_score = 0.97593
Training_score = 0.97023     Validation_score = 0.97553
Training_score = 0.96819     Validation_score = 0.9741
Training_score = 0.96805     Validation_score = 0.97366
Training_score = 0.96794     Validation_score = 0.9741
Training_score = 0.96402     Validation_score = 0.96946
Training_score = 0.96373     Validation_score = 0.96907
Training_score = 0.96367     Validation_score = 0.96906
Training_score = 0.9636     Validation_score = 0.96903
Training_score = 0.96313     Validation_score = 0.9691
Training_score = 0.96311     Validation_score = 0.96913
Training_score = 0.96311     Validation_score = 0.96929
Training_score = 0.9631     Validation_score = 0.96924
Training_score = 0.96306     Validation_score = 0.96926
Training_score = 0.96295     Validation_score = 0.969

After selecting the best features and tuning for penalty and C values, the best possible logistic classifier has been selected.

In [123]:
features = ['Animal_map', 'Sex_map', 'Fertility_map', 'Age_groups', 'Mix', 'Chihuahua', 'Labrador', 'Retriever', 'Bull',\
            'Shepherd', 'Australian', 'German', 'Miniature', 'Dachshund', 'Cattle', 'Poodle', 'Border', 'Collie', 'Boxer',\
            'American', 'Hound', 'Beagle', 'Russell', 'Schnauzer', 'Chow', 'Jack', 'Yorkshire', 'Rat', 'Catahoula', 'Great',\
            'Domestic', 'Shorthair', 'Medium', 'Longhair', 'Siamese', 'Snowshoe', 'White', 'Black', 'Brown', 'Tan', 'Brindle',\
            'Red', 'Tricolor', 'Blue', 'Tabby', 'Orange', 'Tortie', 'Point', 'Calico', 'Torbie', 'Cream', 'Lynx', 'Seal',\
            'Gray', 'Flame', 'Smoke', 'Chocolate', 'Sable', 'Merle', 'Buff', 'Yellow', 'Tick', 'Silver', 'Gold', 'Weekday',\
            'Month', 'Year', 'Day_of_Year', 'Month_end']
df_trn,df_val,label_trn,label_val = train_test_split(df_raw[features],df_raw[['AnimalID','out_map']],
                                            stratify = df_raw['out_map'].values,test_size = 0.3,random_state = 123)

#clf = RandomForestClassifier(n_estimators=100,max_features=0.5)
clf = LogisticRegression(penalty = 'l2',C=1)
clf.fit(df_trn.values, label_trn['out_map'].values)

log_loss_prob_trn = log_loss(label_trn['out_map'].values,clf.predict_proba(df_trn.values),labels = [1,2,3,4,5])
log_loss_prob_val = log_loss(label_val['out_map'].values,clf.predict_proba(df_val.values),labels = [1,2,3,4,5])

print('Training_score = '+ str(round(log_loss_prob_trn,5)) + '     Validation_score = ' + str(round(log_loss_prob_val,5)))

Training_score = 0.95042     Validation_score = 0.96092


Using the predict_proba method the probabilities for a feature to belong to different classes have been extracted and a dataframe has been created using those values. 
Finally using to_csv method on the dataframe, a csv value with the dataframe values has been created. 

In [124]:
out_cols=['Adoption','Died','Euthanasia', 'Transfer','Return_to_owner' ]

#df_output = pd.get_dummies(clf.predict(df_tst.values))
df_output = pd.DataFrame(clf.predict_proba(df_test[features]),columns=out_cols)
df_output.columns = out_cols
df_output['ID'] = (df_output.index+1)

df_output.head()

df_output.to_csv('logistic_best.csv',index = False)