# Project 2 - Shelter Animal Outcomes


## Submitted by: Dror Vered

### June 2018

The goal of this project is to predict the outcome of the dogs and the cats as they leave the Austin Animal Center.
These outcomes include: Adoption, Died, Euthanasia, Return to owner, and Transfer.

### Part 1:
Understand and analyze the problem, inquire the data, perform data visualization, provide insights and plan the next stages.

-----------------------------------------------------------------------------------------------------------------------------

### My Work Plan:

1. After reading the "train" file, I'll get familiar with the data and its features.
2. I'll perform some general analysis and visualization regarding OutcomeTypes.
3. Exploring the features:
   1. For each given feature, I will:
      - Look for empty and special values. 
        - Are they important? (e.g. do empty pets' names relate somehow to OutcomeType?)
        - should they be ignored/removed/updated?
      - Explore and analyze the given features
   2. As of now, it seems that there are features that are missing and could have been of use:
      - Typical characteristics of the pet, such as: size, level of friendliness, ability to be train etc. 
        Based on the pet's breed, I will try to obtain some data from external sources.
      - Condition of pet upon arriving to shelter, level of training when entering the shelter (espcially with dogs),
        and reason/circumstance of arrival to shelter. I guess this data will remain unknown.
4. Summary:
   - Summary of the features added to the given data structure
   - Summary of insights based on given and collected data
   - Plans for the next part of the project

In [1]:
import pandas as pd
%matplotlib inline

## 1. Read the train file and get familiar with the data. Show unique values in each column

In [2]:
outcomes = pd.read_csv('train.csv', index_col='AnimalID', parse_dates=['DateTime'])
# outcomes.head()

In [3]:
# outcomes.describe()

In [4]:
animal_types = outcomes.groupby('AnimalType')['DateTime'].count()
# animal_types

## 2. Perform some general analysis and visualization regarding OutcomeTypes

In [5]:
type_value_counts = outcomes.OutcomeType.value_counts()

In [6]:
# print(type_value_counts)
# type_value_counts.plot(kind='bar')

In [7]:
# outcomes.OutcomeType.value_counts(normalize=True)

### Insight:
According to the description of the case in the Kaggle web page, the outcome of more than 35% of pets arriving to shelters
in the US every year is death (2.7 milion out of 7.6 million). However, strangely enough, according to the dataset of intake
information received from the AAC (Austin Animal Center), the percent of dead and euthanized dogs and cats (along 2.5 years)
is only 6.5% (5.8% Euthanasia + 0.7% Died ,752 out of 26,729 ).

Trying to explain the above finding, I'd like to suggest that the "final outcome" of many transferred pets might be
"not as good as expected".

In [8]:
# outcomes.groupby(['AnimalType', 'OutcomeType'])['Breed'].count()

In [9]:
# Looking separately at Dogs and Cats:

dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

In [10]:
dogs_outcomes_grpby_type = dogs_outcomes.groupby(['OutcomeType'])['Breed'].count()
# dogs_outcomes_grpby_type.plot(kind='bar', title='Dogs Outcomes')

In [11]:
cats_outcomes_grpby_type = cats_outcomes.groupby(['OutcomeType'])['Breed'].count()
# cats_outcomes_grpby_type.plot(kind='bar', title='Cats Outcomes')

In [12]:
animal_type_crstb = pd.crosstab(outcomes.OutcomeType, outcomes.AnimalType, normalize='columns')
# animal_type_crstb

In [13]:
# animal_type_crstb.plot(kind='bar', title='Outcomes by Animal Type')

### Insights:
- The percentage of dogs being adopted is a bit higher than cats
- The percentage of RTO dogs is *much* higher than cats
- The percentage of cats being Euthanaised, Transferred or Died is higher than dogs

## 3. For each given feature:
    a. Looking for empty and special values
    b. Exploring and analyzing the given features

In [14]:
# outcomes.info()

**The following features have empty values:**
- *Name:* there are 26,729-19,038=7,691 pets with no name. I assume these animals were picked-up from the street, or maybe they were born in the shelter and were not given a name.
  I will later inquire whether there is a correlation between the outcome of the pet and the fact that its name is uknown.
- *OutcomeSubtype:* there are 26,729-13,117=13,612 empty values. However, most of these might be a matter of little importance. I will explore on that later on.
- *SexuponOutcome:* there is only 1 empty value (negligible)
- *AgeuponOutcome:* there are 26,729-26,711=18 empty values. Although this is an important feature, the amount of these null values is negligible.

## 3.1. Name:

### 3.1.a. looking for "special" names (such as "unknown" or "John Doe")

In [15]:
# print(outcomes.Name.nunique())
# outcomes.Name.value_counts(normalize=True).head(10)

*It seems that there's no significant group of pets that has a "special name"*

In [16]:
# print(len(outcomes[outcomes.Name=='Unknown']))
# print(len(outcomes[outcomes.Name=='John Doe']))

*It seems that there are no special names such as the above*

In [17]:
# I will now convert empty names to "Unknown", for later use.

outcomes.Name.fillna(value='Unknown', inplace=True)

In [18]:
# outcomes.Name.value_counts().head()

### 3.1.b. Exploring and analyzing the Name feature

- Do 'Unknown' pets' names relate somehow to OutcomeType?
- To explore this question, I'll now add a column called 'Named' (0="unknown", 1=otherwise)

In [19]:
outcomes['Named'] = outcomes.Name.apply(lambda name: 0 if name == 'Unknown' else 1)
# outcomes.head()

In [20]:
type_named_crstb = pd.crosstab(outcomes.OutcomeType, outcomes.Named)
# type_named_crstb

In [21]:
# visualization - bar chart:

# type_named_crstb.plot(kind='bar', title='OutcomeTypes of Named/UnNamed animals')

In [22]:
# another way to show it - stacked bar:

# type_named_crstb.plot(kind='bar', stacked=True, title='OutcomeTypes of Named/UnNamed animals')

In [23]:
# normalize over each row

type_named_crstb = pd.crosstab(outcomes.OutcomeType, outcomes.Named, normalize='index')
# type_named_crstb

In [24]:
# type_named_crstb.plot(kind='bar', stacked=True, title='OutcomeTypes of Named/UnNamed animals (Percent Stacked)')

In [25]:
# normalize over each column

# pd.crosstab(outcomes.OutcomeType, outcomes.Named, margins=True, normalize='columns')

# Insights:
- 84% of adopted animals, are named
- Almost 97% of animals which are Returned To their Owners (RTO), are named
- 61% of died animals, are un-named
- un-named animals tend to be euthanised and transferred more than named animals
- 47% of named animals are adopted, weheras the rate of total adoptions is only 40%
- 24% of named animals are RTO, weheras the rate of total RTO is only 18%
- Likewize, the percentages of named animals which are Died/Euthanaised/Transferred are lower than total percentages

## 3.2. DateTime:

### 3.2.a. Empty/special values:

As shown above, there are no empty values in this column, and all dates are in the same format (yyyy/dd/mm hh:mm:ss)

### 3.2.b Exploring and analyzing the DateTime feature

Although the header of the column doesn't say "OutcomeDateTime", as in the case of all other "outcome" features, by looking
at the order of the DateTimes values, it is noticeable that they are not sorted like the AnimalIDs (which are given during
the intakes). That indicates that the DateTime values are not the entrances dates, but the outcomes dates.
As such, it seems that this feature will not be of use when trying to predict the next outcome. However it might suggests
that the shelter may put some extra effort during certain periods of times during the year (weekends? holidays?...)

What I'm going to look for, is whether the 'volume of activity' of the shelter is somehow related to:

    i.   specific day(s) of the week
    ii.  specific season(s)/month(s)
    iii. holidays

#### i. specific day(s) of the week

In [26]:
# To explore this issue, I'll add the column 'DayOfWeek', as follows:

outcomes['DayOfWeek'] = outcomes['DateTime'].dt.weekday_name

In [27]:
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
outcomes['DayOfWeek'] = pd.Categorical(outcomes['DayOfWeek'], categories=days, ordered=True)

In [28]:
# outcomes.head()

In [29]:
OutcomeType_DOW_crstb = pd.crosstab(outcomes.DayOfWeek.sort_values(), outcomes.OutcomeType, normalize='columns')
# OutcomeType_DOW_crstb

In [30]:
# OutcomeType_DOW_crstb.plot(kind='bar', title='Outcomes by Days of Week', figsize=(8,7))

In [31]:
# Another type of visualization: Heatmap

import seaborn as sns

In [32]:
# sns.heatmap(OutcomeType_DOW_crstb, linewidths=0.5, annot=True, center=0.3)

### Insight:
- Not surprisingly, it seems that most adoptions take place during the weekends. Also, it seems that animals are euthanaised
  mostly on Mondays, supposedly after they had not been adopted during the passed weekend.
- As I mentioned above, this feature will not contribute for predicting the next outcome, however it might suggests that the
  shelter may put some extra effort during weekends, as people tend to adopt the animals on Sat and Sun.

#### ii. seasons
To make it simple, I'll use the attribute 'quarter' (1 = Jan to Mar, 2 = Apr to Jun, etc.)

In [33]:
# I'll add the column 'Quarter' as follows:

outcomes['Quarter'] = outcomes['DateTime'].dt.quarter

In [34]:
# outcomes.head()

In [35]:
type_Q_crstb = pd.crosstab(outcomes.OutcomeType, outcomes.Quarter, margins=True, normalize='index')

In [36]:
# print(type_Q_crstb)
# sns.heatmap(type_Q_crstb, linewidths=0.5, annot=True, center=0.3)

### Insight:
it seems that more activites take place during Q4, and less during Q1.

#### iii. holidays

I'll try to explore whether people tend to adopt animals as a present for the holidays. It will probably not help predict
the outcome of the next dog/cat, but it might indicate that the shelter sould put some extra effort before holidays.

For that purpose, I built a csv file with most of the holidays that took place during the relevant period (Oct-13 to Mar-16).

In [37]:
holidays = pd.read_csv('austin holidays.csv', sep=',', parse_dates=['HolidayDate'], encoding='latin-1')

In [38]:
# holidays.info()

In [39]:
# holidays.head()

In [40]:
def before_holiday(date_time):
    '''
    Finds out whether a given date falls 5 days or less before a holiday (an adoption that took place up to 5 days before
    the holiday, is considered here as a "holiday gift").
    If yes, returns the holiday's name, otherwise None.
    '''
    i = 0
    for holiday_date in holidays.HolidayDate:
        dates_diff = holiday_date - date_time
        if dates_diff.days > 5:  
            return 'No Holiday' 
        if dates_diff.days >= 0:
            return holidays.iloc[i,0]
        i += 1
    return 'No Holiday'

outcomes['Holiday'] = outcomes['DateTime'].apply(before_holiday)

In [41]:
# print(outcomes[outcomes.OutcomeType == 'Adoption'].DateTime.count())
# outcomes[outcomes.OutcomeType == 'Adoption'].Holiday.value_counts(dropna=False)

In [42]:
# outcomes[outcomes.OutcomeType == 'Adoption'].Holiday.value_counts(dropna=False, normalize=True).head()

In [43]:
# outcomes[outcomes.Holiday != 'No Holiday'].Holiday.value_counts().plot(kind='bar', title='Adoptions before Holidays')

### Insight: 
about 20% of all adoptions take place 5 days or less before a holiday (mostly before Christmas)

## 3.3 OutcomeSubtype:

As of now, it is not clear whether this feature is important for us, or not. I will now explore this feature.

### 3.3.a. Empty/special values:
As shown above, there are many empty values in this column. However, this is not much of a problem.

### 3.3.b Exploring and analyzing the OutcomeSubtype feature

In [44]:
subtype_type_crosstab = pd.crosstab(outcomes.OutcomeSubtype, outcomes.OutcomeType, dropna=False)
# subtype_type_crosstab

In [45]:
# sns.heatmap(subtype_type_crosstab, linewidths=0.5, annot=True, fmt="d", center=8000)

### Insights:
- It seems that if an animal is suffering, aggressive, bad-behaving, in Rabies Risk or has a medical problem - it will be
  euthanaized.
- Main subtypes of 'Transfer' are: 'Partner' and 'SCRP':
    - Partner: is probably another shelter for pets
    - SCRP: just by searching the web, I couldn't find what was the meaning of this abbreviation. However, by exploring the data furthermore, I assume that it means: "Street Cats Rescue Program". Here's why: 
          
          i. SCRP is relevant only for cats (as shown below)
          ii. Almost all SCRP rows are relevant for pets with no name (as shown below) ==> i.e. street cats

In [46]:
# i. The following output clearly shows that SCRP is relevant only for cats
# pd.crosstab(outcomes.OutcomeSubtype, outcomes.AnimalType)

In [47]:
# ii. The following output shows that 96% of SCRP samples are relevant for pets with no name
# outcomes[outcomes.OutcomeSubtype == 'SCRP'].Name.value_counts(normalize=True, dropna=False).head()

## 3.4 SexuponOutcome:

### 3.4.a. Empty/special values:

As mentioned earlier, there's one empty value in this feature. I'll now replace it with 'Unknown' value.

In [48]:
outcomes.SexuponOutcome.fillna(value='Unknown', inplace=True)

### 3.4.b Exploring and analyzing the SexuponOutcome feature

It is pretty obvious that this feature actually consists of two separate important features:

- Sex of pet (Male/Female)
- Neuter status (Neutered, Spayed, or Intact)

I will deal with this separation as follows:

In [49]:
# outcomes.SexuponOutcome.value_counts()

It seems that more than a thousand pets have "unknown" gender and/or it is unknown whether or not they were spayed/neutered

In [50]:
# pd.crosstab(outcomes.OutcomeType, outcomes.SexuponOutcome)

### Insight: 
Amazingly, NONE of the animals that their SexuponOutcome is "Unknown" were adopted!
Only few of them were RTO.

I will now create 2 new separated features: Sex of pet (Male/Female) and Neuter status (Neutered, Spayed, or Intact).

The values of the new column 'NeuterStatus' will reflect the SexuponOutcome values as follows:
- the values 'Intact Female' and 'Intact Male' will be mapped to the new value 'Intact'
- the values 'Spayed Female' and 'Neutered Male' will be mapped to the new value 'N/S' (I see no point separating these two)
- the value 'Unknown' will be mapped as-is

In [51]:
outcome_sex_map_dict = {
    'Intact Female': 'Female',
    'Intact Male': 'Male',
    'Spayed Female': 'Female',
    'Neutered Male': 'Male',
    'Unknown': 'Unknown'    
}

outcome_neuter_map_dict = {
    'Intact Female': 'Intact',
    'Intact Male': 'Intact',
    'Spayed Female': 'N/S',
    'Neutered Male': 'N/S',
    'Unknown': 'Unknown'    
}

outcomes['Sex'] = outcomes.SexuponOutcome.replace(outcome_sex_map_dict)
outcomes['NeuterStatus'] = outcomes.SexuponOutcome.replace(outcome_neuter_map_dict)
# outcomes.head()

In [52]:
# pd.crosstab(outcomes.OutcomeType, outcomes.NeuterStatus)

In [53]:
# from the OutcomeType direction (normalazed by rows)
# sns.heatmap(pd.crosstab(outcomes.OutcomeType, outcomes.NeuterStatus, normalize='index'), linewidths=0.5, annot=True, center=1)

### Insights regarding NeuterStatus:
- 97% of adopted pets were spayed/neutered
- 83% of RTO pets were spayed/neutered
- 69% of Died pets and 56% of Euthanaised pets were Intact

In [54]:
# from the N/S status direction (normalazed by columns)
# sns.heatmap(pd.crosstab(outcomes.OutcomeType, outcomes.NeuterStatus, normalize='columns'), \
#             linewidths=0.5, annot=True, center=1)

### More insights regarding NeuterStatus:
- 77% (56 + 21) of N/S pets ended well (either adopted or returned to their owners)
- 69% of Intact pets were transferred
- 87% of pets whose neutered/spayed status was not clear, were transferred. None of them were adopted

In [55]:
# pd.crosstab(outcomes.OutcomeType, outcomes.Sex, margins=True)

In [56]:
# pd.crosstab(outcomes.OutcomeType, outcomes.Sex, margins=True, normalize='index')

### Insights regarding Sex:
- It seems that adoption is not related to the sex of the pet.
- Males Die and Euthanaised more than females.
- Males RTO more than females

## 3.5 AgeuponOutcome:

### 3.5.a. Empty/special values:

In [57]:
print(outcomes.AgeuponOutcome.nunique())
# outcomes.AgeuponOutcome.value_counts(dropna=False).tail(10)

44


There are 18 empty values, and 22 pets with age '0 years'. 

In [58]:
# outcomes[outcomes.AgeuponOutcome == '0 years'].groupby(['AnimalType', 'OutcomeType'])['Name'].count()

Most of the '0 years' pets are cats. Most of them were transferred (none of them were adopted).

In [59]:
# outcomes[pd.isna(outcomes.AgeuponOutcome)].groupby(['AnimalType', 'OutcomeType'])['Name'].count()

Most of the None values are cats. Most of them were transferred (none of them were adopted).

Since the number of empty values of AgeuponOutcome is very small, I'll drop these samples from the dataframe.

In [60]:
outcomes.dropna(subset=['AgeuponOutcome'], inplace=True)

### 3.5.b Exploring and analyzing the AgeuponOutcome feature

It is noticeable that the values of this feature are presented in various formats: day(s) / week(s) / month(s) / year(s).

Furthermore, since there are lots of "ages" in the dataset (44 to be precise, as shown above), I will "group" them into 7
"life stages", as follows.

In [61]:
def calc_life_stage(age):
    '''
    Returns a "life stage" to which a given age belongs.
    The defined life stages are:
        - neonatal:     birth - 4 weeks
        - puppy/kitten: 1 month - 6 months
        - junior:       7 months - 2 years
        - prime:        3 - 6 years
        - mature:       7 - 10 years
        - senior:       11 - 14 years
        - geriatric:    15+ years    
    '''
    try:
        age_val = int(age[0:2])
    except:
        return 'Neonatal'        
    if age_val == 0:
        return 'Neonatal'
    if 'day' in age or 'week' in age:
        return 'Neonatal'
    elif 'month' in age:
        if age_val <= 6:
            return 'Puppy/Kitten'
        else:
            return 'Junior'
    elif 'year' in age:
        if age_val <= 2:
            return 'Junior'
        elif age_val <= 6:
            return 'Prime'
        elif age_val <= 10:
            return 'Mature'
        elif age_val <= 14:
            return 'Senior'
        else:
            return 'Geriatric'
    else:
        return 'Other'

outcomes['LifeStage'] = outcomes['AgeuponOutcome'].apply(calc_life_stage)
# outcomes.LifeStage.value_counts(dropna=False)

In [62]:
# Converting LifeStage to a categorical and specifying an order on the categories (for using logical order)

life_stages = ['Neonatal', 'Puppy/Kitten', 'Junior', 'Prime', 'Mature', 'Senior', 'Geriatric']
outcomes['LifeStage'] = pd.Categorical(outcomes['LifeStage'], categories=life_stages, ordered=True)

In [63]:
# pd.crosstab(outcomes.LifeStage.sort_values(), outcomes.OutcomeType)

In [64]:
# Normalization by OutcomeType
outcome_age_crstb = pd.crosstab(outcomes.LifeStage.sort_values(), outcomes.OutcomeType, normalize='columns')

In [65]:
# outcome_age_crstb

In [66]:
# outcome_age_crstb.plot(kind='bar', title='Outcome Types by Life Stages', grid=True, figsize=(6,9), subplots=True)

### Insights:
- Adoption:
    - Pets are not adopted until they're at least 1 month old. 
    - Pets are almost not adopted in the Mature stage or older
    - Pets are mostly adopted when they're at the Puppy/Kitten stage. After that - as Juniors
- RTO:
    - Pets are returned to their owners mostly as Junior or Prime
- Die:
    - Pets mostly Die in the Neonatal and Puppy/Kitten stages. 
- Euthanasia:
    - Pets mostly Eunathaised in the Junior and Prime stages

In [67]:
# Normalization by LifeStage
age_outcome_crstb = pd.crosstab(outcomes.OutcomeType, outcomes.LifeStage.sort_values(), normalize='columns', margins=True)

In [68]:
# age_outcome_crstb

In [69]:
# age_outcome_crstb.plot(kind='bar', title='Outcome Types by Life Stages', grid=True, figsize=(4,14), subplots=True)

### Insights:
- Neonatal:
    - Pets are not adopted during their Neonatal period (as seen earlier) 
    - Being in the Neonatal life stage, there's almost 93% chance to be trasferred
    - Chances to Die are high
- Puppy/Kitten:
    - As a Puppy/Kitten, there's 61% chance to be Adopted
    - Chances to Die are also high
- Junior:
    - As a Junior, there's almost 40% chance to be Adopted; 34% chance to be transferred
- Mature:
    - As a Mature pet, there's 38% chance to be RTO
- Senior:
    - As a Senior, there's 45% chance to be RTO
- Geriatric:
    - As a Geriatric pet, there's 52% chance to be RTO, 24% to be Euthanaised and only 11% chance to be adopted

Separating between Dogs and Cats

In [70]:
dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.LifeStage, normalize='columns', margins=True). \
#             plot(kind='bar', title='Dogs - Life Stages by Outcome Types', grid=True, figsize=(6,4))

In [71]:
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']
# pd.crosstab(cats_outcomes.OutcomeType, cats_outcomes.LifeStage, normalize='columns', margins=True). \
#             plot(kind='bar', title='Cats - Life Stages by Outcome Types', grid=True, figsize=(6,4))

### Insights about main differences between Dogs and Cats regarding Age:
- While Junior dogs' chances to be Adopted are rising, Junior cats' chances are falling
- Senior dogs' chances to be Adopted are much lower than Senior cats'
- Senior and Geriatric cats' chances to be Euthaaised are higher than Senior and Geriatric dogs'
- Junior cats' chances to be Transferred are higher than Junior dogs'

## 3.6 Breed:

#### First - some exploration of this feature

In [72]:
# outcomes.Breed.nunique()

In [73]:
dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

#### Dogs

In [74]:
# dogs_outcomes.Breed.nunique()

In [75]:
dogs_popularity = dogs_outcomes['Breed'].value_counts()
# dogs_popularity

In [76]:
# number of breeds that contains 80% of all dogs
number_of_popular_dogs = dogs_popularity.cumsum().searchsorted(0.8*animal_types.loc['Dog'])[0]
# number_of_popular_dogs

#### Cats

In [77]:
# cats_outcomes.Breed.nunique()

In [78]:
cats_popularity = cats_outcomes['Breed'].value_counts()
# cats_popularity

In [79]:
# number of breeds that contains 80% of all cats
number_of_popular_cats = cats_popularity.cumsum().searchsorted(0.8*animal_types.loc['Cat'])[0]
# number_of_popular_cats

As shown above, there are 1,380 different values in the Breed column. Some of the values include the word 'Mix', some of them
consist of two breeds (separated by '/') and some consist of one "pure" breed.

I'll try to deal with this feature and its correlation with the pets OutcomeTypes, in several different ways:

    a. Creating a new 'BreedPurity' column, whose values would relate to the "purity" of the breed
    b. Narrowing the variety of values by adding a new 'MainBreed' column which will hold the "main breed" of each pet
    c. Presenting additional data that I've found in external sources, based on the pets' breeds

First, I'd like to check whether there are values which contain more than one '/'

In [80]:
check_slash_breed = outcomes.Breed.apply(lambda breed: breed if breed.count('/') > 1 else None)

In [81]:
# check_slash_breed.value_counts()

I'll now convert the 'Black/Tan Hound' string to 'Black and Tan Hound'

In [82]:
outcomes.Breed = outcomes.Breed.apply(lambda breed: breed.replace('Black/Tan Hound', 'Black and Tan Hound') \
                                      if 'Black/Tan Hound' in breed else breed)

### 3.6.a Creating a new 'BreedPurity' column:

In [83]:
def calc_breed_type(breed):
    '''
    Returns a "breed purity" of a given breed.
    The defined values of breed purity are:
        - Mix: in case the word 'Mix' exists in the breed
        - Crossbreeds: in case the sign '/' exists in the breed
        - Purebred: in case the value consists of one breed only (no 'mix' and no '/')    
    '''
    if 'Mix' in breed:
        return 'Mix'
    if '/' in breed:
        return 'Crossbreeds'
    return 'Purebred'

outcomes['BreedPurity'] = outcomes['Breed'].apply(calc_breed_type)

In [84]:
# outcomes.head()

In [85]:
# outcomes.BreedPurity.value_counts(dropna=False)

In [86]:
# pd.crosstab(outcomes.OutcomeType, outcomes.BreedPurity, normalize='columns', margins=True)

In [87]:
# Looking separately at Dogs and Cats:

dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

#### Dogs

In [88]:
# dogs_outcomes.BreedPurity.value_counts(dropna=False)

In [89]:
dogs_breed_outcome_crstb = pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.BreedPurity, normalize='columns', margins=True)

In [90]:
# dogs_breed_outcome_crstb

In [91]:
# dogs_breed_outcome_crstb.plot(kind='bar', title='Outcome Types by Breed Purity', grid=True, figsize=(6,4), subplots=False)

### Insights regarding dogs breeds:
- As a Crossbreeds dog, your chances of being Adopted are rising, and your chances of being RTO are falling
- As a Purebed dog, your chances of being Adopted are falling, and your chances of being RTO are rising (which is not
  surprising, as it might be an expensive or rare dog, whom the owner would try hard to find)    

#### Cats

In [92]:
# cats_outcomes.BreedPurity.value_counts(dropna=False)

In [93]:
cats_breed_outcome_crstb = pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.BreedPurity, normalize='columns', margins=True)

In [94]:
# cats_breed_outcome_crstb

In [95]:
# cats_breed_outcome_crstb.plot(kind='bar', title='Outcome Types by Breed Purity', grid=True, figsize=(6,4))

### Insights regarding Cats:
- As a Crossbreeds cat, your chances of being Adopted are rising, and your chances of being RTO are falling
- As a Purebed cat, your chances of being Adopted are falling, and your chances of being RTO are rising (same as dogs). Your
  chances of being Transferred are a bit rising

Note: Percentage of 'Mix' cats is very high (97.5%), so the contribution of this feature to the Model is doubtful

### 3.6.b Narrowing the variety of values by adding a new column which will hold the "main breed" of each pet. 

In [96]:
def calc_main_breed(breed):
    """Returns the first breed of the original value, by taking the letters from the beginning till 'Mix' or '/'"""
    breed = breed.replace(' Mix', '')
    try:
        return breed[:breed.index('/')]
    except:
        return breed
    
outcomes['MainBreed'] = outcomes['Breed'].apply(calc_main_breed)


In [97]:
# outcomes.head()

In [98]:
outcomes.MainBreed.nunique()

220

In [99]:
# Number of unique breeds went down from 1,380 to 220 (still high...)

In [100]:
# Looking separately at Dogs and Cats:

dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

#### Dogs

In [101]:
# dogs_outcomes.MainBreed.nunique()

In [102]:
# dogs_outcomes.MainBreed.value_counts().head()

In [103]:
dogs_breed_outcome_crstb = pd.crosstab(dogs_outcomes.MainBreed, dogs_outcomes.OutcomeType, normalize='index', margins=True)

In [104]:
# dogs_breed_outcome_crstb

#### Cats

In [105]:
# cats_outcomes.MainBreed.nunique()

In [106]:
# cats_outcomes.MainBreed.value_counts().head()

In [107]:
cats_breed_outcome_crstb = pd.crosstab(cats_outcomes.MainBreed, cats_outcomes.OutcomeType, normalize='index', margins=True)

In [108]:
# cats_breed_outcome_crstb

### 3.6.c Presenting additional data from external sources, based on the pets' breeds:

- **CATS:**
  - As shown above, the most 3 frequent cats' "MainBreeds", are: Domestic Shorthair (8,958), Domestic Medium Hair (883) and
    Domestic Longhair (547). Actually, these are not "real" breeds (cats of mixed ancestry – thus not belonging to any
    particular recognised cat breed) and as such, it was impossible to find external statistical data about them. 
    Since these 3 "breeds" together constitute about 93% of all cats' samples, there's no use of trying to elaborate on
    the cats' breeds.

- **DOGS:**
  - As mentioned above, it seems that there are features, that are missing and could be useful, such as dog's size and other typical characteristics. While seeking these data, I found the following:
      
        i.   a CSV file with dog breeds' "families", such as: Herding, Hound, Toy, etc.
        ii.  a json file which summarizes significant characteristics of numerous breeds of dogs, such as:
             size, friendliness, level of shedding, etc.
          
    I will now elaborate on that, as follows.

#### i. Families of Breeds

In [109]:
families = pd.read_csv('breeds_families.csv', index_col='Breed')

In [110]:
# families.head()

In [111]:
# families.Family.value_counts()

In [112]:
# families.describe()

In [113]:
def find_dog_family(row):
    """
    If Cat - return 'Irrelevant'.
    If the dog's MainBreed exists in the families dataframe - return the family, else return 'Unknown'
    """
    if row.loc['AnimalType'] == 'Cat':
        return 'Irrelevant'
    try:
        return families.loc[row.loc['MainBreed']][0]
    except:
        return 'Unknown'
   
outcomes['DogFamily'] = outcomes.apply(find_dog_family, axis=1)

In [114]:
# outcomes.head()

In [115]:
# outcomes.DogFamily.value_counts(dropna=False)

In [116]:
dogs_outcomes = outcomes[(outcomes.AnimalType == 'Dog') & (outcomes.DogFamily != 'Unknown')]

In [117]:
# dogs_outcomes.DogFamily.value_counts(dropna=False)

In [118]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogFamily, margins=True)

In [119]:
dogs_outcome_family_crstb = pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogFamily, normalize='columns', margins=True)

In [120]:
# dogs_outcome_family_crstb

In [121]:
# dogs_outcome_family_crstb.plot(kind='bar', title='Breed Families by Outcome Types', grid=True, figsize=(7,5))

In [122]:
dogs_family_outcome_crstb = pd.crosstab(dogs_outcomes.DogFamily, dogs_outcomes.OutcomeType, normalize='index', margins=True)

In [123]:
# dogs_family_outcome_crstb

In [124]:
# dogs_family_outcome_crstb.plot(kind='bar', title='Outcome Types by Breed Families', grid=True, figsize=(8,6))

### Insights regarding Families of Dog breeds:
- As a Herding/Hound/Terrier/Sporting/Toy dog, your chances of being Adopted are rising
- As a Companion/Pit Bull dog, your chances of being Adopted are significantly falling
- As a Pit Bull, your chances of being Euthanaised are significantly rising
- As a Hound dog, your chances of being Euthanaised are significantly falling
- As a Toy/Companion dog, your chances to Die are rising
- As a Companion/Working/Pit Bull dog, your chances of being RTO are rising. These chances are falling for Herding/Hound dogs

#### ii. Characteristics of Dogs' Breeds

In [125]:
import json

In [126]:
# Reading a json file, which contains main characteristics of dogs' breeds. Each characteristic is given as a grade from 1 to 5

with open('dogs_characteristics.json', encoding='utf-8') as f:
    dogs_chars = pd.DataFrame.from_dict(json.load(f))

In [127]:
# dogs_chars.info()

In [128]:
# dogs_chars.name.nunique()

In [129]:
dogs_chars.set_index('name', inplace=True)

In [130]:
# 'id' is not necessary. The 'name' column represents the dog's breed
dogs_chars.drop(columns='id', inplace=True)         

In [131]:
# rename dogs_chars' columns to "standard form", with indication that these features are relevant only to Dogs
for col_name in list(dogs_chars.columns):
    dogs_chars.rename(columns={col_name: 'Dog'+col_name[0].upper()+col_name[1:]}, inplace=True)

In [132]:
# dogs_chars.columns

In [133]:
# dogs_chars.head()

In [134]:
# Joining the dogs_chars df with the outcomes df (dogs_chars.name and outcomes.MainBreed are the "keys" for this join)

outcomes = outcomes.join(dogs_chars, on='MainBreed')
# outcomes.head()

In [135]:
# outcomes.info()

In [195]:
cols = list(dogs_chars.columns)                    # a list of the new columns
for col in cols:                                   # fill missing values with 999
    outcomes[col].fillna(value=999, inplace=True)

outcomes[cols] = outcomes[cols].astype(int)        # convert float numbers to int

In [196]:
# outcomes.info()

In [197]:
# outcomes.head()

#### Investigating the new features

In [198]:
# outcomes['DogSize'].value_counts()

In [199]:
dogs_outcomes = outcomes[(outcomes['AnimalType'] == 'Dog') & (outcomes['DogSize'] != 999)]

In [200]:
# dogs_outcomes['DogSize'].value_counts()

In [201]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogSize, margins=True)

In [202]:
dogs_outcome_size_crstb = pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogSize, normalize='columns', margins=True)
# dogs_outcome_size_crstb

In [203]:
# dogs_outcome_size_crstb.plot(kind='bar', title='Dog Sizes by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' sizes:
- As a very big dog (size=5), your chances to Die or to be Transferred are rising. Your chances of being Adopted are falling
- As a  dog of size = 2, your chances of being RTO are significantly rising. Your chances of being Adopted are falling
- As a very small dog (size=1), your chances to be Transferred are rising. Your chances of being RTO are falling

In [204]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogDogFriendly, normalize='columns', margins=True). \
#             plot(kind='bar', title='Friendliness to Other Dogs by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' level of friendliness to other dogs:
- As an unfriendly dog (to other dogs), your chances to Die or to be Euthanaised are significally rising. Your chances of
  being Adopted are falling

In [205]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogKidFriendly, normalize='columns', margins=True). \
#             plot(kind='bar', title='Friendliness to Kids by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' level of friendliness to kids (very similar to dogFriendly):
- As an unfriendly dog (to kids), your chances to Die or to be Euthanaised are significally rising. Your chances of
  being Adopted are falling

In [206]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogEasyToGroom, normalize='columns', margins=True). \
#             plot(kind='bar', title='Easy to Groom by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' ease of grooming:
- As a dog with low levels of ease-of-grooming, your chances to be Adopted are falling, while your chances of being 
  Euthanaised or Transferred are rising

In [207]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogLowShedding, normalize='columns', margins=True). \
#             plot(kind='bar', title='Shedding by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' level of Shedding:
- As a dog with level of shedding = 2, your chances to be Adopted are falling, while your chances of being 
  Euthanaised or RTO are rising (can't find an explanation for this insight...)

In [208]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogEasyToTrain, normalize='columns', margins=True). \
#             plot(kind='bar', title='Ease of Training by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' ease of training:
- As a dog with level of easeToTrain = 3, your chances to be Adopted are falling, while your chances of being 
  Transferred are rising (can't find an explanation for this insight...)

In [209]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogIntelligence, normalize='columns', margins=True). \
#             plot(kind='bar', title='Intelligence by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' intelligence:
- As a dog with low level of intelligence, your chances to be Adopted are significantly falling, while your chances of being 
  Euthanaised are significantly rising

In [210]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogHighEnergy, normalize='columns', margins=True). \
#             plot(kind='bar', title='Level of Energy by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' level of Energy:
- As a low-energy dog, your chances of being Adopted are significantly falling, while your chances of being 
  Euthanaised/RTO/Transferred are rising

In [211]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogLowBarking, normalize='columns', margins=True). \
#             plot(kind='bar', title='Level of Barking by Outcome Types', grid=True, figsize=(6,4))

### No specific insights regarding Dogs' level of Barking.

In [212]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogGoodHealth, normalize='columns', margins=True). \
#             plot(kind='bar', title='General Health by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' level of Health:
- As a dog with very low level of general health, your chances of being Adopted are falling, while your chances of being 
  RTO are rising
- As a dog with the highest level of general health, your chances of being Adopted are rising

In [213]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogToleratesCold, normalize='columns', margins=True). \
#             plot(kind='bar', title='Tolerance to Cold by Outcome Types', grid=True, figsize=(6,4))

### No specific insights regarding Dogs' Tolerance to Cold

In [214]:
# pd.crosstab(dogs_outcomes.OutcomeType, dogs_outcomes.DogToleratesHot, normalize='columns', margins=True). \
#             plot(kind='bar', title='Tolerance to Heat by Outcome Types', grid=True, figsize=(6,4))

### Insights regarding Dogs' Tolerance to Heat:
- As a dog with very low tolerance to heat, your chances of being Adopted are significantly falling. Your chances of being 
  Transferred are rising

It's hot in Texas! :)

## 3.7 Color:

**First - some exploration of this feature**

In [215]:
# outcomes.Color.nunique()

In [216]:
dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

#### Dogs

In [217]:
# dogs_outcomes.Color.nunique()

In [218]:
dogs_color_popularity = dogs_outcomes['Color'].value_counts()
# dogs_color_popularity

In [219]:
# number of colors that constitute 80% of all dogs' samples
dogs_number_of_popular_colors = dogs_color_popularity.cumsum().searchsorted(0.8*animal_types.loc['Dog'])[0]
# dogs_number_of_popular_colors

#### Cats

In [220]:
# cats_outcomes.Color.nunique()

In [221]:
cats_color_popularity = cats_outcomes['Color'].value_counts()
# cats_color_popularity

In [222]:
# number of colors that constitute 80% of all cats' samples
cats_number_of_popular_colors = cats_color_popularity.cumsum().searchsorted(0.8*animal_types.loc['Cat'])[0]
# cats_number_of_popular_colors

As shown above, there are 366 unique values in the Color column. Some of these values include a "coat pattern" (especially
with cats), some of them consist of two colors (separated by '/') and some consist of one color.
I'll try to deal with this feature and its correlation with the pets' OutcomeTypes, in 2 different ways:

    a. Creating a new 'ColorCoat' column, whose values would relate to the "coat pattern" of the Color feature
    b. Creating a new 'ColorGroup' column, whose values would reflect the colors' "families"
    
**Note:** Not like in the case of "Breed" feature, narrowing the variety of colors by adding a new 'MainColor' column, might not
      serve us well, because there's a big difference between, for instance, 'Black/White' and 'Black'. For that reason, 
      I will not add such a column.

### 3.7.a Creating a new 'ColorCoat' column:

In [223]:
coats = ['Augoti', 'Brindle', 'Calico', 'Merle', 'Point', 'Smoke', 'Tabby', 'Tick', 'Tiger', 'Torbie', 'Tortie', 'Tricolor']

In [224]:
def set_color_coat(color):
    """
    Returns a "color coat" of a given color. 
    If no coat was found, return 'Bicolor' if given color consists of two colors (separated by '/'), else return 'Solid' 
    """
    if color == 'Black/White':     # Black/White is not really a "coat", 
        return color               #                    however it is very popular and I want to define it specifically
    try:
        first_color, second_color = color.split('/')
        for coat in coats:                # look for coat in prime color
            if coat in first_color:
                return coat
        for coat in coats:                # look for coat in secondary color
            if coat in second_color:
                return coat
        return 'Bicolor'
    except:
        for coat in coats:                # look for coat in color
            if coat in color:
                return coat
        return 'Solid'

outcomes['ColorCoat'] = outcomes['Color'].apply(set_color_coat)

In [225]:
# outcomes.head()

In [226]:
# outcomes['ColorCoat'].value_counts()

In [227]:
# pd.crosstab(outcomes.ColorCoat, outcomes.OutcomeType, margins=True)

Separating between Dogs and Cats

#### DOGS

In [228]:
dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']

In [229]:
# pd.crosstab(dogs_outcomes.ColorCoat, dogs_outcomes.OutcomeType, margins=True)

In [230]:
dogs_outcome_coat_crosstab = pd.crosstab(dogs_outcomes.ColorCoat, dogs_outcomes.OutcomeType, margins=True, normalize='index')
# dogs_outcome_coat_crosstab

In [231]:
# sns.heatmap(dogs_outcome_coat_crosstab, linewidths=0.5, annot=True, center=1)

### Insights regarding Dogs color coats (ignoring the very few samples of Tabby/Smoke/Tiger dogs):
- As a Brindle dog, your chances to Die or to be Euthanaised are significantly rising
- As a dog with a Tick coat, your chance of being Adopted are rising. Your chances to be Euthanaised are falling
- As a Tricolor dog, your chances to be Euthanaised are falling

#### CATS

In [232]:
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

In [233]:
# pd.crosstab(cats_outcomes.ColorCoat, cats_outcomes.OutcomeType, margins=True)

In [234]:
cats_outcome_coat_crosstab = pd.crosstab(cats_outcomes.ColorCoat, cats_outcomes.OutcomeType, margins=True, normalize='index')
# cats_outcome_coat_crosstab

In [235]:
# sns.heatmap(cats_outcome_coat_crosstab, linewidths=0.5, annot=True, center=1)

### Insights regarding Cats color coats (ignoring the very few samples of Tricolor/Smoke/Tiger dogs):
- As a Calico/Bicolor cat, your chances to Die are falling
- As a Tortie/Torbie cat, your chances to Die or to be Euthanaised are falling

In [236]:
# 3.7.b Creating a new 'ColorGroup' column, whose values would reflect the colors' "families", as follows:

color_groups_dict = {
    'Black'   : ['Black'],
    'White'   : ['White'],
    'Brown'   : ['Brown', 'Chocolate', 'Liver', 'Agouti', 'Tortie', 'Torbie', 'Seal'],
    'Red'     : ['Red', 'Orange', 'Ruddy', 'Sable', 'Pink'],
    'Gray'    : ['Gray', 'Silver', 'Silver Lynx'],
    'Yellow'  : ['Yellow', 'Apricot', 'Tan', 'Gold', 'Fawn', 'Buff', 'Lynx', 'Flame'],
    'Cream'   : ['Cream'],
    'Blue'    : ['Blue', 'Lilac', 'Blue Cream'],
    'Tricolor': ['Tricolor', 'Calico']}

To the above items, I'll now add 'Black/White' and 10 most popular "Bicolor" combinations (as created earlier)

In [237]:
for bicolor in list(outcomes.Color[outcomes.ColorCoat == 'Bicolor'].value_counts().head(10).index) + ['Black/White']:
    color_groups_dict[bicolor] = [bicolor]

In [238]:
def remove_coat(color):
    '''remove "coat" from color, if exists'''
    for coat in coats:                
        if coat in color:                                      # remove coat from color, unless color is a coat by itself
            color = color.replace(' ' + coat, '') if coat != color else color
    return color

def set_color_group(color):
    '''In: color, Out: color-group'''
    try:
        first_color, second_color = color.split('/')
        first_color = remove_coat(first_color)
        second_color = remove_coat(second_color)
        for group in color_groups_dict:
            if first_color + '/' + second_color == group:         # first, look for the exact Bi-color group
                return group
        for group in color_groups_dict:
            if first_color in color_groups_dict[group]:           # if not found, look for the prime color's group
                return group
    except:
        color = remove_coat(color)
        for group in color_groups_dict:
            if color in color_groups_dict[group]:
                return group
    return 'Other'

outcomes['ColorGroup'] = outcomes['Color'].apply(set_color_group)

In [239]:
# outcomes['ColorGroup'].value_counts()

In [240]:
# pd.crosstab(outcomes.ColorGroup, outcomes.OutcomeType, margins=True)

Separating between Dogs and Cats

#### DOGS

In [241]:
dogs_outcomes = outcomes[outcomes.AnimalType == 'Dog']

In [242]:
# pd.crosstab(dogs_outcomes.ColorGroup, dogs_outcomes.OutcomeType, margins=True)

In [243]:
dogs_outcome_group_crosstab = pd.crosstab(dogs_outcomes.ColorGroup, dogs_outcomes.OutcomeType, margins=True, normalize='index')
# dogs_outcome_group_crosstab

In [244]:
# sns.heatmap(dogs_outcome_group_crosstab, linewidths=0.5, annot=True, center=0.5)

### Insights regarding Dogs color groups:
- As a Gray dog, your chances to be Adopted or Euthanaised are falling, while your chances to Die or to be RTO are rising
- As a Blue/White dog, your chances to be Euthanaised are significantly rising

#### CATS

In [245]:
cats_outcomes = outcomes[outcomes.AnimalType == 'Cat']

In [246]:
# pd.crosstab(cats_outcomes.ColorGroup, cats_outcomes.OutcomeType, margins=True)

In [247]:
cats_outcome_group_crosstab = pd.crosstab(cats_outcomes.ColorGroup, cats_outcomes.OutcomeType, margins=True, normalize='index')
# cats_outcome_group_crosstab

In [248]:
# sns.heatmap(cats_outcome_group_crosstab, linewidths=0.5, annot=True, center=1)

### Insights regarding Cats color coats (ignoring the very few samples of Black/Brown and Brown/Black cats):
- As a Gray cat, your chances to be Euthanaised are significantly rising
- As a Yellow cat, your chances of being Adopted are rising. Your chances to be Transferred are rising

## 4. Summary:

### 4.a Summary of the features added to the given data structure

- **Named:** 1 if the pet has a name, 0 if not
- **DayOfWeek:** DOW (Sun-Sat) of outcome
- **Quarter:** Quarter (1-4) of outcome
- **Holiday:** A holiday (if any) that occurs 0-5 days after an outcome
- **Sex:** The pet's gender (Male/Female)
- **NeuterStatus:** Intact / N/S (Neutered/Spayed) / Unknown
- **LifeStage:** Based on the pet's age: neonatal (birth-4 weeks), puppy/kitten (1-6 months), junior (7 months-2 years),        prime (3-6 years), mature (7-10 years), senior (11-14 years), geriatric (15+ years)
- **BreedPurity:** Based on the pet's breed: Mix/Crossbreeds/Purebred 
- **MainBreed:** The "prime" breed of the pet
- **DogFamily:** Relevant for dogs only. Based on the dog's breed - Working/Hound/Herding/Terrier/Sporting/Toy/Companion/Pit Bull
- **dogFriendly:** Relevant for dogs only. Based on the dog's breed. From 1 to 5    
- **easyToGroom:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **easyToTrain:** Relevant for dogs only. Based on the dog's breed. From 1 to 5  
- **goodHealth:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **highEnergy:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **intelligence:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **kidFriendly:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **lowBarking:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **lowShedding:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **dogSize:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **toleratesCold:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **toleratesHot:** Relevant for dogs only. Based on the dog's breed. From 1 to 5
- **ColorCoat:** Based on the pet's color: Augoti/Brindle/Calico/Merle/Point/Smoke/Tabby/Tick/Tiger/Torbie/Tortie/Tricolor
- **ColorGroup:** Based on the pet's color: Black/White/Brown/Red/Gray/Yellow/Cream/Blue/Tricolor and 11 most popular "Bicolor"s

### 4.b Summary of insights based on given and collected data:

- According to the description of the case in the Kaggle web page, the outcome of more than 35% of pets arriving to shelters in the US every year is death (2.7 milion out of 7.6 million). However, strangely enough, according to the dataset of intake information received from the AAC (Austin Animal Center), the percent of dead and euthanized dogs and cats (along 2.5 years) is only 6.5% (5.8% Euthanasia + 0.7% Died ,752 out of 26,729 ). Trying to explain the above finding, I'd like to suggest that the final outcome" of many transferred pets might be "not as good as expected".


- The percentage of dogs being adopted is a bit higher than cats
- The percentage of RTO dogs is *much* higher than cats
- The percentage of cats being Euthanaised, Transferred or Died is higher than dogs


- Almost 85% of adopted animals, are named
- Almost 97% of animals which are Returned To their Owners (RTO), are named
- Almost 70% of died animals, are un-named
- un-named animals tend to be euthanised and transferred more than named animals
- 47% of named animals are adopted, weheras the rate of total adoptions is only 40%
- 24% of named animals are RTO, weheras the rate of total RTO is only 18%
- Likewize, the percentages of named animals which are Died/Euthanaised/Transferred are lower than total percentages


- Not surprisingly, it seems that most adoptions take place during the weekends. Also, it seems that animals are euthanaised
  mostly on Mondays, supposedly after they had not been adopted during the passed weekend.
- As I mentioned above, this feature will not contribute for predicting the next outcome, however it might suggests that the
  shelter may put some extra effort during weekends, as people tend to adopt the animals on Sat and Sun.


- It seems that more activites take place during Q4, and less during Q1.


- About 20% of all adoptions take place 5 days or less before a holiday (mostly before Christmas)


- It seems that if an animal is suffering, aggressive, bad-behaving, in Rabies Risk or has a medical problem - it will be
  euthanaized.
  
  
- Main subtypes of Transfer are: 'Partner' and 'SCRP':
    - Partner: is probably another shelter for pets
    - SCRP: I assume that it means: "Street Cats Rescue Program"

- Amazingly, NONE of the animals that their SexuponOutcome is "Unknown" were adopted! Only few of them were RTO.


- 97% of adopted pets were spayed/neutered
- 83% of RTO pets were spayed/neutered
- 69% of Died pets and 56% of Euthanaised pets were Intact


- 77% (56 + 21) of N/S pets ended well (either adopted or returned to their owners)
- 69% of Intact pets were transferred
- 87% of pets whose neutered/spayed status was not clear, were transferred. None of them were adopted


- It seems that adoption is not related to the sex of the pet
- Males Die and Euthanaised more than females
- Males RTO more than females


- Adoption:
    - Pets are not adopted until they're at least 1 month old. 
    - Pets are almost not adopted at the Mature stage or older
    - Pets are mostly adopted when they're at the Puppy/Kitten stage. After that - as Juniors
- RTO:
    - Pets are returned to their owners mostly as Junior or Prime
- Die:
    - Pets mostly Die in the Neonatal and Puppy/Kitten stages. 
- Euthanasia:
    - Pets mostly Eunathaised in the Junior and Prime stages

- Neonatal:
    - Pets are not adopted during their Neonatal period (as seen earlier) 
    - Being in the Neonatal life stage, there's almost 93% chance to be trasferred
    - Chances to Die are high
- Puppy/Kitten:
    - As a Puppy/Kitten, there's 61% chance to be Adopted
    - Chances to Die are also high
- Junior:
    - As a Junior, there's almost 40% chance to be Adopted; 34% chance to be transferred
- Mature:
    - As a Mature pet, there's 38% chance to be RTO
- Senior:
    - As a Senior, there's 45% chance to be RTO
- Geriatric:
    - As a Geriatric pet, there's 52% chance to be RTO, 24% to be Euthanaised and only 11% chance to be adopted


- While Junior dogs' chances to be Adopted are rising, Junior cats' chances are falling
- Senior dogs' chances to be Adopted are much lower than Senior cats'
- Senior and Geriatric cats' chances to be Euthaaised are higher than Senior and Geriatric dogs'
- Junior cats' chances to be Transferred are higher than Junior dogs'


- As a Crossbreeds dog, your chances of being Adopted are rising, and your chances of being RTO are falling
- As a Purebed dog, your chances of being Adopted are falling, and your chances of being RTO are rising (which is not
  surprising, as it might be an expensive or rare dog, whom the owner would try hard to find)    


- As a Crossbreeds cat, your chances of being Adopted are rising, and your chances of being RTO are falling
- As a Purebed cat, your chances of being Adopted are falling, and your chances of being RTO are rising (same as dogs). Your
  chances of being Transferred are a bit rising
- Note: Percentage of 'Mix' cats is very high (97.5%), so the contribution of this feature to the Model is doubtful


- As a Herding/Hound/Terrier/Sporting/Toy dog, your chances of being Adopted are rising
- As a Companion/Pit Bull dog, your chances of being Adopted are significantly falling
- As a Pit Bull, your chances of being Euthanaised are significantly rising
- As a Hound dog, your chances of being Euthanaised are significantly falling
- As a Toy/Companion dog, your chances to Die are rising
- As a Companion/Working/Pit Bull dog, your chances of being RTO are rising. These chances are falling for Herding/Hound dogs


- As a very big dog (size=5), your chances to Die or to be Transferred are rising. Your chances of being Adopted are falling
- As a  dog of size = 2, your chances of being RTO are significantly rising. Your chances of being Adopted are falling
- As a very small dog (size=1), your chances to be Transferred are rising. Your chances of being RTO are falling


- As an unfriendly dog (to other dogs), your chances to Die or to be Euthanaised are significally rising. Your chances of
  being Adopted are falling


- As an unfriendly dog (to kids), your chances to Die or to be Euthanaised are significally rising. Your chances of
  being Adopted are falling


- As a dog with low levels of ease-of-grooming, your chances to be Adopted are falling, while your chances of being 
  Euthanaised or Transferred are rising


- As a dog with level of shedding = 2, your chances to be Adopted are falling, while your chances of being 
  Euthanaised or RTO are rising (can't find an explanation for this insight...)


- As a dog with level of easeToTrain = 3, your chances to be Adopted are falling, while your chances of being 
  Transferred are rising (can't find an explanation for this insight...)


- As a dog with low level of intelligence, your chances to be Adopted are significantly falling, while your chances of being 
  Euthanaised are significantly rising


- As a low-energy dog, your chances of being Adopted are significantly falling, while your chances of being 
  Euthanaised/RTO/Transferred are rising


- No specific insights regarding Dogs' level of Barking.


- As a dog with very low level of general health, your chances of being Adopted are falling, while your chances of being 
  RTO are rising
- As a dog with the highest level of general health, your chances of being Adopted are rising


- No specific insights regarding Dogs' Tolerance to Cold


- As a dog with very low tolerance to heat, your chances of being Adopted are significantly falling. Your chances of being 
  Transferred are rising


- As a Brindle dog, your chances to Die or to be Euthanaised are significantly rising
- As a dog with a Tick coat, your chance of being Adopted are rising. Your chances to be Euthanaised are falling
- As a Tricolor dog, your chances to be Euthanaised are falling


- As a Calico/Bicolor cat, your chances to Die are falling
- As a Tortie/Torbie cat, your chances to Die or to be Euthanaised are falling


- As a Gray dog, your chances to be Adopted or Euthanaised are falling, while your chances to Die or to be RTO are rising
- As a Blue/White dog, your chances to be Euthanaised are significantly rising


- As a Gray cat, your chances to be Euthanaised are significantly rising
- As a Yellow cat, your chances of being Adopted are rising. Your chances to be Transferred are rising

### 4.c Plans for the next part of the project:
    - I'll prepare the Dataframe to be used by various models:
        - feature selection: I'll try to select the appropriate features (and drop the others), in order to improve 
          models' scores
        - I'll transform some of the features into dummy variables, using DataFrame.get_dummies()
        - Some of features I added, are related only for dogs. I'll have to deal with that (I prefer not to split the given
          data into 2 separate dataframes (dogs/cats))
    - As the given "test" dataset is not complete, I'll prepare the data for the Train/Test split
    - I'll run several Models, with various hyper parameters, and compare between them while analysing the results

In [249]:
# save outcomes DF for part2 of the project
import pickle

In [250]:
with open('part1outcomes', 'wb') as f:
    pickle.dump(outcomes, f)