Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [95]:
import pandas as pd
adoption_url = 'https://data.austintexas.gov/resource/9t4d-g238.csv?$limit=100000'
adoption = pd.read_csv(adoption_url)
# in order to see all of the columns:
pd.options.display.max_columns = 100

# Target

In [96]:
adoption.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A808084,*Dori,2019-11-13T16:56:00.000,2019-11-13T16:56:00.000,2018-11-03T00:00:00.000,Adoption,,Dog,Spayed Female,1 year,Australian Cattle Dog Mix,Brown/Tricolor
1,A808715,Kanga,2019-11-13T16:41:00.000,2019-11-13T16:41:00.000,2019-04-12T00:00:00.000,Adoption,,Dog,Intact Female,7 months,Dutch Shepherd/Belgian Malinois,Brown Brindle
2,A802292,George,2019-11-13T16:39:00.000,2019-11-13T16:39:00.000,2017-08-16T00:00:00.000,Adoption,,Dog,Neutered Male,2 years,Rhod Ridgeback/Labrador Retriever,Brown
3,A697397,Blue,2019-11-13T16:33:00.000,2019-11-13T16:33:00.000,2006-11-12T00:00:00.000,Euthanasia,Suffering,Cat,Neutered Male,13 years,Domestic Shorthair Mix,Blue
4,A802452,Hobbey,2019-11-13T16:29:00.000,2019-11-13T16:29:00.000,2019-02-18T00:00:00.000,Return to Owner,,Dog,Neutered Male,8 months,German Shepherd Mix,Black/Tan


In [97]:
adoption.shape

(100000, 12)

The target in this project will be to predict whether or not an animal will be adopted or not (transferred to another shelter or, sadly, euthanized) so that perhaps animal shelters, though overwhelmed, can give some extra love or use unique methods to get those animals that may not have the best odds forever homes.

In [98]:
adoption['outcome_type'].value_counts(dropna=False)

Adoption           44383
Transfer           29862
Return to Owner    17514
Euthanasia          6297
Died                 969
Rto-Adopt            507
Disposal             387
Missing               61
Relocate              17
NaN                    3
Name: outcome_type, dtype: int64

In [161]:
# this might be a feature that can create leakage, will come back to it.
adoption['outcome_subtype'].value_counts(dropna=False)

NaN                    54836
Partner                24944
Foster                  7950
Rabies Risk             2790
SCRP                    2640
Suffering               2503
Snr                     2275
In Kennel                508
Aggressive               360
Offsite                  281
Medical                  254
In Foster                243
At Vet                   171
Behavior                  82
Enroute                   65
Underage                  29
Court/Investigation       21
In Surgery                19
Possible Theft            15
Field                      8
Barn                       4
Prc                        1
Customer S                 1
Name: outcome_subtype, dtype: int64

# Classification or Regression?

There are 9 classes of outcomes but for this project I'd like to focus on the animals that were adopted or not. There are a large number of animals that were returned to owners but that would just be due to them getting out etc but they do have a home so I will not include those in my project. "Rto-adopt" or return to owner adoption will also be included with "return to owner."

For animals that were adopted I will consider that to be:  
-adoption  
-rto-adopt (return to owner through adoption)  


I will combine the following for not adopted:  
-transfer  
-euthanasia  
-relocate  
-missing (animals that went missing from the shelter--still unsuccessful in getting them homes)  
"Died" and "disposal" are animals that may have died while at the shelter or were brought in that needed to be properly disposed of so I will not include these either as they may have been very ill when brought in.

# How is the target distributed?
## Are the classes imbalanced?

In [194]:
adoption['outcome_type'].value_counts(normalize=True)

Adoption           0.443843
Transfer           0.298629
Return to Owner    0.175145
Euthanasia         0.062972
Died               0.009690
Rto-Adopt          0.005070
Disposal           0.003870
Missing            0.000610
Relocate           0.000170
Name: outcome_type, dtype: float64

will need to drop the rows where the outcome type are the ones listed above to be excluded:  
-Return to owner  
-Rto-adopt  
-Died  
-Disposal


In [195]:
# adoption updated to drop the outcomes we are excluding:
adoption_upd = adoption[~adoption['outcome_type'].isin(['Return to Owner', 
                                                        'Rto-Adopt', 
                                                        'Died', 
                                                        'Disposal'])]

In [196]:
adoption_upd['outcome_type'].value_counts(dropna=False)

Adoption      44383
Transfer      29862
Euthanasia     6297
Missing          61
Relocate         17
NaN               3
Name: outcome_type, dtype: int64

In [197]:
# am going to drop all animals except cats and dogs, 2 of the unknown
# outcome types are dogs and 1 also has a lot of other missing info so
# will drop these rows.
adoption_upd = adoption_upd.dropna(subset = ['outcome_type'])


In [198]:
# need to redefine classes as binary. Adoption as 'adopted' and 
# the rest as 'not adopted'.
def new_status(outcome):
    if outcome == 'Transfer' or outcome == 'Euthanasia' or outcome == 'Missing' or outcome == 'Relocate':
      return 'Not adopted'
    else:
      return 'Adopted'


In [199]:
adoption_upd = adoption_upd.copy()
adoption_upd['new_outcome_type'] = adoption_upd['outcome_type'].apply(new_status)

In [200]:
adoption_upd['new_outcome_type'].value_counts(normalize=True)

Adopted        0.550521
Not adopted    0.449479
Name: new_outcome_type, dtype: float64

The classes now are combined into a binary classification and the classes are not imbalanced.

In [201]:
# drop the original 'Outcome_Type' column:
adoption_upd = adoption_upd.drop(columns='outcome_type')

# Choose Observations

As mentioned above, since the focus of my project is predicting if animals that are in need of forever homes will be adopted or not, I have already excluded the following observations from my model:  
-Return to Owner  
-Rto-Adopt  
-Died  
-Disposal  

There are 3 missing values for the outcome ,so we can try to look at the outcome subtype to see if we can determine what happened to them but the remaining animals have been categorized into 'adopted' or 'not adopted.'

# How to Split Data:

In [202]:
adoption_upd.dtypes

animal_id           object
name                object
datetime            object
monthyear           object
date_of_birth       object
outcome_subtype     object
animal_type         object
sex_upon_outcome    object
age_upon_outcome    object
breed               object
color               object
new_outcome_type    object
dtype: object

In [203]:
adoption_upd['datetime'] = pd.to_datetime(adoption_upd['datetime'], infer_datetime_format=True)

In [204]:
adoption_upd['datetime'].dt.year.value_counts()

2015    14792
2019    14339
2016    14119
2017    14058
2018    13361
2014     9951
Name: datetime, dtype: int64

The description of the data set said that Austin is becoming a more pet-friendly city so there may be more animals going in and out of shelters in the more recent data vs the earlier data. I will therefore split the data based on time with the most recent data being the test set and then create a test and validation set with the remaining data.  

test = adoption_upd[(adoption_upd['datetime'].dt.year == 2019)]  
val =  adoption_upd[(adoption_upd['datetime'].dt.year == 2018)]  
train = adoption_upd[(adoption_upd['datetime'].dt.year < 2018)]

In [17]:
# how big to make test set? 2019:
# 14292 observations

# Evaluation Metrics

since the classes aren't imbalanced I can use accuracy but will also explore the precision and recall for this problem.

precision positive: correctly predict all the animals that were adopted.  

\begin{align}
precision = \frac{accurately \ predicted \ adopted}{total\ predicted \ adopted}
\end{align}


recall positive: of all the animals that were adopted, how many were we able to identify?
\begin{align}
recall = \frac{accurately \ predicted \ adopted}{actually \ adopted}
\end{align}


# Begin to clean data and feature selection

In [205]:
adoption_upd.isnull().sum()

animal_id               0
name                30055
datetime                0
monthyear               0
date_of_birth           0
outcome_subtype     36347
animal_type             0
sex_upon_outcome        2
age_upon_outcome       24
breed                   0
color                   0
new_outcome_type        0
dtype: int64

In [206]:
# the name column has a lot of missing values, will change NaN's to Unknown
adoption_upd['name'].fillna("Unknown", inplace = True)  

In [207]:
# the outcome subtype is missing 36353 values, almost half of all of our 
# data, since we will know all of the outcome types and this may cause 
# leakage into the test set because certain outcomes can be deduced from 
# the outcome subtype, I will drop that entire column.

adoption_upd = adoption_upd.drop(columns='outcome_subtype')
adoption_upd.shape


(80620, 11)

In [208]:
# the sex_upon_outcome column has 2 missing values but noticed an 'unknown'
# value so check to see if that's common:
adoption_upd['sex_upon_outcome'].value_counts()
# will one hot encode this column.

Neutered Male    27562
Spayed Female    26009
Intact Female    10101
Intact Male       9375
Unknown           7571
Name: sex_upon_outcome, dtype: int64

In [209]:
# age_upon_outcome has 25 missing values but date_of_birth has none, lets
# look at how many times 'unknown' shows up in the data:
adoption_upd.isin(['Unknown']).sum()
# the name already has 18 unknowns, so will stick to changing NaN's to unknown.

animal_id               0
name                30073
datetime                0
monthyear               0
date_of_birth           0
animal_type             0
sex_upon_outcome     7571
age_upon_outcome        0
breed                   0
color                   0
new_outcome_type        0
dtype: int64

In [210]:
# to see if 'Other is a common entry as well'
adoption_upd.isin(['Other']).sum()

animal_id              0
name                   0
datetime               0
monthyear              0
date_of_birth          0
animal_type         4450
sex_upon_outcome       0
age_upon_outcome       0
breed                  0
color                  0
new_outcome_type       0
dtype: int64

In [211]:
# to see what kind of animals come in to the shelter
adoption_upd['animal_type'].value_counts()

Dog          39752
Cat          35977
Other         4450
Bird           432
Livestock        9
Name: animal_type, dtype: int64

In [212]:
# initially thinking of only using cats and dogs.
adoption_upd = adoption_upd[~adoption_upd['animal_type'].isin(['Bird', 
                                                        'Other', 
                                                        'Livestock'])]

In [213]:
# check for missing values now:
adoption_upd.isnull().sum()

animal_id           0
name                0
datetime            0
monthyear           0
date_of_birth       0
animal_type         0
sex_upon_outcome    2
age_upon_outcome    9
breed               0
color               0
new_outcome_type    0
dtype: int64

In [214]:
adoption_upd['date_of_birth'] = pd.to_datetime(adoption_upd['date_of_birth'], infer_datetime_format=True)

In [215]:
adoption_upd.loc[adoption_upd['age_upon_outcome'].isnull()]

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type
18,A808636,Unknown,2019-11-13 13:46:00,2019-11-13T13:46:00.000,2004-11-11,Cat,Intact Female,,Siamese,Seal Point,Not adopted
55,A808702,Keepers,2019-11-12 15:20:00,2019-11-12T15:20:00.000,2009-11-12,Dog,Neutered Male,,Golden Retriever,Gold,Not adopted
90,A808649,Unknown,2019-11-11 17:49:00,2019-11-11T17:49:00.000,2019-10-11,Cat,Intact Male,,Domestic Shorthair,Brown Tabby,Not adopted
91,A738697,Boots,2019-11-11 17:48:00,2019-11-11T17:48:00.000,2007-11-17,Dog,,,Miniature Schnauzer Mix,Black,Not adopted
116,A807543,Unknown,2019-11-11 15:22:00,2019-11-11T15:22:00.000,2016-10-26,Cat,Neutered Male,,Domestic Shorthair,Brown Tabby/White,Not adopted
120,A808626,Unknown,2019-11-11 13:57:00,2019-11-11T13:57:00.000,2019-09-11,Cat,Intact Female,,Domestic Shorthair,Brown Tabby,Not adopted
173,A808466,Unknown,2019-11-10 09:35:00,2019-11-10T09:35:00.000,2019-09-09,Cat,Intact Male,,Domestic Shorthair,Brown/Black,Not adopted
306,A808352,Unknown,2019-11-07 16:15:00,2019-11-07T16:15:00.000,2011-11-07,Dog,Intact Male,,Dachshund Mix,Tan,Not adopted
6608,A752967,Gray,2019-07-24 11:42:00,2019-07-24T11:42:00.000,2015-06-29,Dog,,,Pit Bull Mix,Blue/White,Not adopted


In [216]:
# 2 of the rows that are missing the age are also missing the sex so I will
# drop those:
adoption_upd = adoption_upd.dropna(subset = ['sex_upon_outcome'])

In [217]:
# since there are no missing or unknown values for DOB which still seems strange 
# especially for stray animals that were found but maybe they approximated. We can 
# create a column where we subtract 2019 from the born on year to get the age and see 
# if there are a lot of differences.

now = pd.Timestamp('now')
# first, get DOB year column and DOB month columns:
adoption_upd['DOB_month'] = adoption_upd['date_of_birth'].dt.month
# now, subtract current year from DOB year:
adoption_upd['DOB_year'] = now.year - adoption_upd['date_of_birth'].dt.year
# now turn DOB_year into months by multiplying by 12
adoption_upd['DOB_year'] = adoption_upd['DOB_year'] * 12
# add DOB_year which is now in months to DOB month and divide by 12 to get years
adoption_upd['calculated_age'] = adoption_upd['DOB_year'] + adoption_upd['DOB_month']
def calculate_age(age):
        if age >= 12:
            return(f'{age // 12} years')
        else:
            return(f'{age} months') 
adoption_upd['calculated_age'] = adoption_upd['calculated_age'].apply(calculate_age)
# fill NaN's in age upon outcome column with calculated age
adoption_upd['age_upon_outcome'].fillna(adoption_upd['calculated_age'], inplace=True)
# drop DOB month, DOB year, calculated age:
adoption_upd.drop(columns =['DOB_month', 'DOB_year', 'calculated_age'], inplace=True)

In [218]:
adoption_upd.isnull().sum()
# no more missing values!

animal_id           0
name                0
datetime            0
monthyear           0
date_of_birth       0
animal_type         0
sex_upon_outcome    0
age_upon_outcome    0
breed               0
color               0
new_outcome_type    0
dtype: int64

In [219]:
adoption_upd.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type
0,A808084,*Dori,2019-11-13 16:56:00,2019-11-13T16:56:00.000,2018-11-03,Dog,Spayed Female,1 year,Australian Cattle Dog Mix,Brown/Tricolor,Adopted
1,A808715,Kanga,2019-11-13 16:41:00,2019-11-13T16:41:00.000,2019-04-12,Dog,Intact Female,7 months,Dutch Shepherd/Belgian Malinois,Brown Brindle,Adopted
2,A802292,George,2019-11-13 16:39:00,2019-11-13T16:39:00.000,2017-08-16,Dog,Neutered Male,2 years,Rhod Ridgeback/Labrador Retriever,Brown,Adopted
3,A697397,Blue,2019-11-13 16:33:00,2019-11-13T16:33:00.000,2006-11-12,Cat,Neutered Male,13 years,Domestic Shorthair Mix,Blue,Not adopted
5,A808465,Unknown,2019-11-13 16:28:00,2019-11-13T16:28:00.000,2019-08-29,Dog,Spayed Female,2 months,Chihuahua Shorthair/Yorkshire Terrier,Black/Brown,Adopted


In [220]:
# looking at redundant columns, date of birth can be dropped since the ages are all accounted for.
# also datetime and monthyear are the same, I will get rid of datetime for now.
adoption_upd.drop(columns=['date_of_birth'], inplace=True)

In [221]:
# going to create a new column for the season the animal came into the shelter to see if certain
# seasons have higher adoption rates:
adoption_upd['monthyear'] = pd.to_datetime(adoption_upd['monthyear'], infer_datetime_format=True)
adoption_upd['month_arrived'] = adoption_upd['monthyear'].dt.month
adoption_upd.head()


Unnamed: 0,animal_id,name,datetime,monthyear,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type,month_arrived
0,A808084,*Dori,2019-11-13 16:56:00,2019-11-13 16:56:00,Dog,Spayed Female,1 year,Australian Cattle Dog Mix,Brown/Tricolor,Adopted,11
1,A808715,Kanga,2019-11-13 16:41:00,2019-11-13 16:41:00,Dog,Intact Female,7 months,Dutch Shepherd/Belgian Malinois,Brown Brindle,Adopted,11
2,A802292,George,2019-11-13 16:39:00,2019-11-13 16:39:00,Dog,Neutered Male,2 years,Rhod Ridgeback/Labrador Retriever,Brown,Adopted,11
3,A697397,Blue,2019-11-13 16:33:00,2019-11-13 16:33:00,Cat,Neutered Male,13 years,Domestic Shorthair Mix,Blue,Not adopted,11
5,A808465,Unknown,2019-11-13 16:28:00,2019-11-13 16:28:00,Dog,Spayed Female,2 months,Chihuahua Shorthair/Yorkshire Terrier,Black/Brown,Adopted,11


In [222]:
def season(arrival_month):
    if arrival_month == 1 or arrival_month == 2 or arrival_month == 3:
        return 'Winter'
    if arrival_month == 3 or arrival_month == 4 or arrival_month ==5:
        return 'Spring'
    if arrival_month == 6 or arrival_month == 7 or arrival_month ==8:
        return 'Summer'
    if arrival_month == 9 or arrival_month == 10 or arrival_month ==11:
        return 'Fall'
adoption_upd['season_arrived'] = adoption_upd['month_arrived'].apply(season)

In [223]:
# want to change the datetime column to year only since we have month_arrived now:
adoption_upd['year_arrived'] = adoption_upd['datetime'].dt.year

In [225]:
# going to drop datetime and monthyear now:
adoption_upd.drop(columns=['datetime', 'monthyear'], inplace=True)

In [227]:
adoption_upd.head(10)

Unnamed: 0,animal_id,name,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type,month_arrived,season_arrived,year_arrived
0,A808084,*Dori,Dog,Spayed Female,1 year,Australian Cattle Dog Mix,Brown/Tricolor,Adopted,11,Fall,2019
1,A808715,Kanga,Dog,Intact Female,7 months,Dutch Shepherd/Belgian Malinois,Brown Brindle,Adopted,11,Fall,2019
2,A802292,George,Dog,Neutered Male,2 years,Rhod Ridgeback/Labrador Retriever,Brown,Adopted,11,Fall,2019
3,A697397,Blue,Cat,Neutered Male,13 years,Domestic Shorthair Mix,Blue,Not adopted,11,Fall,2019
5,A808465,Unknown,Dog,Spayed Female,2 months,Chihuahua Shorthair/Yorkshire Terrier,Black/Brown,Adopted,11,Fall,2019
6,A804539,Bubbles,Cat,Spayed Female,3 months,Domestic Shorthair Mix,Brown Tabby,Adopted,11,Fall,2019
9,A807697,Lucy,Dog,Spayed Female,1 year,Basenji Mix,Red,Adopted,11,Fall,2019
10,A799663,Cookie,Dog,Neutered Male,2 years,Rat Terrier,Black/Tricolor,Adopted,11,Fall,2019
11,A808495,Unknown,Cat,Intact Male,7 months,Domestic Shorthair,Black/White,Not adopted,11,Fall,2019
12,A808494,Unknown,Cat,Intact Male,7 months,Domestic Shorthair,Black,Not adopted,11,Fall,2019


In [53]:
# how many different breeds are there:
adoption_upd['breed'].value_counts()
# 2042...very high cardinality

Domestic Shorthair Mix                            26070
Pit Bull Mix                                       4722
Labrador Retriever Mix                             4373
Chihuahua Shorthair Mix                            3974
Domestic Shorthair                                 3552
Domestic Medium Hair Mix                           2549
German Shepherd Mix                                1816
Domestic Longhair Mix                              1184
Siamese Mix                                        1022
Australian Cattle Dog Mix                           968
Dachshund Mix                                       627
Border Collie Mix                                   594
Boxer Mix                                           572
Domestic Medium Hair                                478
Miniature Poodle Mix                                454
Catahoula Mix                                       432
Labrador Retriever                                  405
Staffordshire Mix                               

In [None]:
# how many colors are there
adoption_upd['color'].value_counts()[:10]
#477 different values, high cardinality as well.