Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [6]:
import pandas as pd
adoption_url = 'https://data.austintexas.gov/resource/9t4d-g238.csv?$limit=100000'
adoption = pd.read_csv(adoption_url)
# in order to see all of the columns:
pd.options.display.max_columns = 100

# Target

In [7]:
adoption.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A775354,Charley,2019-11-12T13:29:00.000,2019-11-12T13:29:00.000,2013-06-28T00:00:00.000,Adoption,,Dog,Neutered Male,6 years,Cocker Spaniel Mix,Black/White
1,A775130,*Slinky,2019-11-12T13:17:00.000,2019-11-12T13:17:00.000,2016-06-25T00:00:00.000,Adoption,Foster,Dog,Spayed Female,3 years,Pit Bull Mix,Blue/White
2,A808385,,2019-11-12T13:16:00.000,2019-11-12T13:16:00.000,2018-11-08T00:00:00.000,Transfer,Partner,Dog,Neutered Male,1 year,Pug/Chihuahua Shorthair,White/Brown
3,A799709,*Deeogee,2019-11-12T13:15:00.000,2019-11-12T13:15:00.000,2018-07-11T00:00:00.000,Adoption,Foster,Dog,Neutered Male,1 year,Beagle Mix,Brown Brindle
4,A713661,Coco,2019-11-12T12:46:00.000,2019-11-12T12:46:00.000,2013-10-10T00:00:00.000,Return to Owner,,Dog,Spayed Female,6 years,Labrador Retriever Mix,Black/White


In [8]:
adoption.shape

(100000, 12)

The target in this project will be to predict whether or not an animal will be adopted or not (transferred to another shelter or, sadly, euthanized) so that perhaps animal shelters, though overwhelmed, can give some extra love or use unique methods to get those animals that may not have the best odds forever homes.

In [82]:
adoption['outcome_type'].value_counts(dropna=False)

Adoption           44389
Transfer           29848
Return to Owner    17520
Euthanasia          6300
Died                 968
Rto-Adopt            507
Disposal             387
Missing               61
Relocate              17
NaN                    3
Name: outcome_type, dtype: int64

In [81]:
# this might be a feature that can create leakage, will come back to it.
adoption['outcome_subtype'].value_counts(dropna=False)

NaN                    54846
Partner                24931
Foster                  7953
Rabies Risk             2791
SCRP                    2642
Suffering               2504
Snr                     2272
In Kennel                508
Aggressive               360
Offsite                  281
Medical                  254
In Foster                242
At Vet                   171
Behavior                  82
Enroute                   65
Underage                  29
Court/Investigation       21
In Surgery                19
Possible Theft            15
Field                      8
Barn                       4
Prc                        1
Customer S                 1
Name: outcome_subtype, dtype: int64

# Classification or Regression?

There are 9 classes of outcomes but for this project I'd like to focus on the animals that were adopted or not. There are a large number of animals that were returned to owners but that would just be due to them getting out etc but they do have a home so I will not include those in my project. "Rto-adopt" or return to owner adoption will also be included with "return to owner."

For animals that were adopted I will consider that to be:  
-adoption  
-rto-adopt (return to owner through adoption)  


I will combine the following for not adopted:  
-transfer  
-euthanasia  
-relocate  
-missing (animals that went missing from the shelter--still unsuccessful in getting them homes)  
"Died" and "disposal" are animals that may have died while at the shelter or were brought in that needed to be properly disposed of so I will not include these either as they may have been very ill when brought in.

# How is the target distributed?
## Are the classes imbalanced?

In [137]:
adoption['outcome_type'].value_counts(normalize=True)

Adoption           0.443903
Transfer           0.298489
Return to Owner    0.175205
Euthanasia         0.063002
Died               0.009680
Rto-Adopt          0.005070
Disposal           0.003870
Missing            0.000610
Relocate           0.000170
Name: outcome_type, dtype: float64

will need to drop the rows where the outcome type are the ones listed above to be excluded:  
-Return to owner  
-Rto-adopt  
-Died  
-Disposal


In [138]:
# adoption updated to drop the outcomes we are excluding:
adoption_upd = adoption[~adoption['outcome_type'].isin(['Return to Owner', 
                                                        'Rto-Adopt', 
                                                        'Died', 
                                                        'Disposal'])]

In [139]:
print(adoption_upd.shape)
adoption_upd['outcome_type'].value_counts()

(80618, 12)


Adoption      44389
Transfer      29848
Euthanasia     6300
Missing          61
Relocate         17
Name: outcome_type, dtype: int64

In [140]:
# need to redefine classes as binary. Adoption as 'adopted' and 
# the rest as 'not adopted'.
def new_status(outcome):
    if outcome == 'Transfer' or outcome == 'Euthanasia' or outcome == 'Missing' or outcome == 'Relocate':
      return 'Not adopted'
    else:
      return 'Adopted'


In [141]:
adoption_upd = adoption_upd.copy()
adoption_upd['new_outcome_type'] = adoption_upd['outcome_type'].apply(new_status)

In [142]:
adoption_upd['new_outcome_type'].value_counts(normalize=True)

Adopted        0.550646
Not adopted    0.449354
Name: new_outcome_type, dtype: float64

The classes now are combined into a binary classification and the classes are not imbalanced.

In [143]:
# drop the original 'Outcome_Type' column:
adoption_upd.drop(columns='outcome_type')

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type
0,A775354,Charley,2019-11-12T13:29:00.000,2019-11-12T13:29:00.000,2013-06-28T00:00:00.000,,Dog,Neutered Male,6 years,Cocker Spaniel Mix,Black/White,Adopted
1,A775130,*Slinky,2019-11-12T13:17:00.000,2019-11-12T13:17:00.000,2016-06-25T00:00:00.000,Foster,Dog,Spayed Female,3 years,Pit Bull Mix,Blue/White,Adopted
2,A808385,,2019-11-12T13:16:00.000,2019-11-12T13:16:00.000,2018-11-08T00:00:00.000,Partner,Dog,Neutered Male,1 year,Pug/Chihuahua Shorthair,White/Brown,Not adopted
3,A799709,*Deeogee,2019-11-12T13:15:00.000,2019-11-12T13:15:00.000,2018-07-11T00:00:00.000,Foster,Dog,Neutered Male,1 year,Beagle Mix,Brown Brindle,Adopted
5,A808367,Daily,2019-11-12T12:38:00.000,2019-11-12T12:38:00.000,2016-11-07T00:00:00.000,Partner,Dog,Intact Female,3 years,Australian Cattle Dog/Labrador Retriever,Cream,Not adopted
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,A679740,,2014-05-26T16:55:00.000,2014-05-26T16:55:00.000,2014-03-25T00:00:00.000,Partner,Dog,Intact Male,2 months,Catahoula Mix,Brown Brindle/White,Not adopted
99996,A679715,,2014-05-26T16:54:00.000,2014-05-26T16:54:00.000,2014-03-25T00:00:00.000,Partner,Dog,Intact Female,2 months,Catahoula Mix,Tan/White,Not adopted
99997,A677532,*Taco,2014-05-26T16:52:00.000,2014-05-26T16:52:00.000,2014-03-15T00:00:00.000,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White,Adopted
99998,A677530,*Chimichanga,2014-05-26T16:51:00.000,2014-05-26T16:51:00.000,2014-03-15T00:00:00.000,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black,Adopted


# Choose Observations

As mentioned above, since the focus of my project is predicting if animals that are in need of forever homes will be adopted or not, I have already excluded the following observations from my model:  
-Return to Owner  
-Rto-Adopt  
-Died  
-Disposal  

There are 3 missing values for the outcome ,so we can try to look at the outcome subtype to see if we can determine what happened to them but the remaining animals have been categorized into 'adopted' or 'not adopted.'

In [144]:
adoption_upd.isnull().sum()

animal_id               0
name                30058
datetime                0
monthyear               0
date_of_birth           0
outcome_type            3
outcome_subtype     36354
animal_type             0
sex_upon_outcome        3
age_upon_outcome       25
breed                   0
color                   0
new_outcome_type        0
dtype: int64

# How to Split Data:

The description of the data set said that Austin is becoming a more pet-friendly city so there may be more animals going in and out of shelters in the more recent data vs the earlier data. I will therefore split the data based on time with the most recent data being the test set and then create a test and validation set with the remaining data.  

test = adoption_upd[(adoption_upd['datetime'].dt.year == 2019)]  
val =  adoption_upd[(adoption_upd['datetime'].dt.year == 2018)]  
train = adoption_upd[(adoption_upd['datetime'].dt.year < 2018)]

In [145]:
adoption_upd.dtypes

animal_id           object
name                object
datetime            object
monthyear           object
date_of_birth       object
outcome_type        object
outcome_subtype     object
animal_type         object
sex_upon_outcome    object
age_upon_outcome    object
breed               object
color               object
new_outcome_type    object
dtype: object

In [146]:
adoption_upd['datetime'] = pd.to_datetime(adoption_upd['datetime'], infer_datetime_format=True)

In [147]:
adoption_upd['datetime'].dt.year.value_counts()

2015    14792
2019    14292
2016    14120
2017    14058
2018    13361
2014     9995
Name: datetime, dtype: int64

In [38]:
# how big to make test set? 2019:
# 14292 observations

# Evaluation Metrics

since the classes aren't imbalanced I can use accuracy but will also explore the precision and recall for this problem.

precision positive: correctly predict all the animals that were adopted.  

\begin{align}
precision = \frac{accurately \ predicted \ adopted}{total\ predicted \ adopted}
\end{align}


recall positive: of all the animals that were adopted, how many were we able to identify?
\begin{align}
recall = \frac{accurately \ predicted \ adopted}{actually \ adopted}
\end{align}


# Begin to clean data and feature selection

In [148]:
# lets look at the 3 missing values for the outcome type:
adoption_upd[adoption_upd['outcome_type'].isnull()]
# both the outcome type and outcome subtype are missing. Since only 3 rows,
# will drop these observations from the data.

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type
2565,A803963,Little Bit,2019-09-27 17:59:00,2019-09-27T17:59:00.000,2017-09-09T00:00:00.000,,,Dog,Intact Male,,Miniature Schnauzer,Gray/Black,Adopted
53725,A737705,*Heddy,2016-11-19 16:35:00,2016-11-19T16:35:00.000,2013-11-02T00:00:00.000,,,Dog,,,Labrador Retriever Mix,Black/White,Adopted
94975,A686025,,2014-08-16 08:35:00,2014-08-16T08:35:00.000,2013-08-15T00:00:00.000,,,Other,Unknown,1 year,Bat Mix,Brown,Adopted


will create a wrangle function to do things at one time, 
list:  
adoption_upd.dropna(subset = ['outcome_type'])  
the name column has a lot of missing values, will change NaN's to unknown
adoption_upd['name].fillna("unknown", inplace = True)  
  
the outcome subtype is missing 36353 values, almost half of all of our data, since we will know all of the outcome types and this may cause leakage into the test set because certain outcomes can be deduced from the outcome subtype, I will drop that entire column.  
adoption_upd.drop(columns='outcome_subtype')


In [149]:
# the sex_upon_outcome column has 3 missing values, 1 of which is a row that
# will be dropped because the outcome type is missing, but noticed an 'unknown'
# value so check to see if that's common:
adoption_upd['sex_upon_outcome'].value_counts()
# will one hot encode this column.

Neutered Male    27563
Spayed Female    26009
Intact Female    10091
Intact Male       9376
Unknown           7576
Name: sex_upon_outcome, dtype: int64

In [150]:
# age_upon_outcome has 25 missing values but date_of_birth has none, lets
# look at how many times 'unknown' shows up in the data:
adoption_upd.isin(['Unknown']).sum()
# the name already has 18 unknowns, so will stick to changing NaN's to unknown.

animal_id              0
name                  18
datetime               0
monthyear              0
date_of_birth          0
outcome_type           0
outcome_subtype        0
animal_type            0
sex_upon_outcome    7576
age_upon_outcome       0
breed                  0
color                  0
new_outcome_type       0
dtype: int64

In [151]:
adoption_upd.isin(['Other']).sum()

animal_id              0
name                   0
datetime               0
monthyear              0
date_of_birth          0
outcome_type           0
outcome_subtype        0
animal_type         4455
sex_upon_outcome       0
age_upon_outcome       0
breed                  0
color                  0
new_outcome_type       0
dtype: int64

In [152]:
# to see what kind of animals come in to the shelter
adoption_upd['animal_type'].value_counts()
# initially was thinking of only using cats and dogs but am curious to see the other types.

Dog          39760
Cat          35962
Other         4455
Bird           432
Livestock        9
Name: animal_type, dtype: int64

since there are no missing or unknown values for DOB which still seems strange 
especially for stray animals that were found but maybe they approximated. We can 
create a column where we subtract 2019 from the born on year to get the age and see 
if there are a lot of differences.

In [153]:
adoption_upd['date_of_birth'] = pd.to_datetime(adoption_upd['date_of_birth'], infer_datetime_format=True)

In [155]:
# lets see what happens when we subtract 11-2019 from the date of birth 
now = pd.Timestamp('now')
adoption_upd['calculated_age']=(now.year - adoption_upd['date_of_birth'].dt.year) - ((now.month - adoption_upd['date_of_birth'].dt.month) < 0)
# lets compare the 'calculated_age' column to the 'age_upon_outcome'

0        6
1        3
2        1
3        1
5        3
        ..
99995    5
99996    5
99997    5
99998    5
99999    8
Name: calculated_age, Length: 80618, dtype: int64

In [124]:
# how many different breeds are there:
adoption_upd['breed'].value_counts()
# 2042...very high cardinality, may need to focus on cats and dogs afterall.

Domestic Shorthair Mix                 26076
Pit Bull Mix                            4723
Labrador Retriever Mix                  4375
Chihuahua Shorthair Mix                 3977
Domestic Shorthair                      3537
                                       ...  
West Highland/Patterdale Terr              1
Alaskan Husky/Australian Shepherd          1
Dachshund Longhair/Miniature Poodle        1
Schipperke/Catahoula                       1
Queensland Heeler/Dachshund                1
Name: breed, Length: 2042, dtype: int64

In [125]:
# how many colors are there
adoption_upd['color'].value_counts()
#512 different values, high cardinality as well.

Black/White               8567
Black                     7240
Brown Tabby               5508
Brown                     3377
Brown Tabby/White         2806
                          ... 
Tricolor/Brown Brindle       1
Agouti/Cream                 1
Cream/Blue Point             1
Gray/Blue Merle              1
Tricolor/Orange              1
Name: color, Length: 512, dtype: int64