Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [21]:
import pandas as pd
adoption_url = 'https://query.data.world/s/mn7f4fy3qnjwkgh7yk3ch2lc26eqwo'
adoption = pd.read_csv(adoption_url)
# in order to see all of the columns:
pd.options.display.max_columns = 100

# Target

In [22]:
adoption.head()

Unnamed: 0.1,Unnamed: 0,Animal ID,Name_intake,DateTime_intake,MonthYear_intake,Found_Location,Intake_Type,IntakeCondition,Animal_Type_intake,Sex,Age,Breed_intake,Color_intake,Name_outcome,DateTime_outcome,MonthYear_outcome,Outcome_Type,Outcome_Subtype,Sex_upon_Outcome,Age_upon_Outcome,gender_intake,gender_outcome,fixed_intake,fixed_outcome,fixed_changed,Age_Bucket,retriever,shepherd,beagle,terrier,boxer,poodle,rottweiler,dachshund,chihuahua,pit bull,DateTime_length,Days_length
0,0,A730601,,2016-07-07 12:11:00,07/07/2016 12:11:00 PM,1109 Shady Ln in Austin (TX),Stray,Normal,Cat,Intact Male,7 months,Domestic Shorthair Mix,Blue Tabby,,2016-07-08 09:00:00,07/08/2016 09:00:00 AM,Transfer,SCRP,Neutered Male,7 months,Male,Male,Intact,Neutered,1,7-12 months,0,0,0,0,0,0,0,0,0,0,0 days 20:49:00.000000000,0-7 days
1,1,A683644,*Zoey,2014-07-13 11:02:00,07/13/2014 11:02:00 AM,Austin (TX),Owner Surrender,Nursing,Dog,Intact Female,4 weeks,Border Collie Mix,Brown/White,*Zoey,2014-11-06 10:06:00,11/06/2014 10:06:00 AM,Adoption,Foster,Spayed Female,4 months,Female,Female,Intact,Spayed,1,1-6 weeks,0,0,0,0,0,0,0,0,0,0,115 days 23:04:00.000000000,12 weeks - 6 months
2,2,A676515,Rico,2014-04-11 08:45:00,04/11/2014 08:45:00 AM,615 E. Wonsley in Austin (TX),Stray,Normal,Dog,Intact Male,2 months,Pit Bull Mix,White/Brown,Rico,2014-04-14 18:38:00,04/14/2014 06:38:00 PM,Return to Owner,,Neutered Male,3 months,Male,Male,Intact,Neutered,1,1-6 months,0,0,0,0,0,0,0,0,0,1,3 days 09:53:00.000000000,0-7 days
3,3,A742953,,2017-01-31 13:30:00,01/31/2017 01:30:00 PM,S Hwy 183 And Thompson Lane in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Saluki,Sable/Cream,,2017-02-04 14:17:00,02/04/2017 02:17:00 PM,Transfer,Partner,Intact Male,2 years,Male,Male,Intact,Intact,0,1-3 years,0,0,0,0,0,0,0,0,0,0,4 days 00:47:00.000000000,0-7 days
4,4,A679549,*Gilbert,2014-05-22 15:43:00,05/22/2014 03:43:00 PM,124 W Anderson in Austin (TX),Stray,Normal,Cat,Intact Male,1 month,Domestic Shorthair Mix,Black/White,*Gilbert,2014-06-16 13:54:00,06/16/2014 01:54:00 PM,Transfer,Partner,Neutered Male,2 months,Male,Male,Intact,Neutered,1,1-6 months,0,0,0,0,0,0,0,0,0,0,24 days 22:11:00.000000000,3-6 weeks


The target in this project will be to predict whether or not an animal will be adopted or not (transferred to another shelter or, sadly, euthanized) so that perhaps animal shelters, though overwhelmed, can give some extra love or use unique methods to get those animals that may not have the best odds forever homes.

In [23]:
adoption['Outcome_Type'].value_counts()

Adoption           32408
Transfer           20799
Return to Owner    17396
Euthanasia          5470
Died                 553
Disposal             257
Missing               51
Rto-Adopt             23
Relocate              13
Name: Outcome_Type, dtype: int64

In [24]:
# this might be a feature that can create leakage, will come back to it.
adoption['Outcome_Subtype'].value_counts()

Partner                17367
Foster                  4786
SCRP                    3430
Suffering               2208
Rabies Risk             2062
Aggressive               620
Offsite                  325
In Kennel                287
Medical                  217
Behavior                 144
In Foster                131
At Vet                    32
Enroute                   30
Court/Investigation       28
Underage                  26
Possible Theft            16
In Surgery                11
Barn                       3
Name: Outcome_Subtype, dtype: int64

# Classification or Regression?

There are 9 classes of outcomes but for this project I'd like to focus on the animals that were adopted or not. There are a large number of animals that were returned to owners but that would just be due to them getting out etc but they do have a home so I will not include those in my project. "Rto-adopt" or return to owner adoption will also be included with "return to owner."

For animals that were adopted I will consider that to be:
-adoption 
-rto-adopt (return to owner through adoption).

I will combine the following for not adopted:
-transfer 
-euthanasia 
-relocate
-missing (animals that went missing from the shelter--still unsuccessful in getting them homes).

"Died" and "disposal" are animals that may have died while at the shelter or were brought in that needed to be properly disposed of so I will not include these either as they may have been very ill when brought in.

# How is the target distributed?
## Are the classes imbalanced?

In [25]:
adoption['Outcome_Type'].value_counts(normalize=True)

Adoption           0.421047
Transfer           0.270222
Return to Owner    0.226010
Euthanasia         0.071067
Died               0.007185
Disposal           0.003339
Missing            0.000663
Rto-Adopt          0.000299
Relocate           0.000169
Name: Outcome_Type, dtype: float64

will need to drop the rows where the outcome type are the ones listed above to be excluded:
-Return to owner
-Rto-adopt
-Died
-Disposal

In [26]:
# adoption updated to drop the outcomes we are excluding:
adoption_upd = adoption[~adoption['Outcome_Type'].isin(['Return to Owner', 
                                                        'Rto-Adopt', 
                                                        'Died', 
                                                        'Disposal'])]

In [27]:
adoption_upd['Outcome_Type'].value_counts()

Adoption      32408
Transfer      20799
Euthanasia     5470
Missing          51
Relocate         13
Name: Outcome_Type, dtype: int64

In [28]:
# need to redefine classes as binary. Adoption as 'adopted' and 
# the rest as 'not adopted'.
def new_status(outcome):
    if outcome == 'Transfer' or outcome == 'Euthanasia' or outcome == 'Missing' or outcome == 'Relocate':
      return 'Not adopted'
    else:
      return 'Adopted'


In [29]:
adoption_upd = adoption_upd.copy()
adoption_upd['New_Outcome_Type'] = adoption_upd['Outcome_Type'].apply(new_status)

In [30]:
adoption_upd['New_Outcome_Type'].value_counts(normalize=True)

Adopted        0.551763
Not adopted    0.448237
Name: New_Outcome_Type, dtype: float64

The classes now are combined into a binary classification and the classes are not imbalanced.

In [31]:
# drop the original 'Outcome_Type' column:
adoption_upd.drop(columns='Outcome_Type')

Unnamed: 0.1,Unnamed: 0,Animal ID,Name_intake,DateTime_intake,MonthYear_intake,Found_Location,Intake_Type,IntakeCondition,Animal_Type_intake,Sex,Age,Breed_intake,Color_intake,Name_outcome,DateTime_outcome,MonthYear_outcome,Outcome_Subtype,Sex_upon_Outcome,Age_upon_Outcome,gender_intake,gender_outcome,fixed_intake,fixed_outcome,fixed_changed,Age_Bucket,retriever,shepherd,beagle,terrier,boxer,poodle,rottweiler,dachshund,chihuahua,pit bull,DateTime_length,Days_length,New_Outcome_Type
0,0,A730601,,2016-07-07 12:11:00,07/07/2016 12:11:00 PM,1109 Shady Ln in Austin (TX),Stray,Normal,Cat,Intact Male,7 months,Domestic Shorthair Mix,Blue Tabby,,2016-07-08 09:00:00,07/08/2016 09:00:00 AM,SCRP,Neutered Male,7 months,Male,Male,Intact,Neutered,1,7-12 months,0,0,0,0,0,0,0,0,0,0,0 days 20:49:00.000000000,0-7 days,Not adopted
1,1,A683644,*Zoey,2014-07-13 11:02:00,07/13/2014 11:02:00 AM,Austin (TX),Owner Surrender,Nursing,Dog,Intact Female,4 weeks,Border Collie Mix,Brown/White,*Zoey,2014-11-06 10:06:00,11/06/2014 10:06:00 AM,Foster,Spayed Female,4 months,Female,Female,Intact,Spayed,1,1-6 weeks,0,0,0,0,0,0,0,0,0,0,115 days 23:04:00.000000000,12 weeks - 6 months,Adopted
3,3,A742953,,2017-01-31 13:30:00,01/31/2017 01:30:00 PM,S Hwy 183 And Thompson Lane in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Saluki,Sable/Cream,,2017-02-04 14:17:00,02/04/2017 02:17:00 PM,Partner,Intact Male,2 years,Male,Male,Intact,Intact,0,1-3 years,0,0,0,0,0,0,0,0,0,0,4 days 00:47:00.000000000,0-7 days,Not adopted
4,4,A679549,*Gilbert,2014-05-22 15:43:00,05/22/2014 03:43:00 PM,124 W Anderson in Austin (TX),Stray,Normal,Cat,Intact Male,1 month,Domestic Shorthair Mix,Black/White,*Gilbert,2014-06-16 13:54:00,06/16/2014 01:54:00 PM,Partner,Neutered Male,2 months,Male,Male,Intact,Neutered,1,1-6 months,0,0,0,0,0,0,0,0,0,0,24 days 22:11:00.000000000,3-6 weeks,Not adopted
5,5,A683798,Mustachala,2016-07-21 12:16:00,07/21/2016 12:16:00 PM,3118 Windsor Rd in Austin (TX),Stray,Normal,Cat,Spayed Female,3 years,Domestic Medium Hair Mix,White/Black,Mustachala,2016-10-18 10:55:00,10/18/2016 10:55:00 AM,Foster,Spayed Female,3 years,Female,Female,Spayed,Spayed,0,1-3 years,0,0,0,0,0,0,0,0,0,0,88 days 22:39:00.000000000,12 weeks - 6 months,Adopted
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76970,76970,A746717,,2017-04-07 18:39:00,04/07/2017 06:39:00 PM,10014 Fm973 in Manor (TX),Stray,Normal,Cat,Intact Male,1 year,Domestic Shorthair Mix,White/Orange,,2017-04-07 19:26:00,04/07/2017 07:26:00 PM,SCRP,Intact Male,,Male,Male,Intact,Intact,0,1-3 years,0,0,0,0,0,0,0,0,0,0,0 days 00:47:00.000000000,0-7 days,Not adopted
76972,76972,A746725,,2017-04-08 11:28:00,04/08/2017 11:28:00 AM,Austin (TX),Stray,Normal,Cat,Unknown,3 weeks,Domestic Shorthair Mix,Blue/White,,2017-04-08 11:42:00,04/08/2017 11:42:00 AM,Suffering,Unknown,,,,Unknown,Unknown,0,1-6 weeks,0,0,0,0,0,0,0,0,0,0,0 days 00:14:00.000000000,0-7 days,Not adopted
76974,76974,A746466,Wilson Fitzg,2017-04-03 15:02:00,04/03/2017 03:02:00 PM,4858 Yager Ln in Travis (TX),Stray,Normal,Dog,Intact Male,2 months,Basset Hound Mix,White/Brown,Wilson Fitzg,2017-04-08 12:21:00,04/08/2017 12:21:00 PM,Offsite,Neutered Male,2 months,Male,Male,Intact,Neutered,1,1-6 months,0,0,0,0,0,0,0,0,0,0,4 days 21:19:00.000000000,0-7 days,Adopted
76975,76975,A746072,Ace,2017-03-28 16:49:00,03/28/2017 04:49:00 PM,9318 Ih 35 in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,German Shepherd Mix,Black/Tan,Ace,2017-04-01 18:50:00,04/01/2017 06:50:00 PM,,Neutered Male,2 years,Male,Male,Neutered,Neutered,0,1-3 years,0,1,0,0,0,0,0,0,0,0,4 days 02:01:00.000000000,0-7 days,Adopted


# Choose Observations

As mentioned above, since the focus of my project is predicting if animals that are in need of forever homes will be adopted or not, I have already excluded the following observations from my model:
-Return to Owner 
-Rto-Adopt
-Died
-Disposal

There are some missing values, 7 are missing outcomes so we will try to look at the outcome subtype to see if we can determine what happened to them but the remaining animals have been categorized into 'adopted' or 'not adopted.'

# How to Split Data:

The description of the data set said that Austin is becoming a more pet-friendly city so there may be more animals going in and out of shelters in the more recent data vs the earlier data. I will therefore split the data based on time with the most recent data being the test set and then create a test and validation set with the remaining data.