Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [236]:
import pandas as pd
adoption_url = 'https://data.austintexas.gov/resource/9t4d-g238.csv?$limit=100000'
adoption = pd.read_csv(adoption_url)
# in order to see all of the columns:
pd.options.display.max_columns = 100

# Target

In [237]:
adoption.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A808320,Kyha,2019-11-14T15:37:00.000,2019-11-14T15:37:00.000,2017-11-07T00:00:00.000,Rto-Adopt,,Dog,Spayed Female,2 years,German Shepherd Mix,Sable
1,A781697,Pookie,2019-11-14T15:33:00.000,2019-11-14T15:33:00.000,2017-10-05T00:00:00.000,Adoption,,Dog,Spayed Female,2 years,Cairn Terrier,White/Brown
2,A808382,,2019-11-14T14:57:00.000,2019-11-14T14:57:00.000,2014-11-08T00:00:00.000,Transfer,Partner,Cat,Intact Male,5 years,Domestic Shorthair,Orange Tabby
3,A806701,*Emerald,2019-11-14T14:41:00.000,2019-11-14T14:41:00.000,2019-09-02T00:00:00.000,Adoption,Foster,Cat,Neutered Male,2 months,Domestic Shorthair,Blue Tabby/White
4,A804553,*Wendy,2019-11-14T14:36:00.000,2019-11-14T14:36:00.000,2019-08-09T00:00:00.000,Adoption,Foster,Cat,Spayed Female,3 months,Domestic Shorthair,Calico


In [238]:
adoption.shape

(100000, 12)

The target in this project will be to predict whether or not an animal will be adopted or not (transferred to another shelter or, sadly, euthanized) so that perhaps animal shelters, though overwhelmed, can give some extra love or use unique methods to get those animals that may not have the best odds forever homes.

In [239]:
adoption['outcome_type'].value_counts(dropna=False)

Adoption           44390
Transfer           29861
Return to Owner    17510
Euthanasia          6294
Died                 969
Rto-Adopt            508
Disposal             387
Missing               61
Relocate              17
NaN                    3
Name: outcome_type, dtype: int64

In [240]:
# this might be a feature that can create leakage, will come back to it.
adoption['outcome_subtype'].value_counts(dropna=False)

NaN                    54837
Partner                24943
Foster                  7953
Rabies Risk             2789
SCRP                    2636
Suffering               2503
Snr                     2279
In Kennel                508
Aggressive               358
Offsite                  281
Medical                  254
In Foster                243
At Vet                   171
Behavior                  82
Enroute                   65
Underage                  29
Court/Investigation       21
In Surgery                19
Possible Theft            15
Field                      8
Barn                       4
Customer S                 1
Prc                        1
Name: outcome_subtype, dtype: int64

# Classification or Regression?

There are 9 classes of outcomes but for this project I'd like to focus on the animals that were adopted or not. There are a large number of animals that were returned to owners but that would just be due to them getting out etc but they do have a home so I will not include those in my project. "Rto-adopt" or return to owner adoption will also be included with "return to owner."

For animals that were adopted I will consider that to be:  
-adoption  
-rto-adopt (return to owner through adoption)  


I will combine the following for not adopted:  
-transfer  
-euthanasia  
-relocate  
-missing (animals that went missing from the shelter--still unsuccessful in getting them homes)  
"Died" and "disposal" are animals that may have died while at the shelter or were brought in that needed to be properly disposed of so I will not include these either as they may have been very ill when brought in.

# How is the target distributed?
## Are the classes imbalanced?

In [241]:
adoption['outcome_type'].value_counts(normalize=True)

Adoption           0.443913
Transfer           0.298619
Return to Owner    0.175105
Euthanasia         0.062942
Died               0.009690
Rto-Adopt          0.005080
Disposal           0.003870
Missing            0.000610
Relocate           0.000170
Name: outcome_type, dtype: float64

will need to drop the rows where the outcome type are the ones listed above to be excluded:  
-Return to owner  
-Rto-adopt  
-Died  
-Disposal


In [242]:
# adoption updated to drop the outcomes we are excluding:
adoption_upd = adoption[~adoption['outcome_type'].isin(['Return to Owner', 
                                                        'Rto-Adopt', 
                                                        'Died', 
                                                        'Disposal'])]

In [243]:
adoption_upd['outcome_type'].value_counts(dropna=False)

Adoption      44390
Transfer      29861
Euthanasia     6294
Missing          61
Relocate         17
NaN               3
Name: outcome_type, dtype: int64

In [244]:
# am going to drop all animals except cats and dogs, 2 of the unknown
# outcome types are dogs and 1 also has a lot of other missing info so
# will drop these rows.
adoption_upd = adoption_upd.dropna(subset = ['outcome_type'])


In [245]:
# need to redefine classes as binary. Adoption as 'adopted' and 
# the rest as 'not adopted'.
def new_status(outcome):
    if outcome == 'Transfer' or outcome == 'Euthanasia' or outcome == 'Missing' or outcome == 'Relocate':
      return 'Not adopted'
    else:
      return 'Adopted'


In [246]:
adoption_upd = adoption_upd.copy()
adoption_upd['new_outcome_type'] = adoption_upd['outcome_type'].apply(new_status)

In [247]:
adoption_upd['new_outcome_type'].value_counts(normalize=True)

Adopted        0.550587
Not adopted    0.449413
Name: new_outcome_type, dtype: float64

The classes now are combined into a binary classification and the classes are not imbalanced.

In [248]:
# drop the original 'Outcome_Type' column:
adoption_upd = adoption_upd.drop(columns='outcome_type')

# Choose Observations

As mentioned above, since the focus of my project is predicting if animals that are in need of forever homes will be adopted or not, I have already excluded the following observations from my model:  
-Return to Owner  
-Rto-Adopt  
-Died  
-Disposal  

There are 3 missing values for the outcome ,so we can try to look at the outcome subtype to see if we can determine what happened to them but the remaining animals have been categorized into 'adopted' or 'not adopted.'

# How to Split Data:

In [249]:
adoption_upd.dtypes

animal_id           object
name                object
datetime            object
monthyear           object
date_of_birth       object
outcome_subtype     object
animal_type         object
sex_upon_outcome    object
age_upon_outcome    object
breed               object
color               object
new_outcome_type    object
dtype: object

In [250]:
adoption_upd['datetime'] = pd.to_datetime(adoption_upd['datetime'], infer_datetime_format=True)

In [251]:
adoption_upd['datetime'].dt.year.value_counts()

2015    14792
2019    14373
2016    14119
2017    14058
2018    13361
2014     9920
Name: datetime, dtype: int64

The description of the data set said that Austin is becoming a more pet-friendly city so there may be more animals going in and out of shelters in the more recent data vs the earlier data. I will therefore split the data based on time with the most recent data being the test set and then create a test and validation set with the remaining data.  

test = adoption_upd[(adoption_upd['datetime'].dt.year == 2019)]  
val =  adoption_upd[(adoption_upd['datetime'].dt.year == 2018)]  
train = adoption_upd[(adoption_upd['datetime'].dt.year < 2018)]

In [252]:
# how big to make test set? 2019:
# 14292 observations

# Evaluation Metrics

since the classes aren't imbalanced I can use accuracy but will also explore the precision and recall for this problem.

precision positive: correctly predict all the animals that were adopted.  

\begin{align}
precision = \frac{accurately \ predicted \ adopted}{total\ predicted \ adopted}
\end{align}


recall positive: of all the animals that were adopted, how many were we able to identify?
\begin{align}
recall = \frac{accurately \ predicted \ adopted}{actually \ adopted}
\end{align}


# Begin to clean data and feature selection

In [253]:
adoption_upd.isnull().sum()

animal_id               0
name                30056
datetime                0
monthyear               0
date_of_birth           0
outcome_subtype     36351
animal_type             0
sex_upon_outcome        2
age_upon_outcome       24
breed                   0
color                   0
new_outcome_type        0
dtype: int64

In [254]:
# the name column has a lot of missing values, will change NaN's to Unknown
adoption_upd['name'].fillna("Unknown", inplace = True)  

In [255]:
# the outcome subtype is missing 36353 values, almost half of all of our 
# data, since we will know all of the outcome types and this may cause 
# leakage into the test set because certain outcomes can be deduced from 
# the outcome subtype, I will drop that entire column.

adoption_upd = adoption_upd.drop(columns='outcome_subtype')
adoption_upd.shape


(80623, 11)

In [256]:
# the sex_upon_outcome column has 2 missing values but noticed an 'unknown'
# value so check to see if that's common:
adoption_upd['sex_upon_outcome'].value_counts()
# will one hot encode this column.

Neutered Male    27561
Spayed Female    26019
Intact Female    10098
Intact Male       9372
Unknown           7571
Name: sex_upon_outcome, dtype: int64

In [257]:
# age_upon_outcome has 25 missing values but date_of_birth has none, lets
# look at how many times 'unknown' shows up in the data:
adoption_upd.isin(['Unknown']).sum()
# the name already has 18 unknowns, so will stick to changing NaN's to unknown.

animal_id               0
name                30074
datetime                0
monthyear               0
date_of_birth           0
animal_type             0
sex_upon_outcome     7571
age_upon_outcome        0
breed                   0
color                   0
new_outcome_type        0
dtype: int64

In [258]:
# to see if 'Other is a common entry as well'
adoption_upd.isin(['Other']).sum()

animal_id              0
name                   0
datetime               0
monthyear              0
date_of_birth          0
animal_type         4450
sex_upon_outcome       0
age_upon_outcome       0
breed                  0
color                  0
new_outcome_type       0
dtype: int64

In [259]:
# to see what kind of animals come in to the shelter
adoption_upd['animal_type'].value_counts()

Dog          39758
Cat          35974
Other         4450
Bird           432
Livestock        9
Name: animal_type, dtype: int64

In [260]:
# initially thinking of only using cats and dogs.
adoption_upd = adoption_upd[~adoption_upd['animal_type'].isin(['Bird', 
                                                        'Other', 
                                                        'Livestock'])]

In [261]:
# check for missing values now:
adoption_upd.isnull().sum()

animal_id           0
name                0
datetime            0
monthyear           0
date_of_birth       0
animal_type         0
sex_upon_outcome    2
age_upon_outcome    8
breed               0
color               0
new_outcome_type    0
dtype: int64

In [262]:
adoption_upd['date_of_birth'] = pd.to_datetime(adoption_upd['date_of_birth'], infer_datetime_format=True)

In [263]:
adoption_upd.loc[adoption_upd['age_upon_outcome'].isnull()]

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type
59,A808636,Unknown,2019-11-13 13:46:00,2019-11-13T13:46:00.000,2004-11-11,Cat,Intact Female,,Siamese,Seal Point,Not adopted
96,A808702,Keepers,2019-11-12 15:20:00,2019-11-12T15:20:00.000,2009-11-12,Dog,Neutered Male,,Golden Retriever,Gold,Not adopted
131,A808649,Unknown,2019-11-11 17:49:00,2019-11-11T17:49:00.000,2019-10-11,Cat,Intact Male,,Domestic Shorthair,Brown Tabby,Not adopted
132,A738697,Boots,2019-11-11 17:48:00,2019-11-11T17:48:00.000,2007-11-17,Dog,,,Miniature Schnauzer Mix,Black,Not adopted
161,A808626,Unknown,2019-11-11 13:57:00,2019-11-11T13:57:00.000,2019-09-11,Cat,Intact Female,,Domestic Shorthair,Brown Tabby,Not adopted
214,A808466,Unknown,2019-11-10 09:35:00,2019-11-10T09:35:00.000,2019-09-09,Cat,Intact Male,,Domestic Shorthair,Brown/Black,Not adopted
347,A808352,Unknown,2019-11-07 16:15:00,2019-11-07T16:15:00.000,2011-11-07,Dog,Intact Male,,Dachshund Mix,Tan,Not adopted
6649,A752967,Gray,2019-07-24 11:42:00,2019-07-24T11:42:00.000,2015-06-29,Dog,,,Pit Bull Mix,Blue/White,Not adopted


In [264]:
# 2 of the rows that are missing the age are also missing the sex so I will
# drop those:
adoption_upd = adoption_upd.dropna(subset = ['sex_upon_outcome'])

In [265]:
# since there are no missing or unknown values for DOB which still seems strange 
# especially for stray animals that were found but maybe they approximated. We can 
# create a column where we subtract 2019 from the born on year to get the age and see 
# if there are a lot of differences.

now = pd.Timestamp('now')
# first, get DOB year column and DOB month columns:
adoption_upd['DOB_month'] = adoption_upd['date_of_birth'].dt.month
# now, subtract current year from DOB year:
adoption_upd['DOB_year'] = now.year - adoption_upd['date_of_birth'].dt.year
# now turn DOB_year into months by multiplying by 12
adoption_upd['DOB_year'] = adoption_upd['DOB_year'] * 12
# add DOB_year which is now in months to DOB month and divide by 12 to get years
adoption_upd['calculated_age'] = adoption_upd['DOB_year'] + adoption_upd['DOB_month']
def calculate_age(age):
        if age >= 12:
            return(f'{age // 12} years')
        else:
            return(f'{age} months') 
adoption_upd['calculated_age'] = adoption_upd['calculated_age'].apply(calculate_age)
# fill NaN's in age upon outcome column with calculated age
adoption_upd['age_upon_outcome'].fillna(adoption_upd['calculated_age'], inplace=True)
# drop DOB month, DOB year, calculated age:
adoption_upd.drop(columns =['DOB_month', 'DOB_year', 'calculated_age'], inplace=True)

In [266]:
adoption_upd.isnull().sum()
# no more missing values!

animal_id           0
name                0
datetime            0
monthyear           0
date_of_birth       0
animal_type         0
sex_upon_outcome    0
age_upon_outcome    0
breed               0
color               0
new_outcome_type    0
dtype: int64

In [267]:
adoption_upd.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type
1,A781697,Pookie,2019-11-14 15:33:00,2019-11-14T15:33:00.000,2017-10-05,Dog,Spayed Female,2 years,Cairn Terrier,White/Brown,Adopted
2,A808382,Unknown,2019-11-14 14:57:00,2019-11-14T14:57:00.000,2014-11-08,Cat,Intact Male,5 years,Domestic Shorthair,Orange Tabby,Not adopted
3,A806701,*Emerald,2019-11-14 14:41:00,2019-11-14T14:41:00.000,2019-09-02,Cat,Neutered Male,2 months,Domestic Shorthair,Blue Tabby/White,Adopted
4,A804553,*Wendy,2019-11-14 14:36:00,2019-11-14T14:36:00.000,2019-08-09,Cat,Spayed Female,3 months,Domestic Shorthair,Calico,Adopted
5,A804552,*Tinkerbell,2019-11-14 14:35:00,2019-11-14T14:35:00.000,2019-08-09,Cat,Spayed Female,3 months,Domestic Shorthair,Calico,Adopted


In [268]:
# looking at redundant columns, date of birth can be dropped since the ages are all accounted for.
# also datetime and monthyear are the same, I will get rid of datetime for now.
adoption_upd.drop(columns=['date_of_birth'], inplace=True)

In [269]:
# going to create a new column for the season the animal came into the shelter to see if certain
# seasons have higher adoption rates:
adoption_upd['monthyear'] = pd.to_datetime(adoption_upd['monthyear'], infer_datetime_format=True)
adoption_upd['month_arrived'] = adoption_upd['monthyear'].dt.month
adoption_upd.head()


Unnamed: 0,animal_id,name,datetime,monthyear,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type,month_arrived
1,A781697,Pookie,2019-11-14 15:33:00,2019-11-14 15:33:00,Dog,Spayed Female,2 years,Cairn Terrier,White/Brown,Adopted,11
2,A808382,Unknown,2019-11-14 14:57:00,2019-11-14 14:57:00,Cat,Intact Male,5 years,Domestic Shorthair,Orange Tabby,Not adopted,11
3,A806701,*Emerald,2019-11-14 14:41:00,2019-11-14 14:41:00,Cat,Neutered Male,2 months,Domestic Shorthair,Blue Tabby/White,Adopted,11
4,A804553,*Wendy,2019-11-14 14:36:00,2019-11-14 14:36:00,Cat,Spayed Female,3 months,Domestic Shorthair,Calico,Adopted,11
5,A804552,*Tinkerbell,2019-11-14 14:35:00,2019-11-14 14:35:00,Cat,Spayed Female,3 months,Domestic Shorthair,Calico,Adopted,11


In [270]:
def season(arrival_month):
    if arrival_month == 1 or arrival_month == 2 or arrival_month == 3:
        return 'Winter'
    if arrival_month == 3 or arrival_month == 4 or arrival_month ==5:
        return 'Spring'
    if arrival_month == 6 or arrival_month == 7 or arrival_month ==8:
        return 'Summer'
    if arrival_month == 9 or arrival_month == 10 or arrival_month ==11:
        return 'Fall'
adoption_upd['season_arrived'] = adoption_upd['month_arrived'].apply(season)

In [271]:
# want to change the datetime column to year only since we have month_arrived now:
adoption_upd['year_arrived'] = adoption_upd['datetime'].dt.year

In [272]:
# going to drop datetime and monthyear now:
adoption_upd.drop(columns=['datetime', 'monthyear'], inplace=True)

In [273]:
adoption_upd.head(10)

Unnamed: 0,animal_id,name,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type,month_arrived,season_arrived,year_arrived
1,A781697,Pookie,Dog,Spayed Female,2 years,Cairn Terrier,White/Brown,Adopted,11,Fall,2019
2,A808382,Unknown,Cat,Intact Male,5 years,Domestic Shorthair,Orange Tabby,Not adopted,11,Fall,2019
3,A806701,*Emerald,Cat,Neutered Male,2 months,Domestic Shorthair,Blue Tabby/White,Adopted,11,Fall,2019
4,A804553,*Wendy,Cat,Spayed Female,3 months,Domestic Shorthair,Calico,Adopted,11,Fall,2019
5,A804552,*Tinkerbell,Cat,Spayed Female,3 months,Domestic Shorthair,Calico,Adopted,11,Fall,2019
6,A808132,Daryl,Cat,Neutered Male,5 years,Domestic Shorthair,Brown Tabby/White,Adopted,11,Fall,2019
7,A805655,*Fezzik,Cat,Neutered Male,2 months,Domestic Shorthair,Orange Tabby,Adopted,11,Fall,2019
9,A808526,Jugez,Dog,Spayed Female,13 years,Shih Tzu,Brown/Cream,Not adopted,11,Fall,2019
10,A787448,Watermelon,Dog,Spayed Female,3 years,Labrador Retriever Mix,Brown Tiger/White,Adopted,11,Fall,2019
11,A808522,Unknown,Cat,Intact Female,4 years,Domestic Shorthair,Brown Tabby,Not adopted,11,Fall,2019


In [274]:
# how many different breeds are there:
adoption_upd['breed'].value_counts()
# 2042...very high cardinality

Domestic Shorthair Mix                         26063
Pit Bull Mix                                    4720
Labrador Retriever Mix                          4372
Chihuahua Shorthair Mix                         3974
Domestic Shorthair                              3566
Domestic Medium Hair Mix                        2548
German Shepherd Mix                             1816
Domestic Longhair Mix                           1184
Siamese Mix                                     1019
Australian Cattle Dog Mix                        969
Dachshund Mix                                    625
Border Collie Mix                                594
Boxer Mix                                        572
Domestic Medium Hair                             478
Miniature Poodle Mix                             453
Catahoula Mix                                    432
Labrador Retriever                               403
Staffordshire Mix                                401
Chihuahua Shorthair                           

In [275]:
# create subsets for dogs and cats:
dogs = adoption_upd[adoption_upd.animal_type=='Dog']
cats = adoption_upd[adoption_upd.animal_type=='Cat']

In [276]:
# Reduce cardinality for breed feature ...
dogs =dogs.copy()
# Get a list of the top 75 breeds
top75 = dogs['breed'].value_counts()[:75].index
# Breeds that are NOT in the top 100,
# replace the breed with 'OTHER'
dogs.loc[~dogs['breed'].isin(top75), 'breed'] = 'OTHER'
dogs['breed'].value_counts()

OTHER                                       9368
Pit Bull Mix                                4720
Labrador Retriever Mix                      4372
Chihuahua Shorthair Mix                     3974
German Shepherd Mix                         1816
Australian Cattle Dog Mix                    969
Dachshund Mix                                625
Border Collie Mix                            594
Boxer Mix                                    572
Miniature Poodle Mix                         453
Catahoula Mix                                432
Labrador Retriever                           403
Chihuahua Shorthair                          401
Staffordshire Mix                            401
Pit Bull                                     384
Australian Shepherd Mix                      381
Great Pyrenees Mix                           367
Pointer Mix                                  367
Rat Terrier Mix                              364
Beagle Mix                                   358
German Shepherd     

In [277]:
# Reduce cardinality for breed feature ...cats don't have as many
# breeds so will only use top 30
cats =cats.copy()
# Get a list of the top 30 breeds
top30 = cats['breed'].value_counts()[:30].index
# Breeds that are NOT in the top 30,
# replace the breed with 'OTHER'
cats.loc[~cats['breed'].isin(top30), 'breed'] = 'OTHER'
cats['breed'].value_counts()

Domestic Shorthair Mix         26063
Domestic Shorthair              3566
Domestic Medium Hair Mix        2548
Domestic Longhair Mix           1184
Siamese Mix                     1019
Domestic Medium Hair             478
American Shorthair Mix           201
Snowshoe Mix                     143
Siamese                          130
OTHER                            125
Domestic Longhair                103
Maine Coon Mix                    81
Manx Mix                          75
Russian Blue Mix                  46
Ragdoll Mix                       34
American Shorthair                29
Himalayan Mix                     27
Persian Mix                       16
Balinese Mix                      14
American Curl Shorthair Mix       12
Japanese Bobtail Mix               9
Persian                            9
Russian Blue                       8
Tonkinese Mix                      8
Turkish Van Mix                    7
Siamese/Domestic Shorthair         7
Manx                               7
H

In [278]:
# concatenate the cats and dogs subsets, matches shape of adoption_upd
adoption_final = pd.concat([dogs, cats])
adoption_final.shape

(75730, 11)

In [279]:
# lets do something similar to color:
adoption_final['color'].value_counts()

Black/White                  8297
Black                        6815
Brown Tabby                  5511
Brown Tabby/White            2809
Orange Tabby                 2642
White                        2320
Tan/White                    2258
Brown/White                  2151
Blue/White                   2113
White/Black                  2067
Tan                          1869
Tortie                       1664
Calico                       1636
Brown                        1620
Tricolor                     1612
Blue                         1540
Black/Tan                    1514
Blue Tabby                   1381
Black/Brown                  1377
White/Brown                  1302
Orange Tabby/White           1289
Brown Brindle/White          1196
White/Tan                    1106
Torbie                       1035
Brown/Black                  1025
Red                           748
Red/White                     704
Blue Tabby/White              675
Tan/Black                     675
Brown Brindle 

In [280]:
adoption_final = adoption_final.copy()
# Get a list of the top 75 colors
top75 = adoption_final['color'].value_counts()[:75].index
# colors that are NOT in the top 75,
# replace the color with 'OTHER'
adoption_final.loc[~adoption_final['color'].isin(top75), 'color'] = 'OTHER'

In [281]:
adoption_final.head()

Unnamed: 0,animal_id,name,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type,month_arrived,season_arrived,year_arrived
1,A781697,Pookie,Dog,Spayed Female,2 years,Cairn Terrier,White/Brown,Adopted,11,Fall,2019
9,A808526,Jugez,Dog,Spayed Female,13 years,Shih Tzu,OTHER,Not adopted,11,Fall,2019
10,A787448,Watermelon,Dog,Spayed Female,3 years,Labrador Retriever Mix,OTHER,Adopted,11,Fall,2019
17,A808589,Lola,Dog,Spayed Female,5 years,OTHER,Brown Brindle,Not adopted,11,Fall,2019
18,A808590,Cheyenne,Dog,Spayed Female,7 years,Australian Shepherd Mix,Brown/White,Not adopted,11,Fall,2019


In [227]:
# # change animal type to numerical:
# # 1 for dog, 2 for cat
# def change_type(animal):
#     if animal == 'dog':
#         return 1
#     else:
#         return 2


In [228]:
# adoption_final['animal_type'] = adoption_final['animal_type'].apply(change_type).astype(int)

In [282]:
adoption_final['sex_upon_outcome'].value_counts()

Neutered Male    27406
Spayed Female    25892
Intact Female     9822
Intact Male       8973
Unknown           3637
Name: sex_upon_outcome, dtype: int64

In [283]:
# do the same thing with sex_upon_outcome:
def change_sex(animal):
    if animal == 'Neutered Male':
        return 1
    if animal == 'Spayed Female':
        return 2
    if animal == 'Intact Male':
        return 3
    if animal == 'Intact Female':
        return 4
    if animal == 'Unknown':
        return 5
adoption_final['sex_upon_outcome'] = adoption_final['sex_upon_outcome'].apply(change_sex).astype(int)

In [284]:
#need to bin ages:
adoption_final['age_upon_outcome'].value_counts()

1 year       12232
2 months     11998
2 years       9739
3 months      4480
1 month       4321
3 years       3684
4 months      2890
4 years       2126
5 years       2050
5 months      1979
6 months      1818
3 weeks       1756
2 weeks       1664
8 months      1321
6 years       1302
8 years       1191
4 weeks       1180
10 months     1158
7 years       1064
7 months      1000
10 years       955
9 months       794
1 weeks        656
9 years        582
1 week         547
12 years       466
11 months      461
11 years       315
2 days         270
3 days         262
13 years       251
1 day          184
6 days         182
4 days         157
14 years       156
15 years       137
5 days         108
0 years        103
5 weeks         92
16 years        39
17 years        28
18 years        15
19 years         8
20 years         7
-1 years         1
22 years         1
Name: age_upon_outcome, dtype: int64

In [286]:
# This is difficult because of the age units, there are days, weeks, months
# and years. Best way I could think of it is to bin the ages that are under 1 year
# and then do the rest of the data in years since that's the majority and it's numerical data.

def bin_ages(age):
    if age == '11 months' or age == '12 months' or age == '1 year':
        return '1 years'
    if age == '8 months' or age == '9 months' or age == '10 months':
        return '.75 years'
    if age == '6 months' or age == '7 months' or age =='5 months' or age == '4 months':
        return '.5 years'
    if age == '1 month' or age == '3 weeks' or age == '2 weeks' or age == '4 weeks' or age == '1 weeks' or age == '1 week' or age == '2 days' or age == '3 days' or age == '1 day' or age == '6 days' or age == '4 days' or age == '5 days' or age == '0 years' or age == '5 weeks' or age == '2 months' or age == '3 months' or age == '2 month' or age == '-1 years':
        return '.25 years'
    else:
        return age 
adoption_final['age_upon_outcome'] = adoption_final['age_upon_outcome'].apply(bin_ages)

In [287]:
adoption_final['age_upon_outcome'].value_counts()

.25 years    27961
1 years      12693
2 years       9739
.5 years      7687
3 years       3684
.75 years     3273
4 years       2126
5 years       2050
6 years       1302
8 years       1191
7 years       1064
10 years       955
9 years        582
12 years       466
11 years       315
13 years       251
14 years       156
15 years       137
16 years        39
17 years        28
18 years        15
19 years         8
20 years         7
22 years         1
Name: age_upon_outcome, dtype: int64

In [288]:
# now for the rest of the ages, need to strip the 'years' from the values:
# also going to drop the row that says -1 because not sure what that means.
def age_to_int(age):
  return float(age.strip('years'))

adoption_final['age_upon_outcome'] = adoption_final['age_upon_outcome'].apply(age_to_int)

In [292]:
adoption_final.dtypes

animal_id            object
name                 object
animal_type          object
sex_upon_outcome      int64
age_upon_outcome    float64
breed                object
color                object
new_outcome_type     object
month_arrived         int64
season_arrived       object
year_arrived          int64
dtype: object

In [293]:
adoption_final.head()

Unnamed: 0,animal_id,name,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,new_outcome_type,month_arrived,season_arrived,year_arrived
1,A781697,Pookie,Dog,2,2.0,Cairn Terrier,White/Brown,Adopted,11,Fall,2019
9,A808526,Jugez,Dog,2,13.0,Shih Tzu,OTHER,Not adopted,11,Fall,2019
10,A787448,Watermelon,Dog,2,3.0,Labrador Retriever Mix,OTHER,Adopted,11,Fall,2019
17,A808589,Lola,Dog,2,5.0,OTHER,Brown Brindle,Not adopted,11,Fall,2019
18,A808590,Cheyenne,Dog,2,7.0,Australian Shepherd Mix,Brown/White,Not adopted,11,Fall,2019


In [294]:
# going to make a copy without the age upon outcome column since it has high cardinality and isn't numeric:
adoption_use = adoption_final.drop(columns=['age_upon_outcome'])

In [295]:
adoption_use.head()

Unnamed: 0,animal_id,name,animal_type,sex_upon_outcome,breed,color,new_outcome_type,month_arrived,season_arrived,year_arrived
1,A781697,Pookie,Dog,2,Cairn Terrier,White/Brown,Adopted,11,Fall,2019
9,A808526,Jugez,Dog,2,Shih Tzu,OTHER,Not adopted,11,Fall,2019
10,A787448,Watermelon,Dog,2,Labrador Retriever Mix,OTHER,Adopted,11,Fall,2019
17,A808589,Lola,Dog,2,OTHER,Brown Brindle,Not adopted,11,Fall,2019
18,A808590,Cheyenne,Dog,2,Australian Shepherd Mix,Brown/White,Not adopted,11,Fall,2019


Think I'm ready to start fitting models, may need to come back to features:

In [296]:
# split data into train, val, test
adoption_use['year_arrived'].value_counts()

2015    13964
2019    13644
2016    13141
2017    13035
2018    12513
2014     9433
Name: year_arrived, dtype: int64

In [297]:
test = adoption_use[(adoption_use['year_arrived'] == 2019)]  
val =  adoption_use[(adoption_use['year_arrived']== 2018)]  
train = adoption_upd[(adoption_upd['year_arrived'] < 2018)]

In [298]:
test.shape, val.shape, train.shape

((13644, 10), (12513, 10), (49573, 11))

In [58]:
# %matplotlib inline
# import matplotlib.pyplot as plt
# import seaborn as sns

# for col in adoption_upd.columns:
#     if adoption_upd[col].nunique() < 10:
#         try:
#             sns.catplot(x=col, y='new_outcome_type', data=adoption_upd, kind='bar', color='grey')
#             plt.show()
#         except:
#             pass

In [59]:
# numeric = train.select_dtypes('number')
# def change_type(animal):
#     if animal == 'Adopted':
#         return 1
#     else:
#         return 2
# train['new_outcome_type'] = train['new_outcome_type'].apply(change_type)
# for col in sorted(numeric.columns):
#     sns.lmplot(x=col, y='new_outcome_type', data=train, scatter_kws=dict(alpha=0.05))
#     plt.show()

In [60]:
train.dtypes

animal_id           object
name                object
animal_type         object
sex_upon_outcome    object
age_upon_outcome    object
breed               object
color               object
new_outcome_type    object
month_arrived        int64
season_arrived      object
year_arrived         int64
dtype: object