## III: Primary Dataset Construction

The Animal IDs provided by the Austin Animal Center in each dataset allowed for a convenient way to merge a single dataframe with all the information, and for animals who have been taken in multiple times, a convenient way to chart their history.

One area of exploration for this project was duration of stay, and merging the Intakes dataset with the Outcomes dataset allowed for this analysis.

Initial imports and file read-ins:

In [1]:
import pandas as pd
import numpy as np

In [2]:
outcomes = pd.read_csv('../../datasets/outcomes_initial.csv').drop(columns=['Unnamed: 0']).sort_values(['animal_id', 'datetime']).reset_index().drop(columns=['index'])
intakes = pd.read_csv('../../datasets/intakes_initial.csv').drop(columns=['Unnamed: 0']).sort_values(['animal_id', 'datetime']).reset_index().drop(columns=['index'])

A brief display to make sure everything is in order:

In [3]:
outcomes.head()

Unnamed: 0,animal_id,name,datetime,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,is_named,year,month,day
0,A006100,Scamp,2014-03-08 17:10:00,07/09/2007,Return to Owner,Unknown,Dog,Neutered Male,6.0,Spinone Italiano Mix,Yellow/White,1,2014,3,Saturday
1,A006100,Scamp,2014-12-20 16:35:00,07/09/2007,Return to Owner,Unknown,Dog,Neutered Male,7.0,Spinone Italiano Mix,Yellow/White,1,2014,12,Saturday
2,A006100,Scamp,2017-12-07 00:00:00,07/09/2007,Return to Owner,Unknown,Dog,Neutered Male,1.0,Spinone Italiano Mix,Yellow/White,1,2017,12,Thursday
3,A047759,Oreo,2014-04-07 15:12:00,04/02/2004,Transfer,Partner,Dog,Neutered Male,1.0,Dachshund,Tricolor,1,2014,4,Monday
4,A134067,Bandit,2013-11-16 11:54:00,10/16/1997,Return to Owner,Unknown,Dog,Neutered Male,1.0,Shetland Sheepdog,Brown/White,1,2013,11,Saturday


In [4]:
intakes.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,is_named,year,month,day
0,A006100,Scamp,2014-03-07 14:26:00,8700 Research in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6.0,Spinone Italiano Mix,Yellow/White,1,2014,3,Friday
1,A006100,Scamp,2014-12-19 10:21:00,8700 Research Blvd in Austin (TX),Public Assist,Normal,Dog,Neutered Male,7.0,Spinone Italiano Mix,Yellow/White,1,2014,12,Friday
2,A006100,Scamp,2017-12-07 14:07:00,Colony Creek And Hunters Trace in Austin (TX),Stray,Normal,Dog,Neutered Male,1.0,Spinone Italiano Mix,Yellow/White,1,2017,12,Thursday
3,A047759,Oreo,2014-04-02 15:55:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,1.0,Dachshund,Tricolor,1,2014,4,Wednesday
4,A134067,Bandit,2013-11-16 09:02:00,12034 Research Blvd in Austin (TX),Public Assist,Injured,Dog,Neutered Male,1.0,Shetland Sheepdog,Brown/White,1,2013,11,Saturday


First, a display of all possible outcomes based on the available data:

In [5]:
outcomes['outcome_type'].value_counts()

Adoption           58148
Transfer           37887
Return to Owner    22019
Euthanasia          8701
Died                1225
Rto-Adopt            762
Disposal             595
Missing               70
Relocate              25
Name: outcome_type, dtype: int64

Again, many animals go through the shelter system more than once, often with different outcomes each time. Features are generated below to track this history and initially are set to zero for the tallying that follows:

In [6]:
outcomes['prev_adoption'] = [0] * len(outcomes)
outcomes['prev_transfer'] = [0] * len(outcomes)
outcomes['prev_ret_to_owner'] = [0] * len(outcomes)
outcomes['prev_rto_adopt'] = [0] * len(outcomes)
outcomes['prev_disposal'] = [0] * len(outcomes)
outcomes['prev_missing'] = [0] * len(outcomes)
outcomes['prev_relocate'] = [0] * len(outcomes)

Tickers for relevant outcome types are likewise set to zero, and (with the Outcomes dataset sorted by animal and by timestamp respectively) a simple iteration of all the dataset rows allows for a tally of each animal's history based on the respective columns for each of their entries. (For example, an animal who has been returned to his or her owner twice, will on the third time in the shelter system, have `2` in the `prev_ret_to_owner` field.) This allows for tracking whether or not such history influences outcome.

In [7]:
adoptions = 0
transfers = 0
ret_to_owners = 0
rto_adopts = 0
disposals = 0
missings = 0
relocates = 0

for i in range(len(outcomes)- 1):

    if outcomes['animal_id'][i] == outcomes['animal_id'][i + 1]:

        if outcomes['outcome_type'][i] == 'Adoption':
            outcomes['prev_adoption'][i + 1] = adoptions + 1
            adoptions += 1
        elif outcomes['outcome_type'][i] == 'Transfer':
            outcomes['prev_transfer'][i + 1] = transfers + 1
            transfers += 1
        elif outcomes['outcome_type'][i] == 'Return to Owner':
            outcomes['prev_ret_to_owner'][i + 1] = ret_to_owners + 1
            ret_to_owners += 1
        elif outcomes['outcome_type'][i] == 'Rto-Adopt':
            outcomes['prev_rto_adopt'][i + 1] = rto_adopts + 1
            rto_adopts += 1
        elif outcomes['outcome_type'][i] == 'Disposal':
            outcomes['prev_disposal'][i + 1] = disposals + 1
            disposals += 1
        elif outcomes['outcome_type'][i] == 'Missing':
            outcomes['prev_missing'][i + 1] = missings + 1
            missings += 1
        elif outcomes['outcome_type'][i] == 'Relocate':
            outcomes['prev_relocate'][i + 1] = relocates + 1
            relocates += 1
    else:
        adoptions = 0
        transfers = 0
        ret_to_owners = 0
        rto_adopts = 0
        disposals = 0
        missings = 0
        relocates = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outcomes['prev_ret_to_owner'][i + 1] = ret_to_owners + 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outcomes['prev_adoption'][i + 1] = adoptions + 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outcomes['prev_transfer'][i + 1] = transfers + 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outcomes['pre

What follow are a few code cells double-checking that the above iteration worked properly. As information displays correctly, it would seem so.

In [8]:
outcomes[['animal_id', 'outcome_type', 'prev_adoption', 'prev_transfer', 'prev_ret_to_owner', 'prev_rto_adopt', 'prev_disposal', 'prev_missing', 'prev_relocate' ]]

Unnamed: 0,animal_id,outcome_type,prev_adoption,prev_transfer,prev_ret_to_owner,prev_rto_adopt,prev_disposal,prev_missing,prev_relocate
0,A006100,Return to Owner,0,0,0,0,0,0,0
1,A006100,Return to Owner,0,0,1,0,0,0,0
2,A006100,Return to Owner,0,0,2,0,0,0,0
3,A047759,Transfer,0,0,0,0,0,0,0
4,A134067,Return to Owner,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
129427,A840385,Transfer,0,0,0,0,0,0,0
129428,A840386,Transfer,0,0,0,0,0,0,0
129429,A840402,Euthanasia,0,0,0,0,0,0,0
129430,A840404,Euthanasia,0,0,0,0,0,0,0


In [9]:
outcomes['prev_adoption'].value_counts()

0    121614
1      6660
2       932
3       174
4        38
5        10
6         3
7         1
Name: prev_adoption, dtype: int64

In [10]:
# checking new features
outcomes.query('animal_id == "A754989"')[['animal_id', 'outcome_type', 'prev_adoption', 'prev_transfer', 'prev_ret_to_owner', 'prev_rto_adopt', 'prev_disposal', 'prev_missing', 'prev_relocate' ]]

Unnamed: 0,animal_id,outcome_type,prev_adoption,prev_transfer,prev_ret_to_owner,prev_rto_adopt,prev_disposal,prev_missing,prev_relocate
74019,A754989,Adoption,0,0,0,0,0,0,0
74020,A754989,Adoption,1,0,0,0,0,0,0
74021,A754989,Adoption,2,0,0,0,0,0,0
74022,A754989,Adoption,3,0,0,0,0,0,0
74023,A754989,Adoption,4,0,0,0,0,0,0
74024,A754989,Adoption,5,0,0,0,0,0,0
74025,A754989,Adoption,6,0,0,0,0,0,0
74026,A754989,Adoption,7,0,0,0,0,0,0


In [11]:
outcomes.query('animal_id == "A774102" or animal_id == "A809074"')[['animal_id', 'outcome_type', 'prev_adoption', 'prev_transfer', 'prev_ret_to_owner', 'prev_rto_adopt', 'prev_disposal', 'prev_missing', 'prev_relocate' ]]

Unnamed: 0,animal_id,outcome_type,prev_adoption,prev_transfer,prev_ret_to_owner,prev_rto_adopt,prev_disposal,prev_missing,prev_relocate
87613,A774102,Transfer,0,0,0,0,0,0,0
87614,A774102,Transfer,0,1,0,0,0,0,0
87615,A774102,Transfer,0,2,0,0,0,0,0
87616,A774102,Return to Owner,0,3,0,0,0,0,0
87617,A774102,Return to Owner,0,0,1,0,0,0,0
87618,A774102,Return to Owner,0,0,2,0,0,0,0
87619,A774102,Return to Owner,0,0,3,0,0,0,0
113857,A809074,Transfer,0,0,0,0,0,0,0
113858,A809074,Transfer,0,1,0,0,0,0,0
113859,A809074,Transfer,0,2,0,0,0,0,0


In [12]:
intakes[intakes['sex_upon_intake'].str.contains('Unknown')].shape

(10368, 15)

In [13]:
'''intakes['is_male'] = intakes['sex_upon_intake'].str.split(' ').str[1]
intakes['is_neutered'] = intakes['sex_upon_intake'].str.split(' ').str[0]

outcomes['is_male'] = outcomes['sex_upon_outcome'].str.split(' ').str[1]
outcomes['is_neutered'] = outcomes['sex_upon_outcome'].str.split(' ').str[0]

intakes['is_male'] = intakes['sex_upon_intake'].str.contains('Male').astype(int)
intakes['is_female'] = intakes['sex_upon_intake'].str.contains('Female').astype(int)
intakes['is_neutered'] = intakes['sex_upon_intake'].str.contains(r'Neutered|Spayed').astype(int)

#intakes['is_unknown'] = intakes['sex_upon_intake'].str.contains('Unknown').astype(int)

#outcomes['is_male'] = (outcomes['is_male'] == 'Male').astype(int)
#outcomes['is_neutered'] = (outcomes['is_neutered'] != 'Intact').astype(int)
#outcomes['is_unknown'] = outcomes['sex_upon_intake'].str.contains('Unknown').astype(int)
'''

"intakes['is_male'] = intakes['sex_upon_intake'].str.split(' ').str[1]\nintakes['is_neutered'] = intakes['sex_upon_intake'].str.split(' ').str[0]\n\noutcomes['is_male'] = outcomes['sex_upon_outcome'].str.split(' ').str[1]\noutcomes['is_neutered'] = outcomes['sex_upon_outcome'].str.split(' ').str[0]\n\nintakes['is_male'] = intakes['sex_upon_intake'].str.contains('Male').astype(int)\nintakes['is_female'] = intakes['sex_upon_intake'].str.contains('Female').astype(int)\nintakes['is_neutered'] = intakes['sex_upon_intake'].str.contains(r'Neutered|Spayed').astype(int)\n\n#intakes['is_unknown'] = intakes['sex_upon_intake'].str.contains('Unknown').astype(int)\n\n#outcomes['is_male'] = (outcomes['is_male'] == 'Male').astype(int)\n#outcomes['is_neutered'] = (outcomes['is_neutered'] != 'Intact').astype(int)\n#outcomes['is_unknown'] = outcomes['sex_upon_intake'].str.contains('Unknown').astype(int)\n"

In [14]:
intakes[intakes['sex_upon_intake'] == 'Unknown']

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,is_named,year,month,day
8,A169438,,2018-04-04 20:37:00,Dessau in Austin (TX),Stray,Normal,Bird,Unknown,1.000,Dove Mix,Gray/White,0,2018,4,Wednesday
4900,A663372,Nutmeg,2013-10-19 11:16:00,Austin (TX),Owner Surrender,Normal,Other,Unknown,0.167,Rabbit Sh Mix,Brown,1,2013,10,Saturday
4901,A663373,Snowball,2013-10-19 11:16:00,Austin (TX),Owner Surrender,Normal,Other,Unknown,0.167,Rabbit Sh Mix,White/Tan,1,2013,10,Saturday
4902,A663374,Cinnimen,2013-10-19 11:16:00,Austin (TX),Owner Surrender,Normal,Other,Unknown,0.167,Rabbit Sh Mix,Brown,1,2013,10,Saturday
4903,A663375,Itzie,2013-10-19 11:16:00,Austin (TX),Owner Surrender,Normal,Other,Unknown,0.167,Rabbit Sh Mix,Brown,1,2013,10,Saturday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128793,A840375,Chase,2021-08-06 15:14:00,Cripple Creek Dr in Austin (TX),Stray,Normal,Dog,Unknown,2.000,German Shepherd,Black/Tan,1,2021,8,Friday
128795,A840377,,2021-08-06 15:14:00,Cripple Creek Dr in Austin (TX),Stray,Normal,Dog,Unknown,2.000,German Shepherd Mix,Black/Tan,0,2021,8,Friday
128812,A840402,,2021-08-06 18:46:00,11306 Parkfield Drive in Austin (TX),Wildlife,Normal,Other,Unknown,2.000,Bat,Brown,0,2021,8,Friday
128814,A840404,,2021-08-06 20:16:00,1950 Webberville Road in Austin (TX),Wildlife,Normal,Other,Unknown,2.000,Bat,Brown,0,2021,8,Friday


In [15]:
# converting for rank
intakes['datetime'] = intakes['datetime'].apply(pd.to_datetime)
outcomes['datetime'] = outcomes['datetime'].apply(pd.to_datetime)

Because many animals are entered in the system multiple times, it becomes necessary to track each unique stay, so that each outcome aligns properly for that stay. Below (after a double-checking of the datetime datatypes), a column is created that generates for each animal a unique tracking extension of the ID to mark each unique stay in the shelter. Then a merge is performed so that the two previous datasets are combined into one for future modeling.

In [16]:
intakes['intake_num'] = intakes.groupby(['animal_id'])['datetime'].rank(method='dense', ascending=False)
intakes['tracking_id'] = intakes['animal_id'] + '_' + intakes['intake_num'].astype('int').astype('str')
outcomes['outcome_num'] = outcomes.groupby(['animal_id'])['datetime'].rank(method='dense', ascending=False)
outcomes['tracking_id'] = outcomes['animal_id'] + '_' + outcomes['outcome_num'].astype('int').astype('str')

In [17]:
outcomes.set_index('tracking_id', inplace=True)
intakes.set_index('tracking_id', inplace=True)

full_df = pd.merge(outcomes, intakes, how='inner', 
                  right_index=True, left_index=True, suffixes=['_out', '_in'])

Again, there are multiple duplicate rows between the two initial datasets, and here these are dropped:

In [18]:
full_df.drop(columns=['date_of_birth','animal_id_out','breed_out','color_out','intake_num','outcome_num','animal_type_out'], inplace=True)

In [19]:
full_df.columns = [
    'name_out', 'datetime_out', 'outcome_type', 'outcome_subtype',
   'sex_upon_outcome', 'age_upon_outcome', 'is_named_out', 'year_out',
   'month_out', 'day_out', 'prev_adoption', 'prev_transfer',
   'prev_ret_to_owner', 'prev_rto_adopt', 'prev_disposal', 'prev_missing',
   'prev_relocate', 'animal_id_in', 'name_in', 'datetime_in',
   'found_location', 'intake_type', 'intake_condition', 'animal_type',
   'sex_upon_intake', 'age_upon_intake', 'breed', 'color',
   'is_named_in', 'year_in', 'month_in', 'day_in'
]

In [20]:
full_df.head()

Unnamed: 0_level_0,name_out,datetime_out,outcome_type,outcome_subtype,sex_upon_outcome,age_upon_outcome,is_named_out,year_out,month_out,day_out,...,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,is_named_in,year_in,month_in,day_in
tracking_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A006100_1,Scamp,2017-12-07 00:00:00,Return to Owner,Unknown,Neutered Male,1.0,1,2017,12,Thursday,...,Normal,Dog,Neutered Male,1.0,Spinone Italiano Mix,Yellow/White,1,2017,12,Thursday
A006100_2,Scamp,2014-12-20 16:35:00,Return to Owner,Unknown,Neutered Male,7.0,1,2014,12,Saturday,...,Normal,Dog,Neutered Male,7.0,Spinone Italiano Mix,Yellow/White,1,2014,12,Friday
A006100_3,Scamp,2014-03-08 17:10:00,Return to Owner,Unknown,Neutered Male,6.0,1,2014,3,Saturday,...,Normal,Dog,Neutered Male,6.0,Spinone Italiano Mix,Yellow/White,1,2014,3,Friday
A047759_1,Oreo,2014-04-07 15:12:00,Transfer,Partner,Neutered Male,1.0,1,2014,4,Monday,...,Normal,Dog,Neutered Male,1.0,Dachshund,Tricolor,1,2014,4,Wednesday
A134067_1,Bandit,2013-11-16 11:54:00,Return to Owner,Unknown,Neutered Male,1.0,1,2013,11,Saturday,...,Injured,Dog,Neutered Male,1.0,Shetland Sheepdog,Brown/White,1,2013,11,Saturday


In [21]:
full_df.shape

(127912, 32)

Columns are rearranged in the cells below for a more intuitive ordering:

In [22]:
full_df.columns

Index(['name_out', 'datetime_out', 'outcome_type', 'outcome_subtype',
       'sex_upon_outcome', 'age_upon_outcome', 'is_named_out', 'year_out',
       'month_out', 'day_out', 'prev_adoption', 'prev_transfer',
       'prev_ret_to_owner', 'prev_rto_adopt', 'prev_disposal', 'prev_missing',
       'prev_relocate', 'animal_id_in', 'name_in', 'datetime_in',
       'found_location', 'intake_type', 'intake_condition', 'animal_type',
       'sex_upon_intake', 'age_upon_intake', 'breed', 'color', 'is_named_in',
       'year_in', 'month_in', 'day_in'],
      dtype='object')

In [23]:
full_df_test = full_df[[
    'animal_id_in', 'animal_type', 'color', 'breed', 'intake_type', 
    'outcome_type', 'intake_condition', 'outcome_subtype', 'datetime_in', 'datetime_out', 
    'year_in', 'month_in', 'day_in', 'year_out', 'month_out', 
    'day_out', 'prev_adoption', 'prev_transfer', 'prev_ret_to_owner', 'prev_rto_adopt', 
    'prev_disposal', 'prev_missing', 'prev_relocate', 'age_upon_outcome', 'age_upon_intake',
    'sex_upon_intake', 'sex_upon_outcome', 'is_named_in', 'is_named_out', 'found_location',
    'name_in', 'name_out'
]]

In [24]:
full_df_test.shape

(127912, 32)

A handful of entries have outcome datetimes that are earlier than intake datetimes (possibly due to midnight being entered as a default outcome time in cases where the real time is unknown. Below, these entries are dropped.

In [26]:
full_df_test = full_df_test[~(full_df_test['datetime_out'] < full_df_test['datetime_in'])]

Many breed combinations and many color combinations are included in the original datasets, often with words in different orders and with slashes to separate distinguishing features. Below, the strings in these columns are set to lowercase and alphabetized to eliminate any superfluous duplication of breeds or colors that are actually the same. This is done of course for the purpose of building the strongest model possible later.

In [154]:
full_df_test['breed'] = full_df_test['breed'].map(lambda x: ' '.join(set(sorted(x.replace('/', ' ').lower().split(' ')))))

In [155]:
full_df_test['color'] = full_df_test['color'].map(lambda x: ' '.join(set(sorted(x.replace('/', ' ').lower().split(' ')))))

In [156]:
full_df_test['breed'].value_counts()

domestic shorthair mix                    30788
pit bull mix                               8431
domestic shorthair                         8306
retriever labrador mix                     6907
chihuahua shorthair mix                    6223
                                          ...  
hound spaniel cocker basset                   1
newfoundland                                  1
pomeranian dachshund                          1
golden retriever fox wire hair terrier        1
yorkshire miniature pinscher terrier          1
Name: breed, Length: 2108, dtype: int64

In [157]:
full_df_test['color'].value_counts()

white black              16795
black                    10756
brown tabby               7226
brown white               6714
tan white                 5777
                         ...  
lynx blue point              1
blue gray smoke              1
lynx gray point tabby        1
blue tortie point            1
brown merle cream            1
Name: color, Length: 373, dtype: int64

In [159]:
full_df_test.columns

Index(['animal_id_in', 'animal_type', 'color', 'breed', 'intake_type',
       'outcome_type', 'intake_condition', 'outcome_subtype', 'datetime_in',
       'datetime_out', 'year_in', 'month_in', 'day_in', 'year_out',
       'month_out', 'day_out', 'prev_adoption', 'prev_transfer',
       'prev_ret_to_owner', 'prev_rto_adopt', 'prev_disposal', 'prev_missing',
       'prev_relocate', 'age_upon_outcome', 'age_upon_intake',
       'sex_upon_intake', 'sex_upon_outcome', 'is_named_in', 'is_named_out',
       'found_location', 'name_in', 'name_out'],
      dtype='object')

Finally, all is well. A single dataset (clean and optimized for modeling) is written to the `datasets` folder:

In [161]:
full_df_test.to_csv('../../datasets/main.csv', index = False)