# **CS 363M Final Project Spring 2025**

## Chenyi Wang, Bhuvan Kannaeganti, Suyog Valsangkar

### **Overview**

For the project in this class, you will participate in a machine learning competition where you’ll apply your ML skills to a real-world dataset. You may work individually or in teams of up to 3 students. 

The dataset for this competition comes from the Austin Animal Center, the largest no-kill animal shelter in the United States. It contains historical records of animals that have entered the shelter, including details such as species, breed, age, intake type, medical condition, and other attributes. Each animal in the dataset has a recorded outcome, which represents what eventually happened to the animal after entering the shelter.

Your goal in this competition is to build a machine learning model that predicts the final outcome of each animal admitted to the shelter, based on its intake characteristics. The possible outcomes are:

**- Adopted**: The animal was placed into a new home.<br>
**- Return to Owner**: The animal was reclaimed by its original owner.<br>
**- Euthanasia**: The animal was humanely euthanized due to medical or behavioral concerns.<br>
**- Died**: The animal passed away while in the shelter’s care.<br>
**- Transfer**: The animal was moved to another shelter or rescue organization.<br>

By accurately predicting these outcomes, your model can help identify factors that influence an animal's journey through the shelter system and potentially aid in improving adoption and survival rates, shelter policies, or allocation of resources.


## **Code and Analysis Below:**

1. We need to go through the dataset and examine the existing features for patterns and methods we can feature engineer our data to enhance our final predictions. 

In [1965]:
import pandas as pd

animal_data = pd.read_csv('train.csv')
animal_test = pd.read_csv('test.csv')

animal_data.sample(5) # sample some data

Unnamed: 0,Id,Name,Intake Time,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Time,Date of Birth,Outcome Type
37512,A749656,,05/17/2017 04:30:00 PM,2124 Burton Dr in Austin (TX),Stray,Injured,Cat,Unknown,1 month,Domestic Shorthair Mix,Orange Tabby/White,05/18/2017 10:44:00 AM,03/17/2017,Euthanasia
54169,A876051,*Penelope,03/08/2023 04:27:00 PM,8502 Elroy Road in Travis (TX),Stray,Normal,Dog,Intact Female,2 years,Chihuahua Shorthair/Dachshund,Brown/White,05/08/2023 03:12:00 PM,03/08/2021,Adoption
91452,A791479,,03/26/2019 11:57:00 AM,14906 Zurick Drive in Travis (TX),Stray,Normal,Cat,Intact Female,2 weeks,Domestic Shorthair Mix,Blue Tabby,03/26/2019 01:47:00 PM,03/12/2019,Transfer
40233,A689632,,10/07/2014 12:47:00 PM,13349 Indian Oak Bend in Manor (TX),Stray,Normal,Dog,Intact Male,6 months,Border Terrier Mix,Chocolate/Tan,10/11/2014 07:00:00 PM,04/07/2014,Adoption
51730,A800836,Sugar,07/27/2019 02:25:00 PM,Austin (TX),Owner Surrender,Normal,Dog,Intact Female,2 years,Pit Bull/Labrador Retriever,Brown/White,08/17/2019 04:22:00 PM,07/27/2017,Return to Owner


2. **Data Cleaning**: After observing the initial dataset, I see that we need to perform data cleaning. We need to separate the sex upon intake feature as it currently also includes sterilization health status. We also need to fix the age upon intake column as it is using all sorts of time units. The name feature also does not exist within the test set, we need to be careful and only train the model on features that exist within the test set. This also applies to Id, which does not correlate to the Id within the test set. Therefore, we can also eliminate name and id before training the model.

In [1966]:
# drop name feature entirely
animal_data.drop("Name", axis=1, inplace=True)

# drop id feature entirely
animal_data.drop("Id", axis=1, inplace=True)

# drop outcome time
animal_data.drop("Outcome Time", axis=1, inplace=True)

# drop birth date
animal_data.drop("Date of Birth", axis=1, inplace=True)

# transform Animal Type into a binary column: Cat -> True, Dog -> False
animal_data['Cat'] = animal_data['Animal Type'].apply(lambda x: True if x == 'Cat' else False)
animal_data.drop('Animal Type', axis=1, inplace=True)

# separate age at intake and reproductive status, create new column
split_cols = animal_data['Sex upon Intake'].str.split(' ', n=1, expand=True)

# handle cases where the split returns "Unknown"
animal_data['Sterilized'] = split_cols[0].map({
    'Neutered': 'True',
    'Spayed': 'True',
    'Intact': 'False',
    'Unknown': 'False'
}).fillna('False')  # unexpected values, let's assume not sterilized

# assign gender, we can keep it a binary by make the feature "Male", where male = true and female = false
animal_data['Male'] = split_cols[1].apply(lambda x: False if x == 'Female' else True)
animal_data.drop("Sex upon Intake", axis=1, inplace=True)

# helper to convert age upon intake to years
def age_to_years(age_str):
    if pd.isna(age_str):
        return None
    
    number, unit = age_str.split()
    number = float(number)
    
    if "year" in unit:
        return number
    elif "month" in unit:
        return number / 12
    elif "week" in unit:
        return number / 52
    elif "day" in unit:
        return number / 365
    else:
        return None  # in case of an unexpected format

# convert age to years
animal_data['Age'] = animal_data['Age upon Intake'].apply(age_to_years)
animal_data.drop('Age upon Intake', axis=1, inplace=True)

pd.set_option('display.max_columns', None)
animal_data.sample(5) # sample some data after transformations

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age
35079,02/28/2014 01:57:00 PM,8410 Beech St in Austin (TX),Stray,Normal,Chinese Crested Mix,White,Adoption,False,True,True,11.0
84131,10/12/2022 02:28:00 PM,6900 Mira Loma Ln in Austin (TX),Stray,Normal,Domestic Shorthair,Torbie,Transfer,True,False,False,0.583333
62120,06/12/2021 05:17:00 PM,Gregg Manor Road in Travis (TX),Stray,Other,Domestic Shorthair,White/Orange Tabby,Adoption,True,False,True,0.0
39499,06/08/2019 04:00:00 PM,Austin (TX),Owner Surrender,Normal,Cane Corso Mix,Black/White,Return to Owner,False,False,False,2.0
1075,05/23/2017 12:20:00 PM,Austin (TX),Owner Surrender,Normal,Domestic Shorthair Mix,Orange Tabby,Adoption,True,True,True,6.0


In [1967]:
# drop birth date
animal_test.drop("Date of Birth", axis=1, inplace=True)

# transform Animal Type into a binary column: Cat -> True, Dog -> False
animal_test['Cat'] = animal_test['Animal Type'].apply(lambda x: True if x == 'Cat' else False)
animal_test.drop('Animal Type', axis=1, inplace=True)

# separate age at intake and reproductive status, create new column
split_cols = animal_test['Sex upon Intake'].str.split(' ', n=1, expand=True)

# handle cases where the split returns "Unknown"
animal_test['Sterilized'] = split_cols[0].map({
    'Neutered': True,
    'Spayed': True,
    'Intact': False,
    'Unknown': False
}).fillna(False)  # unexpected values, let's assume not sterilized
animal_data['Sterilized'] = animal_data['Sterilized'].map({'True': True, 'False': False})

# assign gender, we can keep it a binary by make the feature "Male", where male = true and female = false
animal_test['Male'] = split_cols[1].apply(lambda x: False if x == 'Female' else True)
animal_test.drop("Sex upon Intake", axis=1, inplace=True)

# helper to convert age upon intake to years
def age_to_years(age_str):
    if pd.isna(age_str):
        return None
    
    number, unit = age_str.split()
    number = float(number)
    
    if "year" in unit:
        return number
    elif "month" in unit:
        return number / 12
    elif "week" in unit:
        return number / 52
    elif "day" in unit:
        return number / 365
    else:
        return None  # in case of an unexpected format

# convert age to years
animal_test['Age'] = animal_test['Age upon Intake'].apply(age_to_years)
animal_test['Age'] = animal_test['Age'].fillna(0)  # fill NaN with 0
animal_test.drop('Age upon Intake', axis=1, inplace=True)

pd.set_option('display.max_columns', None)
animal_test.sample(5) # sample some data after transformations

Unnamed: 0,Id,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Color,Cat,Sterilized,Male,Age
20081,20082,6/3/14 11:00,S 2Nd St And Cardinal Lane in Austin (TX),Stray,Normal,Domestic Shorthair Mix,Brown Tabby,True,False,False,0.083333
16368,16369,5/9/24 16:59,6709 Wentworth Drive in Austin (TX),Stray,Normal,Domestic Shorthair,Cream Tabby,True,False,True,0.038462
11962,11963,8/18/20 10:44,Austin (TX),Owner Surrender,Normal,Pit Bull,White,False,True,True,2.0
17888,17889,10/29/13 13:26,Cameron And 290 in Austin (TX),Stray,Normal,Labrador Retriever Mix,Black/White,False,False,True,0.666667
12276,12277,2/20/19 13:03,2700 Barton Creek Boulevard in Austin (TX),Stray,Normal,Domestic Shorthair Mix,Black/White,True,False,True,6.0


3. **Feature Engineering**: We need to one hot encode the categorical features. However, there are categoricals that have many labels, some with very little data points. We can study the trend of their similarities to each other and group to reduce dimensionality.

Let's look at Intake Condition and see all the possible labels for that class.

In [1968]:
# Count the occurrences of each intake condition label
intake_condition_counts = animal_data['Intake Condition'].value_counts()

# Print the counts for each intake condition
for condition, count in intake_condition_counts.items():
    print(f"{condition}: {count}")

Normal: 95010
Injured: 6394
Sick: 4295
Nursing: 2957
Neonatal: 1240
Aged: 373
Medical: 298
Other: 247
Pregnant: 111
Feral: 104
Med Attn: 48
Behavior: 42
Unknown: 12
Neurologic: 10
Med Urgent: 7
Parvo: 5
Space: 2
Agonal: 1
Congenital: 1


We see above that there are 19 different categorical values for the Intake Condition feature, we can merge some of these rarer classification together.

Let's explore the outcome percentages of each intake condition so we can group these conditions better and reduce the amount of labels we need to one hot encode (increases dimensionality).

In [1969]:
# Iterate through each unique intake condition and calculate outcome percentages
intake_conditions = animal_data['Intake Condition'].unique()

for condition in intake_conditions:
    condition_data = animal_data[animal_data['Intake Condition'] == condition]
    total_count = len(condition_data)
    if total_count > 0:
        outcome_percentages = condition_data['Outcome Type'].value_counts(normalize=True) * 100
        print(f"Intake Condition: {condition}")
        print(outcome_percentages)
        print("-" * 50)
    else:
        print(f"Intake Condition: {condition} has no entries.")
        print("-" * 50)

Intake Condition: Normal
Outcome Type
Adoption           52.596569
Transfer           29.360067
Return to Owner    15.985686
Euthanasia          1.513525
Died                0.544153
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Injured
Outcome Type
Adoption           34.813888
Transfer           30.669378
Euthanasia         18.720676
Return to Owner    12.605568
Died                3.190491
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Pregnant
Outcome Type
Transfer           54.954955
Adoption           39.639640
Return to Owner     4.504505
Died                0.900901
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Neonatal
Outcome Type
Transfer           69.919355
Adoption           25.564516
Died                3.064516
Return to Owner     0.967742
Euthanasia          0.483871
Name: proportion, dtype: float64
-------

Groupings based on similar outcome percentages

In [1970]:
'''
   [Group]	       [Categories]
    Normal	        Normal
    Neonatal 	    Neonatal, Nursing
    Med_Minor	    Injured, Medical (more similar outcome adopt %)
    Med_Major       Med Attn, Med Urgent, Neurologic, Pregnant, Sick (more similar outcome transfer %)
    Behavioral	    Feral, Behavior
    Critical    	Agonal, Aged, Congenital, Parvo, Space, Other, Unknown
'''

'\n   [Group]\t       [Categories]\n    Normal\t        Normal\n    Neonatal \t    Neonatal, Nursing\n    Med_Minor\t    Injured, Medical (more similar outcome adopt %)\n    Med_Major       Med Attn, Med Urgent, Neurologic, Pregnant, Sick (more similar outcome transfer %)\n    Behavioral\t    Feral, Behavior\n    Critical    \tAgonal, Aged, Congenital, Parvo, Space, Other, Unknown\n'

We see that our merged class labels have relatively similar outcomes percentage-wise. This will reduce dimensionality and improve our training whilst not losing the originality of the groupings.

In [1971]:
# intake condition into grouped categories
condition_map = {
    'Normal': 'Normal',
    'Injured': 'Med_Minor',
    'Sick': 'Med_Major',
    'Nursing': 'Neonatal',
    'Neonatal': 'Neonatal',
    'Med Attn': 'Med_Major',
    'Med Urgent': 'Med_Major',
    'Medical': 'Med_Minor',
    'Neurologic': 'Med_Major',
    'Pregnant': 'Med_Major',
    'Feral': 'Behavioral',
    'Behavior': 'Behavioral',
}

# map with a fallback to 'Rare'
animal_data['Condition'] = animal_data['Intake Condition'].map(condition_map).fillna('Rare')
animal_data = pd.get_dummies(animal_data, columns=['Condition'])
animal_data.drop('Intake Condition', axis=1, inplace=True)

animal_test['Condition'] = animal_test['Intake Condition'].map(condition_map).fillna('Rare')
animal_test = pd.get_dummies(animal_test, columns=['Condition'])
animal_test.drop('Intake Condition', axis=1, inplace=True)

animal_data.sample(5)  # sample some data after transformation

Unnamed: 0,Intake Time,Found Location,Intake Type,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare
94590,12/11/2020 12:26:00 PM,Austin (TX),Owner Surrender,Domestic Shorthair,Black/White,Adoption,True,False,False,0.5,False,False,False,False,True,False
22487,10/25/2021 12:06:00 PM,Austin (TX),Owner Surrender,Domestic Shorthair,White/Brown Tabby,Adoption,True,False,False,0.166667,False,False,False,False,True,False
64301,08/09/2023 04:05:00 PM,3702 Balcones Dr in Austin (TX),Stray,Domestic Shorthair,Calico,Transfer,True,False,False,0.038462,False,False,False,False,True,False
69578,09/11/2022 08:00:00 AM,William Cannon And Ih 35 in Austin (TX),Stray,Domestic Shorthair,Orange Tabby,Transfer,True,False,True,0.005479,False,False,False,False,False,True
32218,06/19/2019 12:48:00 PM,3508 Breckenridge Drive in Austin (TX),Stray,American Bulldog Mix,Tan/White,Return to Owner,False,True,True,7.0,False,False,True,False,False,False


Let's do the same for Intake Type and check the outcome percentages of each condition.

In [1972]:
# observe outcome percentages for each intake type so we can group these categoricals, reduces dimensionality
intake_types = animal_data['Intake Type'].unique()

for intake_type in intake_types:
    type_data = animal_data[animal_data['Intake Type'] == intake_type]
    total_count = len(type_data)
    if total_count > 0:
        outcome_percentages = type_data['Outcome Type'].value_counts(normalize=True) * 100
        print(f"Intake Type: {intake_type}")
        print(outcome_percentages)
        print("-" * 50)
    else:
        print(f"Intake Type: {intake_type} has no entries.")
        print("-" * 50)

Intake Type: Stray
Outcome Type
Adoption           47.774546
Transfer           34.219783
Return to Owner    13.905286
Euthanasia          3.056120
Died                1.044266
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Public Assist
Outcome Type
Return to Owner    63.545914
Adoption           18.528788
Transfer           14.222802
Euthanasia          3.262111
Died                0.440385
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Owner Surrender
Outcome Type
Adoption           64.671050
Transfer           26.502541
Return to Owner     5.399357
Euthanasia          2.736980
Died                0.690073
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Abandoned
Outcome Type
Adoption           63.017032
Transfer           26.845093
Return to Owner     9.083536
Euthanasia          0.648824
Died                0.405515
Name: proportion, 

These would be the groupings that are most similar to each other based on the outcome percentages.

In [1973]:
'''
    [Group]	         [Categories]
     Public Assist    Public Assist
     Stray            Stray, Wildlife
     Owner-Initiated  Abandoned, Owner Surrender
     Euthanasia       Euthanasia Request
'''

'\n    [Group]\t         [Categories]\n     Public Assist    Public Assist\n     Stray            Stray, Wildlife\n     Owner-Initiated  Abandoned, Owner Surrender\n     Euthanasia       Euthanasia Request\n'

In [1974]:
# Define the mapping for grouping intake types
intake_type_map = {
    'Stray': 'Stray',
    'Public Assist': 'Public Assist',
    'Wildlife': 'Stray',
    'Abandoned': 'Owner Initiated',
    'Owner Surrender': 'Owner Initiated',
    'Euthanasia Request': 'Euthanasia'
}

# Map the intake types to their respective groups
animal_data['Intake'] = animal_data['Intake Type'].map(intake_type_map).fillna('Other')
animal_data = pd.get_dummies(animal_data, columns=['Intake'])
animal_data.drop('Intake Type', axis=1, inplace=True)

animal_test['Intake'] = animal_test['Intake Type'].map(intake_type_map).fillna('Other')
animal_test = pd.get_dummies(animal_test, columns=['Intake'])
animal_test.drop('Intake Type', axis=1, inplace=True)

animal_data.sample(5)  # sample some data after transformation


Unnamed: 0,Intake Time,Found Location,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
78154,03/31/2020 10:45:00 AM,Decker Lake & Fm 973 in Travis (TX),Queensland Heeler Mix,White/Black,Adoption,False,False,False,0.083333,False,False,False,False,True,False,False,False,False,True
37020,09/03/2018 12:14:00 PM,Austin (TX),Domestic Shorthair Mix,Brown Tabby,Transfer,True,True,False,5.0,False,False,False,False,True,False,False,True,False,False
99943,07/18/2017 02:25:00 PM,2204 Curtis Avenue in Austin (TX),Domestic Shorthair Mix,Black/White,Transfer,True,False,True,1.0,False,False,False,False,True,False,False,False,False,True
75899,06/30/2014 01:05:00 PM,1301 Crossing Place in Austin (TX),Labrador Retriever Mix,Chocolate/White,Adoption,False,False,False,0.057692,False,False,False,False,True,False,False,False,False,True
69391,12/19/2020 10:23:00 AM,Austin (TX),Lhasa Apso Mix,White,Return to Owner,False,False,True,2.0,False,False,False,False,True,False,False,True,False,False


Still need to find a way to encode:
- Intake Time
- Location Found
- Breed
- Color

How do they even affect the outcome or is there even a correlation? Are some colors / breeds more or less desirable than others (therefore changing adoption rates)?

In [1975]:

# Filter data for breeds with "mix"
mix_breeds = animal_data[animal_data['Breed'].str.contains('mix', case=False, na=False)]

# Filter data for breeds without "mix"
non_mix_breeds = animal_data[~animal_data['Breed'].str.contains('mix', case=False, na=False)]

# Calculate percentages of Outcome Type for breeds with "mix"
mix_outcome_percentages = mix_breeds['Outcome Type'].value_counts(normalize=True) * 100

# Calculate percentages of Outcome Type for breeds without "mix"
non_mix_outcome_percentages = non_mix_breeds['Outcome Type'].value_counts(normalize=True) * 100

# Top 20 breeds with "mix" and their frequencies
top_mix_breeds = mix_breeds['Breed'].value_counts().head(20)

# Top 20 breeds without "mix" and their frequencies
top_non_mix_breeds = non_mix_breeds['Breed'].value_counts().head(20)

# Print the results
print("Top 20 breeds with 'mix' and their frequencies:")
print(top_mix_breeds)

print("\nTop 20 breeds without 'mix' and their frequencies:")
print(top_non_mix_breeds)

# Print the results
print("Outcome Type percentages for breeds with 'mix':")
print(mix_outcome_percentages)
print("\nOutcome Type percentages for breeds without 'mix':")
print(non_mix_outcome_percentages)

# Get the top 3 mixed breeds
top_3_mixed_breeds = mix_breeds['Breed'].value_counts().head(3).index

# Get the top 3 purebred breeds
top_3_purebred_breeds = non_mix_breeds['Breed'].value_counts().head(3).index

# Calculate and print outcome percentages for the top 3 mixed breeds
print("Outcome percentages for top 3 mixed breeds:")
for breed in top_3_mixed_breeds:
    breed_data = mix_breeds[mix_breeds['Breed'] == breed]
    outcome_percentages = breed_data['Outcome Type'].value_counts(normalize=True) * 100
    print(f"\nBreed: {breed}")
    print(outcome_percentages)

# Calculate and print outcome percentages for the top 3 purebred breeds
print("\nOutcome percentages for top 3 purebred breeds:")
for breed in top_3_purebred_breeds:
    breed_data = non_mix_breeds[non_mix_breeds['Breed'] == breed]
    outcome_percentages = breed_data['Outcome Type'].value_counts(normalize=True) * 100
    print(f"\nBreed: {breed}")
    print(outcome_percentages)

Top 20 breeds with 'mix' and their frequencies:
Breed
Domestic Shorthair Mix       25361
Pit Bull Mix                  6042
Labrador Retriever Mix        5654
Chihuahua Shorthair Mix       4896
German Shepherd Mix           2637
Domestic Medium Hair Mix      2564
Australian Cattle Dog Mix     1337
Domestic Longhair Mix         1254
Siamese Mix                   1106
Dachshund Mix                  852
Border Collie Mix              769
Boxer Mix                      749
Miniature Poodle Mix           646
Siberian Husky Mix             613
Australian Shepherd Mix        612
Catahoula Mix                  544
Yorkshire Terrier Mix          523
Great Pyrenees Mix             515
Miniature Schnauzer Mix        478
Rat Terrier Mix                475
Name: count, dtype: int64

Top 20 breeds without 'mix' and their frequencies:
Breed
Domestic Shorthair                    16046
Pit Bull                               2117
Domestic Medium Hair                   1436
Chihuahua Shorthair           

Let's try taking these four features out, from studying the data at a surface level, they don't seem to have too much relevancy to the final outcome.

Let's try building a KNN classifier with the remaining features. We can try techniques we have learned to find a good k-value.

In [1976]:
animal_data.drop('Intake Time', axis=1, inplace=True)
animal_data.drop('Found Location', axis=1, inplace=True)
animal_data.drop('Breed', axis=1, inplace=True)
animal_data.drop('Color', axis=1, inplace=True)

animal_test.drop('Intake Time', axis=1, inplace=True)
animal_test.drop('Found Location', axis=1, inplace=True)
animal_test.drop('Breed', axis=1, inplace=True)
animal_test.drop('Color', axis=1, inplace=True)

animal_data.sample(5)

Unnamed: 0,Outcome Type,Cat,Sterilized,Male,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
102611,Adoption,False,False,False,0.416667,False,False,False,False,True,False,False,False,False,True
32722,Return to Owner,False,False,True,0.166667,False,True,False,False,False,False,False,False,False,True
44809,Adoption,True,False,False,3.0,False,False,False,False,True,False,False,True,False,False
103669,Adoption,False,True,True,4.0,False,False,False,False,True,False,False,False,False,True
107710,Adoption,True,False,True,0.083333,False,False,False,False,True,False,False,False,False,True


In [1977]:
# Print the type of each feature in the training set
print("Feature types in the training set:")
print(animal_data.dtypes)

# Print the type of each feature in the test set
print("Feature types in the test set:")
print(animal_test.dtypes)

Feature types in the training set:
Outcome Type               object
Cat                          bool
Sterilized                   bool
Male                         bool
Age                       float64
Condition_Behavioral         bool
Condition_Med_Major          bool
Condition_Med_Minor          bool
Condition_Neonatal           bool
Condition_Normal             bool
Condition_Rare               bool
Intake_Euthanasia            bool
Intake_Owner Initiated       bool
Intake_Public Assist         bool
Intake_Stray                 bool
dtype: object
Feature types in the test set:
Id                          int64
Cat                          bool
Sterilized                   bool
Male                         bool
Age                       float64
Condition_Behavioral         bool
Condition_Med_Major          bool
Condition_Med_Minor          bool
Condition_Neonatal           bool
Condition_Normal             bool
Condition_Rare               bool
Intake_Euthanasia            bool
In

In [1978]:
# Print the frequency of each Outcome Type
outcome_frequencies = animal_data['Outcome Type'].value_counts() / len(animal_data) * 100
print("Frequency of each Outcome Type:")
print(outcome_frequencies)

Frequency of each Outcome Type:
Outcome Type
Adoption           49.519149
Transfer           31.508587
Return to Owner    14.932933
Euthanasia          3.102819
Died                0.936513
Name: count, dtype: float64


Let's try training a histogram-based binning to build an ensemble of decision trees. The scikit-learn library has HistGradientBoostingClassifier which we can use to implement our approach. By using an ensemble of methods, we can more effectively balance predictions and work through class imbalances.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter

# === 1. Load your dataset (replace this with actual loading) ===
# animal_data = pd.read_csv("animal_data.csv")  # or load however needed

# === 2. Separate labels and features ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"])
X = X.fillna(0)

# === 3. Encode class labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 4. Train/Test split ===
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# === 5. Train classifier ===
# Assume you have label-encoded classes
class_counts = Counter(y_encoded)
total = sum(class_counts.values())

class_freqs = {cls: count / total for cls, count in class_counts.items()}
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}

# we can play with these class weights to better balance the model and reduce the effects of class imbalance
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}

sample_weight = np.array([class_weights[label] for label in y_encoded])

clf = HistGradientBoostingClassifier(
    max_iter=200,
    max_depth=15,
    random_state=42
)

clf.fit(X, y_encoded, sample_weight=sample_weight)

# === 6. Predict and evaluate ===
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Accuracy: 0.6038143216984527
                 precision    recall  f1-score   support

       Adoption       0.66      0.74      0.70     10981
           Died       0.22      0.08      0.12       201
     Euthanasia       0.26      0.52      0.35       731
Return to Owner       0.52      0.63      0.57      3309
       Transfer       0.65      0.40      0.50      7010

       accuracy                           0.60     22232
      macro avg       0.46      0.47      0.45     22232
   weighted avg       0.62      0.60      0.60     22232



Hyperparam tuning:

In [1980]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.preprocessing import LabelEncoder
# import pandas as pd
# import numpy as np

# # === 1. Prepare training data ===
# y = animal_data["Outcome Type"]
# X = animal_data.drop(columns=["Outcome Type"])
# X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

# # === 2. Encode class labels ===
# label_encoder = LabelEncoder()
# y_encoded = label_encoder.fit_transform(y)

# # === 3. Define parameter grid
# param_grid = {
#     'n_estimators': [50, 100, 150, 200],
#     'max_depth': [5, 10, 15, 20],
#     'min_samples_split': [2, 3, 4, 5],
#     'min_samples_leaf': [1, 2, 3, 4, 5]
# }

# # === 4. Setup grid search with 5-fold CV
# grid_search = GridSearchCV(
#     estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
#     param_grid=param_grid,
#     cv=5,
#     scoring='accuracy',
#     n_jobs=-1,
#     verbose=2
# )

# # === 5. Run the grid search
# grid_search.fit(X, y_encoded)

# # === 6. View best params and score
# print("Best Parameters:", grid_search.best_params_)
# print("Best Cross-Validated Accuracy:", grid_search.best_score_)


Now let's build the entire model using the training set and test on the test set, also outputs the final predictions to csv file.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

# === 1. Prepare training data ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"])
X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

# === 2. Prepare test data ===
X_test_final = animal_test.replace([np.inf, -np.inf], np.nan).fillna(0)
X_test_final.drop("Id", axis=1, inplace=True)

# === 3. Encode class labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 4. Train on full dataset ===
class_counts = Counter(y_encoded)
total = sum(class_counts.values())
class_freqs = {cls: count / total for cls, count in class_counts.items()}
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}

sample_weight = np.array([class_weights[label] for label in y_encoded])

clf = HistGradientBoostingClassifier(
    max_iter=200,
    max_depth=15,
    random_state=42
)

clf.fit(X, y_encoded, sample_weight=sample_weight)

# === 5. Predict on test set ===
y_test_pred_encoded = clf.predict(X_test_final)
y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

# === 6. Save predictions to CSV ===
submission = pd.DataFrame({
    "Id": animal_test.index + 1,  # or animal_test["Id"] if you have one
    "Outcome Type": y_test_pred
})
submission.to_csv("animal_test_predictions.csv", index=False)

# === 7. Show prediction distribution as percentages ===
prediction_distribution = submission["Outcome Type"].value_counts(normalize=True) * 100
print("Prediction Distribution (%):")
print(prediction_distribution.round(2).to_string())

# These distribution percentages match closer to the training set's outcome percentages.

Prediction Distribution (%):
Outcome Type
Adoption           56.03
Transfer           19.34
Return to Owner    17.75
Euthanasia          6.55
Died                0.33


Work in Progress: Potentially train cats and dogs separately?

In [1983]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter

# === 1. Prepare features and labels ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"])
X = X.fillna(0)

# === 2. Encode labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 3. Split by species
cat_mask = X["Cat"] == True
dog_mask = X["Cat"] == False

X_cat = X[cat_mask]
y_cat = y_encoded[cat_mask]

X_dog = X[dog_mask]
y_dog = y_encoded[dog_mask]

# === 4. Train/Test split for both sets
X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(X_cat, y_cat, test_size=0.2, random_state=42)
X_dog_train, X_dog_test, y_dog_train, y_dog_test = train_test_split(X_dog, y_dog, test_size=0.2, random_state=42)

# === 5. Define class weights (same logic for both models)
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}

# === 6. Create sample weights
sample_weight_cat = np.array([class_weights[label] for label in y_cat_train])
sample_weight_dog = np.array([class_weights[label] for label in y_dog_train])

# === 7. Train models
clf_cat = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
clf_dog = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)

clf_cat.fit(X_cat_train, y_cat_train, sample_weight=sample_weight_cat)
clf_dog.fit(X_dog_train, y_dog_train, sample_weight=sample_weight_dog)

# === 8. Predict & Evaluate
y_cat_pred = clf_cat.predict(X_cat_test)
y_dog_pred = clf_dog.predict(X_dog_test)

print("=== 🐱 Cat Model Evaluation ===")
print("Accuracy:", accuracy_score(y_cat_test, y_cat_pred))
print(classification_report(y_cat_test, y_cat_pred, target_names=label_encoder.classes_))

print("=== 🐶 Dog Model Evaluation ===")
print("Accuracy:", accuracy_score(y_dog_test, y_dog_pred))
print(classification_report(y_dog_test, y_dog_pred, target_names=label_encoder.classes_))


=== 🐱 Cat Model Evaluation ===
Accuracy: 0.6311771795901968
                 precision    recall  f1-score   support

       Adoption       0.71      0.70      0.70      4922
           Died       0.08      0.03      0.05       124
     Euthanasia       0.26      0.61      0.36       399
Return to Owner       0.39      0.44      0.41       440
       Transfer       0.66      0.59      0.63      4071

       accuracy                           0.63      9956
      macro avg       0.42      0.47      0.43      9956
   weighted avg       0.65      0.63      0.64      9956

=== 🐶 Dog Model Evaluation ===
Accuracy: 0.5839850114043662
                 precision    recall  f1-score   support

       Adoption       0.61      0.79      0.69      5977
           Died       0.05      0.03      0.04        65
     Euthanasia       0.20      0.28      0.24       294
Return to Owner       0.58      0.65      0.61      2974
       Transfer       0.52      0.14      0.22      2966

       accuracy     