# **CS 363M Final Project Spring 2025**

## Chenyi Wang, Bhuvan Kannaeganti, Suyog Valsangkar

### **Overview**

For the project in this class, you will participate in a machine learning competition where you’ll apply your ML skills to a real-world dataset. You may work individually or in teams of up to 3 students. 

The dataset for this competition comes from the Austin Animal Center, the largest no-kill animal shelter in the United States. It contains historical records of animals that have entered the shelter, including details such as species, breed, age, intake type, medical condition, and other attributes. Each animal in the dataset has a recorded outcome, which represents what eventually happened to the animal after entering the shelter.

Your goal in this competition is to build a machine learning model that predicts the final outcome of each animal admitted to the shelter, based on its intake characteristics. The possible outcomes are:

**- Adopted**: The animal was placed into a new home.<br>
**- Return to Owner**: The animal was reclaimed by its original owner.<br>
**- Euthanasia**: The animal was humanely euthanized due to medical or behavioral concerns.<br>
**- Died**: The animal passed away while in the shelter’s care.<br>
**- Transfer**: The animal was moved to another shelter or rescue organization.<br>

By accurately predicting these outcomes, your model can help identify factors that influence an animal's journey through the shelter system and potentially aid in improving adoption and survival rates, shelter policies, or allocation of resources.


## **Code and Analysis Below:**

## 1. Import Datasets

Import the training and test set from their respective csv files.

In [913]:
import pandas as pd

animal_data = pd.read_csv('train.csv')
animal_test = pd.read_csv('test.csv')

pd.set_option('display.max_columns', None) # show all columns
animal_data.sample(5) # sample some data

Unnamed: 0,Id,Name,Intake Time,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Time,Date of Birth,Outcome Type
88487,A700010,,04/06/2015 12:30:00 PM,2205 Voyageurs in Austin (TX),Stray,Nursing,Cat,Intact Male,3 weeks,Domestic Medium Hair Mix,Blue Tabby,04/06/2015 02:45:00 PM,03/15/2015,Transfer
43506,A822423,,09/02/2020 10:33:00 AM,Austin (TX),Owner Surrender,Normal,Dog,Intact Male,1 month,Labrador Retriever,Brown,05/18/2021 10:17:00 AM,07/26/2020,Adoption
80947,A662558,Shenron,05/03/2017 04:36:00 PM,E 12Th And Airport in Austin (TX),Stray,Normal,Dog,Neutered Male,5 years,Dutch Shepherd Mix,Black/Brown,05/03/2017 05:14:00 PM,03/05/2012,Return to Owner
47986,A750904,*Sonic,06/02/2017 09:20:00 AM,7707 Ih 35 in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Pit Bull Mix,Brown Brindle/White,06/08/2017 06:04:00 PM,06/02/2015,Return to Owner
52311,A694191,,12/22/2014 11:48:00 AM,Brodie Lane And Alexandria Drive in Austin (TX),Stray,Normal,Dog,Intact Male,8 months,Shiba Inu Mix,Tan,12/27/2014 12:00:00 AM,04/22/2014,Transfer


## 2. Data Cleaning

By sampling the data and observing the features, we can already see features that do not exist within our test set. This means that we cannot train on the information that we do not have access to later when predicting, thus we should get rid of them. Id and Date of Birth exist within the test set but they are misaligned and irrelevant to training, respectively.

In [914]:
# drop name feature entirely
animal_data.drop("Name", axis=1, inplace=True)

# drop id feature entirely
animal_data.drop("Id", axis=1, inplace=True)

# drop outcome time
animal_data.drop("Outcome Time", axis=1, inplace=True)

# drop birth date
animal_data.drop("Date of Birth", axis=1, inplace=True)

# verify dropper features
animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Type
46101,01/11/2017 11:03:00 AM,Austin (TX),Owner Surrender,Normal,Dog,Spayed Female,2 years,Carolina Dog Mix,Sable,Adoption
41214,02/13/2017 11:21:00 AM,1701 Webberwood Dr in Travis (TX),Stray,Normal,Dog,Intact Female,3 years,Labrador Retriever Mix,Black,Adoption
25378,04/29/2014 05:30:00 AM,2100 Barton Springs Rd in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Greater Swiss Mountain Dog Mix,Tricolor,Adoption
2427,06/16/2020 03:40:00 PM,Austin (TX),Owner Surrender,Normal,Cat,Spayed Female,4 years,Domestic Shorthair,Black/White,Adoption
50258,06/08/2021 01:36:00 PM,17709 Powder Creek Drive in Manor (TX),Stray,Normal,Cat,Intact Male,0 years,Domestic Shorthair,Orange Tabby,Adoption


We see that there are only two different types of animals within both the training and test set. Let's make this a binary feature so we don't need to unnecessary hot hot encoded something that can be done with a binary approach.

In [915]:
print("Unique values in 'Animal Type' in TRAINING SET:", animal_data['Animal Type'].unique())
print("Unique values in 'Animal Type' in TEST SET:", animal_data['Animal Type'].unique())

Unique values in 'Animal Type' in TRAINING SET: ['Dog' 'Cat']
Unique values in 'Animal Type' in TEST SET: ['Dog' 'Cat']


In [916]:
# transform Animal Type into a binary column: Cat -> True, Dog -> False
animal_data['Cat'] = animal_data['Animal Type'].apply(lambda x: True if x == 'Cat' else False)
animal_data.drop('Animal Type', axis=1, inplace=True)

animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Type,Cat
82980,04/10/2017 12:21:00 AM,1012 Bird Creek in Austin (TX),Owner Surrender,Normal,Intact Female,7 months,Chihuahua Shorthair/Dachshund,Tricolor,Adoption,False
63915,10/01/2019 03:23:00 PM,Austin (TX),Owner Surrender,Normal,Spayed Female,1 year,Great Pyrenees,White,Adoption,False
44434,02/08/2018 08:42:00 PM,915 W Ben White in Austin (TX),Public Assist,Normal,Intact Female,7 years,English Bulldog Mix,White/Tan,Return to Owner,False
85573,04/25/2017 02:08:00 PM,2329 Bitter Creek Dr in Austin (TX),Stray,Normal,Intact Female,3 weeks,Domestic Shorthair Mix,Black/White,Transfer,True
53724,07/08/2019 11:54:00 AM,1430 Frontier Valley in Austin (TX),Stray,Normal,Intact Male,1 month,Australian Cattle Dog,White,Transfer,False


We see that Sex upon Intake is a feature that describes multiple labels (gender and sterlization status). We can break this down into two separate binary features to again avoid one hot encoding.

In [917]:
animal_data["Sex upon Intake"].sample(5)

37435      Intact Male
69761    Neutered Male
7822       Intact Male
67229    Intact Female
46961      Intact Male
Name: Sex upon Intake, dtype: object

In [918]:
# separate age at intake and reproductive status, create new column
sex_sterile_status = animal_data['Sex upon Intake'].str.split(' ', n=1, expand=True)

# handle cases where the split returns "Unknown"
animal_data['Sterilized'] = sex_sterile_status[0].map({
    'Neutered': 'True',
    'Spayed': 'True',
    'Intact': 'False',
    'Unknown': 'False'
}).fillna('False')  # unexpected values, let's assume not sterilized

# assign gender, we can keep it a binary by make the feature "Male", where male = true and female = false
animal_data['Male'] = sex_sterile_status[1].apply(lambda x: False if x == 'Female' else True)
animal_data.drop("Sex upon Intake", axis=1, inplace=True)

animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Age upon Intake,Breed,Color,Outcome Type,Cat,Sterilized,Male
97486,06/08/2021 03:21:00 PM,124 West Anderson Lane in Austin (TX),Stray,Nursing,1 day,Domestic Shorthair,Gray,Transfer,True,False,False
12294,01/19/2024 02:41:00 PM,9500 S Ih 35 in Austin (TX),Stray,Normal,1 month,Labrador Retriever,Black,Adoption,False,False,False
72780,03/08/2022 10:09:00 AM,9704 Giles Lane in Austin (TX),Stray,Injured,2 years,Mastiff Mix,Brown Brindle/White,Return to Owner,False,False,True
91527,10/04/2014 11:29:00 AM,6701 Galindo St in Austin (TX),Stray,Normal,3 years,Yorkshire Terrier/Chihuahua Shorthair,Black/Tan,Transfer,False,False,False
105808,07/14/2022 09:06:00 AM,10101 S 1St St in Austin (TX),Stray,Injured,1 month,Domestic Shorthair,Brown Tabby,Transfer,True,False,True


We see that the Age upon Intake feature has all sorts of different units (years, months, weeks, days) describing the age of the animal when it entered the shelter. We need to use a universal unit so the age can be more easily compared amongst each other when training.

In [919]:
animal_data['Age upon Intake'].head(10)

0      8 years
1    11 months
2      2 years
3      2 years
4      6 years
5     6 months
6      2 years
7      4 weeks
8      4 weeks
9     5 months
Name: Age upon Intake, dtype: object

We can write a helper function that checks the age string and converts it to a year expressed as a float so we can maintain consistency in this feature.

In [920]:
# helper to convert age upon intake to years
def age_to_years(age_str):
    if pd.isna(age_str):
        return None
    
    number, unit = age_str.split()
    number = float(number)
    
    if "year" in unit:
        return number
    elif "month" in unit:
        return number / 12
    elif "week" in unit:
        return number / 52
    elif "day" in unit:
        return number / 365
    else:
        return None  # in case of an unexpected format

# convert age to years
animal_data['Age'] = animal_data['Age upon Intake'].apply(age_to_years)
animal_data.drop('Age upon Intake', axis=1, inplace=True)

# sample some data after transformations, verify age conversion
animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age
73800,10/17/2015 09:39:00 AM,1000 Ellingson in Austin (TX),Owner Surrender,Normal,Domestic Shorthair Mix,Blue,Adoption,True,True,True,5.0
71060,10/22/2020 10:51:00 AM,Manor (TX),Owner Surrender,Normal,Domestic Shorthair Mix,Cream Tabby,Adoption,True,False,False,1.0
48779,02/25/2022 04:22:00 PM,14020 Briarcreek Loop in Travis (TX),Stray,Normal,Shih Tzu/Miniature Poodle,Black/White,Transfer,False,False,True,2.0
66108,12/13/2013 06:04:00 PM,5315 Apple Orchard Ln in Austin (TX),Stray,Normal,Chihuahua Shorthair Mix,White/Brown,Return to Owner,False,True,False,4.0
110213,03/10/2014 05:45:00 PM,1200 Lily in Austin (TX),Stray,Injured,Domestic Shorthair Mix,Brown Tabby/White,Transfer,True,False,True,0.583333


Repeat the process for the test set.

In [921]:
# drop birth date
animal_test.drop("Date of Birth", axis=1, inplace=True)

# separate age at intake and reproductive status, create new column
sex_sterile_status = animal_test['Sex upon Intake'].str.split(' ', n=1, expand=True)

# handle cases where the split returns "Unknown"
animal_test['Sterilized'] = sex_sterile_status[0].map({
    'Neutered': True,
    'Spayed': True,
    'Intact': False,
    'Unknown': False
}).fillna(False)  # unexpected values, let's assume not sterilized
animal_data['Sterilized'] = animal_data['Sterilized'].map({'True': True, 'False': False})

# assign gender, we can keep it a binary by make the feature "Male", where male = true and female = false
animal_test['Male'] = sex_sterile_status[1].apply(lambda x: False if x == 'Female' else True)
animal_test.drop("Sex upon Intake", axis=1, inplace=True)

# transform Animal Type into a binary column: Cat -> True, Dog -> False
animal_test['Cat'] = animal_test['Animal Type'].apply(lambda x: True if x == 'Cat' else False)
animal_test.drop('Animal Type', axis=1, inplace=True)

# convert age to years
animal_test['Age'] = animal_test['Age upon Intake'].apply(age_to_years)
animal_test['Age'] = animal_test['Age'].fillna(0)  # fill NaN with 0
animal_test.drop('Age upon Intake', axis=1, inplace=True)

# sample some data after transformations, verify age conversion
animal_test.sample(5)

Unnamed: 0,Id,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Color,Sterilized,Male,Cat,Age
27020,27021,11/28/22 11:35,5001 Crainway Dr in Austin (TX),Stray,Normal,Black Mouth Cur Mix,Tan/Black,False,False,False,0.166667
26158,26159,1/25/15 20:18,3505 Lafayette Ave. in Austin (TX),Stray,Normal,Boxer Mix,Brown Brindle/White,False,True,False,1.0
2709,2710,3/16/17 11:58,Little Loop And Ridgeview Dr in Lago Vista (TX),Stray,Normal,Pit Bull Mix,White/Brown Brindle,True,False,False,2.0
2672,2673,10/26/18 18:15,907 West Slaughter Lane in Austin (TX),Stray,Normal,Domestic Shorthair Mix,Black,False,False,True,0.166667
23295,23296,1/21/15 11:38,1800 E Stassney Ln in Austin (TX),Stray,Normal,Boxer/Australian Cattle Dog,Brown/White,False,False,False,0.666667


## **3. Feature Engineering:**

We need to one hot encode the categorical features. However, there are categoricals that have many labels, some with very little data points. We can study the trend of their similarities to each other and group to reduce dimensionality.

Let's look at Intake Condition and see all the possible labels for that class.

In [922]:
# Count the occurrences of each intake condition label
intake_condition_counts = animal_data['Intake Condition'].value_counts()

# Print the counts for each intake condition
for condition, count in intake_condition_counts.items():
    print(f"{condition}: {count}")

Normal: 95010
Injured: 6394
Sick: 4295
Nursing: 2957
Neonatal: 1240
Aged: 373
Medical: 298
Other: 247
Pregnant: 111
Feral: 104
Med Attn: 48
Behavior: 42
Unknown: 12
Neurologic: 10
Med Urgent: 7
Parvo: 5
Space: 2
Agonal: 1
Congenital: 1


We see above that there are 19 different categorical values for the Intake Condition feature, we can merge some of these rarer classification together.

Let's explore the outcome percentages of each intake condition so we can group these conditions better and reduce the amount of labels we need to one hot encode (increases dimensionality).

In [923]:
# Iterate through each unique intake condition and calculate outcome percentages
intake_conditions = animal_data['Intake Condition'].unique()

for condition in intake_conditions:
    condition_data = animal_data[animal_data['Intake Condition'] == condition]
    total_count = len(condition_data)
    if total_count > 0:
        outcome_percentages = condition_data['Outcome Type'].value_counts(normalize=True) * 100
        print(f"Intake Condition: {condition}")
        print(outcome_percentages)
        print("-" * 50)
    else:
        print(f"Intake Condition: {condition} has no entries.")
        print("-" * 50)

Intake Condition: Normal
Outcome Type
Adoption           52.596569
Transfer           29.360067
Return to Owner    15.985686
Euthanasia          1.513525
Died                0.544153
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Injured
Outcome Type
Adoption           34.813888
Transfer           30.669378
Euthanasia         18.720676
Return to Owner    12.605568
Died                3.190491
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Pregnant
Outcome Type
Transfer           54.954955
Adoption           39.639640
Return to Owner     4.504505
Died                0.900901
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Neonatal
Outcome Type
Transfer           69.919355
Adoption           25.564516
Died                3.064516
Return to Owner     0.967742
Euthanasia          0.483871
Name: proportion, dtype: float64
-------

Groupings based on similar outcome percentages

In [924]:
'''
   [Group]	       [Categories]
    Normal	        Normal
    Neonatal 	    Neonatal, Nursing
    Med_Minor	    Injured, Medical (more similar outcome adopt %)
    Med_Major       Med Attn, Med Urgent, Neurologic, Pregnant, Sick (more similar outcome transfer %)
    Behavioral	    Feral, Behavior
    Critical    	Agonal, Aged, Congenital, Parvo, Space, Other, Unknown
'''

'\n   [Group]\t       [Categories]\n    Normal\t        Normal\n    Neonatal \t    Neonatal, Nursing\n    Med_Minor\t    Injured, Medical (more similar outcome adopt %)\n    Med_Major       Med Attn, Med Urgent, Neurologic, Pregnant, Sick (more similar outcome transfer %)\n    Behavioral\t    Feral, Behavior\n    Critical    \tAgonal, Aged, Congenital, Parvo, Space, Other, Unknown\n'

We see that our merged class labels have relatively similar outcomes percentage-wise. This will reduce dimensionality and improve our training whilst not losing the originality of the groupings.

In [925]:
# intake condition into grouped categories
condition_map = {
    'Normal': 'Normal',
    'Injured': 'Med_Minor',
    'Sick': 'Med_Major',
    'Nursing': 'Neonatal',
    'Neonatal': 'Neonatal',
    'Med Attn': 'Med_Major',
    'Med Urgent': 'Med_Major',
    'Medical': 'Med_Minor',
    'Neurologic': 'Med_Major',
    'Pregnant': 'Med_Major',
    'Feral': 'Behavioral',
    'Behavior': 'Behavioral',
}

# make copy to make transformations (in case it doesn't improve accuracy)
reduced_animal_data = animal_data.copy()
reduced_animal_test = animal_test.copy()

# # map with a fallback to 'Rare'
reduced_animal_data['Condition'] = reduced_animal_data['Intake Condition'].map(condition_map).fillna('Rare')
reduced_animal_data = pd.get_dummies(reduced_animal_data, columns=['Condition'])
reduced_animal_data.drop('Intake Condition', axis=1, inplace=True)

reduced_animal_test['Condition'] = reduced_animal_test['Intake Condition'].map(condition_map).fillna('Rare')
reduced_animal_test = pd.get_dummies(reduced_animal_test, columns=['Condition'])
reduced_animal_test.drop('Intake Condition', axis=1, inplace=True)

reduced_animal_data.sample(5)  # sample some data after transformation

Unnamed: 0,Intake Time,Found Location,Intake Type,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare
39251,04/22/2015 06:11:00 PM,5300 Apple Orchard in Austin (TX),Stray,Rat Terrier Mix,White/Tan,Return to Owner,False,True,False,9.0,False,False,False,False,True,False
46253,08/24/2017 04:34:00 PM,Cherico Street in Austin (TX),Stray,Chihuahua Shorthair Mix,Brown,Transfer,False,True,True,1.0,False,False,False,False,True,False
103065,06/20/2016 09:22:00 AM,14100 Morgan Creek in Austin (TX),Stray,Lhasa Apso Mix,Buff/Gray,Transfer,False,True,False,4.0,False,False,False,False,True,False
106267,01/06/2023 12:54:00 PM,Travis (TX),Stray,Domestic Shorthair,Blue,Adoption,True,False,False,2.0,False,False,False,False,True,False
50179,01/14/2016 11:15:00 AM,Austin (TX),Owner Surrender,Miniature Poodle Mix,White,Transfer,False,False,False,15.0,False,False,False,False,True,False


Let's do the same for Intake Type and check the outcome percentages of each condition.

In [926]:
# observe outcome percentages for each intake type so we can group these categoricals, reduces dimensionality
intake_types = animal_data['Intake Type'].unique()

for intake_type in intake_types:
    type_data = animal_data[animal_data['Intake Type'] == intake_type]
    total_count = len(type_data)
    if total_count > 0:
        outcome_percentages = type_data['Outcome Type'].value_counts(normalize=True) * 100
        print(f"Intake Type: {intake_type}")
        print(outcome_percentages)
        print("-" * 50)
    else:
        print(f"Intake Type: {intake_type} has no entries.")
        print("-" * 50)

Intake Type: Stray
Outcome Type
Adoption           47.774546
Transfer           34.219783
Return to Owner    13.905286
Euthanasia          3.056120
Died                1.044266
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Public Assist
Outcome Type
Return to Owner    63.545914
Adoption           18.528788
Transfer           14.222802
Euthanasia          3.262111
Died                0.440385
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Owner Surrender
Outcome Type
Adoption           64.671050
Transfer           26.502541
Return to Owner     5.399357
Euthanasia          2.736980
Died                0.690073
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Abandoned
Outcome Type
Adoption           63.017032
Transfer           26.845093
Return to Owner     9.083536
Euthanasia          0.648824
Died                0.405515
Name: proportion, 

These would be the groupings that are most similar to each other based on the outcome percentages.

In [927]:
'''
    [Group]	         [Categories]
     Public Assist    Public Assist
     Stray            Stray, Wildlife
     Owner-Initiated  Abandoned, Owner Surrender
     Euthanasia       Euthanasia Request
'''

'\n    [Group]\t         [Categories]\n     Public Assist    Public Assist\n     Stray            Stray, Wildlife\n     Owner-Initiated  Abandoned, Owner Surrender\n     Euthanasia       Euthanasia Request\n'

In [928]:
# Define the mapping for grouping intake types
intake_type_map = {
    'Stray': 'Stray',
    'Public Assist': 'Public Assist',
    'Wildlife': 'Stray',
    'Abandoned': 'Owner Initiated',
    'Owner Surrender': 'Owner Initiated',
    'Euthanasia Request': 'Euthanasia'
}

# # Map the intake types to their respective groups
reduced_animal_data['Intake'] = reduced_animal_data['Intake Type'].map(intake_type_map).fillna('Other')
reduced_animal_data = pd.get_dummies(reduced_animal_data, columns=['Intake'])
reduced_animal_data.drop('Intake Type', axis=1, inplace=True)

reduced_animal_test['Intake'] = reduced_animal_test['Intake Type'].map(intake_type_map).fillna('Other')
reduced_animal_test = pd.get_dummies(reduced_animal_test, columns=['Intake'])
reduced_animal_test.drop('Intake Type', axis=1, inplace=True)

reduced_animal_data.sample(5)  # sample some data after transformation


Unnamed: 0,Intake Time,Found Location,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
24347,08/22/2016 11:38:00 AM,7201 Levander Loop in Austin (TX),Chihuahua Shorthair/Italian Greyhound,Brown/White,Adoption,False,True,False,8.0,False,False,False,False,True,False,False,False,False,True
85983,09/12/2021 01:48:00 PM,6200 Loyola Ln Unit 1113 in Austin (TX),Anatol Shepherd,White,Adoption,False,False,True,2.0,False,False,False,False,True,False,False,False,True,False
52069,06/08/2024 07:53:00 AM,3307 Kim St in Austin (TX),Alaskan Malamute,Black/White,Return to Owner,False,True,True,4.0,False,False,False,False,True,False,False,True,False,False
87936,10/09/2019 11:40:00 AM,11207 N Lamar in Austin (TX),Domestic Shorthair,Black,Died,True,False,True,2.0,False,True,False,False,False,False,False,False,False,True
70155,06/07/2022 04:55:00 PM,3600 Presidential Blvd in Austin (TX),Domestic Shorthair,Black/White,Transfer,True,False,False,0.010959,False,False,False,False,True,False,False,False,False,True


Still need to find a way to encode:
- Intake Time
- Location Found
- Breed
- Color

How do they even affect the outcome or is there even a correlation? Are some colors / breeds more or less desirable than others (therefore changing adoption rates)?

In [929]:
# Show top full color strings
top_colors = animal_data["Color"].value_counts().nlargest(10).index
filtered = animal_data[animal_data["Color"].isin(top_colors)]

print("=== Outcome Distribution by Full Color (Top 10 Only) ===")
color_outcome = (
    filtered.groupby("Color")["Outcome Type"]
    .value_counts(normalize=False)
    .unstack()
    .fillna(0)
)

# Add totals
color_outcome["Count"] = color_outcome.sum(axis=1)

# Convert to percentage 
color_percent = color_outcome.div(color_outcome["Count"], axis=0) * 100

# Round and attach Count column
final = color_percent.round(1)
final["Count"] = color_outcome["Count"].astype(int)

print(final.to_string())
# # Extract primary color before "/"
# animal_data["PrimaryColorOnly"] = animal_data["Color"].fillna("Unknown").apply(lambda x: x.split("/")[0])

# # Display outcome distribution by primary color
# print("=== Outcome Distribution by Primary Color ===")
# primary_color_outcome = animal_data.groupby("PrimaryColorOnly")["Outcome Type"].value_counts(normalize=True).unstack().fillna(0) * 100
# print(primary_color_outcome.round(1).to_string())


=== Outcome Distribution by Full Color (Top 10 Only) ===
Outcome Type       Adoption  Died  Euthanasia  Return to Owner  Transfer  Count
Color                                                                          
Black                  46.8   1.4         3.6             10.6      37.5   9674
Black/White            52.4   1.0         3.0             14.4      29.1  11620
Blue/White             51.8   0.5         3.9             17.0      26.8   3003
Brown Tabby            49.5   1.3         3.7              3.7      41.7   7708
Brown Tabby/White      51.7   1.5         3.3              4.5      38.9   3862
Brown/White            49.3   0.5         2.5             23.9      23.9   3457
Orange Tabby           49.0   1.2         3.4              4.8      41.7   3673
Tan/White              50.8   0.5         2.5             22.0      24.2   3178
White                  40.1   0.8         3.4             24.6      31.2   3945
White/Black            49.0   0.7         2.9             17.4 

It looks like theres some noticeable differences with some colors like black and white, Lets Do more in-depth research about the Color

In [930]:
from scipy.stats import chi2_contingency

# Build contingency table
contingency_table = pd.crosstab(animal_data["Color"], animal_data["Outcome Type"])

# Run chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-squared: {chi2:.2f}")
print(f"P-value: {p:.10f}")

Chi-squared: 10816.40
P-value: 0.0000000000


It looks like there is a significant relationship between color and outcome type. Lets process the data more before we use it

In [931]:
def extract_primary_color(color):
    if pd.isna(color):
        return "Unknown"
    if "/" in color:
        return color.split("/")[0]
    if color.lower() in ["tricolor", "calico", "torbie", "tortie"]:
        return "Multi"
    return color.strip()

# Apply to both datasets
animal_data["PrimaryColor"] = animal_data["Color"].apply(extract_primary_color)
animal_test["PrimaryColor"] = animal_test["Color"].apply(extract_primary_color)

# Binary pattern flags
def color_flags(color):
    color = str(color).lower()
    return pd.Series({
        "has_tabby": "tabby" in color,
        "has_tortie": "tortie" in color,
        "has_calico": "calico" in color,
        "has_torbie": "torbie" in color,
        "has_tricolor": "tricolor" in color
    })

color_features_data = animal_data["Color"].apply(color_flags)
color_features_test = animal_test["Color"].apply(color_flags)

# Attach pattern flags to data
animal_data = pd.concat([animal_data, color_features_data], axis=1)
animal_test = pd.concat([animal_test, color_features_test], axis=1)

# Encode PrimaryColor
from sklearn.preprocessing import LabelEncoder

color_encoder = LabelEncoder()
animal_data["PrimaryColorEncoded"] = color_encoder.fit_transform(animal_data["PrimaryColor"])
animal_test["PrimaryColorEncoded"] = color_encoder.transform(animal_test["PrimaryColor"])

# Drop unused original color fields
animal_data.drop(["Color", "PrimaryColor"], axis=1, inplace=True)
animal_test.drop(["Color", "PrimaryColor"], axis=1, inplace=True)

# # === Quick Stats ===
# print("\nTop 10 Primary Colors:")
# print(animal_data["PrimaryColorEncoded"].value_counts().head(10))

# print("\nhas_tabby Outcome Breakdown:")
# print(animal_data[animal_data["has_tabby"] == True]["Outcome Type"].value_counts(normalize=True).round(2) * 100)

# === Color Summary Tables ===

# Decode color labels for readability
animal_data["PrimaryColorLabel"] = color_encoder.inverse_transform(animal_data["PrimaryColorEncoded"])

# Get top 20 most frequent primary colors
top_colors = animal_data["PrimaryColorLabel"].value_counts().head(20).index
top_color_data = animal_data[animal_data["PrimaryColorLabel"].isin(top_colors)]

# Group outcome % and counts by primary color
color_outcome = (
    top_color_data.groupby("PrimaryColorLabel")["Outcome Type"]
    .value_counts(normalize=False)
    .unstack(fill_value=0)
)

color_outcome["Count"] = color_outcome.sum(axis=1)
color_percent = color_outcome.div(color_outcome["Count"], axis=0) * 100
final_color_table = color_percent.round(1)
final_color_table["Count"] = color_outcome["Count"]

print("\n=== Top 20 Primary Colors: Outcome % and Counts ===")
print(final_color_table.to_string())

# === Pattern Feature Breakdown ===
print("\n=== Pattern Flags (Tabby, Tortie, etc.): Outcome % and Counts ===")
for col in ["has_tabby", "has_tortie", "has_calico", "has_torbie", "has_tricolor"]:
    subset = animal_data[animal_data[col]]
    if len(subset) == 0:
        continue
    outcome_counts = subset["Outcome Type"].value_counts()
    outcome_percent = subset["Outcome Type"].value_counts(normalize=True) * 100
    summary = pd.DataFrame({
        "Count": outcome_counts,
        "Percent": outcome_percent.round(1)
    })
    print(f"\n-- {col} --")
    print(summary.to_string())


=== Top 20 Primary Colors: Outcome % and Counts ===
Outcome Type       Adoption  Died  Euthanasia  Return to Owner  Transfer  Count
PrimaryColorLabel                                                              
Black                  50.4   1.1         3.1             14.9      30.5  27150
Blue                   50.3   0.9         3.7             13.5      31.6   5386
Blue Tabby             54.6   1.3         3.2              3.4      37.5   2937
Brown                  48.2   0.6         2.8             23.6      24.7   8626
Brown Brindle          50.9   0.3         3.3             22.8      22.6   2788
Brown Tabby            50.1   1.4         3.6              4.0      40.9  11694
Buff                   45.1   0.8         2.5             22.7      28.9    634
Chocolate              49.3   0.6         2.9             24.6      22.5   1462
Cream                  44.7   0.8         1.9             18.9      33.7   1070
Cream Tabby            55.4   1.7         2.4              2.5     

Lets look at corellations

In [932]:
import numpy as np

def cramers_v(confusion_matrix):
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    return np.sqrt(phi2 / min(k - 1, r - 1))

# For primary color
cramer_v_color = cramers_v(pd.crosstab(animal_data["PrimaryColorEncoded"], animal_data["Outcome Type"]))
print(f"\nCramér’s V for PrimaryColorEncoded: {cramer_v_color:.3f}")

from scipy.stats import chi2_contingency

from scipy.stats import chi2_contingency

# Color-related binary features
color_flags = ["has_tabby", "has_tortie", "has_calico", "has_torbie", "has_tricolor"]

print("=== Chi-squared Test Results for Color Pattern Flags ===")
for col in color_flags:
    table = pd.crosstab(animal_data[col], animal_data["Outcome Type"])
    try:
        chi2, p, dof, expected = chi2_contingency(table)
        print(f"{col:<15} | Chi2: {chi2:8.2f} | p: {p:.5f}")
    except Exception as e:
        print(f"{col:<15} | Error: {e}")


Cramér’s V for PrimaryColorEncoded: 0.119
=== Chi-squared Test Results for Color Pattern Flags ===
has_tabby       | Chi2:  3183.38 | p: 0.00000
has_tortie      | Chi2:   293.01 | p: 0.00000
has_calico      | Chi2:   241.65 | p: 0.00000
has_torbie      | Chi2:   217.69 | p: 0.00000
has_tricolor    | Chi2:   188.13 | p: 0.00000


These results indicate that there is enough correlation and significance in these categories to include them in further predictions

In [933]:

# Filter data for breeds with "mix"
mix_breeds = animal_data[animal_data['Breed'].str.contains('mix', case=False, na=False)]

# Filter data for breeds without "mix"
non_mix_breeds = animal_data[~animal_data['Breed'].str.contains('mix', case=False, na=False)]

# Calculate percentages of Outcome Type for breeds with "mix"
mix_outcome_percentages = mix_breeds['Outcome Type'].value_counts(normalize=True) * 100

# Calculate percentages of Outcome Type for breeds without "mix"
non_mix_outcome_percentages = non_mix_breeds['Outcome Type'].value_counts(normalize=True) * 100

# Top 20 breeds with "mix" and their frequencies
top_mix_breeds = mix_breeds['Breed'].value_counts().head(20)

# Top 20 breeds without "mix" and their frequencies
top_non_mix_breeds = non_mix_breeds['Breed'].value_counts().head(20)

# Print the results
print("Top 20 breeds with 'mix' and their frequencies:")
print(top_mix_breeds)

print("\nTop 20 breeds without 'mix' and their frequencies:")
print(top_non_mix_breeds)

# Print the results
print("Outcome Type percentages for breeds with 'mix':")
print(mix_outcome_percentages)
print("\nOutcome Type percentages for breeds without 'mix':")
print(non_mix_outcome_percentages)

# Get the top 3 mixed breeds
top_3_mixed_breeds = mix_breeds['Breed'].value_counts().head(3).index

# Get the top 3 purebred breeds
top_3_purebred_breeds = non_mix_breeds['Breed'].value_counts().head(3).index

# Calculate and print outcome percentages for the top 3 mixed breeds
print("Outcome percentages for top 3 mixed breeds:")
for breed in top_3_mixed_breeds:
    breed_data = mix_breeds[mix_breeds['Breed'] == breed]
    outcome_percentages = breed_data['Outcome Type'].value_counts(normalize=True) * 100
    print(f"\nBreed: {breed}")
    print(outcome_percentages)

# Calculate and print outcome percentages for the top 3 purebred breeds
print("\nOutcome percentages for top 3 purebred breeds:")
for breed in top_3_purebred_breeds:
    breed_data = non_mix_breeds[non_mix_breeds['Breed'] == breed]
    outcome_percentages = breed_data['Outcome Type'].value_counts(normalize=True) * 100
    print(f"\nBreed: {breed}")
    print(outcome_percentages)

Top 20 breeds with 'mix' and their frequencies:
Breed
Domestic Shorthair Mix       25361
Pit Bull Mix                  6042
Labrador Retriever Mix        5654
Chihuahua Shorthair Mix       4896
German Shepherd Mix           2637
Domestic Medium Hair Mix      2564
Australian Cattle Dog Mix     1337
Domestic Longhair Mix         1254
Siamese Mix                   1106
Dachshund Mix                  852
Border Collie Mix              769
Boxer Mix                      749
Miniature Poodle Mix           646
Siberian Husky Mix             613
Australian Shepherd Mix        612
Catahoula Mix                  544
Yorkshire Terrier Mix          523
Great Pyrenees Mix             515
Miniature Schnauzer Mix        478
Rat Terrier Mix                475
Name: count, dtype: int64

Top 20 breeds without 'mix' and their frequencies:
Breed
Domestic Shorthair                    16046
Pit Bull                               2117
Domestic Medium Hair                   1436
Chihuahua Shorthair           

Let's try taking these four features out, from studying the data at a surface level, they don't seem to have too much relevancy to the final outcome.

Let's try building a KNN classifier with the remaining features. We can try techniques we have learned to find a good k-value.

In [934]:
# import pandas as pd
# import numpy as np
# from datetime import datetime

# # === 1. Load intake and weather data ===
# animal_data["Intake Time"] = pd.to_datetime(animal_data["Intake Time"], errors="coerce")
# animal_test["Intake Time"] = pd.to_datetime(animal_test["Intake Time"], errors="coerce")
# weather = pd.read_csv("austin_weather_data.csv")

# # === 2. Parse datetime in weather data
# weather["datetime"] = pd.to_datetime(weather["datetime"], errors="coerce")
# weather = weather.set_index("datetime").sort_index()

# # === 3. Create mapping from date to temperature
# feelslike_map = weather["feelslike"].to_dict()

# # === 4. Normalize intake times to date only
# animal_data["Intake Date"] = pd.to_datetime(animal_data["Intake Time"], format="%m/%d/%Y %I:%M:%S %p", errors="coerce")
# animal_test["Intake Date"] = pd.to_datetime(animal_test["Intake Time"], format="%m/%d/%Y %I:%M:%S %p", errors="coerce")

# # === 5. Function to get matching or fallback temperature
# def get_temperature(intake_date):
#     if pd.isnull(intake_date):
#         return np.nan
#     # Try exact match first
#     if intake_date in feelslike_map:
#         return feelslike_map[intake_date]
#     # Try previous years for same MM-DD
#     for years_back in range(1, 6):
#         try_date = intake_date - pd.DateOffset(years=years_back)
#         if try_date in feelslike_map:
#             return feelslike_map[try_date]
#     return np.nan

# # === 6. Apply mapping
# animal_data["Intake Temperature"] = animal_data["Intake Date"].apply(get_temperature)
# animal_test["Intake Temperature"] = animal_test["Intake Date"].apply(get_temperature)

# # === 7. Fill missing with mean (optional)
# animal_data["Intake Temperature"] = animal_data["Intake Temperature"].fillna(animal_data["Intake Temperature"].mean())
# animal_test["Intake Temperature"] = animal_test["Intake Temperature"].fillna(animal_test["Intake Temperature"].mean())

# # === 8. Drop helper column
# animal_data.drop(columns=["Intake Date"], inplace=True)
# animal_test.drop(columns=["Intake Date"], inplace=True)

# # ✅ Done: Intake Temperature column is now added to animal_data
# animal_data.sample(5)


In [935]:
animal_data.drop('Intake Time', axis=1, inplace=True)
animal_data.drop('Found Location', axis=1, inplace=True)
animal_data.drop('Breed', axis=1, inplace=True)
animal_data.drop('PrimaryColorLabel', axis=1, inplace=True, errors='ignore')

animal_test.drop('Intake Time', axis=1, inplace=True)
animal_test.drop('Found Location', axis=1, inplace=True)
animal_test.drop('Breed', axis=1, inplace=True)
animal_test.drop('PrimaryColorLabel', axis=1, inplace=True, errors='ignore')

reduced_animal_data.drop('Intake Time', axis=1, inplace=True)
reduced_animal_data.drop('Found Location', axis=1, inplace=True)
reduced_animal_data.drop('Breed', axis=1, inplace=True)
reduced_animal_data.drop('Color', axis=1, inplace=True)

reduced_animal_test.drop('Intake Time', axis=1, inplace=True)
reduced_animal_test.drop('Found Location', axis=1, inplace=True)
reduced_animal_test.drop('Breed', axis=1, inplace=True)
reduced_animal_test.drop('Color', axis=1, inplace=True, errors='ignore')

animal_data.sample(5)

Unnamed: 0,Intake Type,Intake Condition,Outcome Type,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded
63729,Stray,Normal,Transfer,True,False,True,0.076923,True,False,False,False,False,18
18070,Stray,Normal,Return to Owner,False,False,True,0.5,False,False,False,False,False,57
104444,Stray,Normal,Adoption,False,False,False,0.166667,False,False,False,False,False,46
9971,Stray,Normal,Died,True,False,True,0.038462,True,False,False,False,False,39
29076,Stray,Injured,Adoption,True,False,True,2.0,True,False,False,False,False,12


Lastly, we need to just one hot encode Intake Type and Intake Condition in the original training and test set (the one we didn't group those features into smaller groups to reduce dimensionality).

In [936]:
outcome = animal_data["Outcome Type"] # separate outcome type
one_hot_encoded = pd.get_dummies(animal_data.drop(columns=["Outcome Type"]), drop_first=True) # one-hot encode the rest
animal_data = pd.concat([one_hot_encoded, outcome], axis=1) # add back outcome type for training later
animal_test = pd.get_dummies(animal_test, drop_first=True)

Let's make sure that we have our different data sets ready. One approach where we tried grouping the categoricals to reduce dimensionality and also imputing the original color feature with one hot encoding.

In [937]:
animal_data.sample(2) # regular categoricals, color one hot encoded

Unnamed: 0,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Neurologic,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown,Outcome Type
66164,False,False,False,0.083333,False,False,False,False,False,20,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Adoption
51343,False,False,True,6.0,False,False,False,False,False,31,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Return to Owner


Corresponding test set:

In [938]:
animal_test.sample(2) # regular categoricals, color one hot encoded

Unnamed: 0,Id,Sterilized,Male,Cat,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Panleuk,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown
25456,25457,False,False,False,0.333333,False,False,False,False,False,42,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
27683,27684,False,True,True,1.0,False,False,False,False,False,36,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False


Training set with reduced dimensionality on categorical features using grouping:

In [939]:
reduced_animal_data.sample(2) # reduced intake condition and type

Unnamed: 0,Outcome Type,Cat,Sterilized,Male,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
60519,Adoption,True,False,False,0.083333,False,False,False,False,True,False,False,False,False,True
40776,Adoption,False,False,False,0.25,False,False,False,False,True,False,False,False,False,True


Corresponding test set:

In [940]:
reduced_animal_test.sample(2) # reduced intake condition and type

Unnamed: 0,Id,Sterilized,Male,Cat,Age,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
264,265,False,True,True,0.038462,False,False,False,True,False,False,False,False,False,True
27118,27119,True,True,True,8.0,False,False,False,False,True,False,False,True,False,False


We have confirmed that our data sets are cleaned and ready for training.

Let's take a look at the general outcome type frequency from the training set. We want our output to trend similarly with most being adoptions, and the least outcomes being died and euthanasia. However, this brings up a concern of class imbalance, where we have almost 50% of the training data as adoption outcomes and less than 1% of outcomes were death. This means that it might be difficult to get a high recall on euthanasia and death and we will have to explore options to improve our learning algorithm.

In [941]:
# Print the frequency of each Outcome Type
outcome_frequencies = animal_data['Outcome Type'].value_counts() / len(animal_data) * 100
print("Frequency of each Outcome Type:")
print(outcome_frequencies)

Frequency of each Outcome Type:
Outcome Type
Adoption           49.519149
Transfer           31.508587
Return to Owner    14.932933
Euthanasia          3.102819
Died                0.936513
Name: count, dtype: float64


## **4. Model implementation comparisons and tuning:**

Let's try training a histogram-based binning to build an ensemble of decision trees. The scikit-learn library has HistGradientBoostingClassifier which we can use to implement our approach. By using an ensemble of methods, we can more effectively balance predictions and work through class imbalances.

In [942]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter

# === 2. Separate labels and features ===
y = reduced_animal_data["Outcome Type"]
X = reduced_animal_data.drop(columns=["Outcome Type"])
X = X.fillna(0)

# === 3. Encode class labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 4. Train/Test split ===
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# === 5. Train classifier ===
# Assume you have label-encoded classes
class_counts = Counter(y_encoded)
total = sum(class_counts.values())

class_freqs = {cls: count / total for cls, count in class_counts.items()}
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}

# we can play with these class weights to better balance the model and reduce the effects of class imbalance
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}

sample_weight = np.array([class_weights[label] for label in y_encoded])

clf = HistGradientBoostingClassifier(
    max_iter=200,
    max_depth=15,
    random_state=42
)

clf.fit(X, y_encoded, sample_weight=sample_weight)

# === 6. Predict and evaluate ===
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Accuracy: 0.6038143216984527
                 precision    recall  f1-score   support

       Adoption       0.66      0.74      0.70     10981
           Died       0.22      0.08      0.12       201
     Euthanasia       0.26      0.52      0.35       731
Return to Owner       0.52      0.63      0.57      3309
       Transfer       0.65      0.40      0.50      7010

       accuracy                           0.60     22232
      macro avg       0.46      0.47      0.45     22232
   weighted avg       0.62      0.60      0.60     22232



In [943]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter

# === 2. Separate labels and features ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"])
X = X.fillna(0)

# === 3. Encode class labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 4. Train/Test split ===
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# === 5. Train classifier ===
# Assume you have label-encoded classes
class_counts = Counter(y_encoded)
total = sum(class_counts.values())

class_freqs = {cls: count / total for cls, count in class_counts.items()}
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}

# we can play with these class weights to better balance the model and reduce the effects of class imbalance
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}

sample_weight = np.array([class_weights[label] for label in y_encoded])

clf = HistGradientBoostingClassifier(
    max_iter=200,
    max_depth=15,
    random_state=42
)

clf.fit(X, y_encoded, sample_weight=sample_weight)

# === 6. Predict and evaluate ===
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Accuracy: 0.6084023029866859
                 precision    recall  f1-score   support

       Adoption       0.66      0.74      0.70     10981
           Died       0.42      0.12      0.19       201
     Euthanasia       0.27      0.54      0.36       731
Return to Owner       0.52      0.64      0.57      3309
       Transfer       0.66      0.41      0.51      7010

       accuracy                           0.61     22232
      macro avg       0.50      0.49      0.46     22232
   weighted avg       0.62      0.61      0.60     22232



Hyperparam tuning:

In [944]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.preprocessing import LabelEncoder
# import pandas as pd
# import numpy as np

# # === 1. Prepare training data ===
# y = animal_data["Outcome Type"]
# X = animal_data.drop(columns=["Outcome Type"])
# X = X.replace([np.inf, -np.inf], np.nan).fillna(0)

# # === 2. Encode class labels ===
# label_encoder = LabelEncoder()
# y_encoded = label_encoder.fit_transform(y)

# # === 3. Define parameter grid
# param_grid = {
#     'n_estimators': [50, 100, 150, 200],
#     'max_depth': [5, 10, 15, 20],
#     'min_samples_split': [2, 3, 4, 5],
#     'min_samples_leaf': [1, 2, 3, 4, 5]
# }

# # === 4. Setup grid search with 5-fold CV
# grid_search = GridSearchCV(
#     estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
#     param_grid=param_grid,
#     cv=5,
#     scoring='accuracy',
#     n_jobs=-1,
#     verbose=2
# )

# # === 5. Run the grid search
# grid_search.fit(X, y_encoded)

# # === 6. View best params and score
# print("Best Parameters:", grid_search.best_params_)
# print("Best Cross-Validated Accuracy:", grid_search.best_score_)


Now let's build the entire model using the training set and test on the test set, also outputs the final predictions to csv file.

In [945]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
from collections import Counter

# === 1. Encode categorical features and prepare training data ===
X = pd.get_dummies(animal_data.drop(columns=["Outcome Type"]), dtype=np.uint8)
y = animal_data["Outcome Type"]

# Save the column structure of training features
feature_columns = X.columns

# === 2. Prepare test data and reindex to match training columns ===
X_test_final = pd.get_dummies(animal_test.drop(columns=["Id"]), dtype=np.uint8)
X_test_final = X_test_final.reindex(columns=feature_columns, fill_value=0)

# === 3. Encode class labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 4. Class weights and mappings ===
class_counts = Counter(y_encoded)
total = sum(class_counts.values())
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}
sample_weight = np.array([class_weights[label] for label in y_encoded])

# === 5. Train classifier ===
clf = HistGradientBoostingClassifier(
    max_iter=200,
    max_depth=15,
    random_state=42
)
clf.fit(X, y_encoded, sample_weight=sample_weight)

# === 6. Predict on test set ===
y_test_pred_encoded = clf.predict(X_test_final)
y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

# === 7. Save predictions to CSV ===
submission = pd.DataFrame({
    "Id": animal_test["Id"],
    "Outcome Type": y_test_pred
})
submission = submission.sort_values("Id").reset_index(drop=True)
submission.to_csv("boosting_predictions.csv", index=False)
print("✅ Saved predictions to 'boosting_predictions.csv'")

# === 8. Show prediction distribution as percentages ===
prediction_distribution = submission["Outcome Type"].value_counts(normalize=True) * 100
print("\nPrediction Distribution (%):")
print(prediction_distribution.round(2).sort_index().to_string())

X.sample(5)


✅ Saved predictions to 'boosting_predictions.csv'

Prediction Distribution (%):
Outcome Type
Adoption           55.06
Died                0.33
Euthanasia          6.54
Return to Owner    18.29
Transfer           19.78


Unnamed: 0,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Neurologic,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown
78940,False,True,True,3.0,False,False,False,False,False,15,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
73349,True,False,False,0.057692,True,False,False,False,False,18,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
26061,True,True,True,2.0,False,False,False,False,False,7,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
7224,True,False,True,0.038462,False,False,False,False,False,2,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
50914,True,True,False,10.0,False,False,False,False,False,7,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False


Work in Progress: Potentially train cats and dogs separately?

In [946]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter

# === 1. Prepare features and labels ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"])
X = X.fillna(0)

# === 2. Encode labels ===
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# === 3. Split by species
cat_mask = X["Cat"] == True
dog_mask = X["Cat"] == False

X_cat = X[cat_mask]
y_cat = y_encoded[cat_mask]

X_dog = X[dog_mask]
y_dog = y_encoded[dog_mask]

# === 4. Train/Test split for both sets
X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(X_cat, y_cat, test_size=0.2, random_state=42)
X_dog_train, X_dog_test, y_dog_train, y_dog_test = train_test_split(X_dog, y_dog, test_size=0.2, random_state=42)

# === 5. Define class weights (same logic for both models)
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}
class_weights = {
    class_to_index['Died']: 5.0,
    class_to_index['Euthanasia']: 4.0,
    class_to_index['Return to Owner']: 1.5,
    class_to_index['Transfer']: 1.0,
    class_to_index['Adoption']: 0.8
}

# === 6. Create sample weights
sample_weight_cat = np.array([class_weights[label] for label in y_cat_train])
sample_weight_dog = np.array([class_weights[label] for label in y_dog_train])

# === 7. Train models
clf_cat = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
clf_dog = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)

clf_cat.fit(X_cat_train, y_cat_train, sample_weight=sample_weight_cat)
clf_dog.fit(X_dog_train, y_dog_train, sample_weight=sample_weight_dog)

# === 8. Predict & Evaluate
y_cat_pred = clf_cat.predict(X_cat_test)
y_dog_pred = clf_dog.predict(X_dog_test)

print("=== 🐱 Cat Model Evaluation ===")
print("Accuracy:", accuracy_score(y_cat_test, y_cat_pred))
print(classification_report(y_cat_test, y_cat_pred, target_names=label_encoder.classes_))

print("=== 🐶 Dog Model Evaluation ===")
print("Accuracy:", accuracy_score(y_dog_test, y_dog_pred))
print(classification_report(y_dog_test, y_dog_pred, target_names=label_encoder.classes_))


=== 🐱 Cat Model Evaluation ===
Accuracy: 0.6329851345922057
                 precision    recall  f1-score   support

       Adoption       0.71      0.70      0.70      4922
           Died       0.07      0.04      0.05       124
     Euthanasia       0.27      0.60      0.37       399
Return to Owner       0.40      0.42      0.41       440
       Transfer       0.66      0.60      0.63      4071

       accuracy                           0.63      9956
      macro avg       0.42      0.47      0.43      9956
   weighted avg       0.65      0.63      0.64      9956

=== 🐶 Dog Model Evaluation ===
Accuracy: 0.5845552297165201
                 precision    recall  f1-score   support

       Adoption       0.62      0.79      0.69      5977
           Died       0.06      0.03      0.04        65
     Euthanasia       0.21      0.29      0.24       294
Return to Owner       0.57      0.66      0.61      2974
       Transfer       0.54      0.14      0.23      2966

       accuracy     

Feature cleaning and engineering take #2, we can try predicting on the model we had earlier, it didn't perform too poorly but class imbalance was an obvious issue.

In [947]:
animal_data.sample(5) # sample some data after transformations

Unnamed: 0,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Neurologic,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown,Outcome Type
55154,False,True,False,16.0,False,False,False,False,False,43,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Return to Owner
40643,True,True,False,0.583333,True,False,False,False,False,18,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Transfer
15916,False,False,True,1.0,False,False,False,False,False,57,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Adoption
39618,False,True,True,10.0,False,False,False,False,False,51,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Adoption
29598,True,False,False,0.166667,False,False,False,False,False,2,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Adoption


In [948]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score

# === 1. Prepare data ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"]).fillna(0)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
class_labels = label_encoder.classes_
class_to_index = {label: i for i, label in enumerate(class_labels)}

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# === 2. Train multiclass model ===
clf_multi = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
clf_multi.fit(X_train, y_train)
probs_multi = clf_multi.predict_proba(X_test)

# === 3. Train OvR classifiers ===
def train_ovr(target_label):
    y_binary = (y_train == class_to_index[target_label]).astype(int)
    clf = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
    clf.fit(X_train, y_binary)
    return clf

clf_died = train_ovr("Died")
clf_euth = train_ovr("Euthanasia")

# === 4. Get OvR probabilities ===
prob_died = clf_died.predict_proba(X_test)[:, 1]
prob_euth = clf_euth.predict_proba(X_test)[:, 1]

# === 5. Define threshold search function ===
def find_best_ovr_thresholds(probs_multi, prob_died, prob_euth, y_true, class_to_index, class_labels):
    thresholds = np.arange(0.01, 0.51, 0.01)
    best_thresholds = {}
    best_f1 = 0
    best_combo = (0.15, 0.18)

    for t_died in thresholds:
        for t_euth in thresholds:
            y_pred = np.argmax(probs_multi, axis=1)
            for i in range(len(y_pred)):
                if prob_died[i] > t_died:
                    y_pred[i] = class_to_index["Died"]
                elif prob_euth[i] > t_euth:
                    y_pred[i] = class_to_index["Euthanasia"]
            score = f1_score(y_true, y_pred, average='macro')
            if score > best_f1:
                best_f1 = score
                best_combo = (t_died, t_euth)

    return {
        "best_thresholds": {
            "Died": round(best_combo[0], 2),
            "Euthanasia": round(best_combo[1], 2)
        },
        "best_macro_f1": round(best_f1, 4)
    }

# === 6. Search for optimal thresholds ===
result = find_best_ovr_thresholds(
    probs_multi=probs_multi,
    prob_died=prob_died,
    prob_euth=prob_euth,
    y_true=y_test,
    class_to_index=class_to_index,
    class_labels=class_labels
)

print("Best Thresholds:")
print(result["best_thresholds"])
print("Best Macro F1 Score:")
print(result["best_macro_f1"])

# === 7. Apply optimal thresholds to predictions ===
optimal_died_thresh = result["best_thresholds"]["Died"]
optimal_euth_thresh = result["best_thresholds"]["Euthanasia"]

y_pred_final = np.argmax(probs_multi, axis=1)
for i in range(len(y_pred_final)):
    if prob_died[i] > optimal_died_thresh:
        y_pred_final[i] = class_to_index["Died"]
    elif prob_euth[i] > optimal_euth_thresh:
        y_pred_final[i] = class_to_index["Euthanasia"]

# === 8. Final evaluation ===
print("\n=== Final Evaluation with Optimized Thresholds ===")
print("Accuracy:", accuracy_score(y_test, y_pred_final))
print(classification_report(y_test, y_pred_final, target_names=class_labels))


Best Thresholds:
{'Died': np.float64(0.08), 'Euthanasia': np.float64(0.2)}
Best Macro F1 Score:
0.4361

=== Final Evaluation with Optimized Thresholds ===
Accuracy: 0.6114159769701332
                 precision    recall  f1-score   support

       Adoption       0.63      0.79      0.70     10981
           Died       0.09      0.08      0.09       201
     Euthanasia       0.35      0.33      0.34       731
Return to Owner       0.59      0.53      0.56      3309
       Transfer       0.63      0.41      0.50      7010

       accuracy                           0.61     22232
      macro avg       0.46      0.43      0.44     22232
   weighted avg       0.61      0.61      0.60     22232



In [949]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder

# 1. Label encode main target
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(animal_data["Outcome Type"])
class_to_index = {label: i for i, label in enumerate(label_encoder.classes_)}

# 2. One-hot encode data
X = pd.get_dummies(animal_data.drop(columns=["Outcome Type"]), dtype=np.uint8)
X_test_final = pd.get_dummies(animal_test.drop(columns=["Id"]), dtype=np.uint8)
X_test_final = X_test_final.reindex(columns=X.columns, fill_value=0)

# 3. Train multiclass model (no class weights)
clf_multi = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
clf_multi.fit(X, y_encoded)
probs_multi = clf_multi.predict_proba(X_test_final)

# 4. Train OvR binary classifiers
def train_ovr_classifier(class_name):
    y_binary = (animal_data["Outcome Type"] == class_name).astype(int)
    clf = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
    clf.fit(X, y_binary)
    return clf

clf_died = train_ovr_classifier("Died")
clf_euth = train_ovr_classifier("Euthanasia")

# 5. Predict probabilities from OvR
prob_died = clf_died.predict_proba(X_test_final)[:, 1]
prob_euth = clf_euth.predict_proba(X_test_final)[:, 1]

# 6. Multiclass prediction (initial)
y_test_pred_encoded = np.argmax(probs_multi, axis=1)
y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

# 7. Override multiclass with OvR if confident enough
thresholds = {
    "Died": 0.07,
    "Euthanasia": 0.18
}
for i in range(len(y_test_pred)):
    if prob_died[i] > thresholds["Died"]:
        y_test_pred[i] = "Died"
    elif prob_euth[i] > thresholds["Euthanasia"]:
        y_test_pred[i] = "Euthanasia"

# 8. Save to CSV
submission = pd.DataFrame({
    "Id": animal_test["Id"],
    "Outcome Type": y_test_pred
})
submission = submission.sort_values("Id").reset_index(drop=True)
submission.to_csv("ovr_predictions.csv", index=False)
print("✅ Saved combined OvR + multiclass predictions to 'ovr_predictions.csv'")

# 9. Show distribution
distribution = submission["Outcome Type"].value_counts(normalize=True) * 100
print("\nPrediction Distribution (%):")
print(distribution.sort_index().round(2).to_string())


✅ Saved combined OvR + multiclass predictions to 'ovr_predictions.csv'

Prediction Distribution (%):
Outcome Type
Adoption           62.43
Died                1.24
Euthanasia          3.40
Return to Owner    13.39
Transfer           19.54


In [950]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE

# === 1. Prepare data ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"]).fillna(0)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
class_labels = label_encoder.classes_

# === 2. Train/Test split ===
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# === 3. Apply SMOTE to training set only
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("Original training distribution:", np.bincount(y_train))
print("After SMOTE:", np.bincount(y_train_resampled))

# === 4. Train classifier (RandomForest works well with SMOTE data)
clf = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

# === 5. Evaluate on original test set
y_pred = clf.predict(X_test)
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\n=== Classification Report (SMOTE) ===")
print(classification_report(y_test, y_pred, target_names=class_labels))


Original training distribution: [44063   840  2718 13290 28014]
After SMOTE: [44063 44063 44063 44063 44063]

Accuracy: 0.5032385750269881

=== Classification Report (SMOTE) ===
                 precision    recall  f1-score   support

       Adoption       0.70      0.53      0.60     10981
           Died       0.04      0.29      0.06       201
     Euthanasia       0.21      0.50      0.29       731
Return to Owner       0.38      0.80      0.52      3309
       Transfer       0.63      0.33      0.43      7010

       accuracy                           0.50     22232
      macro avg       0.39      0.49      0.38     22232
   weighted avg       0.61      0.50      0.52     22232



In [951]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

# === 1. Prepare training data ===
y = animal_data["Outcome Type"]
X = animal_data.drop(columns=["Outcome Type"]).fillna(0)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
class_labels = label_encoder.classes_

# === 2. Apply SMOTE to the full training data ===
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y_encoded)

print("Original training distribution:", np.bincount(y_encoded))
print("After SMOTE:", np.bincount(y_resampled))

# === 3. Prepare test data ===
X_test_final = animal_test.drop(columns=["Id"], errors="ignore").fillna(0)

# Ensure test set has same columns (use reindex if one-hot encoding was used before)
X_test_final = X_test_final.reindex(columns=X.columns, fill_value=0)

# === 4. Train the classifier on full SMOTE-augmented data ===
clf = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42)
clf.fit(X_resampled, y_resampled)

# === 5. Predict and format results ===
y_test_pred_encoded = clf.predict(X_test_final)
y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

submission = pd.DataFrame({
    "Id": animal_test["Id"],
    "Outcome Type": y_test_pred
})
submission = submission.sort_values("Id").reset_index(drop=True)
submission.to_csv("smote_predictions.csv", index=False)
print("✅ Predictions saved to 'smote_predictions.csv'")

# === 6. Show prediction distribution
distribution = submission["Outcome Type"].value_counts(normalize=True) * 100
print("\n=== Prediction Distribution (%): ===")
print(distribution.sort_index().round(2).to_string())


Original training distribution: [55044  1041  3449 16599 35024]
After SMOTE: [55044 55044 55044 55044 55044]
✅ Predictions saved to 'smote_predictions.csv'

=== Prediction Distribution (%): ===
Outcome Type
Adoption           36.88
Died                7.55
Euthanasia          8.42
Return to Owner    31.49
Transfer           15.67


Best is SMOTE WITHOUT MAKING TEMPERATURE FEATURE!!!!