# **CS 363M Final Project Spring 2025**

## Chenyi Wang, Bhuvan Kannaeganti, Suyog Valsangkar

### **Overview**

For the project in this class, you will participate in a machine learning competition where you’ll apply your ML skills to a real-world dataset. You may work individually or in teams of up to 3 students. 

The dataset for this competition comes from the Austin Animal Center, the largest no-kill animal shelter in the United States. It contains historical records of animals that have entered the shelter, including details such as species, breed, age, intake type, medical condition, and other attributes. Each animal in the dataset has a recorded outcome, which represents what eventually happened to the animal after entering the shelter.

Your goal in this competition is to build a machine learning model that predicts the final outcome of each animal admitted to the shelter, based on its intake characteristics. The possible outcomes are:

**- Adopted**: The animal was placed into a new home.<br>
**- Return to Owner**: The animal was reclaimed by its original owner.<br>
**- Euthanasia**: The animal was humanely euthanized due to medical or behavioral concerns.<br>
**- Died**: The animal passed away while in the shelter’s care.<br>
**- Transfer**: The animal was moved to another shelter or rescue organization.<br>

By accurately predicting these outcomes, your model can help identify factors that influence an animal's journey through the shelter system and potentially aid in improving adoption and survival rates, shelter policies, or allocation of resources.


## **Code and Analysis Below:**

## 1. Import Datasets

Import the training and test set from their respective csv files.

In [390]:
import pandas as pd

animal_data = pd.read_csv('train.csv')
animal_test = pd.read_csv('test.csv')

pd.set_option('display.max_columns', None) # show all columns
animal_data.sample(5) # sample some data

Unnamed: 0,Id,Name,Intake Time,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Time,Date of Birth,Outcome Type
6335,A680906,*Barney,06/09/2014 04:45:00 PM,E Martin Luther King Jr Blvd & Tillery St in A...,Stray,Normal,Dog,Intact Male,1 year,Yorkshire Terrier Mix,Black/Tan,06/17/2014 01:26:00 PM,06/09/2013,Transfer
87395,A876877,Milo,03/21/2023 01:11:00 PM,18809 Littig Road in Manor (TX),Stray,Normal,Dog,Intact Male,1 year,Labrador Retriever,White,06/11/2023 06:14:00 PM,03/21/2022,Adoption
27330,A668501,Pierre,12/04/2013 02:52:00 PM,10502 Ponder Lane in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,Miniature Poodle Mix,White,12/09/2013 03:47:00 PM,12/04/2011,Adoption
22456,A751455,Carlos,06/08/2017 01:35:00 PM,1802 Tartar Way in Austin (TX),Owner Surrender,Normal,Dog,Intact Male,6 years,Mastiff Mix,Brown/White,06/20/2017 06:27:00 PM,05/28/2011,Adoption
39620,A841053,*Duchess,08/19/2021 04:10:00 PM,Von Quintus And Maha Loop in Travis (TX),Stray,Normal,Cat,Intact Female,3 weeks,Domestic Shorthair,Torbie,11/20/2021 07:16:00 AM,07/29/2021,Adoption


## 2. Data Cleaning

By sampling the data and observing the features, we can already see features that do not exist within our test set. This means that we cannot train on the information that we do not have access to later when predicting, thus we should get rid of them. Id and Date of Birth exist within the test set but they are misaligned and irrelevant to training, respectively.

In [391]:
# drop name feature entirely
animal_data.drop("Name", axis=1, inplace=True)

# drop id feature entirely
animal_data.drop("Id", axis=1, inplace=True)

# drop outcome time
animal_data.drop("Outcome Time", axis=1, inplace=True)

# drop birth date
animal_data.drop("Date of Birth", axis=1, inplace=True)

# verify dropper features
animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Type
60066,05/14/2014 11:57:00 PM,2009 Kirksey Dr in Austin (TX),Stray,Nursing,Cat,Intact Female,3 days,Domestic Medium Hair Mix,Tortie,Transfer
10687,10/11/2014 05:49:00 PM,Austin (TX),Euthanasia Request,Sick,Dog,Neutered Male,7 years,Pit Bull,Blue/White,Euthanasia
5686,02/12/2017 03:01:00 PM,4527 Ave C in Austin (TX),Stray,Normal,Dog,Neutered Male,11 months,Australian Cattle Dog,Black/White,Return to Owner
110530,07/27/2024 08:36:00 AM,Austin (TX),Stray,Neonatal,Cat,Intact Male,1 weeks,Domestic Shorthair,Brown Tabby,Transfer
37609,02/18/2023 06:48:00 PM,Moore Road And S Fm 973 in Travis (TX),Stray,Other,Dog,Intact Male,1 month,Australian Shepherd/Plott Hound,Brown Brindle/White,Adoption


We see that there are only two different types of animals within both the training and test set. Let's make this a binary feature so we don't need to unnecessary hot hot encoded something that can be done with a binary approach.

In [392]:
print("Unique values in 'Animal Type' in TRAINING SET:", animal_data['Animal Type'].unique())
print("Unique values in 'Animal Type' in TEST SET:", animal_data['Animal Type'].unique())

Unique values in 'Animal Type' in TRAINING SET: ['Dog' 'Cat']
Unique values in 'Animal Type' in TEST SET: ['Dog' 'Cat']


In [393]:
# transform Animal Type into a binary column: Cat -> True, Dog -> False
animal_data['Cat'] = animal_data['Animal Type'].apply(lambda x: True if x == 'Cat' else False)
animal_data.drop('Animal Type', axis=1, inplace=True)

animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Sex upon Intake,Age upon Intake,Breed,Color,Outcome Type,Cat
23718,09/08/2021 11:33:00 AM,4705 Leather Leaf Trail in Austin (TX),Stray,Normal,Neutered Male,10 years,Domestic Shorthair Mix,Orange Tabby,Transfer,True
59818,01/22/2023 11:19:00 AM,5806 Encinal Cv in Austin (TX),Owner Surrender,Normal,Intact Female,1 year,Domestic Shorthair,Brown Tabby/White,Transfer,True
58905,02/05/2014 10:58:00 AM,Austin (TX),Owner Surrender,Normal,Intact Male,1 month,Domestic Shorthair Mix,Orange Tabby,Adoption,True
76763,08/31/2019 01:31:00 PM,Austin (TX),Stray,Nursing,Intact Female,1 weeks,German Shepherd/Australian Kelpie,White/Tricolor,Adoption,False
20309,07/17/2014 12:00:00 PM,Austin (TX),Owner Surrender,Normal,Intact Male,1 year,Siberian Husky/German Shepherd,Red/White,Adoption,False


We see that Sex upon Intake is a feature that describes multiple labels (gender and sterlization status). We can break this down into two separate binary features to again avoid one hot encoding.

In [394]:
animal_data["Sex upon Intake"].sample(5)

40612    Intact Female
52951    Neutered Male
58007    Intact Female
43479    Neutered Male
3437     Intact Female
Name: Sex upon Intake, dtype: object

In [395]:
# separate age at intake and reproductive status, create new column
sex_sterile_status = animal_data['Sex upon Intake'].str.split(' ', n=1, expand=True)

# handle cases where the split returns "Unknown"
animal_data['Sterilized'] = sex_sterile_status[0].map({
    'Neutered': 'True',
    'Spayed': 'True',
    'Intact': 'False',
    'Unknown': 'False'
}).fillna('False')  # unexpected values, let's assume not sterilized

# assign gender, we can keep it a binary by make the feature "Male", where male = true and female = false
animal_data['Male'] = sex_sterile_status[1].apply(lambda x: False if x == 'Female' else True)
animal_data.drop("Sex upon Intake", axis=1, inplace=True)

animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Age upon Intake,Breed,Color,Outcome Type,Cat,Sterilized,Male
82017,10/19/2021 12:44:00 PM,Austin (TX),Owner Surrender,Normal,1 year,Domestic Shorthair,Tortie,Adoption,True,False,False
8992,02/06/2017 09:36:00 AM,5910 Ed Bluestein Boulevard in Austin (TX),Stray,Normal,5 months,Labrador Retriever Mix,Cream,Adoption,False,False,False
29848,04/04/2015 11:46:00 AM,812 And Hwy 183 in Austin (TX),Stray,Normal,3 months,German Shepherd Mix,Sable,Transfer,False,False,True
32862,11/17/2023 03:57:00 PM,6601 Rialto Blvd in Austin (TX),Stray,Normal,3 years,Domestic Shorthair,Blue,Adoption,True,True,True
1522,06/11/2014 01:15:00 PM,Colony Creek And Galewood in Austin (TX),Stray,Normal,2 years,Chihuahua Shorthair Mix,Tan,Adoption,False,False,True


We see that the Age upon Intake feature has all sorts of different units (years, months, weeks, days) describing the age of the animal when it entered the shelter. We need to use a universal unit so the age can be more easily compared amongst each other when training.

In [396]:
animal_data['Age upon Intake'].head(10)

0      8 years
1    11 months
2      2 years
3      2 years
4      6 years
5     6 months
6      2 years
7      4 weeks
8      4 weeks
9     5 months
Name: Age upon Intake, dtype: object

We can write a helper function that checks the age string and converts it to a year expressed as a float so we can maintain consistency in this feature.

In [397]:
# helper to convert age upon intake to years
def age_to_years(age_str):
    if pd.isna(age_str):
        return None
    
    number, unit = age_str.split()
    number = float(number)
    
    if "year" in unit:
        return number
    elif "month" in unit:
        return number / 12
    elif "week" in unit:
        return number / 52
    elif "day" in unit:
        return number / 365
    else:
        return None  # in case of an unexpected format

# convert age to years
animal_data['Age'] = animal_data['Age upon Intake'].apply(age_to_years)
animal_data.drop('Age upon Intake', axis=1, inplace=True)

# sample some data after transformations, verify age conversion
animal_data.sample(5)

Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Color,Outcome Type,Cat,Sterilized,Male,Age
101161,07/19/2021 02:28:00 PM,209 West Wheeler Street in Austin (TX),Stray,Normal,German Shepherd,Black/Brown,Adoption,False,True,False,1.0
47786,10/28/2014 01:52:00 PM,Montopolis And Porter St in Austin (TX),Stray,Normal,Labrador Retriever/Anatol Shepherd,Tan/White,Transfer,False,False,False,0.416667
67880,04/18/2017 11:48:00 AM,Austin (TX),Stray,Nursing,Chihuahua Shorthair Mix,Tan,Adoption,False,False,False,0.005479
80700,11/10/2020 09:52:00 AM,Decker Lane And Canoga Avenue in Austin (TX),Stray,Normal,Maltese Mix,Cream,Adoption,False,False,True,2.0
3860,05/07/2022 06:57:00 PM,Austin (TX),Owner Surrender,Normal,Snowshoe Mix,Lynx Point,Adoption,True,True,True,7.0


Repeat the process for the test set.

In [398]:
# drop birth date
animal_test.drop("Date of Birth", axis=1, inplace=True)

# separate age at intake and reproductive status, create new column
sex_sterile_status = animal_test['Sex upon Intake'].str.split(' ', n=1, expand=True)

# handle cases where the split returns "Unknown"
animal_test['Sterilized'] = sex_sterile_status[0].map({
    'Neutered': True,
    'Spayed': True,
    'Intact': False,
    'Unknown': False
}).fillna(False)  # unexpected values, let's assume not sterilized
animal_data['Sterilized'] = animal_data['Sterilized'].map({'True': True, 'False': False})

# assign gender, we can keep it a binary by make the feature "Male", where male = true and female = false
animal_test['Male'] = sex_sterile_status[1].apply(lambda x: False if x == 'Female' else True)
animal_test.drop("Sex upon Intake", axis=1, inplace=True)

# transform Animal Type into a binary column: Cat -> True, Dog -> False
animal_test['Cat'] = animal_test['Animal Type'].apply(lambda x: True if x == 'Cat' else False)
animal_test.drop('Animal Type', axis=1, inplace=True)

# convert age to years
animal_test['Age'] = animal_test['Age upon Intake'].apply(age_to_years)
animal_test['Age'] = animal_test['Age'].fillna(0)  # fill NaN with 0
animal_test.drop('Age upon Intake', axis=1, inplace=True)

# sample some data after transformations, verify age conversion
animal_test.sample(5)

Unnamed: 0,Id,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Color,Sterilized,Male,Cat,Age
24547,24548,7/2/23 12:36,Austin (TX),Stray,Normal,Domestic Shorthair,Black,False,True,True,0.0
9435,9436,4/8/21 10:43,Travis (TX),Owner Surrender,Normal,American Bulldog Mix,Brown Brindle/White,True,True,False,6.0
25543,25544,9/22/18 17:19,Manor (TX),Owner Surrender,Normal,Labrador Retriever Mix,Tan/White,False,True,False,0.083333
4369,4370,3/22/19 7:49,1315 E 13Th St in Austin (TX),Stray,Injured,Domestic Shorthair Mix,Black,False,True,True,3.0
7419,7420,4/3/19 13:29,Manor (TX),Owner Surrender,Nursing,Pit Bull Mix,Brown Brindle/White,False,False,False,0.0


## **3. Feature Engineering:**

Let's look at how colors correlate. We see that there looks to be a primary and a secondary color which we potentially can split.

In [399]:
# find top colors
top_colors = animal_data["Color"].value_counts().nlargest(10).index
filtered = animal_data[animal_data["Color"].isin(top_colors)]

print("     Outcome Distribution by Full Color (Top 10 Only)        ")
color_outcome = (
    filtered.groupby("Color")["Outcome Type"]
    .value_counts(normalize=False)
    .unstack()
    .fillna(0)
)

color_outcome["Count"] = color_outcome.sum(axis=1)
color_percent = color_outcome.div(color_outcome["Count"], axis=0) * 100
final = color_percent.round(1)
final["Count"] = color_outcome["Count"].astype(int)

print(final.to_string())

     Outcome Distribution by Full Color (Top 10 Only)        
Outcome Type       Adoption  Died  Euthanasia  Return to Owner  Transfer  Count
Color                                                                          
Black                  46.8   1.4         3.6             10.6      37.5   9674
Black/White            52.4   1.0         3.0             14.4      29.1  11620
Blue/White             51.8   0.5         3.9             17.0      26.8   3003
Brown Tabby            49.5   1.3         3.7              3.7      41.7   7708
Brown Tabby/White      51.7   1.5         3.3              4.5      38.9   3862
Brown/White            49.3   0.5         2.5             23.9      23.9   3457
Orange Tabby           49.0   1.2         3.4              4.8      41.7   3673
Tan/White              50.8   0.5         2.5             22.0      24.2   3178
White                  40.1   0.8         3.4             24.6      31.2   3945
White/Black            49.0   0.7         2.9             

It looks like theres some noticeable differences with some colors like black and white, Lets Do more in-depth research about the Color

In [400]:
from scipy.stats import chi2_contingency

# Build contingency table
contingency_table = pd.crosstab(animal_data["Color"], animal_data["Outcome Type"])

# Run chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-squared: {chi2:.2f}")
print(f"P-value: {p:.10f}")

Chi-squared: 10816.40
P-value: 0.0000000000


It looks like there is a significant relationship between color and outcome type. Lets process the data more before we use it

In [401]:
# helper to extract primarry color from the string
def extract_primary_color(color):
    if pd.isna(color):
        return "Unknown"
    if "/" in color:
        return color.split("/")[0]
    if color.lower() in ["tricolor", "calico", "torbie", "tortie"]:
        return "Multi"
    return color.strip()

# Apply to both datasets
animal_data["PrimaryColor"] = animal_data["Color"].apply(extract_primary_color)
animal_test["PrimaryColor"] = animal_test["Color"].apply(extract_primary_color)

# Binary pattern flags
def color_flags(color):
    color = str(color).lower()
    return pd.Series({
        "has_tabby": "tabby" in color,
        "has_tortie": "tortie" in color,
        "has_calico": "calico" in color,
        "has_torbie": "torbie" in color,
        "has_tricolor": "tricolor" in color
    })

color_features_data = animal_data["Color"].apply(color_flags)
color_features_test = animal_test["Color"].apply(color_flags)

# Attach pattern flags to data
animal_data = pd.concat([animal_data, color_features_data], axis=1)
animal_test = pd.concat([animal_test, color_features_test], axis=1)

# Encode PrimaryColor
from sklearn.preprocessing import LabelEncoder

color_encoder = LabelEncoder()
animal_data["PrimaryColorEncoded"] = color_encoder.fit_transform(animal_data["PrimaryColor"])
animal_test["PrimaryColorEncoded"] = color_encoder.transform(animal_test["PrimaryColor"])

# Drop unused original color fields
animal_data.drop(["Color", "PrimaryColor"], axis=1, inplace=True)
animal_test.drop(["Color", "PrimaryColor"], axis=1, inplace=True)

# Decode color labels for readability
animal_data["PrimaryColorLabel"] = color_encoder.inverse_transform(animal_data["PrimaryColorEncoded"])

# Get top 20 most frequent primary colors
top_colors = animal_data["PrimaryColorLabel"].value_counts().head(20).index
top_color_data = animal_data[animal_data["PrimaryColorLabel"].isin(top_colors)]

# Group outcome % and counts by primary color
color_outcome = (
    top_color_data.groupby("PrimaryColorLabel")["Outcome Type"]
    .value_counts(normalize=False)
    .unstack(fill_value=0)
)

color_outcome["Count"] = color_outcome.sum(axis=1)
color_percent = color_outcome.div(color_outcome["Count"], axis=0) * 100
final_color_table = color_percent.round(1)
final_color_table["Count"] = color_outcome["Count"]

print("\n   Top 20 Primary Colors: Outcome Percent and Counts   ")
print(final_color_table.to_string())

# check these print outs for patterns within the colors
print("\n   Pattern Flags: Outcome Percent and Counts   ")
for col in ["has_tabby", "has_tortie", "has_calico", "has_torbie", "has_tricolor"]:
    subset = animal_data[animal_data[col]]
    if len(subset) == 0:
        continue
    outcome_counts = subset["Outcome Type"].value_counts()
    outcome_percent = subset["Outcome Type"].value_counts(normalize=True) * 100
    summary = pd.DataFrame({
        "Count": outcome_counts,
        "Percent": outcome_percent.round(1)
    })
    print(f"\n-- {col} --")
    print(summary.to_string())


   Top 20 Primary Colors: Outcome Percent and Counts   
Outcome Type       Adoption  Died  Euthanasia  Return to Owner  Transfer  Count
PrimaryColorLabel                                                              
Black                  50.4   1.1         3.1             14.9      30.5  27150
Blue                   50.3   0.9         3.7             13.5      31.6   5386
Blue Tabby             54.6   1.3         3.2              3.4      37.5   2937
Brown                  48.2   0.6         2.8             23.6      24.7   8626
Brown Brindle          50.9   0.3         3.3             22.8      22.6   2788
Brown Tabby            50.1   1.4         3.6              4.0      40.9  11694
Buff                   45.1   0.8         2.5             22.7      28.9    634
Chocolate              49.3   0.6         2.9             24.6      22.5   1462
Cream                  44.7   0.8         1.9             18.9      33.7   1070
Cream Tabby            55.4   1.7         2.4              2.5 

Lets look at corellations

In [402]:
import numpy as np

def cramers_v(confusion_matrix):
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    return np.sqrt(phi2 / min(k - 1, r - 1))

# For primary color
cramer_v_color = cramers_v(pd.crosstab(animal_data["PrimaryColorEncoded"], animal_data["Outcome Type"]))
print(f"\nCramér’s V for PrimaryColorEncoded: {cramer_v_color:.3f}")

from scipy.stats import chi2_contingency

from scipy.stats import chi2_contingency

# Color-related binary features
color_flags = ["has_tabby", "has_tortie", "has_calico", "has_torbie", "has_tricolor"]

print("=== Chi-squared Test Results for Color Pattern Flags ===")
for col in color_flags:
    table = pd.crosstab(animal_data[col], animal_data["Outcome Type"])
    try:
        chi2, p, dof, expected = chi2_contingency(table)
        print(f"{col:<15} | Chi2: {chi2:8.2f} | p: {p:.5f}")
    except Exception as e:
        print(f"{col:<15} | Error: {e}")


Cramér’s V for PrimaryColorEncoded: 0.119
=== Chi-squared Test Results for Color Pattern Flags ===
has_tabby       | Chi2:  3183.38 | p: 0.00000
has_tortie      | Chi2:   293.01 | p: 0.00000
has_calico      | Chi2:   241.65 | p: 0.00000
has_torbie      | Chi2:   217.69 | p: 0.00000
has_tricolor    | Chi2:   188.13 | p: 0.00000


Let's look at Intake Condition and see all the possible labels for that class.

In [403]:
# Count the occurrences of each intake condition label
intake_condition_counts = animal_data['Intake Condition'].value_counts()

# Print the counts for each intake condition
for condition, count in intake_condition_counts.items():
    print(f"{condition}: {count}")

Normal: 95010
Injured: 6394
Sick: 4295
Nursing: 2957
Neonatal: 1240
Aged: 373
Medical: 298
Other: 247
Pregnant: 111
Feral: 104
Med Attn: 48
Behavior: 42
Unknown: 12
Neurologic: 10
Med Urgent: 7
Parvo: 5
Space: 2
Agonal: 1
Congenital: 1


We see above that there are 19 different categorical values for the Intake Condition feature, we can merge some of these rarer classification together.

Let's explore the outcome percentages of each intake condition so we can group these conditions better and reduce the amount of labels we need to one hot encode (increases dimensionality).

In [404]:
# Iterate through each unique intake condition and calculate outcome percentages
intake_conditions = animal_data['Intake Condition'].unique()

for condition in intake_conditions:
    condition_data = animal_data[animal_data['Intake Condition'] == condition]
    total_count = len(condition_data)
    if total_count > 0:
        outcome_percentages = condition_data['Outcome Type'].value_counts(normalize=True) * 100
        print(f"Intake Condition: {condition}")
        print(outcome_percentages)
        print("-" * 50)
    else:
        print(f"Intake Condition: {condition} has no entries.")
        print("-" * 50)

Intake Condition: Normal
Outcome Type
Adoption           52.596569
Transfer           29.360067
Return to Owner    15.985686
Euthanasia          1.513525
Died                0.544153
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Injured
Outcome Type
Adoption           34.813888
Transfer           30.669378
Euthanasia         18.720676
Return to Owner    12.605568
Died                3.190491
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Pregnant
Outcome Type
Transfer           54.954955
Adoption           39.639640
Return to Owner     4.504505
Died                0.900901
Name: proportion, dtype: float64
--------------------------------------------------
Intake Condition: Neonatal
Outcome Type
Transfer           69.919355
Adoption           25.564516
Died                3.064516
Return to Owner     0.967742
Euthanasia          0.483871
Name: proportion, dtype: float64
-------

Groupings based on similar outcome percentages

In [405]:
'''
   [Group]	       [Categories]
    Normal	        Normal
    Neonatal 	    Neonatal, Nursing
    Med_Minor	    Injured, Medical (more similar outcome adopt %)
    Med_Major       Med Attn, Med Urgent, Neurologic, Pregnant, Sick (more similar outcome transfer %)
    Behavioral	    Feral, Behavior
    Critical    	Agonal, Aged, Congenital, Parvo, Space, Other, Unknown
'''

'\n   [Group]\t       [Categories]\n    Normal\t        Normal\n    Neonatal \t    Neonatal, Nursing\n    Med_Minor\t    Injured, Medical (more similar outcome adopt %)\n    Med_Major       Med Attn, Med Urgent, Neurologic, Pregnant, Sick (more similar outcome transfer %)\n    Behavioral\t    Feral, Behavior\n    Critical    \tAgonal, Aged, Congenital, Parvo, Space, Other, Unknown\n'

We see that our merged class labels have relatively similar outcomes percentage-wise. We can use this as an attempt to reduce dimensionality. However, this raises the question of if dimensionality reduction in this scenario is necessary. We should instead make a copy of the original dataframe to see if our dimensionality reduction did indeed help with the accuracy of the model.

In [406]:
# intake condition into grouped categories
condition_map = {
    'Normal': 'Normal',
    'Injured': 'Med_Minor',
    'Sick': 'Med_Major',
    'Nursing': 'Neonatal',
    'Neonatal': 'Neonatal',
    'Med Attn': 'Med_Major',
    'Med Urgent': 'Med_Major',
    'Medical': 'Med_Minor',
    'Neurologic': 'Med_Major',
    'Pregnant': 'Med_Major',
    'Feral': 'Behavioral',
    'Behavior': 'Behavioral',
}

# make copy to make transformations (in case it doesn't improve accuracy)
reduced_animal_data = animal_data.copy()
reduced_animal_test = animal_test.copy()

# # map with a fallback to 'Rare'
reduced_animal_data['Condition'] = reduced_animal_data['Intake Condition'].map(condition_map).fillna('Rare')
reduced_animal_data = pd.get_dummies(reduced_animal_data, columns=['Condition'])
reduced_animal_data.drop('Intake Condition', axis=1, inplace=True)

reduced_animal_test['Condition'] = reduced_animal_test['Intake Condition'].map(condition_map).fillna('Rare')
reduced_animal_test = pd.get_dummies(reduced_animal_test, columns=['Condition'])
reduced_animal_test.drop('Intake Condition', axis=1, inplace=True)

reduced_animal_data.sample(5)  # sample some data after transformation

Unnamed: 0,Intake Time,Found Location,Intake Type,Breed,Outcome Type,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,PrimaryColorLabel,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare
29259,12/03/2014 11:26:00 AM,Austin (TX),Owner Surrender,Australian Cattle Dog/Border Collie,Adoption,False,False,False,0.083333,False,False,False,False,False,2,Black,False,False,False,False,True,False
6822,05/27/2021 01:36:00 PM,Austin (TX),Stray,Domestic Shorthair,Transfer,True,False,False,0.666667,True,False,False,False,False,18,Brown Tabby,False,False,False,False,True,False
57456,11/06/2023 12:09:00 PM,8104 Taza Trail in Austin (TX),Stray,Domestic Shorthair,Transfer,True,False,True,2.0,False,False,False,False,False,7,Blue,False,False,False,False,True,False
30952,05/14/2024 05:49:00 PM,1620 East 6Th Street in Austin (TX),Stray,Pit Bull,Adoption,False,False,True,0.583333,False,False,False,False,False,2,Black,False,False,False,False,True,False
53579,05/04/2021 05:12:00 PM,Blake Manor Road in Manor (TX),Stray,German Shepherd,Adoption,False,False,True,0.076923,False,False,False,False,False,2,Black,False,True,False,False,False,False


Let's do the same for Intake Type and check the outcome percentages of each condition.

In [407]:
# observe outcome percentages for each intake type so we can group these categoricals, reduces dimensionality
intake_types = animal_data['Intake Type'].unique()

for intake_type in intake_types:
    type_data = animal_data[animal_data['Intake Type'] == intake_type]
    total_count = len(type_data)
    if total_count > 0:
        outcome_percentages = type_data['Outcome Type'].value_counts(normalize=True) * 100
        print(f"Intake Type: {intake_type}")
        print(outcome_percentages)
        print("-" * 50)
    else:
        print(f"Intake Type: {intake_type} has no entries.")
        print("-" * 50)

Intake Type: Stray
Outcome Type
Adoption           47.774546
Transfer           34.219783
Return to Owner    13.905286
Euthanasia          3.056120
Died                1.044266
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Public Assist
Outcome Type
Return to Owner    63.545914
Adoption           18.528788
Transfer           14.222802
Euthanasia          3.262111
Died                0.440385
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Owner Surrender
Outcome Type
Adoption           64.671050
Transfer           26.502541
Return to Owner     5.399357
Euthanasia          2.736980
Died                0.690073
Name: proportion, dtype: float64
--------------------------------------------------
Intake Type: Abandoned
Outcome Type
Adoption           63.017032
Transfer           26.845093
Return to Owner     9.083536
Euthanasia          0.648824
Died                0.405515
Name: proportion, 

These would be the groupings that are most similar to each other based on the outcome percentages.

In [408]:
'''
    [Group]	         [Categories]
     Public Assist    Public Assist
     Stray            Stray, Wildlife
     Owner-Initiated  Abandoned, Owner Surrender
     Euthanasia       Euthanasia Request
'''

'\n    [Group]\t         [Categories]\n     Public Assist    Public Assist\n     Stray            Stray, Wildlife\n     Owner-Initiated  Abandoned, Owner Surrender\n     Euthanasia       Euthanasia Request\n'

In [409]:
# Define the mapping for grouping intake types
intake_type_map = {
    'Stray': 'Stray',
    'Public Assist': 'Public Assist',
    'Wildlife': 'Stray',
    'Abandoned': 'Owner Initiated',
    'Owner Surrender': 'Owner Initiated',
    'Euthanasia Request': 'Euthanasia'
}

# # Map the intake types to their respective groups
reduced_animal_data['Intake'] = reduced_animal_data['Intake Type'].map(intake_type_map).fillna('Other')
reduced_animal_data = pd.get_dummies(reduced_animal_data, columns=['Intake'])
reduced_animal_data.drop('Intake Type', axis=1, inplace=True)

reduced_animal_test['Intake'] = reduced_animal_test['Intake Type'].map(intake_type_map).fillna('Other')
reduced_animal_test = pd.get_dummies(reduced_animal_test, columns=['Intake'])
reduced_animal_test.drop('Intake Type', axis=1, inplace=True)

reduced_animal_data.sample(5)  # sample some data after transformation


Unnamed: 0,Intake Time,Found Location,Breed,Outcome Type,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,PrimaryColorLabel,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
50860,10/02/2013 09:19:00 PM,132 Karen Hill Dr. in Austin (TX),Domestic Shorthair Mix,Adoption,True,False,True,0.166667,True,False,False,False,False,18,Brown Tabby,False,False,False,False,True,False,False,False,False,True
105489,04/27/2024 05:15:00 PM,5971 Hiline Rd in Austin (TX),Domestic Shorthair,Adoption,True,False,True,0.083333,True,False,False,False,False,18,Brown Tabby,False,False,False,False,True,False,False,False,False,True
11294,11/22/2016 02:01:00 PM,8600 Cretys Cove in Austin (TX),Basenji Mix,Adoption,False,False,False,0.833333,False,False,False,False,False,42,Red,False,False,False,False,True,False,False,False,False,True
10914,04/13/2018 11:12:00 AM,Georgian Drive And West Powell Lane in Austin ...,Staffordshire Mix,Adoption,False,False,True,1.0,False,False,False,False,False,2,Black,False,False,False,False,True,False,False,False,False,True
3008,12/06/2015 12:54:00 PM,Outside Jurisdiction,Chihuahua Shorthair Mix,Adoption,False,True,True,1.0,False,False,False,False,False,51,Tan,False,False,False,False,True,False,False,True,False,False


These results indicate that there is enough correlation and significance in these categories to include them in further predictions

Let's try considering the intake time. We found a csv online that has all of the `feelslike` temperatures for Austin from 2013 - 2023. We can try adding temperature to our data set to see if that has any correlation with any of the outcomes, specifically intaking stray animals.

In [410]:
# make duplicate of dataset so we can can keep separate
temp_animal_data = animal_data.copy()
temp_animal_test = animal_test.copy()

# transform to pd datatime
temp_animal_data["Intake Time"] = pd.to_datetime(temp_animal_data["Intake Time"], errors="coerce")
temp_animal_test["Intake Time"] = pd.to_datetime(temp_animal_test["Intake Time"], errors="coerce")

# read in weather data csv
weather = pd.read_csv("austin_weather_data.csv")

# parse to the date time format
weather["datetime"] = pd.to_datetime(weather["datetime"], errors="coerce")
weather["date_only"] = weather["datetime"].dt.date
feelslike_map = dict(zip(weather["date_only"], weather["feelslike"]))
temp_animal_data["Intake Date"] = temp_animal_data["Intake Time"].dt.date
temp_animal_test["Intake Date"] = temp_animal_test["Intake Time"].dt.date

# helper to find matching temperature given date
def get_temperature(intake_date):
    if pd.isnull(intake_date):
        return np.nan
    # Try exact match
    if intake_date in feelslike_map:
        return feelslike_map[intake_date]
    # Try previous years
    for years_back in range(1, 6):
        try:
            fallback_date = (pd.to_datetime(intake_date) - pd.DateOffset(years=years_back)).date()
            if fallback_date in feelslike_map:
                return feelslike_map[fallback_date]
        except:
            continue
    return np.nan

# apply mapping
temp_animal_data["Intake Temperature"] = temp_animal_data["Intake Date"].apply(get_temperature)
temp_animal_test["Intake Temperature"] = temp_animal_test["Intake Date"].apply(get_temperature)

# impute with mean
temp_animal_data["Intake Temperature"] = temp_animal_data["Intake Temperature"].fillna(temp_animal_data["Intake Temperature"].mean())
temp_animal_test["Intake Temperature"] = temp_animal_test["Intake Temperature"].fillna(temp_animal_test["Intake Temperature"].mean())

# drop the original date column
temp_animal_data.drop(columns=["Intake Date"], inplace=True)
temp_animal_test.drop(columns=["Intake Date"], inplace=True)

temp_animal_data.sample(5)


  temp_animal_test["Intake Time"] = pd.to_datetime(temp_animal_test["Intake Time"], errors="coerce")


Unnamed: 0,Intake Time,Found Location,Intake Type,Intake Condition,Breed,Outcome Type,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,PrimaryColorLabel,Intake Temperature
67602,2016-05-05 11:41:00,505 Bowery Ln in Austin (TX),Stray,Normal,German Shepherd Mix,Return to Owner,False,False,True,3.0,False,False,False,False,False,2,Black,72.4
61091,2023-10-14 15:09:00,Austin (TX),Owner Surrender,Aged,Pit Bull,Euthanasia,False,False,True,12.0,False,False,False,False,False,7,Blue,76.3
110925,2024-08-16 15:28:00,7201 Levander Loop in Austin (TX),Stray,Normal,Standard Schnauzer Mix,Adoption,False,False,False,0.333333,False,False,False,False,False,57,White,87.6
107669,2015-02-16 13:21:00,S Capital Of Texas Hwy And S Mopac Expy in Aus...,Stray,Normal,Australian Cattle Dog/Labrador Retriever,Adoption,False,False,False,0.833333,False,False,False,False,False,51,Tan,41.1
57900,2024-06-18 12:47:00,17505 Hamilton Pool Rd in Travis (TX),Public Assist,Injured,Carolina Dog Mix,Return to Owner,False,True,True,2.0,False,False,False,False,False,15,Brown,97.6


Now that we have found a usecase for `Intake Time`, we can try experimenting creating different models to see how they perform with our different feature engineering strategies.

We need to drop the rest of the columns that we are not wanting to train on. More specifically, we chose to not train on the breed as it was too specific and would likely disorient our training. As for location, there wasn't a reliable way to parse and use the information efficiently, and much of the data was broken and unspecific.

In [411]:
animal_data.drop('Intake Time', axis=1, inplace=True)
animal_data.drop('Found Location', axis=1, inplace=True)
animal_data.drop('Breed', axis=1, inplace=True)
animal_data.drop('PrimaryColorLabel', axis=1, inplace=True)
animal_data.fillna(0, inplace=True)

animal_test.drop('Intake Time', axis=1, inplace=True)
animal_test.drop('Found Location', axis=1, inplace=True)
animal_test.drop('Breed', axis=1, inplace=True)
animal_test.fillna(0, inplace=True)

reduced_animal_data.drop('Intake Time', axis=1, inplace=True)
reduced_animal_data.drop('Found Location', axis=1, inplace=True)
reduced_animal_data.drop('Breed', axis=1, inplace=True)
reduced_animal_data.drop('PrimaryColorLabel', axis=1, inplace=True)
reduced_animal_data.fillna(0, inplace=True)

reduced_animal_test.drop('Intake Time', axis=1, inplace=True)
reduced_animal_test.drop('Found Location', axis=1, inplace=True)
reduced_animal_test.drop('Breed', axis=1, inplace=True)
reduced_animal_test.fillna(0, inplace=True)

temp_animal_data.drop('Intake Time', axis=1, inplace=True)
temp_animal_data.drop('Found Location', axis=1, inplace=True)
temp_animal_data.drop('Breed', axis=1, inplace=True)
temp_animal_data.drop('PrimaryColorLabel', axis=1, inplace=True)
temp_animal_data.fillna(0, inplace=True)

temp_animal_test.drop('Intake Time', axis=1, inplace=True)
temp_animal_test.drop('Found Location', axis=1, inplace=True)
temp_animal_test.drop('Breed', axis=1, inplace=True)
temp_animal_test.fillna(0, inplace=True)

animal_data.sample(5)

Unnamed: 0,Intake Type,Intake Condition,Outcome Type,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded
24202,Owner Surrender,Normal,Adoption,False,True,False,0.166667,False,False,False,False,False,15
15707,Stray,Normal,Return to Owner,False,True,False,5.0,False,False,False,False,False,25
47058,Public Assist,Normal,Return to Owner,False,True,False,5.0,False,False,False,False,False,15
21759,Stray,Injured,Adoption,False,False,False,2.0,False,False,False,False,False,15
85687,Stray,Normal,Adoption,False,False,False,0.083333,False,False,False,False,False,2


Lastly, we need to just one hot encode Intake Type and Intake Condition in the original training and test set (the one we didn't group those features into smaller groups to reduce dimensionality).

In [412]:
outcome = animal_data["Outcome Type"] # separate outcome type
one_hot_encoded = pd.get_dummies(animal_data.drop(columns=["Outcome Type"]), drop_first=True) # one-hot encode the rest
temp_one_hot_encoded = pd.get_dummies(temp_animal_data.drop(columns=["Outcome Type"]), drop_first=True) # one-hot encode the rest
animal_data = pd.concat([one_hot_encoded, outcome], axis=1) # add back outcome type for training later
temp_animal_data = pd.concat([temp_one_hot_encoded, outcome], axis=1) # add back outcome type for training later
animal_test = pd.get_dummies(animal_test, drop_first=True)

Let's make sure that we have our different data sets ready. One approach where we tried grouping the categoricals to reduce dimensionality and also imputing the original color feature with one hot encoding.

In [413]:
animal_data.sample(2) # regular categoricals, color one hot encoded

Unnamed: 0,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Neurologic,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown,Outcome Type
21017,False,True,True,3.0,False,False,False,False,False,2,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Transfer
88011,True,False,True,0.666667,False,False,False,False,False,2,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Adoption


Corresponding test set:

In [414]:
animal_test.sample(2) # regular categoricals, color one hot encoded

Unnamed: 0,Id,Sterilized,Male,Cat,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Panleuk,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown
26861,26862,True,False,False,3.0,False,False,False,False,False,57,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
21308,21309,False,True,False,0.583333,False,False,False,False,False,15,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False


Training set with reduced dimensionality on categorical features using grouping:

In [415]:
reduced_animal_data.sample(2) # reduced intake condition and type

Unnamed: 0,Outcome Type,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
18702,Transfer,False,False,False,0.083333,False,False,False,False,False,57,False,False,False,False,True,False,False,True,False,False
69856,Transfer,False,False,True,0.083333,False,False,False,False,False,15,False,False,False,False,True,False,False,False,False,True


Corresponding test set:

In [416]:
reduced_animal_test.sample(2) # reduced intake condition and type

Unnamed: 0,Id,Sterilized,Male,Cat,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Condition_Behavioral,Condition_Med_Major,Condition_Med_Minor,Condition_Neonatal,Condition_Normal,Condition_Rare,Intake_Euthanasia,Intake_Owner Initiated,Intake_Public Assist,Intake_Stray
20928,20929,False,True,True,0.076923,True,False,False,False,False,57,False,False,False,False,True,False,False,False,False,True
12750,12751,False,False,False,7.0,False,False,False,False,False,51,False,False,False,False,True,False,False,False,False,True


In [417]:
temp_animal_data.sample(2) # reduced intake condition and type

Unnamed: 0,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Temperature,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Neurologic,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown,Outcome Type
52526,True,False,False,0.083333,False,False,False,False,False,2,88.8,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Adoption
37990,False,True,False,6.0,False,False,False,False,False,42,71.6,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Return to Owner


In [418]:
temp_animal_data.sample(2) # reduced intake condition and type

Unnamed: 0,Cat,Sterilized,Male,Age,has_tabby,has_tortie,has_calico,has_torbie,has_tricolor,PrimaryColorEncoded,Intake Temperature,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Med Urgent,Intake Condition_Medical,Intake Condition_Neonatal,Intake Condition_Neurologic,Intake Condition_Normal,Intake Condition_Nursing,Intake Condition_Other,Intake Condition_Parvo,Intake Condition_Pregnant,Intake Condition_Sick,Intake Condition_Space,Intake Condition_Unknown,Outcome Type
11221,False,False,True,6.0,False,False,False,False,False,31,86.9,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,Return to Owner
31481,False,False,False,0.333333,False,False,False,False,False,16,62.1,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,Transfer


We have confirmed that our data sets are cleaned and ready for training.

Let's take a look at the general outcome type frequency from the training set. We want our output to trend similarly with most being adoptions, and the least outcomes being died and euthanasia. However, this brings up a concern of class imbalance, where we have almost 50% of the training data as adoption outcomes and less than 1% of outcomes were death. This means that it might be difficult to get a high recall on euthanasia and death and we will have to explore options to improve our learning algorithm.

In [419]:
# Print the frequency of each Outcome Type
outcome_frequencies = animal_data['Outcome Type'].value_counts() / len(animal_data) * 100
print("Frequency of each Outcome Type:")
print(outcome_frequencies)

Frequency of each Outcome Type:
Outcome Type
Adoption           49.519149
Transfer           31.508587
Return to Owner    14.932933
Euthanasia          3.102819
Died                0.936513
Name: count, dtype: float64


## **4. Model implementation comparisons and tuning:**

Let's try training a histogram-based binning to build an ensemble of decision trees. The scikit-learn library has HistGradientBoostingClassifier which we can use to implement our approach. By using an ensemble of methods, we can more effectively balance predictions and work through class imbalances.

Add necessary imports:

In [420]:
import numpy as np
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score
from imblearn.over_sampling import SMOTE
from collections import Counter

We need to cross validate over the training set and train on different segments and average the accuracies and recalls. This will allow us to determine if we are headed in the right direction.

In [421]:
# DATASET 1 = reduced_animal_data, DATASET 2 = animal_data, DATASET 3 = temp_animal_data
for i, df in enumerate([reduced_animal_data, animal_data, temp_animal_data]):
    print(f"\nEvaluating Dataset {i + 1}/3 with 5-Fold CV")

    labels = df["Outcome Type"]
    data = df.drop(columns=["Outcome Type"]).fillna(0)

    label_encoder = LabelEncoder()
    labels_encoded = label_encoder.fit_transform(labels)

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=363)
    acc_scores = []
    recall_scores = []

    for fold, (train_index, test_index) in enumerate(skf.split(data, labels_encoded), start=1):
        x_train, x_test = data.iloc[train_index], data.iloc[test_index]
        y_train, y_test = labels_encoded[train_index], labels_encoded[test_index]

        clf = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=363)
        clf.fit(x_train, y_train)

        predictions = clf.predict(x_test)
        acc = accuracy_score(y_test, predictions)
        recall = recall_score(y_test, predictions, average='macro')

        acc_scores.append(acc)
        recall_scores.append(recall)

        print(f"  Fold {fold}: Accuracy = {acc:.4f}, Macro Recall = {recall:.4f}")

    print(f"- Average Accuracy: {np.mean(acc_scores):.4f}")
    print(f"- Average Macro Recall: {np.mean(recall_scores):.4f}")


Evaluating Dataset 1/3 with 5-Fold CV
  Fold 1: Accuracy = 0.6176, Macro Recall = 0.3785
  Fold 2: Accuracy = 0.6179, Macro Recall = 0.3855
  Fold 3: Accuracy = 0.6253, Macro Recall = 0.3908
  Fold 4: Accuracy = 0.6218, Macro Recall = 0.3810
  Fold 5: Accuracy = 0.6199, Macro Recall = 0.3921
- Average Accuracy: 0.6205
- Average Macro Recall: 0.3856

Evaluating Dataset 2/3 with 5-Fold CV
  Fold 1: Accuracy = 0.6180, Macro Recall = 0.3822
  Fold 2: Accuracy = 0.6181, Macro Recall = 0.3882
  Fold 3: Accuracy = 0.6247, Macro Recall = 0.3915
  Fold 4: Accuracy = 0.6226, Macro Recall = 0.3868
  Fold 5: Accuracy = 0.6195, Macro Recall = 0.3934
- Average Accuracy: 0.6206
- Average Macro Recall: 0.3884

Evaluating Dataset 3/3 with 5-Fold CV
  Fold 1: Accuracy = 0.6228, Macro Recall = 0.3884
  Fold 2: Accuracy = 0.6237, Macro Recall = 0.3915
  Fold 3: Accuracy = 0.6280, Macro Recall = 0.3954
  Fold 4: Accuracy = 0.6274, Macro Recall = 0.3970
  Fold 5: Accuracy = 0.6254, Macro Recall = 0.3977
- 

Looking at the results above, I see that the `reduced_animal_data` data frame was more accurate but did not output a better recall. In prediction models such as these, it is sometimes more of a priority to have higher recall even at the cost of accuracy. Let's see the values for the other models.

We see that we get the best performance with adding on the temperature feature seen in `temp_animal_data`. Let's use that as our data set in future training.

We have the issue of class imbalance within `Outcome Type`. There very little occurrences of `Euthanasia` and `Died`. So, we can utilize weights for each prediction at the end to account for the rarer outcomes. The following code shows the inverse frequency weights based on the frequency of each outcome within the training set. We can use this to account for the class imbalance and effectively tell the model to pay more attention to these specific outcomes due to their higher weights, affecting how the internal tree structure is splitting. We can attempt to use these weights to train on the same model and check to see if it improved our accuracy and recall.

In [422]:
# normalized class frequencies
class_counts = temp_animal_data["Outcome Type"].value_counts()
class_freqs = class_counts / class_counts.sum()

# inverse frequency weights
inverse_freq = 1 / class_freqs

# normalize to mean
normalized_weights = inverse_freq / inverse_freq.mean()

print("Suggested Class Weights (normalized inverse frequency):\n")
weight_df = pd.DataFrame({
    "Outcome Type": normalized_weights.index,
    "Weight": normalized_weights.round(2).values,
    "Count": class_counts.values,
    "Percentage": (class_freqs * 100).round(2).values
})
print(weight_df.to_string(index=False))

Suggested Class Weights (normalized inverse frequency):

   Outcome Type  Weight  Count  Percentage
       Adoption    0.07  55044       49.52
       Transfer    0.11  35024       31.51
Return to Owner    0.22  16599       14.93
     Euthanasia    1.07   3449        3.10
           Died    3.54   1041        0.94


Let's implement these weights into our previous model to see if it improves its predictions.

In [423]:
labels = temp_animal_data["Outcome Type"]
data = temp_animal_data.drop(columns=["Outcome Type"]).fillna(0)
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

# compute the inverse frequency weights
class_counts = pd.Series(labels_encoded).value_counts().sort_index()
class_freqs = class_counts / class_counts.sum()
inv_freq_weights = 1 / class_freqs
normalized_weights = inv_freq_weights / inv_freq_weights.mean()

print("\nClass Weights (Normalized Inverse Frequency):")
for i, label in enumerate(label_encoder.classes_):
    print(f"{label:<17}: {normalized_weights[i]:.2f}")

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=363)
accs = []
recalls = []

# cross validation
for fold, (train_idx, test_idx) in enumerate(skf.split(data, labels_encoded), 1):
    x_train, x_test = data.iloc[train_idx], data.iloc[test_idx]
    y_train, y_test = labels_encoded[train_idx], labels_encoded[test_idx]
    sw_train = np.array([normalized_weights[y] for y in y_train])

    clf = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=363)
    clf.fit(x_train, y_train, sample_weight=sw_train)

    predictions = clf.predict(x_test)
    acc = accuracy_score(y_test, predictions)
    recall = recall_score(y_test, predictions, average='macro')

    accs.append(acc)
    recalls.append(recall)

    print(f"\nFold {fold} Accuracy     : {acc:.4f}")
    print(f"Fold {fold} Macro Recall : {recall:.4f}")

print("\n   Cross-Validation Summary    ")
print(f"Average Accuracy     : {np.mean(accs):.4f}")
print(f"Average Macro Recall : {np.mean(recalls):.4f}")


Class Weights (Normalized Inverse Frequency):
Adoption         : 0.07
Died             : 3.54
Euthanasia       : 1.07
Return to Owner  : 0.22
Transfer         : 0.11

Fold 1 Accuracy     : 0.5067
Fold 1 Macro Recall : 0.4977

Fold 2 Accuracy     : 0.5019
Fold 2 Macro Recall : 0.5015

Fold 3 Accuracy     : 0.5034
Fold 3 Macro Recall : 0.5101

Fold 4 Accuracy     : 0.5043
Fold 4 Macro Recall : 0.5154

Fold 5 Accuracy     : 0.5079
Fold 5 Macro Recall : 0.5050

   Cross-Validation Summary    
Average Accuracy     : 0.5048
Average Macro Recall : 0.5059


We can see that this model scored lower on the accuracy metric but improved for recall, scoring 0.50. This means that class imbalance is a big issue within our model and we need to consider other approaches or ways to feature engineer our data to allow our model to be trained to predict those classes with lesser popularity.

Another approach would to be to fine tune these parameters to achieve higher recalls potentially. However for the sake of time, we can just use the inverse frequency as the weight for each of our outcomes.

Now let's build the entire model using the training set and test on the test set, also outputs the final predictions to csv file.

In [424]:
X = pd.get_dummies(temp_animal_data.drop(columns=["Outcome Type"]), dtype=np.uint8)
y = temp_animal_data["Outcome Type"]

feature_columns = X.columns
X_test_final = pd.get_dummies(temp_animal_test.drop(columns=["Id"]), dtype=np.uint8)
X_test_final = X_test_final.reindex(columns=feature_columns, fill_value=0)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# compute inverse frequency weights
class_counts = pd.Series(y_encoded).value_counts().sort_index()
class_freqs = class_counts / class_counts.sum()
inv_freq_weights = 1 / class_freqs
normalized_weights = inv_freq_weights / inv_freq_weights.mean()

# TRAIN ON THE FULL SET THIS TIME
sample_weight = np.array([normalized_weights[label] for label in y_encoded])
clf = HistGradientBoostingClassifier(max_iter=200, max_depth=15, random_state=42)
clf.fit(X, y_encoded, sample_weight=sample_weight)

# PREDICT ON THE FULL SET
y_test_pred_encoded = clf.predict(X_test_final)
y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

# save output to csv
submission = pd.DataFrame({
    "Id": temp_animal_test["Id"],
    "Outcome Type": y_test_pred
})
submission = submission.sort_values("Id").reset_index(drop=True)
submission.to_csv("boosting_predictions.csv", index=False)
print("Saved predictions to 'boosting_predictions.csv'")

# check the distribution just to see and ensure general trends are met
prediction_distribution = submission["Outcome Type"].value_counts(normalize=True) * 100
print("\nPrediction Distribution (%):")
print(prediction_distribution.round(2).sort_index().to_string())

Saved predictions to 'boosting_predictions.csv'

Prediction Distribution (%):
Outcome Type
Adoption           37.49
Died                8.63
Euthanasia          9.97
Return to Owner    27.34
Transfer           16.57


We see within our prediction outcome distributions that we are slightly over-predicting died and euthanasia. Even though our recall seemed better, the model became too sensitive with these weights. It may be better to experiment tuning the weights to get more accurate outputs and recall.

I feel like maybe `HistGradientBoostingClassifier` isn't the best classifier to train our data on. Its main weaknesses lie in high dimensional one-hot-encoding, which is our current dataset can be considered. And even when we performed feature reduction it did not improve our recall. So many trying a different strategy may be better.

Let's try handling the class imbalance with SMOTE paired with a `RandomForstClassifier`. SMOTE will hopefully assist with the class imbalance and help improve the recall of our `Died` and `Euthanasia` outcomes.

In [425]:
labels = temp_animal_data["Outcome Type"]
data = temp_animal_data.drop(columns=["Outcome Type"]).fillna(0)

label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)
class_labels = label_encoder.classes_

# set up cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=363)
accs = []
recalls = []

# cross-validation
for fold, (train_idx, test_idx) in enumerate(skf.split(data, labels_encoded), 1):
    # split sets
    x_train, x_test = data.iloc[train_idx], data.iloc[test_idx]
    y_train, y_test = labels_encoded[train_idx], labels_encoded[test_idx]

    # apply smote to the training set
    smote = SMOTE(random_state=363)
    x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

    print(f"\nFold {fold} - Resampling Info:")
    print("  Original training distribution:", np.bincount(y_train))
    print("  After SMOTE:", np.bincount(y_resampled))

    # train the random forest classifer using the resampled data
    clf = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42)
    clf.fit(x_resampled, y_resampled)

    # evaluate on test set
    predictions = clf.predict(x_test)
    acc = accuracy_score(y_test, predictions)
    recall = recall_score(y_test, predictions, average='macro')

    accs.append(acc)
    recalls.append(recall)

    print(f"  Fold {fold} Accuracy     : {acc:.4f}")
    print(f"  Fold {fold} Macro Recall : {recall:.4f}")

print("\n=== Cross-Validation Summary ===")
print(f"Average Accuracy     : {np.mean(accs):.4f}")
print(f"Average Macro Recall : {np.mean(recalls):.4f}")



Fold 1 - Resampling Info:
  Original training distribution: [44035   833  2759 13279 28019]
  After SMOTE: [44035 44035 44035 44035 44035]
  Fold 1 Accuracy     : 0.5546
  Fold 1 Macro Recall : 0.4644

Fold 2 - Resampling Info:
  Original training distribution: [44035   833  2759 13279 28019]
  After SMOTE: [44035 44035 44035 44035 44035]
  Fold 2 Accuracy     : 0.5550
  Fold 2 Macro Recall : 0.4733

Fold 3 - Resampling Info:
  Original training distribution: [44036   832  2760 13279 28019]
  After SMOTE: [44036 44036 44036 44036 44036]
  Fold 3 Accuracy     : 0.5539
  Fold 3 Macro Recall : 0.4763

Fold 4 - Resampling Info:
  Original training distribution: [44035   833  2759 13279 28020]
  After SMOTE: [44035 44035 44035 44035 44035]
  Fold 4 Accuracy     : 0.5592
  Fold 4 Macro Recall : 0.4691

Fold 5 - Resampling Info:
  Original training distribution: [44035   833  2759 13280 28019]
  After SMOTE: [44035 44035 44035 44035 44035]
  Fold 5 Accuracy     : 0.5530
  Fold 5 Macro Recall

We see that the cross validated accuracy and recall look about the same as the weighted boosting classifier. However, we may not be able to compare the recalls to each other as it is the underlying decision-making within the model that ultimately provides a better or worse prediction. Let's attempt to implement it and predict on it anyway to see how it trends as far as the outcome distributions.

In [426]:
X = pd.get_dummies(temp_animal_data.drop(columns=["Outcome Type"]), dtype=np.uint8)
y = temp_animal_data["Outcome Type"]
feature_columns = X.columns
X_test_final = pd.get_dummies(temp_animal_test.drop(columns=["Id"]), dtype=np.uint8)
X_test_final = X_test_final.reindex(columns=feature_columns, fill_value=0)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
class_labels = label_encoder.classes_
smote = SMOTE(random_state=363)
X_resampled, y_resampled = smote.fit_resample(X, y_encoded)

print("Original training distribution:", np.bincount(y_encoded))
print("After SMOTE:", np.bincount(y_resampled))

# train the random forest classifier using the resampled data
clf = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=363)
clf.fit(X_resampled, y_resampled)

# predict on the test set
y_test_pred_encoded = clf.predict(X_test_final)
y_test_pred = label_encoder.inverse_transform(y_test_pred_encoded)

# save output to csv
submission = pd.DataFrame({
    "Id": temp_animal_test["Id"],
    "Outcome Type": y_test_pred
})
submission = submission.sort_values("Id").reset_index(drop=True)
submission.to_csv("smote_predictions.csv", index=False)
print("Saved predictions to 'smote_predictions.csv'")

# check distribution to see trends
distribution = submission["Outcome Type"].value_counts(normalize=True) * 100
print("\nPrediction Distribution (%):")
print(distribution.sort_index().round(2).to_string())

Original training distribution: [55044  1041  3449 16599 35024]
After SMOTE: [55044 55044 55044 55044 55044]
Saved predictions to 'smote_predictions.csv'

Prediction Distribution (%):
Outcome Type
Adoption           45.80
Died                4.16
Euthanasia          6.26
Return to Owner    26.18
Transfer           17.60


Upon seeeing the outcome distributions, we see that it is trending better than our previous weight balanced boosting classifier with slightly lower percentages for died and euthanasia. Upon submitting this to Kaggle, we were able to achieve the highest score of 0.47682.