# cool-title-here

**Group Members:** Aaron Go, John Alonzo, Sean Olores, Sean Cardeno 

**Deadline:** March 21, 2026 (Saturday), 8:00 AM

**Task:** Predict **the Challenge Rating of a monster** based on **the monster's features**

**Dataset Source:** [DnD 5e Monsters](https://www.kaggle.com/datasets/mrpantherson/dnd-5e-monsters/data)

**Justification:** In Dungeons & Dragons, *Challenge Rating* is a mathematical estimation of a monster's combat efficiency. A model that learns to predict this value from raw stats allows Game Masters to instantly balance custom *homebrew* content without manually calculating complex design formulas.


## Section 1: Data Preparation

In this section, we will:
- Load the dataset
- Inspect the data structure and basic statistics
- Handle missing values and data cleaning
- Identify data types and any inconsistencies

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('dnd_monsters.csv') 
print("Original shape of data:", df.shape)

Original shape of data: (762, 17)


In [36]:
# Data Cleaning

# convert fraction strings in challenge rating column to floats
def convert_cr_to_float(cr_val):
    if pd.isna(cr_val): # If the cell is completely empty, leave it
        return cr_val
    
    cr_str = str(cr_val).strip()
    
    if '/' in cr_str:
        numerator, denominator = cr_str.split('/')
        return float(numerator) / float(denominator)
    
    return float(cr_str)

df['cr'] = df['cr'].apply(convert_cr_to_float)

# ALDANI fix, aldani is null in row 22, supplemented with data from 5e.tools
df.loc[df['name'].str.lower() == 'aldani', 'cr'] = 1.0

# ordinal encoding the size
df['size'] = df['size'].astype(str).str.capitalize().str.strip()
size_mapping = {'Tiny': 0, 'Small': 1, 'Medium': 2, 'Large': 3, 'Huge': 4, 'Gargantuan': 5}
df['size'] = df['size'].map(size_mapping)

# double checking CR if theres any stray nulls
missing_cr_count = df['cr'].isna().sum()
print(f"Monsters missing CR: {missing_cr_count}")

print("New shape of data:", df.shape)
df.head()

Monsters missing CR: 0
New shape of data: (762, 17)


Unnamed: 0,name,url,cr,type,size,ac,hp,speed,align,legendary,source,str,dex,con,int,wis,cha
0,aarakocra,https://www.aidedd.org/dnd/monstres.php?vo=aar...,0.25,humanoid (aarakocra),2,12,13,fly,neutral good,,Monster Manual (BR),10.0,14.0,10.0,11.0,12.0,11.0
1,abjurer,,9.0,humanoid (any race),2,12,84,,any alignment,,Volo's Guide to Monsters,,,,,,
2,aboleth,https://www.aidedd.org/dnd/monstres.php?vo=abo...,10.0,aberration,3,17,135,swim,lawful evil,Legendary,Monster Manual (SRD),21.0,9.0,15.0,18.0,15.0,18.0
3,abominable-yeti,,9.0,monstrosity,4,15,137,,chaotic evil,,Monster Manual,,,,,,
4,acererak,,23.0,undead,2,21,285,,neutral evil,,Adventures (Tomb of Annihilation),,,,,,


In [37]:
# Calculate the total missing values for every single column
missing_data = df.isna().sum()

# Filter to ONLY show columns that actually have missing data (greater than 0)
print("Columns with missing data before fixing:")
print(missing_data[missing_data > 0])

Columns with missing data before fixing:
url          361
speed        514
legendary    719
str          361
dex          361
con          361
int          361
wis          361
cha          361
dtype: int64


In [38]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
# TODO: maybe split this into multiple cells and add more comments about why we are doing each step, especially the scaling part since it might not be intuitive to everyone

# 1. Drop columns that don't help prediction
df = df.drop(columns=['url', 'source', 'align'], errors='ignore')

# 2. Fill Text NAs 
# 514 missing speeds become 'None', 719 missing legendaries become 0 (False)
df['speed'] = df['speed'].fillna('None')
df['legendary'] = df['legendary'].fillna(0)

# 3. Scaled KNN Imputation for the 361 missing core stats [idea is from another notebook, TODO: look into why these are missing, why they used/did this instead of others, and if we can do better -- maybe ask sir in advance]
cols_to_impute = ['str', 'dex', 'con', 'int', 'wis', 'cha']

    # Step A: Scale the data to a 0-1 range so the imputer doesn't get biased by large numbers TODO: talk about this in the cell above, maybe show a quick graph of how the imputer can get biased by large numbers and why scaling helps
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[cols_to_impute])

    # Step B: KNN Impute the 361 missing values based on 5 similar monsters
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df_scaled)

    # Step C: Inverse Scale back to normal D&D numbers for our dataset
df[cols_to_impute] = scaler.inverse_transform(df_imputed)

# round the stats back to whole numbers (idt you can have 15.4 Strength, CMIIW)
df[cols_to_impute] = df[cols_to_impute].round()

# 4. Final check & Export 
print("Remaining Missing Data:")
print(df.isna().sum()[df.isna().sum() > 0]) 

# checkpoint file 
df.to_csv('cleaned_monsters.csv', index=False)
print("\nSection 1 Complete! Data exported to cleaned_monsters.csv")

Remaining Missing Data:
Series([], dtype: int64)

Section 1 Complete! Data exported to cleaned_monsters.csv


## Section 2: Exploratory Data Analysis (EDA)

In this section, we will:
- Analyze the distribution of features
- Identify relationships between features and the target variable
- Generate visualizations to understand the data better
- Compute statistical summaries

In [39]:
# Exploratory Data Analysis
# TODO
# Dont forget to export visualizations for poster

## Section 3: Feature Engineering & Preprocessing

In this section, we will:
- Separate features and target variable
- Encode categorical variables
- Split data into training, validation, and test sets
- Normalize/standardize features if needed
- Perform feature selection if necessary

In [40]:
# TODO: Define target variable and features
# TODO: Parse 'Speed' column into has_fly and has_swim

# TODO: Extract sub-race from 'Type' column (like extracting Elf from Humanoid)

# TODO: Standardize/Scale the numeric features for the Neural Network

## Section 4: Model Selection and Training

In this section, we will:
- Select and train **Classical Model 1** 
- Select and train **Classical Model 2** 
- Select and train **Neural Network Model** 
- Provide justification for model choices
- Train models and monitor performance on validation set

### Classical Model 1: 
Justification: 

### Classical Model 2: 
Justification: 

### Neural Network Model: 
Justification: 

## Section 5: Error Analysis and Model Tuning

In this section, we will:
- Analyze errors made by the models
- Identify patterns in prediction errors [i forgor were doing regression]
- Perform hyperparameter tuning
- Improve model performance through adjustments
- Select the best model based on validation performance

In [41]:
# Error Analysis
# TODO

## Section 6: Model Evaluation

In this section, we will:
- Evaluate all models on the test set
- Compare performance metrics across models
- Generate detailed evaluation reports
- Visualize performance comparisons
- Identify the best performing model

## Section 7: Conclusions and Findings


### **Summary of Results:**
- 

### **Best Model Details:**
- 

### **Future Improvements:**
- 

### **References:**
- 

### **AI Disclosure:**
- 