# 🐾 Canine Health Prediction App – Project Documentation

This Django web application predicts canine health (Healthy or Not Healthy) based on lifestyle and demographic inputs. It features a trained machine learning model, user authentication, and a dashboard for data visualization.

---

## 📊 Dataset Overview

This synthetic dataset simulates a wide range of dog breeds and their health-related characteristics. It is designed for **binary classification**, where the target variable is whether a dog is considered healthy (`"Yes"`) or not healthy (`"No"`).

The data was generated using realistic distributions of age, breed sizes, weight, diet, and lifestyle factors. A simple rule-based logic was applied to determine the `"Healthy"` label through feature interactions.

**What's Included:**
- 10,000 rows of synthetic data
- 21 features including breed size, age, diet, daily activity, medications, sleep, vet visits, etc.
- Binary target column: `Healthy` (`Yes`/`No`)
- ~3% missing values per feature
- Balanced outcome with slight real-world skew

---

## 🧠 1. Machine Learning Model Training Steps

**Steps:**
1. **Data Cleaning:**
   - Dropped rows with missing target (`Healthy`)
   - Filled numeric nulls with median
   - Filled categorical nulls with mode

2. **Dropped Columns:**
   - `ID`, `Sex`, `Food Brand`, `Daily Walk Distance (miles)`, `Other Pets in Household`,  
     `Medications`, `Seizures`, `Owner Activity Level`, `Average Temperature (F)`, `Synthetic.`

3. **Feature Selection:**
   - Excluded `Breed` due to form-model mismatch

4. **Encoding:**
   - All remaining categorical columns were label-encoded
   - `Healthy` column label-encoded as the binary target

5. **Model Training:**
   - Trained with `RandomForestClassifier` (100 estimators)
   - 80/20 train-test split
   - Evaluation using accuracy score and classification report

6. **Artifacts Saved:**
   - `canine_model.pkl` – trained model
   - `canine_label_encoders.pkl` – encoders for categorical features
   - `canine_target_encoder.pkl` – target encoder
   - `canine_feature_order.pkl` – column order reference for prediction

---

## 🔐 2. Authentication

Authentication is built using Django’s built-in auth system:

- `UserCreationForm` for new user registration
- `AuthenticationForm` for login
- Views for `register`, `login`, and `logout`
- `@login_required` used to protect `predict/` and `dashboard/` routes

---

## 🔌 3. Integration Steps

### 🔄 Prediction Flow:
1. Authenticated user submits form data on `/predict/`
2. Data is mapped to `canine_feature_order.pkl` using a `field_mapping` dictionary
3. Categorical values are encoded using `LabelEncoder`
4. Numeric values are converted to `float`
5. Model predicts and inverse-transforms the output using `target_encoder`

### 📈 Dashboard:
- `matplotlib` is used to generate a health distribution bar chart
- The chart is base64-encoded and embedded in a Django HTML 

---

## ⚠️ 4. Challenges Encountered

| Challenge | Solution |
|----------|----------|
| NumPy strings caused encoder issues | Converted all values with `str(value)` |
| Feature/form mismatches | Used a `field_mapping` dictionary |
| Null values in data | Replaced using median (numerical) and mode (categorical) |

---

## ✅ Outcome

- Fully functional Django ML prediction app
- Clean integration between model and user form
- User authentication and session management
- Visual dashboard and reliable prediction pipeline


In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
import joblib

# Load dataset
file_path = "/xampp/htdocs/APPDEV_FINALPROJ/canine_health/predictor/data/synthetic_dog_breed_health_data.csv"
df = pd.read_csv(file_path)

# Drop rows with missing target
df_clean = df.dropna(subset=['Healthy']).copy()

# Drop irrelevant columns first (before encoding)
df_clean = df_clean.drop(columns=[ 
    'ID', 
    'Sex', 
    'Food Brand', 
    'Daily Walk Distance (miles)', 
    'Other Pets in Household', 
    'Medications', 
    'Seizures', 
    'Owner Activity Level', 
    'Average Temperature (F)', 
    'Synthetic'
], errors='ignore')

# Drop 'Breed' if it exists
if 'Breed' in df_clean.columns:
    df_clean = df_clean.drop(columns=['Breed'])

# Fill missing numeric values with median
numeric_cols = df_clean.select_dtypes(include=['float64', 'int64']).columns
df_clean[numeric_cols] = df_clean[numeric_cols].fillna(df_clean[numeric_cols].median())

# Detect categorical columns after dropping unwanted ones
categorical_cols = df_clean.select_dtypes(include=['object']).columns.drop('Healthy')

# Fill missing categorical values with mode
df_clean[categorical_cols] = df_clean[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))

# Encode only remaining categorical features
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df_clean[col] = le.fit_transform(df_clean[col])
    label_encoders[col] = le

# Encode target
target_encoder = LabelEncoder()
df_clean['Healthy'] = target_encoder.fit_transform(df_clean['Healthy'])


# Define features and target
X = df_clean.drop('Healthy', axis=1)
y = df_clean['Healthy']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Save model and encoders
joblib.dump(model, "canine_model.pkl")
joblib.dump(label_encoders, "canine_label_encoders.pkl")
joblib.dump(target_encoder, "canine_target_encoder.pkl")
joblib.dump(list(X.columns), "canine_feature_order.pkl")

# Evaluate model
y_pred = model.predict(X_test)
print("✅ Training complete")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification report:\n", classification_report(y_test, y_pred))

# Function to encode unseen labels in future test data
def encode_test_data(test_data, label_encoders, target_encoder):
    # Apply the label encoders to categorical columns
    for col, le in label_encoders.items():
        test_data[col] = le.transform(test_data[col])

    # Apply the target encoder to the 'Healthy' column (if predicting target labels)
    test_data['Healthy'] = target_encoder.transform(test_data['Healthy'])
    
    return test_data


✅ Training complete
Accuracy: 0.7975206611570248
Classification report:
               precision    recall  f1-score   support

           0       0.64      0.53      0.58       512
           1       0.84      0.89      0.87      1424

    accuracy                           0.80      1936
   macro avg       0.74      0.71      0.72      1936
weighted avg       0.79      0.80      0.79      1936



# 🐶 Canine Health Prediction Model – Training Documentation

## 📊 Dataset Overview

- **Source**: https://www.kaggle.com/datasets/aaronisomaisom3/canine-wellness-dataset-synthetic-10k-samples
- **Target Variable**: `Healthy` – indicating the overall health status of the dog (binary classification)
- **Size**: 10k Samples and 21 Columns
- **Initial Features**: Included demographic, lifestyle, environmental, and medical information
- **Dropped Columns**:  
  `ID`, `Sex`, `Food Brand`, `Daily Walk Distance (miles)`,  
  `Other Pets in Household`, `Medications`, `Seizures`,  
  `Owner Activity Level`, `Average Temperature (F)`, `Synthetic`, `Breed`

---

## 🧹 Preprocessing Steps

1. **Remove rows with missing target values** (`Healthy`)
2. **Drop irrelevant features** listed above
3. **Impute Missing Values**:
   - Numeric columns: filled using the **median**
   - Categorical columns: filled using the **mode**
4. **Encode Categorical Features**:
   - Used `LabelEncoder` for all categorical variables
   - Saved encoders for future use
5. **Encode Target (`Healthy`)** using `LabelEncoder`
6. **Train-Test Split**:
   - 80% Training, 20% Testing
   - `random_state=42` for reproducibility

---

## 🧠 Model Architecture / Algorithm

- **Algorithm**: Random Forest Classifier
- **Library**: `sklearn.ensemble.RandomForestClassifier`
- **Key Parameters**:
  - `n_estimators=100`: Number of decision trees
  - `random_state=42`: Seed for reproducibility
- **Training Input**: Preprocessed features (`X_train`, `y_train`)

---

## 📈 Training Results

Model evaluation on test set:

```plaintext
✅ Training complete
Accuracy: 0.7975206611570248
Classification report:
               precision    recall  f1-score   support

           0       0.64      0.53      0.58       512
           1       0.84      0.89      0.87      1424

    accuracy                           0.80      1936
   macro avg       0.74      0.71      0.72      1936
weighted avg       0.79      0.80      0.79      1936



In [2]:
df_clean.head(10)

Unnamed: 0,Breed Size,Age,Weight (lbs),Spay/Neuter Status,Daily Activity Level,Diet,Hours of Sleep,Play Time (hrs),Annual Vet Visits,Healthy
0,1,3.0,60.0,0,0,3,12.0,1.0,1.0,1
2,2,12.0,67.0,0,0,1,10.0,1.0,0.0,1
3,1,13.0,35.0,1,3,3,12.0,2.0,1.0,1
4,1,13.0,35.0,1,3,3,9.0,1.0,0.0,0
5,1,2.0,42.0,1,3,3,11.0,3.0,2.0,1
6,0,8.0,50.0,1,0,2,8.0,1.0,0.0,1
7,1,11.0,66.0,0,0,1,9.0,2.0,2.0,0
8,1,12.0,28.0,0,2,2,11.0,2.0,0.0,0
9,1,11.0,49.0,0,0,3,11.0,2.0,0.0,0
10,0,2.0,64.0,1,3,0,11.0,2.0,1.0,1
