# ML Mini-Project: Demonstration (Wine Quality dataset)

This notebook demonstrates the steps required by the project assignment: problem description, dataset, preprocessing, model training (two algorithms), evaluation (RMSE & MAE), visualizations, and conclusions. The training script is available at `backend/ml/train_model.py` and will download a public dataset (UCI Wine Quality - red) by default if no dataset is provided.

## Dataset
We use the UCI Wine Quality (red) dataset as a public dataset with >1500 rows and multiple numeric features. The script will save reports and models under `backend/ml/reports/` and `backend/ml/` respectively.

In [None]:
# Run the training script (this will download the dataset if missing and train two models).
# It's executed as a subprocess to keep the notebook environment aligned with the training script behavior.
import subprocess, sys, os
root = os.path.abspath('..')
backend_dir = os.path.join(root, 'backend')
cmd = [sys.executable, os.path.join(backend_dir, 'ml', 'train_model.py')]
print('Running:', ' '.join(cmd))
proc = subprocess.run(cmd, cwd=backend_dir, capture_output=True, text=True)
print(proc.stdout)
if proc.returncode != 0:
    print('ERROR', proc.returncode)
    print(proc.stderr)

In [None]:
# Load and show results summary (if produced)
import pandas as pd, os
reports = os.path.join('backend', 'ml', 'reports')
csv_path = os.path.join(reports, 'results_summary.csv')
print('Looking for', csv_path)
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
    display(df)
else:
    print('Results CSV not found; run the training script first.')

In [None]:
# Display run summaries and key plots for processed datasets
import os
import pandas as pd
from IPython.display import display, Image

runs = ['pesticides_run', 'plant_vase1_run', 'soil_moisture_run']
base = os.path.join('backend', 'ml')
for r in runs:
    print('\n--- Run:', r, '---')
    rep = os.path.join(base, r, 'reports')
    if not os.path.exists(rep):
        print('  Reports folder not found for', r)
        continue
    csv = os.path.join(rep, 'results_summary.csv')
    if os.path.exists(csv):
        df = pd.read_csv(csv)
        display(df)
    else:
        print('  Results CSV not found for', r)
    # show a few representative images if present
    for prefix in ['RandomForest_pred_vs_actual', 'GradientBoosting_pred_vs_actual', 'target_distribution', 'correlation_heatmap'] :
        matches = sorted([f for f in os.listdir(rep) if f.startswith(prefix)])
        if matches:
            # display the latest matching image
            img_path = os.path.join(rep, matches[-1])
            print('  Showing', os.path.basename(img_path))
            display(Image(filename=img_path))


## Conclusions & Next Steps

- The repository now contains a reusable training pipeline (`backend/ml/train_model.py`) that accepts CSVs, validates dataset size/features, preprocesses (imputation + scaling + encoding), trains two regressors (RandomForest, GradientBoosting), evaluates with RMSE & MAE, and saves models and a `reports/` folder per run.

Summary of runs performed on user-provided CSVs:

1) `pesticides.csv` (saved to `backend/ml/pesticides_run/`):
   - Rows: 4349, Columns: 7
   - Results (see `backend/ml/pesticides_run/reports/results_summary.csv`): RandomForest RMSE ≈ 3840, MAE ≈ 1037; GradientBoosting performed worse on this dataset.
   - Artifacts: models and plots saved under `backend/ml/pesticides_run/` and `backend/ml/pesticides_run/reports/`.

2) `plant_vase1.CSV` (saved to `backend/ml/plant_vase1_run/`):
   - Rows: 4117, Columns: 12
   - Results: both models reported 0.0 RMSE/MAE — this indicates the chosen target column is constant (no variance) or identical across train/test. Inspect the target column or choose a different target with the `--target` flag.
   - Recommendation: open `backend/data/plant_vase1.CSV` and choose a meaningful numeric target column, or supply `--target <colname>` when re-running.

3) `soil_moisture.csv` (saved to `backend/ml/soil_moisture_run/`):
   - Rows: 679, Columns: 129
   - Results: very small RMSE/MAE (on order 1e-3) — consistent with sensor measurement units; models trained successfully and artifacts saved.

Files not processed due to validation:

- `rainfall.csv` and `temp.csv` both have only 3 columns; the pipeline requires at least 4 features by default (to encourage richer feature sets). You can re-run them by relaxing the validation (example below).

How to reproduce / re-run specific datasets:

From the workspace root (PowerShell):
```powershell
cd backend
python -u ml\train_model.py --data data\pesticides.csv --outdir ml\pesticides_run
python -u ml\train_model.py --data data\plant_vase1.CSV --target <your_target_col> --outdir ml\plant_vase1_run
python -u ml\train_model.py --data data\soil_moisture.csv --outdir ml\soil_moisture_run
# To accept 3-column CSVs (relax min features):
python -u ml\train_model.py --data data\rainfall.csv --min-features 3 --outdir ml\rainfall_run
python -u ml\train_model.py --data data\temp.csv --min-features 3 --outdir ml\temp_run
```

Suggested next actions (I can do these for you):

- I can embed the key result images and tables directly into this notebook (already added a cell above which displays images and summaries).
- I can re-run the two 3-column CSVs with `--min-features 3` or with explicit `--target` values if you provide them.
- I can open `plant_vase1.CSV`, identify candidate numeric targets and re-run with the best candidate.

If you want me to proceed, tell me which of the next actions to take and I will execute them.
