<a href="https://colab.research.google.com/github/bgoueti/BandersnatchStarter/blob/main/Machine_Learning_BandersnatchStarter_Project_Sprint_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning BandersnatchStarter Project - Sprint 2**




## **# Description (Markdown)**

This notebook demonstrates the visualization work for Sprint 2:
- Load monster data exported from the app `/data` endpoint
- Clean & transform the data into a Pandas DataFrame
- Explore with summary statistics and tables
- Build interactive Altair charts (scatter, histogram, boxplot)
- Save an Altair chart as JSON (for Flask `/view` integration)

**How to use:**  
1. If you're using Colab, run the "Upload data" cell and upload `monster_data.json` or `monster_data.csv` exported from your local app.  
2. Run cells top to bottom.

# **Install & Imports**

In [1]:
%%capture
import sys

if 'google.colab' in sys.modules:
  !pip install altair pandas scikit-learn matplotlib==3.7.1 openai category_encoders pdpbox matplotlib-venn
  !pip install category_encoders
  !pip install matplotlib==3.7.1
  !pip install pdpbox

In [2]:
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.inspection import PartialDependenceDisplay
from sklearn.metrics import accuracy_score

In [3]:
from datetime import datetime
from glob import glob
from IPython.display import Image
from typing import Tuple, Dict, Any
from pprint import pprint
from altair import Chart
from pandas import DataFrame
from sklearn import metrics

In [4]:
#import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import altair as alt
import joblib
import requests
import json
import os

In [5]:
# Altair display helpers for Colab/Jupyter
# For Colab this usually works; if rendering issues occur use renderer 'default'.
# alt.renderers.enable('default')

print("pandas:", pd.__version__, "altair:", alt.__version__)

pandas: 2.2.2 altair: 5.5.0


# **Upload data (Colab) and Load the JSON/CSV file into a DataFrame**

In [6]:
# If you're in Colab: upload the JSON or CSV you exported from http://127.0.0.1:5000/data
# If you're in local Jupyter, make sure monster_data.json or .csv is in the working dir.

try:
    from google.colab import files
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    print("Colab detected — please upload your data file when prompted.")
    uploaded = files.upload()  # click the chooser and upload monster_data.json or monster_data.csv
    # Choose the first filename uploaded
    fn = list(uploaded.keys())[0]
else:
    # If running locally, set the filename here:
    # fn = "monster_data.json" or "monster_data.csv"
    fn = "monsters.json"  # change if necessary

print("Using file:", fn)


Colab detected — please upload your data file when prompted.


Saving monsters.csv to monsters.csv
Using file: monsters.csv


In [19]:
# # Load JSON or CSV automatically based on extension
# ext = os.path.splitext(fn)[1].lower()
# if ext in [".json"]:
#     df = pd.read_json(fn)
# elif ext in [".csv"]:
#     df = pd.read_csv(fn)
# else:
#     raise ValueError("Unsupported file type: " + ext)

# print("Loaded rows:", len(df))
# df.head()


In [8]:
# # Upload file into Google Colab
# from google.colab import files
# uploaded = files.upload()


In [9]:
# Load the JSON/CSV file into a DataFrame
df = pd.read_csv("monsters.csv")
df.head()

Unnamed: 0,Name,Type,Level,Rarity,Damage,Health,Energy,Sanity,Timestamp
0,Skeletal Villager,Undead,11.0,Rank 1,11d4+1,44.63,43.26,43.77,10/29/2025 15:13
1,Ghostly Villager,Undead,3.0,Rank 1,3d4+2,11.11,12.42,13.53,10/29/2025 15:13
2,Lightning Elemental,Elemental,2.0,Rank 3,2d8,19.89,18.19,19.8,10/29/2025 15:13
3,Djinni,Elemental,2.0,Rank 2,2d6,10.27,14.09,13.63,10/29/2025 15:13
4,Demilich,Undead,11.0,Rank 0,11d2+3,22.14,22.96,22.91,10/29/2025 15:13


# **Basic cleaning & expected columns**

In [10]:
# Ensure expected columns exist and normalize names
expected = ["Level", "Health", "Energy", "Sanity", "Rarity"]
print("Columns present:", list(df.columns))

# Some datasets include nested structures — flatten if needed (simple case)
# Ensure all expected columns exist (if missing, create with NaN)
for col in expected:
    if col not in df.columns:
        df[col] = np.nan

# Convert numeric fields
for col in ["Level", "Health", "Energy", "Sanity"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Drop rows missing Rarity or all features
df = df.dropna(subset=["Rarity"])  # keep only rows with a Rarity
df = df.reset_index(drop=True)

print("After cleaning rows:", len(df))
df[expected].head()


Columns present: ['Name', 'Type', 'Level', 'Rarity', 'Damage', 'Health', 'Energy', 'Sanity', 'Timestamp']
After cleaning rows: 1000


Unnamed: 0,Level,Health,Energy,Sanity,Rarity
0,11.0,44.63,43.26,43.77,Rank 1
1,3.0,11.11,12.42,13.53,Rank 1
2,2.0,19.89,18.19,19.8,Rank 3
3,2.0,10.27,14.09,13.63,Rank 2
4,11.0,22.14,22.96,22.91,Rank 0


# **Quick EDA (statistics / class counts)**

In [11]:
# Summary statistics
display(df[["Level", "Health", "Energy", "Sanity"]].describe().T)

# Count by Rarity
counts = df["Rarity"].value_counts()
print("Rarity distribution:")
display(counts)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Level,1000.0,7.894,4.41863,1.0,4.0,7.0,11.0,20.0
Health,1000.0,39.03665,33.347495,1.05,16.295,27.77,51.32,220.71
Energy,1000.0,39.28469,33.634929,1.73,16.3325,28.075,51.7775,220.17
Sanity,1000.0,39.12219,33.323662,1.0,15.67,27.895,51.41,220.69


Rarity distribution:


Unnamed: 0_level_0,count
Rarity,Unnamed: 1_level_1
Rank 0,316
Rank 1,257
Rank 2,175
Rank 3,133
Rank 4,89
Rank 5,30


# **Scatterplot: Health vs Energy colored by Rarity (Altair)**

In [12]:
chart1 = (
    alt.Chart(df, title="Health vs Energy (colored by Rarity)")
    .mark_circle(size=80, opacity=0.8)
    .encode(
        x=alt.X("Health:Q", title="Health"),
        y=alt.Y("Energy:Q", title="Energy"),
        color=alt.Color("Rarity:N", title="Rarity"),
        tooltip=["Level", "Health", "Energy", "Sanity", "Rarity"]
    )
    .properties(width=700, height=420, background="#0d1117", padding={"top":10,"left":10,"right":10,"bottom":10})
    .configure_axis(labelColor="white", titleColor="white")
    .configure_title(color="white")
    .configure_legend(labelColor="white", titleColor="white")
).interactive()

chart1.display()


# **Scatterplot: Level vs Health**

In [13]:
chart2 = (
    alt.Chart(df, title="Level vs Health")
    .mark_circle(size=60)
    .encode(
        x=alt.X("Level:Q", title="Level"),
        y=alt.Y("Health:Q", title="Health"),
        color=alt.Color("Rarity:N"),
        tooltip=["Level", "Health", "Energy", "Sanity", "Rarity"]
    )
    .properties(width=700, height=420)
).interactive()

chart2.display()


# **Histogram: Distribution of Health**

In [14]:
hist = (
    alt.Chart(df, title="Health distribution")
    .mark_bar()
    .encode(
        alt.X("Health:Q", bin=alt.Bin(maxbins=40), title="Health"),
        y="count()",
        color=alt.Color("Rarity:N", legend=None),
        tooltip=["count()"]
    )
    .properties(width=700, height=300)
)

hist.display()


# **Boxplot: Sanity by Rarity**

In [15]:
box = (
    alt.Chart(df, title="Sanity distribution by Rarity")
    .mark_boxplot(size=40, extent=1.5)
    .encode(
        x=alt.X("Rarity:N", title="Rarity"),
        y=alt.Y("Sanity:Q", title="Sanity"),
        color=alt.Color("Rarity:N", legend=None),
        tooltip=["min(Sanity)", "median(Sanity)", "max(Sanity)"]
    )
    .properties(width=700, height=420)
)

box.display()





# **Small multiples: Health vs Energy faceted by Rarity**

In [16]:
facet = (
    alt.Chart(df, title="Health vs Energy by Rarity (facets)")
    .mark_circle(size=60, opacity=0.8)
    .encode(
        x="Health:Q",
        y="Energy:Q",
        color="Rarity:N",
        tooltip=["Level", "Health", "Energy", "Sanity"]
    )
    .properties(width=200, height=200)
    .facet("Rarity:N", columns=3)
)

facet.display()


# **Save the main chart JSON (for Flask /view)**



In [17]:
# Save chart1's JSON spec to file so you can test embedding in Flask
chart_json = chart1.to_json()

with open("chart_health_energy.json", "w", encoding="utf-8") as f:
    f.write(chart_json)

print("Saved chart JSON to chart_health_energy.json")


Saved chart JSON to chart_health_energy.json


If we want to use it in Flask, we move this chart_health_energy.json into  app or copy the JSON string into view rendering.

# **Export cleaned dataset for reproducible modeling**

In [18]:
# Save cleaned data to CSV for later modeling / Sprint 3 notebook
clean_fn = "monsters_cleaned_data.csv"
df.to_csv(clean_fn, index=False)
print("Saved cleaned dataset to", clean_fn)


Saved cleaned dataset to monsters_cleaned_data.csv


# **Short writeup / conclusions (Markdown)**


## Conclusions & Next Steps

- I loaded and cleaned the monster dataset and created multiple visualizations:
  - Health vs Energy scatterplot (interactive with tooltips)
  - Level vs Health scatterplot
  - Health histogram
  - Sanity boxplot by Rarity
  - Faceted scatterplots per Rarity

- I saved one chart's Altair JSON (`chart_health_energy.json`) and a cleaned CSV (`monsters_cleaned_data.csv`) for use in Sprint 3 (modeling).

**Next steps (Sprint 3):**
- Use `monsters_cleaned_data.csv` to train and tune ML models (Random Forest, Logistic Regression, KNN / Gradient Boosting).
- Persist the best model and integrate into `/model` endpoint.
