
# 🧬 NuMark Orb – Real Data AI Pipeline for Cancer Risk Detection

This notebook is the **complete AI pipeline** for NuMark Orb. It uses **real RNA gene expression data** (no mock data),
and simulates receiving input from a saliva test device.

## 🔁 Workflow Overview
- Upload real training data (with labels) to train the model
- Upload new test samples (like from a saliva kit) to predict
- Trigger alerts for high-risk samples


In [None]:

# 📥 Upload your REAL training dataset (CSV with labels)
from google.colab import files
import pandas as pd
import io

uploaded = files.upload()
filename = list(uploaded.keys())[0]
train_df = pd.read_csv(io.BytesIO(uploaded[filename]))

# Expecting a format like:
# gene_1,gene_2,...,gene_n,label
train_df.head()


In [None]:

# 🔧 Train the model using real data
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import joblib

X = train_df.drop("label", axis=1)
y = train_df["label"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

joblib.dump(model, "cancer_risk_rf_model.pkl")
print("✅ Model saved as cancer_risk_rf_model.pkl")

# Evaluate
y_pred = model.predict(X_test)
print("📊 Evaluation Report:")
print(classification_report(y_test, y_pred))


In [None]:

# 📥 Upload new saliva test input (device-like format, no label)
uploaded_test = files.upload()
test_file = list(uploaded_test.keys())[0]
test_df = pd.read_csv(io.BytesIO(uploaded_test[test_file]))

test_df.head()


In [None]:

# 🧠 Predict cancer risk and generate alerts
X_new = scaler.transform(test_df)
preds = model.predict(X_new)
probs = model.predict_proba(X_new)

results = pd.DataFrame({
    "Sample": test_df.index + 1,
    "Low Risk Prob": probs[:, 0],
    "High Risk Prob": probs[:, 1],
    "Prediction": preds
})
results["Alert"] = results["Prediction"].apply(
    lambda x: "🚨 HIGH RISK – Notify physician!" if x == 1 else "✅ Low risk")

results


In [None]:

# 📊 Visual 1: High vs Low Risk Prediction Count
import matplotlib.pyplot as plt

risk_counts = results["Prediction"].value_counts().sort_index()
labels = ["Low Risk", "High Risk"]

plt.figure(figsize=(6, 4))
plt.bar(labels, risk_counts, color=["green", "red"])
plt.title("Prediction Distribution")
plt.ylabel("Number of Samples")
plt.tight_layout()
plt.show()


In [None]:

# 📈 Visual 2: Probability Distribution for High Risk
plt.hist(probs[:, 1], bins=10, color="tomato", edgecolor="black")
plt.title("Distribution of High Risk Probabilities")
plt.xlabel("Probability of High Risk")
plt.ylabel("Number of Samples")
plt.tight_layout()
plt.show()
