In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Breast Cancer Prediction System - Model Development\n",
    "## Part A: Model Building and Training\n",
    "\n",
    "**Educational Purpose Only**\n",
    "\n",
    "This notebook demonstrates the development of a breast cancer prediction model using machine learning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Import Required Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix\n",
    "import joblib\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Load the Breast Cancer Wisconsin Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dataset\n",
    "data = load_breast_cancer()\n",
    "\n",
    "# Create DataFrame\n",
    "df = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "df['diagnosis'] = data.target\n",
    "\n",
    "# Display basic information\n",
    "print(\"Dataset Shape:\", df.shape)\n",
    "print(\"\\nFirst 5 rows:\")\n",
    "print(df.head())\n",
    "print(\"\\nDataset Info:\")\n",
    "print(df.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Data Preprocessing - Check Missing Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "print(\"Missing values per column:\")\n",
    "print(df.isnull().sum())\n",
    "print(\"\\nTotal missing values:\", df.isnull().sum().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Check Target Variable Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check target distribution\n",
    "print(\"Target Variable Distribution:\")\n",
    "print(df['diagnosis'].value_counts())\n",
    "print(\"\\n0 = Malignant (Cancerous)\")\n",
    "print(\"1 = Benign (Non-Cancerous)\")\n",
    "\n",
    "# Percentage distribution\n",
    "print(\"\\nPercentage Distribution:\")\n",
    "print(df['diagnosis'].value_counts(normalize=True) * 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Feature Selection\n",
    "\n",
    "**Selected 5 features from the recommended list:**\n",
    "1. mean radius\n",
    "2. mean texture\n",
    "3. mean perimeter\n",
    "4. mean area\n",
    "5. mean concavity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select 5 features\n",
    "selected_features = [\n",
    "    'mean radius',\n",
    "    'mean texture',\n",
    "    'mean perimeter',\n",
    "    'mean area',\n",
    "    'mean concavity'\n",
    "]\n",
    "\n",
    "# Prepare X (features) and y (target)\n",
    "X = df[selected_features]\n",
    "y = df['diagnosis']\n",
    "\n",
    "print(\"Selected Features:\")\n",
    "for i, feature in enumerate(selected_features, 1):\n",
    "    print(f\"{i}. {feature}\")\n",
    "\n",
    "print(\"\\nFeature Statistics:\")\n",
    "print(X.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Train-Test Split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split data into training and testing sets (80-20 split)\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
    ")\n",
    "\n",
    "print(f\"Training set size: {X_train.shape[0]} samples\")\n",
    "print(f\"Testing set size: {X_test.shape[0]} samples\")\n",
    "print(f\"\\nTraining set shape: {X_train.shape}\")\n",
    "print(f\"Testing set shape: {X_test.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Feature Scaling\n",
    "\n",
    "Feature scaling is mandatory for distance-based models and improves performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize StandardScaler\n",
    "scaler = StandardScaler()\n",
    "\n",
    "# Fit on training data and transform both training and testing data\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)\n",
    "\n",
    "print(\"Feature scaling completed!\")\n",
    "print(\"\\nScaled training data (first 5 rows):\")\n",
    "print(X_train_scaled[:5])\n",
    "print(\"\\nMean of scaled features (should be close to 0):\")\n",
    "print(X_train_scaled.mean(axis=0))\n",
    "print(\"\\nStd of scaled features (should be close to 1):\")\n",
    "print(X_train_scaled.std(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8. Model Training - Logistic Regression\n",
    "\n",
    "Using Logistic Regression as the machine learning algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize Logistic Regression model\n",
    "model = LogisticRegression(random_state=42, max_iter=1000)\n",
    "\n",
    "# Train the model\n",
    "model.fit(X_train_scaled, y_train)\n",
    "\n",
    "print(\"✓ Model training completed!\")\n",
    "print(f\"\\nModel coefficients: {model.coef_}\")\n",
    "print(f\"\\nModel intercept: {model.intercept_}\")\n",
    "print(f\"\\nNumber of iterations: {model.n_iter_}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9. Model Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions on test set\n",
    "y_pred = model.predict(X_test_scaled)\n",
    "\n",
    "# Calculate evaluation metrics\n",
    "accuracy = accuracy_score(y_test, y_pred)\n",
    "precision = precision_score(y_test, y_pred)\n",
    "recall = recall_score(y_test, y_pred)\n",
    "f1 = f1_score(y_test, y_pred)\n",
    "\n",
    "# Display results\n",
    "print(\"=\"*60)\n",
    "print(\"MODEL EVALUATION RESULTS\")\n",
    "print(\"=\"*60)\n",
    "print(f\"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)\")\n",
    "print(f\"Precision: {precision:.4f} ({precision*100:.2f}%)\")\n",
    "print(f\"Recall:    {recall:.4f} ({recall*100:.2f}%)\")\n",
    "print(f\"F1-Score:  {f1:.4f} ({f1*100:.2f}%)\")\n",
    "print(\"=\"*60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10. Detailed Classification Report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Detailed classification report\n",
    "print(\"Classification Report:\")\n",
    "print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 11. Confusion Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confusion Matrix\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "print(\"Confusion Matrix:\")\n",
    "print(cm)\n",
    "print(\"\\n[TN  FP]\")\n",
    "print(\"[FN  TP]\")\n",
    "print(\"\\nWhere:\")\n",
    "print(f\"True Negatives (TN):  {cm[0,0]} - Correctly predicted Malignant\")\n",
    "print(f\"False Positives (FP): {cm[0,1]} - Incorrectly predicted Benign as Malignant\")\n",
    "print(f\"False Negatives (FN): {cm[1,0]} - Incorrectly predicted Malignant as Benign\")\n",
    "print(f\"True Positives (TP):  {cm[1,1]} - Correctly predicted Benign\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12. Save the Trained Model and Components\n",
    "\n",
    "Using Joblib for model persistence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the trained model\n",
    "joblib.dump(model, 'breast_cancer_model.pkl')\n",
    "print(\"✓ Model saved as 'breast_cancer_model.pkl'\")\n",
    "\n",
    "# Save the scaler\n",
    "joblib.dump(scaler, 'scaler.pkl')\n",
    "print(\"✓ Scaler saved as 'scaler.pkl'\")\n",
    "\n",
    "# Save feature names for reference\n",
    "joblib.dump(selected_features, 'feature_names.pkl')\n",
    "print(\"✓ Feature names saved as 'feature_names.pkl'\")\n",
    "\n",
    "print(\"\\nAll model components saved successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 13. Demonstrate Model Reloading and Prediction\n",
    "\n",
    "Load the saved model and demonstrate prediction without retraining."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the saved model and components\n",
    "loaded_model = joblib.load('breast_cancer_model.pkl')\n",
    "loaded_scaler = joblib.load('scaler.pkl')\n",
    "loaded_features = joblib.load('feature_names.pkl')\n",
    "\n",
    "print(\"✓ Model loaded successfully!\")\n",
    "print(\"✓ Scaler loaded successfully!\")\n",
    "print(f\"✓ Features loaded: {loaded_features}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 14. Test Prediction with Sample Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test prediction with first sample from test set\n",
    "sample_index = 0\n",
    "sample_data = X_test.iloc[sample_index:sample_index+1]\n",
    "actual_diagnosis = y_test.iloc[sample_index]\n",
    "\n",
    "print(\"Sample Input Data:\")\n",
    "print(sample_data)\n",
    "\n",
    "# Scale the sample data\n",
    "sample_scaled = loaded_scaler.transform(sample_data)\n",
    "\n",
    "# Make prediction\n",
    "prediction = loaded_model.predict(sample_scaled)\n",
    "prediction_proba = loaded_model.predict_proba(sample_scaled)\n",
    "\n",
    "# Display results\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"PREDICTION RESULTS\")\n",
    "print(\"=\"*60)\n",
    "print(f\"Actual Diagnosis:    {'Benign' if actual_diagnosis == 1 else 'Malignant'}\")\n",
    "print(f\"Predicted Diagnosis: {'Benign' if prediction[0] == 1 else 'Malignant'}\")\n",
    "print(f\"\\nPrediction Probabilities:\")\n",
    "print(f\"  Malignant: {prediction_proba[0][0]:.4f} ({prediction_proba[0][0]*100:.2f}%)\")\n",
    "print(f\"  Benign:    {prediction_proba[0][1]:.4f} ({prediction_proba[0][1]*100:.2f}%)\")\n",
    "print(f\"\\nMatch: {'✓ Correct' if actual_diagnosis == prediction[0] else '✗ Incorrect'}\")\n",
    "print(\"=\"*60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 15. Test with Multiple Samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test with 5 random samples from test set\n",
    "print(\"Testing with 5 samples from test set:\\n\")\n",
    "\n",
    "for i in range(5):\n",
    "    sample = X_test.iloc[i:i+1]\n",
    "    actual = y_test.iloc[i]\n",
    "    \n",
    "    sample_scaled = loaded_scaler.transform(sample)\n",
    "    pred = loaded_model.predict(sample_scaled)[0]\n",
    "    pred_proba = loaded_model.predict_proba(sample_scaled)[0]\n",
    "    \n",
    "    print(f\"Sample {i+1}:\")\n",
    "    print(f\"  Actual:     {'Benign' if actual == 1 else 'Malignant'}\")\n",
    "    print(f\"  Predicted:  {'Benign' if pred == 1 else 'Malignant'}\")\n",
    "    print(f\"  Confidence: {max(pred_proba)*100:.2f}%\")\n",
    "    print(f\"  Match:      {'✓' if actual == pred else '✗'}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 16. Test with Custom Input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test with custom input (typically benign values)\n",
    "custom_input = {\n",
    "    'mean radius': 13.5,\n",
    "    'mean texture': 19.2,\n",
    "    'mean perimeter': 88.0,\n",
    "    'mean area': 570.0,\n",
    "    'mean concavity': 0.05\n",
    "}\n",
    "\n",
    "print(\"Testing with custom input (Benign-like values):\")\n",
    "print(custom_input)\n",
    "\n",
    "# Convert to DataFrame\n",
    "custom_df = pd.DataFrame([custom_input])\n",
    "\n",
    "# Scale and predict\n",
    "custom_scaled = loaded_scaler.transform(custom_df)\n",
    "custom_pred = loaded_model.predict(custom_scaled)[0]\n",
    "custom_proba = loaded_model.predict_proba(custom_scaled)[0]\n",
    "\n",
    "print(f\"\\nPrediction: {'Benign' if custom_pred == 1 else 'Malignant'}\")\n",
    "print(f\"Confidence: {max(custom_proba)*100:.2f}%\")\n",
    "print(f\"Probabilities - Malignant: {custom_proba[0]*100:.2f}%, Benign: {custom_proba[1]*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 17. Summary\n",
    "\n",
    "**Model Development Complete!**\n",
    "\n",
    "- **Algorithm Used:** Logistic Regression\n",
    "- **Features Used:** 5 features (radius, texture, perimeter, area, concavity)\n",
    "- **Model Persistence:** Joblib\n",
    "- **Model Performance:** ~92% accuracy\n",
    "- **Files Created:**\n",
    "  - breast_cancer_model.pkl (trained model)\n",
    "  - scaler.pkl (feature scaler)\n",
    "  - feature_names.pkl (feature names)\n",
    "\n",
    "**Note:** This system is strictly for educational purposes and must not be presented as a medical diagnostic tool."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}