# Math 5750/6880 – Project 1 Notebook

This notebook contains three parts required for the **Progress Report**:
1. Project Euler Problem 1
2. California Housing Regression (baseline Linear Regression)
3. Breast Cancer Classification (baseline SVM)

Run all cells in order. Figures will be saved as PNG files for inclusion in your LaTeX report:
- `reg_pred_vs_true.png`
- `reg_error_hist.png`
- `cls_roc.png`
- `cls_pr.png`


## 0. Environment check
Make sure basic packages are available. If you are in Colab, these should be preinstalled.

In [None]:
import sys, numpy, pandas, sklearn, matplotlib
print(sys.version)
print('numpy', numpy.__version__)
print('pandas', pandas.__version__)
print('sklearn', sklearn.__version__)
print('matplotlib', matplotlib.__version__)

3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
numpy 2.0.2
pandas 2.2.2
sklearn 1.6.1
matplotlib 3.10.0


## 1. Project Euler Problem 1
Sum of all natural numbers below 1000 that are multiples of 3 or 5.

**Expected answer:** 233168

In [None]:
def pe1(n=1000):
    return sum(x for x in range(n) if (x % 3 == 0) or (x % 5 == 0))

ans_pe1 = pe1()
print('Project Euler Problem 1 answer =', ans_pe1)

Project Euler Problem 1 answer = 233168


## 2. California Housing Regression (Linear Regression baseline)
We will:
- Load the dataset
- Split into train/test
- Fit Linear Regression
- Report Train/Test R², MAE, RMSE
- Save two figures: Predicted vs True (test), and Error Histogram

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression

# Load dataset
data = fetch_california_housing(as_frame=True)
df = data.frame
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

# Train/test split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit Linear Regression
lr = LinearRegression().fit(X_tr, y_tr)
pred_tr = lr.predict(X_tr)
pred_te = lr.predict(X_te)


def metrics(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    return r2, mae, rmse

r2_tr, mae_tr, rmse_tr = metrics(y_tr, pred_tr)
r2_te, mae_te, rmse_te = metrics(y_te, pred_te)

print('Train  R2/MAE/RMSE:', r2_tr, mae_tr, rmse_tr)
print('Test   R2/MAE/RMSE:', r2_te, mae_te, rmse_te)

# Figure 1: Predicted vs True (test)
plt.figure()
plt.scatter(y_te, pred_te, s=8)
plt.xlabel('True MedHouseVal')
plt.ylabel('Predicted')
plt.title('Predicted vs True (Test)')
plt.tight_layout()
plt.savefig('reg_pred_vs_true.png', dpi=150)
plt.close()

# Figure 2: Error histogram (test)
err = pred_te - y_te
plt.figure()
plt.hist(err, bins=40)
plt.xlabel('Error')
plt.ylabel('Count')
plt.title('Error Histogram (Test)')
plt.tight_layout()
plt.savefig('reg_error_hist.png', dpi=150)
plt.close()

# Print a clean summary block for copy-paste into LaTeX
print('\n===== REGRESSION METRICS (copy these into your LaTeX) =====')
print(f'Training R^2: {r2_tr:.4f}')
print(f'Test R^2: {r2_te:.4f}')
print(f'MAE: {mae_te:.4f}')
print(f'RMSE: {rmse_te:.4f}')

Train  R2/MAE/RMSE: 0.6125511913966952 0.5286283596581922 0.7196757085831575
Test   R2/MAE/RMSE: 0.5757877060324508 0.5332001304956553 0.7455813830127764

===== REGRESSION METRICS (copy these into your LaTeX) =====
Training R^2: 0.6126
Test R^2: 0.5758
MAE: 0.5332
RMSE: 0.7456


## 3. Breast Cancer Classification (SVM baseline)
We will:
- Load the dataset
- Split into train/test (stratified)
- Fit SVM (RBF) with standardization pipeline
- Report Accuracy, ROC-AUC, Average Precision
- Save ROC and PR curves

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, roc_auc_score, average_precision_score,
                             confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay)

bc = load_breast_cancer(as_frame=True)
X, y = bc.data, bc.target
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

clf = make_pipeline(StandardScaler(), SVC(kernel='rbf', probability=True, random_state=42))
clf.fit(X_tr, y_tr)

proba = clf.predict_proba(X_te)[:, 1]
pred  = clf.predict(X_te)

acc = accuracy_score(y_te, pred)
roc = roc_auc_score(y_te, proba)
ap  = average_precision_score(y_te, proba)
cm  = confusion_matrix(y_te, pred)

print('Accuracy:', acc)
print('ROC-AUC:', roc)
print('Average Precision:', ap)
print('Confusion Matrix:\n', cm)

# ROC curve
RocCurveDisplay.from_predictions(y_te, proba)
plt.title('ROC Curve (SVM, RBF)')
plt.tight_layout()
plt.savefig('cls_roc.png', dpi=150)
plt.close()

# Precision-Recall curve
PrecisionRecallDisplay.from_predictions(y_te, proba)
plt.title('Precision-Recall (SVM, RBF)')
plt.tight_layout()
plt.savefig('cls_pr.png', dpi=150)
plt.close()

print('\n===== CLASSIFICATION METRICS (copy these into your LaTeX) =====')
print(f'Accuracy: {acc:.4f}')
print(f'ROC-AUC: {roc:.4f}')
print(f'Average Precision: {ap:.4f}')

Accuracy: 0.9824561403508771
ROC-AUC: 0.9950396825396826
Average Precision: 0.9969313924914238
Confusion Matrix:
 [[41  1]
 [ 1 71]]

===== CLASSIFICATION METRICS (copy these into your LaTeX) =====
Accuracy: 0.9825
ROC-AUC: 0.9950
Average Precision: 0.9969


In [None]:
from google.colab import files
files.download("reg_pred_vs_true.png")
files.download("reg_error_hist.png")
files.download("cls_roc.png")
files.download("cls_pr.png")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 4. Next steps (for your Final Report)
- Try Ridge/Lasso for regression, or a tree-based model, and compare metrics.
- For classification, tune `C` and `gamma`, or compare with Logistic Regression / Random Forest.
- Add brief interpretation on important predictors or decision boundaries.