
# Linear Regression: Impact of Mean Subtraction on Fit Stability

## Introduction

In this notebook, we will illustrate how performing a linear fit on a dataset far from the origin can lead to a strong correlation between the **slope** and **intercept** parameters. We will then demonstrate how **subtracting the means** of \(x\) and \(y\) improves numerical stability by reducing this correlation.

We will:

1. **Generate synthetic data** far from the origin.
2. **Fit a linear model** and visualize the confidence intervals of the fit parameters.
3. **Compute and plot the correlation between slope and intercept** before and after subtracting means.
4. **Demonstrate how centering the data reduces correlation**, making the fit numerically more stable.

---


## Step 1: Generate Data and Perform Initial Fit

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data far from the origin
N = 50  # Number of points
x = np.linspace(1000, 1100, N)  # X values far from the origin
true_slope = 2.0
true_intercept = 500.0
y = true_slope * x + true_intercept + np.random.normal(0, 10, N)  # Add noise

# Define the linear model
def linear_model(x, slope, intercept):
    return slope * x + intercept

# Perform the linear fit
popt, pcov = opt.curve_fit(linear_model, x, y)
slope_fit, intercept_fit = popt
slope_err, intercept_err = np.sqrt(np.diag(pcov))  # Extract standard errors

# Compute correlation coefficient
cov_m_b = pcov[0, 1]
corr_original = cov_m_b / (slope_err * intercept_err)

# Generate confidence contours using covariance matrix
slope_range = np.linspace(slope_fit - 3 * slope_err, slope_fit + 3 * slope_err, 100)
intercept_range = np.linspace(intercept_fit - 3 * intercept_err, intercept_fit + 3 * intercept_err, 100)
S, I = np.meshgrid(slope_range, intercept_range)

# Compute probability density from 2D Gaussian using covariance
delta = np.stack([S - slope_fit, I - intercept_fit], axis=-1)
cov_inv = np.linalg.inv(pcov)
Z = np.exp(-0.5 * np.einsum('...i,ij,...j', delta, cov_inv, delta))

# Plot data and fit
fig, axs = plt.subplots(1, 2, figsize=(14, 5))

# Left: Data with best-fit line
axs[0].scatter(x, y, label="Data", color="blue")
axs[0].plot(x, linear_model(x, slope_fit, intercept_fit), 'r-', 
            label=f"Fit: y = {slope_fit:.2f}x + {intercept_fit:.2f}")
axs[0].set_xlabel("x")
axs[0].set_ylabel("y")
axs[0].set_title("Linear Fit to Data (Before Mean Subtraction)")
axs[0].legend()
axs[0].grid()

# Right: Correct confidence contours
contour = axs[1].contourf(S, I, Z, levels=10, cmap="viridis")
axs[1].axvline(slope_fit, linestyle="--", color="white", alpha=0.7)
axs[1].axhline(intercept_fit, linestyle="--", color="white", alpha=0.7)
axs[1].set_xlabel("Slope")
axs[1].set_ylabel("Intercept")
axs[1].set_title(f"Confidence Contours (Correlation: {corr_original:.3f})")
fig.colorbar(contour, ax=axs[1])

plt.show()

# Print results
print(f"Before Mean Subtraction:")
print(f"Best-fit Slope: {slope_fit:.3f} ± {slope_err:.3f}")
print(f"Best-fit Intercept: {intercept_fit:.3f} ± {intercept_err:.3f}")
print(f"Correlation between Slope & Intercept: {corr_original:.3f}")



## Step 2: Observing Correlation in Fit Parameters

- The confidence contour plot shows an **elongated ellipse**, meaning the slope and intercept are **highly correlated**.
- The correlation value is **close to 1**, indicating that **small changes in slope cause large changes in intercept**.
- This happens because our data is far from the origin, which makes fitting less numerically stable.

---

## Step 3: Subtract Means and Repeat Fit

Now, we **subtract the means** from \(x\) and \(y\), which improves numerical stability and reduces correlation.


In [None]:

# Subtract means from x and y
x_centered = x - np.mean(x)
y_centered = y - np.mean(y)

# Perform the fit again on centered data
popt_centered, pcov_centered = opt.curve_fit(linear_model, x_centered, y_centered)
slope_centered, intercept_centered = popt_centered
slope_err_centered, intercept_err_centered = np.sqrt(np.diag(pcov_centered))

# Compute correlation after centering
cov_m_b_centered = pcov_centered[0, 1]
corr_centered = cov_m_b_centered / (slope_err_centered * intercept_err_centered)

# Generate confidence contours using new covariance matrix
slope_range_c = np.linspace(slope_centered - 3 * slope_err_centered, slope_centered + 3 * slope_err_centered, 100)
intercept_range_c = np.linspace(intercept_centered - 3 * intercept_err_centered, intercept_centered + 3 * intercept_err_centered, 100)
S_c, I_c = np.meshgrid(slope_range_c, intercept_range_c)

# Compute probability density for the new, decorrelated fit
delta_c = np.stack([S_c - slope_centered, I_c - intercept_centered], axis=-1)
cov_inv_c = np.linalg.inv(pcov_centered)
Z_c = np.exp(-0.5 * np.einsum('...i,ij,...j', delta_c, cov_inv_c, delta_c))

# Plot results after mean subtraction
fig, axs = plt.subplots(1, 2, figsize=(14, 5))

# Left: Data and refit after centering
axs[0].scatter(x_centered, y_centered, label="Data (Centered)", color="blue")
axs[0].plot(x_centered, linear_model(x_centered, slope_centered, intercept_centered), 'r-', 
            label=f"Fit: y = {slope_centered:.2f}x + {intercept_centered:.2f}")
axs[0].set_xlabel("x (Centered)")
axs[0].set_ylabel("y (Centered)")
axs[0].set_title("Linear Fit to Centered Data")
axs[0].legend()
axs[0].grid()

# Right: Corrected confidence contours after centering
contour_c = axs[1].contourf(S_c, I_c, Z_c, levels=10, cmap="viridis")
axs[1].axvline(slope_centered, linestyle="--", color="white", alpha=0.7)
axs[1].axhline(intercept_centered, linestyle="--", color="white", alpha=0.7)
axs[1].set_xlabel("Slope")
axs[1].set_ylabel("Intercept")
axs[1].set_title(f"Confidence Contours (Correlation: {corr_centered:.3f})")
fig.colorbar(contour_c, ax=axs[1])

plt.show()

# Print results
print(f"After Mean Subtraction:")
print(f"Best-fit Slope: {slope_centered:.3f} ± {slope_err_centered:.3f}")
print(f"Best-fit Intercept: {intercept_centered:.3f} ± {intercept_err_centered:.3f}")
print(f"Correlation between Slope & Intercept: {corr_centered:.3f}")
