# Interactive Exploration of Fitting Data to Models:<br> Gaussian Distributions, Reduced Chi-Squared, Linear vs. Non-Linear Fits

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate act as an expert in hand-on STEM learning. I want you to generate an interactive Jupyter notebook that does the following:

introduce the concept of a Gaussian distribution as a model for an observation and associated uncertainty,
Ask students to evaluate how likely is is that 'truth' lies more than one sigma away from the observation,
produces a random set of 10 data points that lie along a line, with associated noise. Plots them with error bars. make the x axis range from 1 to 10.
Make an interactive cell that has sliders to adjust slope and intercept of a fit, and computes the reduced chi-squared for the fit. Add a tutorial cell that talks about minimzing negative log likelihood.
include a 3-d plot of the reduced chi-squared surface (which should be parabolic) and show the chosen value as it evolves.

--------------------------------

This Jupyter notebook offers an immersive exploration into the world of Gaussian distributions, uncertainty quantification, and linear fitting, centering on the reduced chi-squared statistic. It begins with an introduction to Gaussian distributions, highlighting the roles of mean and standard deviation as indicators of central tendency and spread, respectively, and representing uncertainty through standard deviation. Students are then guided to calculate the probability of the true value lying within one standard deviation from the mean using the Gaussian cumulative distribution function. Further, the notebook generates a set of 10 random data points, akin to a line with Gaussian noise, and uses Python's numpy and matplotlib libraries to visualize these points with error bars across an x-axis ranging from 1 to 10. An interactive component is included, featuring Jupyter widgets that allow users to adjust the slope and intercept of a fitted line and calculate the reduced chi-squared statistic for this fit. Additionally, a tutorial demonstrates the concept of maximum likelihood estimation (MLE), emphasizing how minimizing the negative log likelihood relates to effective data fitting and model selection. The notebook culminates with a 3D plot showcasing the reduced chi-squared surface over varying slope and intercept values using matplotlib's mplot3d, illustrating the dynamics of chosen fit values with evolving markers on the surface.

## Likelihood of the Truth Being a Distance \(X\) from the Measured Value

In scenarios where measurements have Gaussian (normal) distributed errors, it's important to understand the likelihood of the "true" value being a certain distance \(X\) away from the measured value. This knowledge is vital for error estimation and confidence analysis in experimental data.

### Gaussian Distribution with Non-zero Mean

We're going to make a model of the uncertainties associated with a measurement as being a Gaussian with mean $\mu$ and standard deviation $\sigma$. The probability density function (PDF) of the Gaussian distribution gives the likelihood of obtaining a given measurement x, and is:

$$
f(x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left(-\frac{(x - \mu)^2}{2 \sigma^2}\right)
$$

where \(x\) is the observed measurement.

The mean $\mu$ is our estimator of some 'true' underlying value, and $\sigma$ reports the 'noise' associated with a single measurement.

### Likelihood of Truth being a Distance X from $\mu$.

The likelihood that the true value is within a distance \(X\) from the measured value involves looking at the probability that the true value falls within an interval around the measurement. This can be calculated using the cumulative distribution function (CDF) approach for a range from \(-X\) to \(X\):

$$
P(-X < \Delta x < X) = \int_{\mu - X}^{\mu + X} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x - \mu)^2}{2 \sigma^2}\right) \, dx
$$

where:
$\Delta x = x_{\text{true}} - x_{\text{measured}}$

### Interpretation

- **Distribution Centered on $\mu$:** The mean of the distribution $\mu$ represents the central tendency of the measurements, and shifts the center of our interval of integration.
- **Confidence Intervals:** This analysis can inform confidence intervals around a measured value, providing insights into the precision and accuracy of measurements by quantifying the probability that the true value lies within a certain range of the measured value.


## Interactive demonstration

The cell below lets you select a mean $\mu$ and standard deviation $\sigma$. Think of this as corresponding to a data point and its associated uncertainty.
You can also use a slider to choose a 'truth' value T. The cell will plot a Gaussian probability distribution (PDF) and will report the likelihood of truth being that far off from $\mu$.  

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import ipywidgets as widgets
from IPython.display import display

def plot_pdf(mean, std, x_value):
    # Create the x values
    x = np.linspace(mean - 4*std, mean + 4*std, 1000)

    # Calculate the PDF values
    pdf_values = norm.pdf(x, mean, std)

    # Plot the PDF
    plt.figure(figsize=(8, 5))
    plt.plot(x, pdf_values, label='PDF', color='blue')
    plt.axvline(x=x_value, color='red', linestyle='--', label=f'Data Point: {x_value:.2f}')
    plt.title('Gaussian Distribution PDF')
    plt.xlabel('X')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(True)
    plt.show()

    # Calculate the discrepancy (z-score)
    z_score = np.abs(x_value - mean) / std

    # Calculate the range bounds
    boundary = np.abs(x_value - mean)

    # Calculate the likelihood of being more discrepant (two-tailed)
    prob_within_boundaries = norm.cdf(mean + boundary, mean, std) - norm.cdf(mean - boundary, mean, std)
    discrepant_prob = 1 - prob_within_boundaries

    print(f"Data point {x_value:.2f} is {z_score:.2f} sigma away from the mean")
    print(f"Likelihood of measurement being more extreme than {z_score:.2f} sigma: {discrepant_prob:.6f}")

# Create widgets for mean, std, and x_value
mean_widget = widgets.FloatSlider(value=0, min=-10, max=10, step=0.1, description='Mean')
std_widget = widgets.FloatSlider(value=1, min=0.1, max=5, step=0.1, description='Std Dev')
x_value_widget = widgets.FloatSlider(value=0, min=-10, max=10, step=0.05, description='T Value')

# Use the interactive function to link the plot to the sliders
interactive_plot = widgets.interactive(plot_pdf, mean=mean_widget, std=std_widget, x_value=x_value_widget)

# Display the interactive plot
display(interactive_plot)

## Fitting Model Parameters with Uncertainties.

With this in hand, we can now move on to finding the best-fit parameters for a model for a given a data set with values and associated uncertainties. <br>
We'll start by making a random data set, and we'll do a manual and automated linear fit to the data points.

## Generate Random Data and Plot with Error Bars

In [None]:
np.random.seed(42) # this initiates a random number generator

In [None]:
# pick some parameters
true_slope = 2.0
true_intercept = 5.0
num_points = 10

In [None]:
# make x, y, and uncertainty vectors

x_values = np.arange(1, num_points + 1) # sets up an array ot 10 equally spaced x values
true_y_values = true_slope * x_values + true_intercept

# leave this commented out until later on please
# true_y_values = true_slope * x_values + true_intercept + 0.5*x_data*x_data # makes y values according to parameters above

noise = np.random.normal(0, 1, num_points) # create Gaussian noise vector with sigma=1 and mean 0.
y_values = true_y_values + noise # add noise to the data

y_errors = np.ones(num_points) # create a vector of measurement uncertainties, with sigma = 1.


In [None]:
# It's ALWAYS a good idea to plot your data, so let's do that.

plt.errorbar(x_values, y_values, yerr=y_errors, fmt='o')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simulated Noisy Data Points with Error Bars')
plt.grid('on')
plt.show()

## Least Squares Fitting and Its Connection to Negative Log Likelihood

In statistical modeling, especially when dealing with Gaussian or normally distributed errors, the process of fitting a model to data often involves minimizing a specific cost function. This approach can be linked to both maximum likelihood estimation and the minimization of the reduced chi-squared statistic.

### Gaussian Assumption

If the errors in the measurements are assumed to be Gaussian and the measurements are independent (uncorrelated), then the likelihood $\mathcal{L}$ of observing the data set given a model is the product of the probability per data point;

$$
L = \prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma_i^2}} \exp\left(-\frac{(y_i - y_{\text{model}, i})^2}{2 \sigma_i^2} \right),
$$

where:
- $y_i$ are the observed data points,
- $y_{\text{model}, i}$ are the model predictions,
- $\sigma_i$ are the standard deviations of the measurement errors,
- $N$ is the number of data points.

If we want to find a mathematical model that is a best-fit to the data, we seek to maximize this likelihood.   

### Negative Log Likelihood

To simplify the calculation, we often work with the negative log likelihood:

$$
-\log{L} = \sum_{i=1}^{N} \frac{(y_i - y_{\text{model}, i})^2}{2 \sigma_i^2} + \text{constant}
$$

Minimizing the negative log likelihood is equivalent to minimizing the sum of squared, normalized residuals, commonly known as the least squares criterion:

$$
\sum_{i=1}^{N} \frac{(y_i - y_{\text{model}, i})^2}{\sigma_i^2}
$$

### Connection to Reduced Chi-Squared

The least squares approach leads directly to the definition of the chi-squared statistic:

$$
\chi^2 = \sum_{i=1}^{N} \frac{(y_i - y_{\text{model}, i})^2}{\sigma_i^2}
$$

In practice, the reduced chi-squared statistic is used to account for the degrees of freedom in the model, providing a normalized measure:

$$
\chi^2_{\text{reduced}} = \frac{\chi^2}{\text{dof}}
$$

where:
- $\text{dof} = N - p$ is the number of degrees of freedom,
- $p$ is the number of parameters in the model.

### Interpretation

- **Likelihood Maximization:** By minimizing the negative log likelihood, we effectively maximize the likelihood function, leading to the best-fit parameters under Gaussian assumptions.
- **Goodness of Fit:** The reduced chi-squared provides a measure of how well the model fits the data relative to the expected statistical variation. A value close to 1 usually indicates a good fit, assuming the model is correct and the errors are well estimated.

This relationship between least squares fitting, likelihood estimation, and chi-squared statistics is a fundamental concept in statistical regression analysis, especially in the context of Gaussian-distributed measurement errors.

### What this interactive $\chi_{\rm{reduced}}^2$ surface shows

This section visualizes the **reduced chi-squared ($\chi_{\rm{reduced}}^2$)** surface for a simple Gaussian model, as a function of the linear model's parameters (slope and intercept).  
The surface plot helps you see where the fit is a *better* (lower $\chi_{\rm{reduced}}^2$) or *worse* (higher $\chi_{\rm{reduced}}^2$) description of the data, and how parameters may be correlated.

**How the plot is built:**

1. A grid over the chosen parameter pair (slope and intercept) is created.
2. For each grid point, the model is evaluated and a χ² value is computed against the current dataset.
3. The resulting 2D array of χ² values is rendered as a **3D surface**: the height is $\chi_{\rm{reduced}}^2$, letting you see how well the model fits the data for different combinations of slope and intercept.  A marker on the surface shows the current combination of parameters.

**Move the slope and intercept sliders to find the combination of parameters that minimizes $\chi_{\rm{reduced}}^2$.**  What values do you find yield the best fit?

In [None]:
from ipywidgets import interact, FloatSlider

In [None]:
# Sample data: x values and corresponding y values with associated uncertainties
x_data = x_values
y_data = y_values

In [None]:
# optional centered of x and y data about their respective means
centered = 0
if centered==1:
        x_data=x_data-np.mean(x_data)
        y_data=y_data-np.mean(y_data)

# optional large offset
# x_data=x_data+10

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact
from mpl_toolkits.mplot3d import Axes3D

# Function to calculate reduced chi-squared
def reduced_chi_squared(y_obs, y_exp, errors, num_params):
    """
    Calculate the reduced chi-squared statistic.

    y_obs: Observed data
    y_exp: Expected data (model predictions)
    errors: Errors in the observed data
    num_params: Number of fitted parameters

    Returns: Reduced chi-squared value
    """
    chi_squared = np.sum(((y_obs - y_exp) / errors) ** 2)
    degrees_of_freedom = len(y_obs) - num_params
    return chi_squared / degrees_of_freedom

# Interactive function to plot both 2D and 3D
def interactive_fit(slope, intercept):

    y_model = slope * x_data + intercept
    red_chi_squared = reduced_chi_squared(y_data, y_model, y_errors, num_params=2)

    # Create subplots side by side
    fig = plt.figure(figsize=(15, 6))

    # Plotting the 2D fit
    ax1 = fig.add_subplot(121)
    ax1.errorbar(x_data, y_data, yerr=y_errors, fmt='o', label='Data with errors')
    ax1.plot(x_data, y_model, label=f'Fit: y = {slope:.2f}x + {intercept:.2f}')
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.set_title(f'Linear Fit with Reduced Chi-Squared: {red_chi_squared:.2f}')
    ax1.legend()
    ax1.grid(True)

    # Define meshgrid for slope and intercept values
    slope_range = np.linspace(-5, 5, 100)  # Match slider range
    intercept_range = np.linspace(-20, 20, 100)  # Match slider range
    slope_grid, intercept_grid = np.meshgrid(slope_range, intercept_range)

    # Calculate the chi-squared surface
    chi_squared_surface = np.array([
        reduced_chi_squared(y_data, m * x_data + b, y_errors, num_params=2)
        for m, b in zip(np.ravel(slope_grid), np.ravel(intercept_grid))
    ]).reshape(slope_grid.shape)

    # 3D surface plot
    ax2 = fig.add_subplot(122, projection='3d')
    surf = ax2.plot_surface(slope_grid, intercept_grid, np.log10(chi_squared_surface), cmap='viridis', alpha=0.8)
    ax2.scatter(slope, intercept, np.log10(red_chi_squared), color='red', s=100, label='Trial point',zorder=100,alpha=1)
    ax2.set_xlabel('Slope')
    ax2.set_ylabel('Intercept')
    ax2.set_zlabel('log Reduced Chi-Squared')
    ax2.set_xlim([-5, 5])
    ax2.set_ylim([-20, 20])
    ax2.set_title('log Reduced Chi-Squared Surface')
    ax2.legend()
    fig.colorbar(surf, ax=ax2, shrink=0.5, aspect=5)

    plt.tight_layout()
    plt.show()

# Use interactive to adjust the slope and intercept
interact(interactive_fit, slope=(-5.0, 5.0, 0.01), intercept=(-20.0, 20.0, 0.01))

### Moving from ``chi-by-eye'' to an automated fit: linear regression with uncertainties

Now that we’ve explored the general behavior of this $\chi_{\rm{reduced}}^2$ surface, let’s apply these ideas to a **real curve fitting example**.  
We will fit a straight line

$$
y = m x + b
$$

to the dataset, taking into account the uncertainties on each data point.

**What happens in this cell:**

1. We define a simple linear model and use `scipy.optimize.curve_fit` to estimate the best-fit slope \(m\) and intercept \(b\).
2. The uncertainties on these parameters are extracted from the covariance matrix of the fit.
3. The data are plotted with error bars, along with the best-fit line.
4. We compute the **residuals** (data – model), the overall χ² statistic, and the **reduced χ²** to evaluate the quality of the fit.

**Why this matters:**

- If the reduced χ² is close to 1, it suggests the model and error estimates are consistent with the data.
- Larger values (≫1) indicate the model is not capturing the data trends or the error bars are too small.
- Smaller values (≪1) suggest the errors may be overestimated or the model is overfitting.

This provides a concrete link between the $\chi_{\rm{reduced}}^2$ surface we visualized above and the practical task of fitting a model to experimental data.

**How close is this best-fit model to the parameters you selected by interacting with the chi-squared surface?**


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Define a linear function for fitting
def linear(x, m, b):
    return m * x + b

# Perform the curve fit
popt, pcov = curve_fit(linear, x_data, y_data, sigma=y_errors, absolute_sigma=True)

# Extract fit parameters and their uncertainties
m_best_fit, b_best_fit = popt
m_uncertainty, b_uncertainty = np.sqrt(np.diag(pcov))

# Plot data
plt.errorbar(x_data, y_data, yerr=y_errors, fmt='o', label='Data with errors', color='blue')
x_fit = np.linspace(min(x_data) - 1, max(x_data) + 1, 100)
y_fit = linear(x_fit, m_best_fit, b_best_fit)
plt.plot(x_fit, y_fit, '-', label=f'Fit: y = ({m_best_fit:.2f} ± {m_uncertainty:.2f})x + ({b_best_fit:.2f} ± {b_uncertainty:.2f})', color='red')

# Add plot legend
plt.legend(loc='best')

# Add plot labels
plt.xlabel('X Data')
plt.ylabel('Y Data')
plt.title('Linear Fit with Uncertainties')

# Add fit equation in a box
fit_eq = (f"y = ({m_best_fit:.2f} ± {m_uncertainty:.2f})x + "
          f"({b_best_fit:.2f} ± {b_uncertainty:.2f})")
plt.annotate(fit_eq, xy=(0.05, 0.95), xycoords='axes fraction',
             fontsize=10, ha='left', va='top',
             bbox=dict(boxstyle="round,pad=0.3", edgecolor='gray', facecolor='lightyellow', alpha=0.5))

plt.grid(True)
plt.show()

# Calculate the residuals and chi-squared
residuals = y_data - linear(x_data, *popt)
chi_squared = np.sum((residuals / y_errors) ** 2)

# Calculate degrees of freedom
dof = len(y_data) - len(popt)  # Number of data points minus number of fit parameters

# Calculate reduced chi-squared
reduced_chi_squared = chi_squared / dof


# Print the fit results
print(f"Slope (m) = {m_best_fit:.2f} ± {m_uncertainty:.2f}")
print(f"Intercept (b) = {b_best_fit:.2f} ± {b_uncertainty:.2f}")
print(f"Reduced Chi-Squared: {reduced_chi_squared:.2f}")


In [None]:
# always a good idea to plot the residuals of the fit

# Plot data
plt.errorbar(x_data, residuals, yerr=y_errors, fmt='o', label='Residuals', color='blue')

# Add plot labels
plt.xlabel('X')
plt.ylabel('Residuals')
plt.title('Fit Residuals with Uncertainties')

plt.grid(True)
plt.show()

# Linear vs. Non-Linear Fitting

## Linear Fitting

**Definition:**  
Linear fitting approximates data where model is linear in the fit parameters. One simple example is $y(x) = mx + b$, where:
- $ m $ is the slope, and
- $ b $ is the y-intercept.

Another example of a linear fit is $y(x) = A \sin(x) + B \cos(x)$, where A and B are fit parameters.

**Characteristics:**
- **Simple & Direct:** The relationship between variables is straightforward.
- **Analytical Solutions:** Solving a linear fit involves straightforward algebra and matrix algebra, leading to unique solutions. The $\chi^2$ surface is simple.
- **Efficient:** Linear regression is computationally simple and fast, making it practical even for large datasets.

**Applications:**  
- Used when data sets exhibit a clear linear trend.
- Widely employed in fields like economics (e.g., forecasting), biology (e.g., determining growth rates), and physics (e.g., relationships following Newtonian mechanics).


## Non-Linear Fitting

**Definition:**  
Non-linear fitting is necessary when relationships between variables are more complex and cannot be captured with a straight line. Examples include exponential growth, power laws, sigmoidal curves, etc.

**Characteristics:**
- **Complex Models:** Non-linear equations often have forms like \( y = ae^{bx} \), \( y = ax^b \), or more complex polynomial or sinusoidal functions.
- **Iterative Solutions:** Non-linear fitting requires iterative numerical algorithms (like Gauss-Newton or Levenberg-Marquardt) due to the absence of simple algebraic solutions.
- **Initial Guesses:** Requires initial parameter estimates to start the iterative process, which influence the convergence and outcome.

**Pitfalls & Challenges:**
- **Convergence Issues:** Algorithmic convergence is not guaranteed; poor initial guesses can lead to non-convergence or convergence to local minima instead of a global solution.
- **Computational Intensity:** The iterative nature can be computationally expensive, especially for complex models or large datasets.
- **Sensitivity & Instability:** Non-linear models can be sensitive to small changes in data, leading to instability in estimates, especially if data quality is poor or models are overparametrized.
- **Overfitting:** Adding too many parameters can lead to modeling noise rather than the underlying trend, reducing predictive power on new data.

## Conclusion

The choice between linear and non-linear fitting should depend on:
- **Nature of Data:** Clearly linear relationships can use linear regression, while complex curves require non-linear methods.
- **Predictive Goal:** Consider whether you need straightforward predictions from clear trends or need to capture intricate patterns.
- **Model Simplicity vs. Accuracy:** Balance overfitting risks by choosing a model complex enough to capture significant patterns without excessively fitting noisy data points.

Understanding these distinctions helps ensure the selected model appropriately represents the data, leveraging either the strengths of linear simplicity or non-linear flexibility, while remaining cautious of their respective pitfalls and challenges.