# Intro
This technique was proposed by Cleveland in 1979 for modeling and smoothing two-dimensional data. This technique provides a general and flexible approach for approximating two-dimensional data. Of course, the approach is computationally quite demanding, but this method deserves the attention of those researchers who are concerned about the presence of outliers in the data. In particular, it is actively used in biology in the field of genetic research.

In statistics, the term lowess refers to "locally weighted scatter plot smoothing"— the process of creating a smooth curve that corresponds to the data points on the scatter plot.

# The problem of prediction

For forecasting tasks, the simplest linear regression in regression is a linear correspondence of the data trend. However, for data with periodicity and volatility, it cannot simply be adjusted in a linear way, otherwise the model will have a large deviation, and local weighted regression (understatement) can better cope with this problem. The line corresponding to the general trend can be used for forecasts.

# The problem of smoothing

At the same time, local weighted regression (lowess) can also better solve the smoothing problem. When smoothing the data, trends or seasonal data will be displayed. For such data, we cannot use a simple mean plus or minus a 3-fold standard deviation to remove outliers, we need to take into account conditions such as trend. Using local weighted regression, you can fit a trend line and use this line as a baseline, and the further away from the baseline, the more accurate the outlier point will be.
In fact, Local weighted regression (Lowess) mainly deals with smoothing problems because more models can be more accurate due to the prediction problem. But in terms of smoothness, Lowess is intuitive and convincing.

This notebook introduces the LOWESS smoother in the nonparametric package. LOWESS performs weighted local linear fits.

We generated some non-linear data and perform a LOWESS fit, then compute a 95% confidence interval around the LOWESS fit by performing bootstrap resampling.

In [None]:
import numpy as np
import pylab
import seaborn as sns
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt

sns.set_style("white")
pylab.rc("figure", figsize=(12, 8))
pylab.rc("font", size=14)

In [None]:
# Seed for consistency
np.random.seed(0)

In [None]:
# Generate data looking like cosine
x = np.random.uniform(0, 4 * np.pi, size=200)
y = np.cos(x) + np.random.random(size=len(x))

# Compute a lowess smoothing of the data
smoothed = sm.nonparametric.lowess(exog=x, endog=y, frac=0.2)

The lowess function that outs smoothed estimates of endog at the given exog values from points (exog, endog).

`exog` 1-D numpy array The x-values of the observed points

`endog` 1-D numpy array The y-values of the observed points

`frac` is between 0 and 1. The fraction of the data used when estimating each y-value.

The returned array is two-dimensional if return_sorted is True, and one dimensional if return_sorted is False. If return_sorted is True, then a numpy array with two columns. The first column contains the sorted x (exog) values and the second column the associated estimated y (endog) values. If return_sorted is False, then only the fitted values are returned, and the observations will be in the same order as the input arrays. If xvals is provided, then return_sorted is ignored and the returned array is always one dimensional, containing the y values fitted at the x values provided by xvals.

**Notes**

This lowess function implements the algorithm given in the reference below using local linear estimates.

Suppose the input data has N points. The algorithm works by estimating the smooth y_i by taking the frac*N closest points to (x_i,y_i) based on their x values and estimating y_i using a weighted linear regression. The weight for (x_j,y_j) is tricube function applied to abs(x_i-x_j).

If it > 1, then further weighted local linear regressions are performed, where the weights are the same as above times the _lowess_bisquare function of the residuals. Each iteration takes approximately the same amount of time as the original fit, so these iterations are expensive. They are most useful when the noise has extremely heavy tails, such as Cauchy noise. Noise with less heavy-tails, such as t-distributions with df>2, are less problematic. The weights downgrade the influence of points with large residuals. In the extreme case, points whose residuals are larger than 6 times the median absolute residual are given weight 0.

delta can be used to save computations. For each x_i, regressions are skipped for points closer than delta. The next regression is fit for the farthest point within delta of x_i and all points in between are estimated by linearly interpolating between the two regression fits.

Judicious choice of delta can cut computation time considerably for large data (N > 5000). A good choice is delta = 0.01 * range(exog).

If xvals is provided, the regression is then computed at those points and the fit values are returned. Otherwise, the regression is run at points of exog.

Some experimentation is likely required to find a good choice of frac and iter for a particular dataset.

In [None]:
# Plot the fit line
fig, ax = pylab.subplots()

ax.scatter(x, y)
ax.plot(smoothed[:, 0], smoothed[:, 1], c="k")
pylab.autoscale(enable=True, axis="x", tight=True)

# Confidence interval

Now that we have performed a fit, we may want to know how precise it is. Bootstrap resampling gives one way of estimating confidence intervals around a LOWESS fit by recomputing the LOWESS fit for a large number of random resamplings from our data.

# Now create a bootstrap confidence interval around the a LOWESS fit

In [None]:
# Create a bootstrap confidence interval around the a LOWESS fit


def lowess_with_confidence_bounds(
    x, y, eval_x, N=200, conf_interval=0.95, lowess_kw=None
):
    """
    
    Perform Lowess regression and determine a confidence interval by bootstrap resampling
    
    """
    # Lowess smoothing
    smoothed = sm.nonparametric.lowess(exog=x, endog=y, xvals=eval_x, **lowess_kw)

    # Perform bootstrap resamplings of the data
    # and  evaluate the smoothing at a fixed set of points
    smoothed_values = np.empty((N, len(eval_x)))
    for i in range(N):
        sample = np.random.choice(len(x), len(x), replace=True)
        sampled_x = x[sample]
        sampled_y = y[sample]

        smoothed_values[i] = sm.nonparametric.lowess(
            exog=sampled_x, endog=sampled_y, xvals=eval_x, **lowess_kw
        )

    # Get the confidence interval
    sorted_values = np.sort(smoothed_values, axis=0)
    bound = int(N * (1 - conf_interval) / 2)
    bottom = sorted_values[bound - 1]
    top = sorted_values[-bound]

    return smoothed, bottom, top



# Compute the 95% confidence interval

In [None]:
eval_x = np.linspace(0, 4 * np.pi, 31)
smoothed, bottom, top = lowess_with_confidence_bounds(
    x, y, eval_x, lowess_kw={"frac": 0.1}
)

# Plot the confidence interval and fit

In [None]:
fig, ax = pylab.subplots()
ax.scatter(x, y)
ax.plot(eval_x, smoothed, c="k")
ax.fill_between(eval_x, bottom, top, alpha=0.4, color="y")
pylab.autoscale(enable=True, axis="x", tight=True)

# Conclusion
We have considered LOWESS Smoother algorithm. The idea of the method is to smooth out a number of values using a simple linear or polynomial dependence of y on x. However, it is proposed to build a model not for the entire data series, but for its individual parts. This approach actually allows you to build simple regressions for evolutionary data series, since only the most relevant data are used when calculating coefficients.

References


* Harrell, Frank E. Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis.
* https://www.statsmodels.org/stable/examples/notebooks/generated/lowess.html
* Cleveland William S. Robust Locally Weighted Regression and Smoothing Scatterplots // American Statistical Association. Vol. 74. № 368 P. 829-836.