<div class="notebook-buttons" style="display:flex; padding-top: 5rem;padding-bottom: 2.5rem;line-height: 2.15;">
    <a href="https://colab.research.google.com/github/magdasalatka/fantasticfeatures/blob/main/noise-on-y_vs_noise-on-y.ipynb">
        <div id="colab-link" style="display: flex;padding-right: 3.5rem;padding-bottom: 0.625rem;border-bottom: 1px solid #ececed; align-items: center;">
            <img class="call-to-action-img" src="img/colab.svg" width="30" height="30" style="margin-right: 10px;margin-top: auto;margin-bottom: auto;">
            <div class="call-to-action-txt">Run in Google Colab</div>
        </div>
    </a>
    <a href="https://raw.githubusercontent.com/magdasalatka/fantasticfeatures/main/noise-on-y_vs_noise-on-y.ipynb" download>
        <div id="download-link" style="display: flex;padding-right: 3.5rem;padding-bottom: 0.625rem;border-bottom: 1px solid #ececed; height: auto;align-items: center;">
            <img class="call-to-action-img" src="img/download.svg" width="22" height="30" style="margin-right: 10px;margin-top: auto;margin-bottom: auto;">
            <div class="call-to-action-txt">Download Notebook</div>
        </div>
    </a>
    <a href="https://github.com/magdasalatka/fantasticfeatures/blob/main/noise-on-y_vs_noise-on-y.ipynb">
        <div id="github-link" style="display: flex;padding-right: 3.5rem;padding-bottom: 0.625rem;border-bottom: 1px solid #ececed; height: auto;align-items: center;">
            <img class="call-to-action-img" src="img/github.svg" width="25" height="30" style="margin-right: 10px;margin-top: auto;margin-bottom: auto;">
            <div class="call-to-action-txt">View on GitHub</div>
        </div>
    </a>
</div>

# How do the noise on x and y influence the results we get from Linear Regression? 

by [Teresa Kubacka](http://teresa-kubacka.com/), [Magdalena Surówka](https://datali.ch)

AMLD 2021, 26.10.2021

In [None]:
!pip install -U git+https://github.com/magdasalatka/fantasticfeatures.git

In [1]:
from fantasticfeatures.dataset_noise_generator import *
from fantasticfeatures.plotting import *

  import pandas.util.testing as tm


In [2]:
import numpy as np

# Step 1 

Ideal X, y drawn from some distribution


$ y_i = \beta_0 + \beta_1 x_i$

In [3]:
num_indep = 1
n_sample = 1000

# X_base, y_base, coeffs = make_regression_custom(n_samples=n_sample, n_features=num_indep, n_informative=num_indep,
#                    tail_strength=0, bias=0, n_targets=1, noise=0, 
#                        shuffle=False, coef=True, random_state=2, custom_coef=[1])


Here let's make X drawn from a uniform distribution. Let's make y equal to x - let's say we want to fit a line where slope = 1 and intercept = 0. That's our ground truth

In [4]:
X_base = 10*np.random.uniform(0,1,n_sample).reshape((n_sample,1))
y_base = np.squeeze(X_base)

In [5]:
X_train, y_train, model, fitted = train_xy(X_base,y_base)

In [6]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1.293e+34
Date:,"Sun, 24 Oct 2021",Prob (F-statistic):,0.0
Time:,19:41:33,Log-Likelihood:,26759.0
No. Observations:,800,AIC:,-53510.0
Df Residuals:,798,BIC:,-53510.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.406e-15,5.13e-17,-27.406,0.000,-1.51e-15,-1.31e-15
x1,1.0000,8.79e-18,1.14e+17,0.000,1.000,1.000

0,1,2,3
Omnibus:,5073.128,Durbin-Watson:,0.976
Prob(Omnibus):,0.0,Jarque-Bera (JB):,82.594
Skew:,0.108,Prob(JB):,1.16e-18
Kurtosis:,1.441,Cond. No.,12.0


So the model is really sure! But we also measured our x,y perfectly

# Step 2

What happens if we have imperfect y instead of the ground-truth y? For a moment, let's say we still have perfect values of x


$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

In [7]:
y1 = y_base + gen_noise(y_base,1,'normal')

In [8]:
_, _, model1, _ = train_xy(X_base,y1)
model1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.892
Model:,OLS,Adj. R-squared:,0.892
Method:,Least Squares,F-statistic:,6594.0
Date:,"Sun, 24 Oct 2021",Prob (F-statistic):,0.0
Time:,19:41:33,Log-Likelihood:,-1131.2
No. Observations:,800,AIC:,2266.0
Df Residuals:,798,BIC:,2276.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1253,0.071,1.765,0.078,-0.014,0.265
x1,0.9878,0.012,81.206,0.000,0.964,1.012

0,1,2,3
Omnibus:,1.42,Durbin-Watson:,1.976
Prob(Omnibus):,0.492,Jarque-Bera (JB):,1.405
Skew:,-0.102,Prob(JB):,0.495
Kurtosis:,2.987,Cond. No.,12.0


Still good! But now we have non-zero confidence intervals. But they still contain our "real" values.

# Step 3

Now what happens if we have imperfect x? 

The original, pure x (x_base), created the effect visible in y_base, but now what we measure is x with error, not the one that actually created y. Let's fit LR to see what happens then 

$ x_i = \xi_i + \delta_i$

$ y_i = \beta_0 + \beta_1 x_i + (\epsilon_i - \beta_1 \delta_i)$

In [9]:
x1 = X_base + gen_noise(X_base,1,'normal',seed=None).reshape(X_base.shape)

_, _, model2, _ = train_xy(x1,y1)
model2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.802
Model:,OLS,Adj. R-squared:,0.802
Method:,Least Squares,F-statistic:,3235.0
Date:,"Sun, 24 Oct 2021",Prob (F-statistic):,5.75e-283
Time:,19:41:33,Log-Likelihood:,-1373.6
No. Observations:,800,AIC:,2751.0
Df Residuals:,798,BIC:,2761.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.7073,0.091,7.755,0.000,0.528,0.886
x1,0.8766,0.015,56.876,0.000,0.846,0.907

0,1,2,3
Omnibus:,0.834,Durbin-Watson:,1.897
Prob(Omnibus):,0.659,Jarque-Bera (JB):,0.923
Skew:,-0.062,Prob(JB):,0.63
Kurtosis:,2.89,Cond. No.,11.6


Let's repeat it a few times... 

In [31]:
params_list = []
params_bse_list = []
for i in range(10):
    x1 = X_base + gen_noise(X_base,1,'normal',seed=None).reshape(X_base.shape)
    _, _, model2, _ = train_xy(x1,y1)
    params_list.append(model2.params)
    params_bse_list.append(model2.bse)

In [57]:
all_params = np.array(list(zip(params_list,params_bse_list)))

slopes_1 = all_params[:,0,1]


In [60]:
# [[intercept, slope], [intercept_std, slope_std]]
all_params

array([[[0.54620752, 0.90273043],
        [0.09381086, 0.01590378]],

       [[0.63200709, 0.88344542],
        [0.09211689, 0.01549366]],

       [[0.69503742, 0.87953356],
        [0.09082578, 0.01536312]],

       [[0.53945512, 0.90539472],
        [0.09167751, 0.01555104]],

       [[0.63882165, 0.88468451],
        [0.09970203, 0.01689142]],

       [[0.61359332, 0.88693336],
        [0.09171542, 0.01543442]],

       [[0.68032216, 0.8743432 ],
        [0.09327286, 0.0156678 ]],

       [[0.67031106, 0.87195559],
        [0.09453345, 0.01582109]],

       [[0.70293885, 0.88498326],
        [0.09512337, 0.01625597]],

       [[0.76521843, 0.86046224],
        [0.09522377, 0.0159976 ]]])

So the fitted parameters are consistently away from expected slope=1 and intercept=0. What happened?

We expect that the estimator in this case will be biased, following: 

$$ E \hat\beta_1 \approx \beta_1 \frac{1}{1 + \sigma^2_{errx} / \sigma^2_{realx}}  $$

(See Faraway book for the derivation!)

Does it hold here? 

In [71]:
# single example (last from the loop)
1 * (1 / (1 + (np.var(x1-X_base)/np.var(X_base))))

0.8872607267757571

In [93]:
# draw new noise every time 
1 * (1 / (1 + (np.var(gen_noise(X_base,1,'normal',seed=None).reshape(X_base.shape))/np.var(X_base))))

0.8852903719792332

Which actually agrees quite well with what we got numerically! 

In [97]:
# These were the results from our experiment 
slopes_1

array([0.90273043, 0.88344542, 0.87953356, 0.90539472, 0.88468451,
       0.88693336, 0.8743432 , 0.87195559, 0.88498326, 0.86046224])

Play with this example to get a feeling of how they change. what happens for more variables? what happens for a correlated noise?  