 # Comparing a trained linear regression model with semi-uniform bias
 Omer trained a linear regression model and tested its performance on a test sample of 500 objects. On 400 of those, the model returned a prediction higher than expected by 0.5, and on the remaining 100, the model returned a prediction lower than expected by 0.7.

What is the MSE for his model?

Limor claims that the linear regression model wasn't trained correctly, and we can do improve it by changing all the answers by a constant value. What will be her MSE?

You can assume that Limor found the smallest error under her constraints.

**Return two values - Omer's and Limor's MSE.**

# Derivation of answer

The mean squared error (MSE) is a measure of the quality of a regression model. It is calculated as the average of the squared differences between the predicted and actual values.

Omer's model made predictions on 500 objects, with 400 of them having a prediction higher than expected by 0.5, and the remaining 100 having a prediction lower than expected by 0.7. Let's denote the actual values by $y_i$ and the predicted values by $\hat{y_i}$. Then, the MSE for Omer's model is:

$$\text{MSE}_\text{Omer} = \frac{1}{500}\sum_{i=1}^{500}(\hat{y_i} - y_i)^2$$

For the 400 objects where the prediction was higher than expected by 0.5, the difference between the predicted and actual values is $0.5$, and for the 100 objects where the prediction was lower than expected by 0.7, the difference is $-0.7$. Therefore, we have:

$$\text{MSE}_\text{Omer} = \frac{1}{500}\left[400\cdot (0.5)^2 + 100\cdot(-0.7)^2\right] = 0.298$$


In [3]:
(400 * (0.5)**2 + 100 * (-0.7)**2 ) / 500

0.298

Limor claims that the model can be improved by changing all the answers by a constant value. Let's denote the constant value by $c$. Then, the new predicted values are $\hat{y_i} + c$, and the MSE for the new model is:

$$\text{MSE}_\text{Limor} = \frac{1}{500}\sum_{i=1}^{500}(\hat{y_i} + c - y_i)^2$$

We want to find the value of $c$ that minimizes the MSE. Taking the derivative of the MSE with respect to $c$ and setting it to zero, we get:

$$\frac{d}{dc}\text{MSE}_\text{Limor} = \frac{2}{500}\sum_{i=1}^{500}(\hat{y_i} + c - y_i) = 0$$

$$\frac{d^2}{dc^2}\text{MSE}_\text{Limor} = \frac{2}{500} 500 \frac{dc}{dc} = 2 > 0$$


The second derivative at this point will be positive, so we are at a minimum, then solving for $c$, we get:

$$c = \frac{1}{500}\sum_{i=1}^{500}(y_i - \hat{y_i})$$

* for 400 data points: $\hat{y_i} - y_i = 0.5 $
* for 100 data points: $\hat{y_i} - y_i = -0.7 $

notice the signs in this case are inversed:

$$ c = \frac{1}{500} [ 400 \cdot(-0.5) + 100 \cdot(0.7) ] = -0.26


In [13]:
# most of the bias is caused by overshooting
400 * 0.5**2 > 100*0.7**2

# that's why the c is negative, to compensate

True

In [6]:
c = (400 * (-0.5) + 100 * (0.7) ) / 500
print(f"the propossed bias is c={c}")

print("Its negative, even if we have 4 times more positive number")

the propossed bias is c=-0.26


Therefore, the new predicted values are:

$$\hat{y_i} + c = \hat{y_i} + \frac{1}{500}\sum_{i=1}^{500}(y_i - \hat{y_i}) = \hat{y_i} -0.26$$


The new MSE is:

$$\text{MSE}_\text{Limor} = \frac{1}{500}\sum_{i=1}^{500}(\hat{y_i}- y_i  -0.26  )^2$$

Remember the inital fact:
* for 400 data points: $\hat{y_i} - y_i = 0.5 $
* for 100 data points: $\hat{y_i} - y_i = -0.7 $

$$\text{MSE}_\text{Limor} =
    \frac{1}{500}[ 400 \cdot (0.5 -0.26)^2 + 100\cdot (-0.7 -0.26)^2 ] \approx 0.2304$$

In [14]:
( 400 * (0.5 - 0.26)**2 + 100 * (-0.7 -0.26)**2 ) / 500

0.23039999999999997

$$ \text{MSE}_\text{Omer} = 0.298$$

$$\text{MSE}_\text{Limor} \approx 0.2304$$

