# Least Squares Using Simulated Data

Now, using our simulated data from the previous step, let’s estimate the optimum values of our variable coefficients, ɑ and β. Using the predictor variable, `X`, and the output variable, `yact`, we will calculate the values of ɑ and β using the Least Squares method described in the “Understanding the maths” step.

The cell below creates the same dataframe as in the last step. Run the cell to get started!

In [None]:
# Import pandas, numpy, and matplotlib.pyplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Generate same data as in previous step
np.random.seed(0)
X = 2.5 * np.random.randn(100) + 1.5
ypred = 2 + 0.3 * X
res = 0.5 * np.random.randn(100)
yact = 2 + 0.3 * X + res
df = pd.DataFrame(
    {'X': X,
     'ypred': ypred,
     'yact': yact}
)

# Show the first five rows of our dataframe
df.head()

Just to reiterate, here are the formulas for ɑ and β again:

![](https://latex.codecogs.com/gif.latex?%5Cbegin%7Balign*%7D%20%26%20%5Cbeta%20%3D%20%5Cfrac%7B%5Csum%5Climits_%7Bi%3D1%7D%5E%7Bn%7D%28X_i%20-%20%5Cbar%20X%29%28Y_i%20-%20%5Cbar%20Y%29%7D%7B%5Csum%5Climits_%7Bi%3D1%7D%5E%7Bn%7D%28X_i%20-%20%5Cbar%20X%29%5E2%7D%20%5C%5C%20%26%20%5Calpha%20%3D%20%5Cbar%20Y%20-%20%5Cbeta*%20%5Cbar%20X%20%5Cend%7Balign*%7D)

To calculate these coefficients, we will create a few more columns in our `df` data frame. We need to calculate `xmean` and `ymean` to calculate the covariance of X and Y (`xycov`) and the variance of X (`xvar`) before we can work out the values for `alpha` and `beta`.

In [None]:
# Calculate the mean of X and Y
xmean = np.mean(X)
ymean = np.mean(yact)

# Calculate the terms needed for the numator and denominator of beta
df['xycov'] = (df['X'] - xmean) * (df['yact'] - ymean)
df['xvar'] = (df['X'] - xmean)**2

# Calculate beta and alpha
beta = df['xycov'].sum() / df['xvar'].sum()
alpha = ymean - (beta * xmean)
print(f'alpha = {alpha}\nbeta = {beta}')

The snippet outputs the value of `alpha` and `beta`: ![](https://latex.codecogs.com/gif.latex?%24%5Calpha%20%3D%202.003%5Ctext%7B%2C%20%7D%20%5Cbeta%20%3D%200.323%24)

As we can see, the values are only a little different from what we had assumed earlier. 

Let’s see how the value of *R<sup>2</sup>* changes if we use the new values of ɑ and β. 

The equation for the new model can be written as:

![](https://latex.codecogs.com/gif.latex?y%20%3D%202.003%20+%200.323*x)

Let’s create a new column in `df` to accommodate the values generated by this equation and call this `ypred2`, and calculate the new *R<sup>2</sup>*. 

In [None]:
# Create new column to store new predictions
df['ypred2'] = alpha + beta * df['X']

# Calculate new SSR with new predictions of Y.
# Note that SST remains the same since yact and ymean do not change.
df['SSR2'] = (df['ypred2'] - ymean)**2
df['SST'] = (df['yact'] - ymean)**2
SSR2 = df['SSR2'].sum()
SST = df['SST'].sum()

# Calculate new R2
R22 = SSR2 / SST
print(f'New R2 = {R22}.')

The new value of *R<sup>2</sup>* comes out to be 0.715 – a slight improvement from the previous value of 0.57.

Let’s also plot our new prediction model against the actual values and our earlier assumed model, just to get a better visual understanding. 

In [None]:
# Plot first prediction as blue line, second prediction as purple line,  and
# actual values of Y as red markers
plt.figure(figsize=(12, 6))
plt.plot(X, ypred)
plt.plot(X, df['ypred2'], 'm')
plt.plot(X, yact, 'ro')

plt.title('''Actual vs Predicted with guessed parameters
vs Predicted with calculated parameters''')

plt.show()

As we can see, the `ypred2` and `ypred` are more or less overlapping since the respective values of ɑ and β are not very different.

Next, we will explore other methods of determining model efficacy. Go back to the notebook directory in Jupyter by pressing `File` > `Open…` in the toolbar at the top, then open the notebook called `1.3 Result Parameters.ipynb`.