# **Linear Regression for Random Data**

This Jupyter Notebook generates a random linear dataset with Gaussian noise added to the following equation: y = 5 + 2X<sub>1</sub>. It employs the [numpy.random.rand()](https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html) function from the [NumPy](https://numpy.org/) library to generate random noise. Then, this code uses the [LinearRegression](https://scikit-learn.org/stable/modules/linear_model.html) class from the
 [Scikit-Learn](https://scikit-learn.org/stable/) library
([Pedregosa et al., 2011](https://doi.org/10.48550/arXiv.1201.0490)) to
build a regression model for this toy dataset. The [LinearRegression](https://scikit-learn.org/stable/modules/linear_model.html) class employs the [scipy.linalg.lstsq](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html) function from [Scipy](https://scipy.org/) linear algebra module ([scipy.linalg](https://docs.scipy.org/doc/scipy/reference/linalg.html)), which calculates the Moore-Penrose pseudoinverse ([Moore, 1920](https://scholar.google.com/scholar_lookup?title=On+reciprocal+of+the+general+algebraic+matrix&author=E.+H.+Moore&publication_year=1920&journal=Bull.+Amer.+Math.+Soc.&pages=394), [Penrose, 1954](https://doi.org/10.1017/s0305004100030401)) of the data matrix <strong>X</strong>. This function from [Scipy](https://scipy.org/) library computes the Moore-Penrose pseudoinverse matrix using the standard matrix factorization approach named Singular Value Decomposition (SVD) ([Pedregosa et al., 2011](https://doi.org/10.48550/arXiv.1201.0490)). The following code is similar to Jupyter Notebooks discussed by
[Géron, 2023](https://www.isbns.net/isbn/9781098125974/).
<br> </br>
<img src="https://drive.usercontent.google.com/download?id=1t7CGbCH4V1NkzKbfNx114ks7us_X9gVr&export=view&authuser=0" width=520 alt="Dice">
<br>This image shows a pair of dice as an analogy of random data. Source: Pixabay: https://pixabay.com/pt/photos/dados-toque-aleat%C3%B3ria-por-sorte-2777809/</br>
<br> </br>
**References**
<br> </br>
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow 3e: Concepts, Tools, and Techniques to Build Intelligent Systems, 3rd ed.; O’Reilly Media: Sebastopol, CA, 2023.
[ISBN: 978-1-098-12597-4](https://www.isbns.net/isbn/9781098125974/)
<br> </br>
Moore EH. On the reciprocal of the general algebraic matrix. Bull Amer Math Soc., 1920; 26:394–395.
[Google Scholar](https://scholar.google.com/scholar_lookup?title=On+reciprocal+of+the+general+algebraic+matrix&author=E.+H.+Moore&publication_year=1920&journal=Bull.+Amer.+Math.+Soc.&pages=394)
<br> </br>
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Verplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikitlearn: Machine Learning in Python. J Mach Learn Res., 2011; 12:2825–2830.
[DOI: 10.48550/arXiv.1201.0490](https://doi.org/10.48550/arXiv.1201.0490)
<br> </br>
Penrose, R. A Generalized Inverse for Matrices. Math Proc Camb Philos Soc., 1955; 51(3):406–413.
[DOI: 10.1017/s0305004100030401](https://doi.org/10.1017/s0305004100030401)
<br> </br>
It follows the code.

In [None]:
#!/usr/bin/env python3
#
################################################################################
# Dr. Walter F. de Azevedo, Jr.                                                #
# [Scopus](https://www.scopus.com/authid/detail.uri?authorId=7006435557)       #
# [GitHub](https://github.com/azevedolab)                                      #
# July 20, 2024                                                                #
################################################################################
#
################################################################################
# Import section                                                               #
################################################################################
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

################################################################################
# Randomly generated linear dataset                                            #
################################################################################
msg_out = "\nGenerating and plotting random linear data..."
print(msg_out,end="...")
np.random.seed(1123581321)          # Set up a random seed
X = 4*np.random.rand(100,1)
y = 5 + np.random.rand(100,1) + 2*X
X_b = np.c_[np.ones((100,1)), X]    # Add x0 = 1 to each instance

# Plotting randomly generated linear dataset
plt.title("Randomly Generated Linear Dataset (y = 5 + 2*X$_1$)")
plt.plot(X,y, "b.")
plt.xlabel("X$_{1}$")
plt.ylabel("y")
plt.axis([0,4,0,15])
plt.grid()
plt.savefig("linear_regression_random_data.pdf",dpi=1500)
plt.close()
print("done!")

################################################################################
# Generate a regression model                                                  #
################################################################################
msg_out = "\nGenerating regression model and plotting it..."
print(msg_out,end="...")
lin_reg = LinearRegression()
lin_reg.fit(X,y)

# Show theta vector
theta_vector = np.array([[lin_reg.intercept_[0]],[lin_reg.coef_[0][0]]])

# Plotting regression model and data
X_in = np.array([[0],[4]])
X_in_b = np.c_[np.ones((2,1)), X_in] # Add x0 = 1 to each instance
y_predict = X_in_b.dot(theta_vector)
plt.plot(X_in,y_predict, "r-")
plt.legend(["Predicted"])
plt.title("Regression Model for a Random Dataset (y = 5 + 2*X$_1$)")
plt.plot(X,y, "b.")
plt.xlabel("X$_{1}$")
plt.ylabel("y")
plt.axis([0,4,0,15])
plt.grid()
plt.savefig("linear_regression_5_plus_2x.pdf",dpi=1500)
plt.close()
print("done!")

# Show theta vector and calculate MSE
print("\nTheta vector: ",theta_vector)
y_predict = X_b.dot(theta_vector)
print("MSE: {:.4f}".format(mean_squared_error(y, y_predict)))
################################################################################


Generating and plotting random linear data......done!

Generating regression model and plotting it......done!

Theta vector:  [[5.50622115]
 [1.97442538]]
MSE: 0.0864
