In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def linear_regression(x_train, y_train):
    # Add a column of ones to x_train for the intercept term
    ones = np.ones((x_train.shape[0], 1))
    x_train = np.concatenate((ones, x_train), axis=1)

    # Calculate the coefficients using the normal equation
    x_transpose = np.transpose(x_train)
    x_transpose_dot_x = np.dot(x_transpose, x_train)
    x_transpose_dot_y = np.dot(x_transpose, y_train)
    coefficients = np.dot(np.linalg.inv(x_transpose_dot_x), x_transpose_dot_y)

    # Predict the values
    y_pred = np.dot(x_train, coefficients)

    # Calculate the mean squared error
    mse = np.mean((y_pred - y_train) ** 2)

    # Plot the data points and the regression line
    plt.scatter(x_train[:, 1], y_train, color='b', label='Actual')
    plt.plot(x_train[:, 1], y_pred, color='r', label='Predicted')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.legend()
    plt.show()

    return coefficients, mse

# Example usage
# Assume we have a pandas DataFrame with two columns: 'X' (features) and 'Y' (output)
# df = pd.read_csv('data.csv')
# x_train = df[['X']].values
# y_train = df['Y'].values
# coefficients, mse = linear_regression(x_train, y_train)


The linear_regression function takes in x_train (the features) and y_train (the output) as arguments.

It adds a column of ones to x_train using NumPy's ones function. This is done to account for the intercept term in the linear regression equation.

The coefficients are calculated using the normal equation, which involves taking the transpose of x_train, multiplying it with x_train, and then taking the inverse of the result. This is done using NumPy's transpose, dot, and linalg.inv functions.

Next, the predicted values y_pred are calculated by multiplying x_train with the calculated coefficients.

The mean squared error (MSE) is calculated by taking the mean of the squared difference between y_pred and y_train.

The scatter plot is created using Matplotlib, where the actual data points are shown in blue and the predicted regression line is shown in red.

The function returns the calculated coefficients and the MSE.
