# Regression Principles 

## Regression vs Classification

Supervised learning can be broadly classified into regression or classification problems. Regression models the relationship between a continuous target variable (often denoted as 𝑌) and a set of independent variables (often denoted as 𝐗), while classification models apply when the target variable is binary.

Summary:
The aim of regression is to find the mathematical relationship between a dependent and one or more independent variables. When the target variable is binary, we use classification models, such as classification trees and logistic regression.

In [None]:
# Dataset example creation
set.seed(123)
df <- data.frame(dependent_var = rnorm(100),
                 independent_var1 = rnorm(100),
                 independent_var2 = rnorm(100))

# This model is a regression model since the dependent variable is continuous
# We use the `lm()` function in  to fit the regression
model <- lm(dependent_var ~ independent_var1 + independent_var2, data = df)

# Check the summary of the model
summary(model)

The above code first creates a dataset with a continuous dependent variable and two independent variables. We then use R's built-in `lm()` function to fit a linear regression model to this dataset. The `summary()` function provides detailed information on the fitted model.

## Linear Regression

Linear Regression is a widely used regression technique. When we represent 𝑓(𝐗) using a linear function, the multiple linear regression equation appears.

Summary:
Linear regression estimates the relationship between the dependent variable (Y) and one or multiple independent variables (X) by fitting a linear equation to observed data.

In [None]:
# Simple Linear egression
# Here we consider only one independent variable
# We use the `lm()` function in  to fit the regression
simple_model <- lm(dependent_var ~ independent_var1, data = df)

# Check the summary of the model
summary(simple_model)

The code above fits a simple linear regression model on the dataset which includes one dependent variable and one independent variable. The `lm()` function is used to fit the model, and the `summary()` function provides a detailed summary of the model. 

## Model Fitting

Model fitting involves choosing the model parameters to minimize the sum of squared errors over all observations. The procedure of estimating these parameters is called Ordinary Least Squares (OLS) regression.

Summary:
Model Fitting implies determining the most optimal set of parameters for a model. This is done by minimising the Residual Sum of Squares (RSS) which measures the amount of variance unexplained by the model.

In [None]:
# The residuals of the model, can be accessed with residuals(). It calculates the difference of predicted from actual Y
model_residuals <- residuals(model)

The above code calculates the residuals of the model that we fitted earlier. Residuals are the differences between the observed and predicted values of the target variable, and they are used to understand the discrepancy between the model predictions and the actual data.

## Principles of Least Squares and Prediction

To make predictions with the estimated model, we use the Ordinary Least Squares (OLS) estimates.

Summary:
Once we have the model fitted, we can make predictions by plugging the observed values of X into the estimated model. The estimated model minimizes the sum of squared residuals.

In [None]:
# Predicting values with the model
newdata <- data.frame(independent_var1 = rnorm(10),
                      independent_var2 = rnorm(10))
predicted_y <- predict(model, newdata)

The above code first creates a new dataset with the same structure as the original one. We then use the `predict()` function to predict the dependent variable for the new data using the fitted model. 

## Use of p-values in Hypothesis Testing

P-values are used in hypothesis testing to decide whether to reject the null hypothesis. The null hypothesis usually states that there is no relationship between the independent and dependent variables.

Summary:
P-value is used to determine the statistical significance of the estimated coefficients of the model. If the p-value is less than the level of significance (usually 0.05), we reject the null hypothesis, indicating that there is evidence of a relationship between the variables.

In [None]:
# From the summary of the model we can select the p-value
pvalues <- summary(model)$coefficients[, 4]

The above code snippet extracts the p-values of the model coefficients from the model summary. These p-values are used to test the null hypothesis that the respective coefficient is zero, i.e., the corresponding variable has no effect on the dependent variable.

## Interpretation of Coefficients

The regression coefficients represent the average change in the dependent variable for one unit of change in the independent variable while holding other independent variables constant.

Summary:
The coefficients of the model capture the quantitative effect of the independent variables on the dependent variable.

In [None]:
# Extract regression coefficients
coefficients <- coef(model)

The above code snippet extracts the coefficients of the independent variables from the fitted model. These coefficients tell us about the relationship between each independent variable and the dependent variable, while keeping other variables constant.

## Assessing Model Performance

R-squared (R^2) is a statistic that provides a measure of how well the regression line approximates the actual data points.

Summary:
R-squared explains how much variability of the dependent variable can be explained by the model.

In [None]:
# Extract -squared from the summary
r_squared <- summary(model)$r.squared

The code above extracts the R-squared statistic from the summary of the fitted model. The R-squared statistic indicates how well the model fits the data.