# R-squared, Clearly Explained!

In this notebook, we will explore R-squared, a metric used in regression analysis to measure the goodness of fit. R-squared is intuitive and provides a quantitative measure of how well a model explains the variability of the dependent variable.

## What is Correlation?

Before diving into R-squared, let's briefly discuss correlation. Correlation measures the strength and direction of the relationship between two variables:

- **Positive Correlation**: As one variable increases, the other also increases.
- **Negative Correlation**: As one variable increases, the other decreases.
- **No Correlation**: No discernible relationship between the variables.

Correlation coefficients range from -1 to 1:
- **1**: Perfect positive correlation.
- **-1**: Perfect negative correlation.
- **0**: No correlation.

## Understanding R-squared

R-squared, also known as the coefficient of determination, measures the proportion of variance in the dependent variable that is predictable from the independent variable(s). It is expressed as a percentage:

$$ R^2 = 1 - \frac{SS_{residual}}{SS_{total}} $$

Where:

- $ SS_{residual} $: Sum of squared residuals (errors).
- $ SS_{total} $: Total sum of squares (variation in the dependent variable).

### Key Insights:
- $ R^2 $ ranges from 0 to 1.
- An $ R^2 $ of 1 indicates the model perfectly explains the data.
- An $ R^2 $ of 0 indicates the model does not explain the data at all.

In [5]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score


### Example: Simple Linear Regression

Let's generate a dataset and calculate the R-squared value for a simple linear regression model.

In [7]:
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # Independent variable
y = 3 * X.squeeze() + np.random.randn(100) * 2  # Dependent variable with some noise

# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Calculate R-squared
r_squared = r2_score(y, y_pred)
r_squared

0.958272869425565

In [26]:
%matplotlib qt

In [51]:
np.mean(y)

14.103261581011564

In [60]:
fig, ax = plt.subplots(1, 1, figsize=(10,8))
ax.scatter(X, y, 150, c='k', zorder=5)
ax.scatter(X, y_pred, 50, 'orange')
ax.plot(X, y_pred,c='orange', alpha=0.1, lw=5)
ax.set_xlabel("X", fontsize=25)
ax.set_ylabel("Y", fontsize=25)
ax.axhline( np.mean(y),  c='darkblue', lw=5)
n = len(X)

SSR = 0
SST = 0
for i in range(n):
    ax.plot( [X[i], X[i]], [y_pred[i], y[i]], c='r')
    SSR += (y[i] - y_pred[i])
    SST += (y[i] - np.mean(y))
ax.grid(False)

In [61]:
1 - SSR/SST

-0.43817204301075274

### Visualizing the Regression

The plot below shows the original data points and the regression line. The R-squared value indicates how well the line fits the data.

In [63]:
# Plot the data and regression line
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred, color='red', label=f'Regression Line (R^2 = {r_squared:.2f})')
plt.title('Simple Linear Regression')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (y)')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

### Interpreting R-squared

- A high R-squared (close to 1) suggests a strong fit.
- A low R-squared (close to 0) suggests a poor fit.

In this example, the R-squared value quantifies the proportion of variance in \( y \) explained by \( X \).

## Limitations of R-squared

- **Overfitting**: Adding more independent variables can artificially inflate R-squared.
- **Non-linearity**: R-squared does not account for non-linear relationships.
- **Context**: A high R-squared does not guarantee causation or model validity.

## Conclusion

R-squared is a valuable metric for assessing the fit of a regression model. It provides insights into the proportion of variability explained by the model, but it should be interpreted cautiously alongside other metrics and domain knowledge.