### Regression analysis
is used to quantify the relationship between two variables.

For example, say there's a relationship between house size and the price. In this case, the size of the house is the explanatory (independant) variable while the price is the dependent variable (as it depends on the size of the house).

Two kinds of regression:
1. Simple Regression
    - Only uses one independent variable (house size in our example) for predicting the other
    - Assumes a linear relationship exists between the two variables
    - Uses the form: y = a + bx (alpha plus beta times x)
    - Finds the line of best fit by finding the line with the least errors
        - errors are the differences between the actual data points and the predictions given by the line of best fit
2. Multivariate Regression
    - Contains multiple independent variables that work together to predict the dependent variable

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm
from scipy import stats

ImportError: cannot import name '_lazywhere' from 'scipy._lib._util' (/Users/adammoore/Documents/Development/PythonForFinance/venv/lib/python3.13/site-packages/scipy/_lib/_util.py)

In [None]:
housing_data = pd.read_excel('PythonForFinanceCourseMaterials/S13 - Part II Finance - Regressions for Fin. Analysis/Housing.xlsx')
housing_data

In [None]:
housing_data[['House Price', 'House Size (sq.ft.)']]

### Univariate Regression

In [5]:
X = housing_data['House Size (sq.ft.)'] # Independent variable
Y = housing_data['House Price'] # Dependent variable

In [None]:
plt.scatter(X, Y)

In [None]:
# If you want to scale the graph to the data:
plt.scatter(X, Y)
plt.axis([0, 2500, 0, 1500000])
plt.ylabel('House Price')
plt.xlabel('House Size (sq.ft)')
plt.show()

In [None]:
X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

### How to distinguish Good Regressions:
Usually, more than one variable determines house prices. Things like:
- Location
- Neighborhood
- Year of Construction
- Etc

So when simple regressions are used, they often omit certain factors which will result in an estimation error (a larger distance between prediction model the actual data). This doesn't make simple predictions useless, as they still may have high predictive power, just noting that they're imperfect. So really, regression models can be written as:
- Y = a + bx + error (Y equals alpha plus beta x plus error)
People often use the word 'residuals' to refer to the error, or distances from the data to the model.

The way the line of best fit becomes the best fitting line is by minimizing the SSR aka the sum of squared residuals (ie. Sum(error<sup>2</sup>))

Thus, the coefficients found with this technique (the beta in y = a + bx + error) are called OLS estimates (Ordinary least square estimates).

When considering if all regressions are good, follow your intuition. Some variables are better predictors than others. For instance, the age of the owner is probably a worse predictor than house size for the price of a house. Enter the R<sup>2</sup> metric:
- To consider R<sup>2</sup>, we begin with TSS or Total Sum of Squares
    - Sum(x-mean)<sup>2</sup>
    - It's like the variance/s<sup>2</sup> formula without the denominator
- Then R<sup>2</sup> = 1 - SSR/TSS
- R<sup>2</sup> varies between 0% - 100%. The higher it is, the more predictive power the model has