# Introduction to Regression analysis

## How Regression analysis works

- **Relationship Between Variables**: Regression analysis examines the relationship between a dependent variable (Y) and one or more explanatory variables (X).
- **Graphical Representation**: Data can be visualized on a graph to show the relationship between Y and X. The slope of the line of best fit represents this relationship.<img src="imgs/reg1.png" width="650">
- **Simple vs. Multiple Regression**: Simple regression involves one X variable, while multiple regression involves multiple X variables, allowing for more complex analysis.
- **Lines and Planes of Best Fit**: Regression analysis fits lines (in simple regression) or planes (in multiple regression) to the data, summarizing the relationships between variables.<img src="imgs/reg2.png" width="650">


## What regression analysis looks like

- **Regression Output Sections**: Regression results are typically split into two sections: diagnostic information (e.g., sum of squares, R-squared, observation count) and the results (e.g., coefficients, standard errors).
- **Key Statistics**: Important statistics include R-squared (indicates model fit), F statistic (predictive power), and root mean square error (size of residuals).
- **Coefficients and Standard Errors**: The coefficient shows the relationship between each variable and the dependent variable (Y). The standard error indicates the accuracy of these coefficients.
- **Statistical Significance**: Variables are statistically significant if t statistics are over 1.96, P values are below 0.05, or confidence intervals do not overlap with zero.
- **Presentation Formats**: Regression results can be presented in various formats, such as tables in reports or outputs from statistical software like Stata.<img src="imgs/reg3.png" width="650">

## Types of regression analysis

- **Ordinary Least Squares (OLS)**: The most common regression method, used for continuous dependent variables and cross-sectional data.
- **Non-Linear Methods**: Includes logit, probit, ordered logit, and multinomial logit models, used when the dependent variable is not continuous.
- **Panel Regression Models**: Used for data collected repeatedly over time, revealing time dynamics.
- **Count Data Models** Such as Poisson and negative binomial regression, used for non-negative integer data.
- **Cox Proportional-Hazard Regression**: Used when the dependent variable is time, common in health sciences for survival analysis.
  <img src="imgs/reg4.png" width="650">

## Correlation is not causation

- **Correlation vs. Causation**: Regression analysis shows correlation, not causation. It helps understand relationships between variables but doesn't prove one causes the other.
- **Interpreting Causality**: Causality is often inferred through theoretical reasoning and common sense, not just statistics.
- **Simultaneity**: Causality can flow in both directions, making it complex to determine which variable affects the other.
- **Importance of Logic**: Logical reasoning, care, and sometimes even philosophy are essential to attribute causality in regression analysis.

# Ordinary Least Squares

## Fitting lines on a scatter plot

- **Parametric vs. Non-Parametric Methods**: Parametric methods apply parameters to the data, making them suitable for multidimensional data and easy to communicate. Non-parametric methods allow the data to speak for itself, requiring fewer assumptions but are less effective in multidimensional environments.<br>
**Advantages and Disadvantages**:
    - **Parametric**: Works well with many variables and is easily transposable but requires strong assumptions about data shape.
    - **Non-Parametric**: Requires fewer assumptions but is harder to communicate and less effective with many variables.
- **Application in Regression**: Most regression methods use parametric fits to estimate parameters that indicate relationships between variables, making it easier to understand without visualizing the data.

<figure>
    <img src="imgs/reg5.png" width=650>
    <figcaption align="center">Non-Parametric fit</figcaption>
</figure>

Non-parametric methods will use something called a *bandwidth* to compute a local average value in a small data space. This *bandwidth* is then traced across the data, and each average is stitched together.<br>
***

<figure>
    <img src="imgs/reg6.png" width=650>
    <figcaption>Parametric fit</figcaption>
</figure>

- Regression estimates **parameters** that **indicate the relationship between Y and X**.
- The **slopes are often called coefficients**, which show the relationship between Y and X without having to look at the graph.

This is the power of parametric fits. You don't need to visualize them. A parameter tells you what you need to know.<br>
- Non-parametric methods should be used for graphical analysis of two variables.
- Parametric methods should be used when you want to explore many variables at once in a regression framework.

## OLS Regression

- **OLS Regression**: It's a method to fit a line through data points by **minimizing the sum of squared residuals** (the differences between observed and predicted values).
- **Residuals and Least Squares**: Residuals are the distances between the observed data points and the fitted line. Squaring these residuals removes negative values, and minimizing their sum helps find the best-fitting line.
- **Parametric Method**: **OLS is a parametric method** that works with one or multiple variables, providing coefficients that describe the relationships between variables.

<figure>
    <img src="imgs/reg7.png" width=650>
    <figcaption>Residuals</figcaption>
</figure>

<figure>
    <img src="imgs/reg8.png" width=650>
    <figcaption>Residual sum of squares</figcaption>
</figure>

<figure>
    <img src="imgs/reg9.png" width=650>
    <figcaption>Least squares</figcaption>
</figure>

## BLUE

**BLUE Definition**: BLUE stands for **Best Linear Unbiased Estimator**.<br>
It means that under certain conditions, the *ordinary least squares (OLS)* estimator is the best method for estimating the parameters of a regression model.<br>

**Conditions for BLUE**: Four main conditions must be met for OLS to be BLUE:
1. **Linearity**: The parameters must be linear. A dependent variable must be a continuously measured variable.
    - **Dependent variable**: This is the outcome you're trying to predict or explain. In regression, it's often called the **target** or **label**.
    - **Continuously measured variable**: This means the variable can take on any value within a range, including decimals. Examples include height, temperature, price, or time.<br>
If the dependent variable is **categorical** (e.g., "yes"/"no", or "low"/"medium"/"high"), you'd need a different type of model—like **logistic regression** or **classification algorithms**.
3. **Zero Conditional Mean (Exogeneity)**: There should be no correlation between explanatory variables and the error term.
    - **Explanatory variables** (also called **independent variables** or **features**) are the inputs used to predict the outcome.
    - The **error term** represents all the factors that **affect the dependent variable but are not included in the model**.
    - **Endogeneity** means the variable is correlated with the error term, which may include unseen or omitted variables.
5. **Homoscedasticity**: The variation of noise in the data must remain stable across explanatory variables.
    - **Noise** = the error term (difference between actual and predicted values).
    - **Stable variation** = the variance of the errors should not increase or decrease systematically with the explanatory variables.
7. **No Collinearity**: Explanatory variables should not be highly correlated with each other.

**Practical Application**: In practice, these conditions are rarely fully met, which means that OLS results often need moderation and holistic interpretation rather than being taken as absolute truth.


### Simulation example

1. Take a small sample from infinite data
2. Estimate a particular relationship
3. Use two hypothetical estimators
4. The true parameter is one
5. Plot the estimates on a graph

<figure>
    <img src="imgs/reg10.png" width=650 align="center">
    <figcaption>Efficiency graph</figcaption>
</figure>

Both estimators, *on average*, estimate the correct value of one.<br>
**However, the inefficient estimator is, *on average*, further away with its predictions than the efficient estimator**.<br>This is the concept of efficiency.<br>
Normally we don't have an infinite amount of data, but in the real world, this concept is visible through higher or lower standard errors. Lower standard errors means more certainty around results.
***

<figure>
    <img src="imgs/reg11.png" width=650>
    <figcaption>Bias graph</figcaption>
</figure>

**The biased estimator, *on average*, does not estimate the true parameter in this data.**<br>
*On average*, it's wrong.

## Which conditions matter?

**Gauss Markov Assumptions**: These are conditions that make OLS regression the best linear unbiased estimator (BLUE). **Not all assumptions are equally important in practice**.
- **Homoscedasticity**: This assumption is less critical in real life. It assumes constant noise around the regression line. Modern statistical software can test and correct for violations of this assumption.
- **No Collinearity**: This assumption is more significant. It requires that explanatory variables are not highly correlated. High collinearity can lead to noisier estimates and higher standard errors.
    - It matters in small data sets (<100). Larger sets are unlikely to be affected. 
- **Linear in Parameters**: **This must be true for OLS to work**. It means that a one-unit change in an explanatory variable should consistently cause a change in the dependent variable - coefficients should have the same meaning across the regression space.
- **Exogeneity**: **This is the most critical assumption**. It requires no correlation between explanatory variables and the error term. Violations can lead to biased estimates and significant issues in regression analysis.

### Why exogeneity matters most?

- **Exogeneity Assumption**: This is crucial for regression models. It requires that explanatory variables are not correlated with unseen variables outside the model.
- **Endogeneity Consequences**: When the exogeneity assumption fails (endogeneity), it leads to biased estimates, meaning the coefficients do not have a causal interpretation, which can result in incorrect conclusions.
- **Practical Implications**: Endogeneity is particularly problematic in applied work involving human behavior. Good regression models rely on theoretical frameworks, prior literature, and rational thought to address this issue.
- **Exogeneity cannot be tested for**

Suppose we're modeling student test scores based on hours studied:

$TestScore = β0 + β1 * HoursStudied + ε$

Here, $ε$ is the error term, which captures all other factors that influence $TestScore$ but are not included in the model.

$ε$ could be:
- Student's prior knowledge or IQ
- Quality of instruction or tutoring
- Sleep quality before the test
- Test anxiety
- Nutrition or health on test day

If any of these omitted factors are correlated with $HoursStudied$ (e.g., smarter students tend to study more), then $HoursStudied$ becomes endogenous, and the OLS estimate of $β1$ will be biased.

Variables that are likely to be endogenous are those that resemble choices made by firms, individuals, or nations. For example:

- **Advertising Expenditure**: This can be influenced by various factors such as price changes and consumer preferences.
- **Sales Performance**: This is driven by multiple factors, including advertising quality and external market conditions.

These variables are often influenced by unseen factors, making them endogenous and potentially leading to biased estimates in regression models.

## Goodness-of-fit statistics

- **R-squared Value**: This measures the goodness of fit in a regression model, ranging from 0 (poor fit) to 1 (perfect fit). It indicates **how much of the variation in the dependent variable is explained by the model**.
- **Limitations of R-squared**: High R-squared values can be misleading if the model is overloaded with variables. Low R-squared values do not necessarily invalidate the model's coefficients.
- **Contextual Use**: Use R-squared for contextual purposes, comparing it with typical values in your field, but focus more on the significance and magnitude of the coefficients.
    - **Macro and Time Series Data Models**: These often have high R-squared values, typically around 0.8 or 0.9.
    - **Micro and Cross-Sectional Data Models**: These usually have lower R-squared values, often around 0.2 or 0.4.

When evaluating the coefficients in a regression model, consider asking the following questions:

- **Do they make sense?**: Assess whether the coefficients align with your expectations and theoretical understanding.
- **Are they significant?**: Check if the coefficients are statistically significant. If not, investigate further to understand why.
- **Are they large?**: Evaluate the magnitude of the coefficients. Unexpectedly large coefficients might indicate a modeling error.

## Using functional forms

- **Flexibility of Linear Regression**: Linear regression can fit many kinds of non-linear relationships, known as functional forms.
- **Quadratic Relationships**: Introducing quadratic terms (e.g., mileage squared) can improve the model fit for non-linear data.
- **Visual Representation**: It's important to display estimates visually to understand complex relationships better.

# OLS Tips and Tricks

## Visualizing coefficients

### Simplify Tables

Ensure regression tables are easy to read by clearly identifying dependent variables, presenting only important results, and using appropriate formatting.

 - **Clear Identification**: Clearly identify the dependent variable in the title of the table.
 - **Relevant Results**: Present only the important results that matter and that you want to discuss.
 - **Simplified Controls**: Use text to state that other controls are included, rather than listing them all.
 - **Decimal Places**: Present results to two or three decimal places only.
 - **Standard Errors**: Report standard errors only, not T statistics, confidence intervals, or P values. Use asterisks to indicate significance.
 - **Formatting**: Use appropriate borders: lines—heavy for top and bottom, thin to separate output from diagnostics.
 - **Naming Conventions**: Keep the naming conventions simple and relevant.

<figure>
    <img src="imgs/reg12.png" width=650>
    <figcaption>Correctly formated table</figcaption>
</figure>

 - **Use Visuals**: Besides tables, visualize regression results through plots to better understand relationships and coefficients.

<figure>
    <img src="imgs/reg13.png" width=650>
    <figcaption>Regression plot</figcaption>
</figure>

## Rule of thumb significance check

- **Statistical Testing**: It is crucial for determining if a variable is significantly different from zero or another variable.
- **Quick Determination**: A shortcut for large samples (above 100) involves using a confidence interval to quickly assess statistical significance.
- **Confidence Interval**: By doubling the standard error and adding/subtracting it from the coefficient, you can determine if a value is statistically different from zero or another value.

To quickly determine if a coefficient is statistically significant, use the following method:

1. **Estimate the Coefficient**: Identify the estimated coefficient and its standard error.
2. **Calculate the Confidence Interval**: Multiply the standard error by 2 (approximately 1.96 for a 95% confidence interval) and add/subtract this value from the estimated coefficient.
3. **Assess Significance**: If the confidence interval does not include zero, the coefficient is statistically significant.<br>
**Practical Use**: This method is useful when scanning regression outputs without diving into exact p-values.<br>

**Key Points**:

**Overlapping Confidence Intervals ≠ No Difference**<br>
Overlap between confidence intervals does not necessarily mean the coefficients are statistically indistinguishable. It just means that each coefficient, individually, might not be significantly different from some values—including possibly each other.

**Comparing Coefficients Requires a Different Test**<br>
If you want to test whether two coefficients are significantly different from each other, you need to perform a *contrast test* or a *hypothesis test* on the difference between them. This is more precise than just eyeballing overlapping intervals.<br>
In Python, using `statsmodels`, you can do this with a *Wald test* or by manually computing the standard error of the difference.

<figure>
    <img src="imgs/reg14.png" width=650>
    <figcaption>Statistical significance assumption</figcaption>
</figure>

## Reading indicator variables

- **Indicator Variables**: Also known as dummy variables, these take values of 0 or 1 and are used to analyze non-continuous data.
 - **Interpretation**: The coefficient of an indicator variable shows the effect relative to a reference category, shifting the regression line up or down.
 - **Multiple Categories**: When dealing with variables with multiple categories (e.g., regions), they need to be recoded into separate dummy variables, with one excluded as the reference category.

## Make use of time

 - **Cross-Sectional Data**: Data collected at a single point in time across multiple subjects (e.g., people, companies, regions).
    - **Example**: Income levels of 100 households in 2025.
 - **Time Series Data**: Unlike cross-sectional data, time series data involves repeated measurements over time, allowing for the investigation of time dynamics.
     - **Example**: Daily temperature in Tomaszkowo for the year 2025.
 - **Lagged Variables**: These are past values of variables used to understand how past events affect current outcomes. For example, living in an urban area last year might impact wages today.
 - **Time Dynamics**: Introducing time into regression models helps analyze cause and effect more accurately, showing that some effects, like wage increases from moving to an urban area, take time to materialize.

### Example

In this regression, we control for various demographic factors and different regions. There are also two variables for living in an urban area:
 - A **contemporaneous variable** called *urban time zero*
 - A **lagged variable** called *urban time minus one*.<br>
Here, time is measured in years, so *T minus one* refers to last year.<br>
What these results say is once demographic and regional factors are controlled for, living in an urban area today has no impact on current hourly wages, but living in an urban area last year increases today's wages by around 5.5%<br>
This is evidence that the effect of moving from a rural area to an urban area takes one year to materialize. It's not instantaneous.

<figure>
    <img src="imgs/reg15.png" width=650>
    <figcaption>Lagged time regression</figcaption>
</figure>

## Interpreting elasticities

 - **Elasticity Concept**: Elasticity measures the proportional change of one variable in response to a change in another variable. It's commonly used in economic and business modeling.
 - **Log Transformations**: Log transformations in regression analysis allow results to be interpreted in terms of elasticities. There are four types of models:
     - **Linear Model**: No transformation; coefficients relate to unit changes.
     - **Log-Linear Model**: Regresses log Y against X; coefficients indicate percentage change in Y for a unit change in X.
     - **Linear-Log Model**: Regresses Y against log X; coefficients indicate unit change in Y for a percentage change in X.
     - **Log-Log Model**: Regresses log Y against log X; coefficients indicate percentage change in Y for a percentage change in X.

**Practical Application**: Understanding these models helps in interpreting how changes in one variable affect another, which is crucial for business and economic analysis.

<figure>
    <img src="imgs/reg16.png" width=650>
    <figcaption>Common regression transformations</figcaption>
</figure>

# Applied Regression Analysis