# Linear Regression - Wooldrige Beauty Dataset

#### Import the basic libraries:

1. Pandas to deal with the data tables we are going to download
3. Statsmodels Linear regression, will help us to make and interpret the regression
4. Wooldrige to give us a dataset to work with (https://github.com/spring-haru/wooldridge)

In [89]:
import pandas as pd
from statsmodels.api import OLS, add_constant
import wooldridge

#### Get Data to Dataframe

This is where we are going to get the data for this project. Jeffrey Wooldridge is a Professor of Economics at Michigan State University and is known for his contributions to the field of econometrics, particularly in the areas of panel data analysis and microeconometrics. Wooldridge has also developed several important econometric models and estimation techniques. One of his most well-known contributions is the linear fixed effects model for panel data analysis, which is now a standard tool in empirical microeconomics.

As for datasets, Wooldridge has used a variety of datasets in his research. In this one were going to use the Beauty dataset.
    * Beauty was used to investigate the relationship between physical attractiveness and labor market outcomes.

The `data` method in the `wooldridge` package is used to load a specific dataset into memory. The "description=True" argument is used to print a description of the dataset to the console, providing information on the variables included in the dataset and how they were collected.

In [90]:
wooldridge.data('Beauty', description=True)

name of dataset: beauty
no of variables: 17
no of observations: 1260

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| wage     | hourly wage                   |
| lwage    | log(wage)                     |
| belavg   | =1 if looks <= 2              |
| abvavg   | =1 if looks >=4               |
| exper    | years of workforce experience |
| looks    | from 1 to 5                   |
| union    | =1 if union member            |
| goodhlth | =1 if good health             |
| black    | =1 if black                   |
| female   | =1 if female                  |
| married  | =1 if married                 |
| south    | =1 if live in south           |
| bigcity  | =1 if live in big city        |
| smllcity | =1 if live in small city      |
| service  | =1 if service industry        |
| expersq  | exper^2                       |
| educ     | years of schooling            |
+----------+------------------

In [6]:
df = wooldridge.data('Beauty')

df.head()

Unnamed: 0,wage,lwage,belavg,abvavg,exper,looks,union,goodhlth,black,female,married,south,bigcity,smllcity,service,expersq,educ
0,5.73,1.745715,0,1,30,4,0,1,0,1,1,0,0,1,1,900,14
1,4.28,1.453953,0,0,28,3,0,1,0,1,1,1,0,1,0,784,12
2,7.96,2.074429,0,1,35,4,0,1,0,1,0,0,0,1,0,1225,10
3,11.57,2.448416,0,0,38,3,0,1,0,0,1,0,1,0,1,1444,16
4,11.42,2.435366,0,0,27,3,0,1,0,0,1,0,0,1,0,729,16


Now that we already have our data inside a pandas dataframe we can start our linear regression model! The model we are going to use is the Ordinary Least Squares Linear Regression a commonly used method in econometrics. The goal is to estimate the coefficients of a linear equation that describes the relationship between a dependent variable (Y) and one or more independent variables (X1, X2, ..., Xn).

The OLS model works by minimizing the sum of the squared errors between the predicted values of Y and the actual values of Y. This is done by estimating the coefficients of the linear equation in such a way that the sum of the squared errors is as small as possible. This can be expressed by:

$$SSR(min) = ∑_{i=1}^{n} (y_i - ŷ_i)^2$$

and this is important for later:

$$Variance = \frac{∑_{i=1}^{n} (y_i - ŷ_i)^2}{n}$$

Mathematically, the OLS linear regression model can be represented as follows:

$$Y = β0 + β1X1 + β2X2 + ... + βnXn + u$$

where:
1. Y is the dependent variable, 
2. X1, X2, ..., Xn are the independent variables, 
3. β0 is the intercept or constant term, 
4. β1, β2, ..., βn are the coefficients or slopes, 
5. u is the error term or disturbance term.

The OLS method estimates the values of β0, β1, β2, ..., βn that minimize the sum of the squared errors. This is done by finding the values of the coefficients that make the partial derivative of the sum of the squared errors with respect to each coefficient equal to zero.

Once the coefficients have been estimated, they can be used to predict the value of Y for a given set of values of the independent variables. OLS linear regression is a widely used method in econometrics and is often used to estimate the causal effect of one or more independent variables on a dependent variable.


Wait up! Before we create our model we first need to define our dependent (y) and Independent(X) variables. Note that we are only going to use some of the variables avaible in the dataset. This can be done by accessing our Beauty Dataframe in the following manner:

In [85]:
y = df.lwage
X = df[['educ','exper','belavg', 'abvavg', 'female', 'union', 'black', 'married','bigcity','goodhlth']]

Its also important for us to add the intercept (β0) to our model to the X variables of our model.

In [87]:
X = add_constant(X)
X.head()

Unnamed: 0,const,educ,exper,belavg,abvavg,female,union,black,married,bigcity,goodhlth
0,1.0,14,30,0,1,1,0,0,1,0,1
1,1.0,12,28,0,0,1,0,0,1,0,1
2,1.0,10,35,0,1,1,0,0,0,0,1
3,1.0,16,38,0,0,0,0,0,1,1,1
4,1.0,16,27,0,0,0,0,0,1,0,1


Great! Now that we have our variables created we need to create the Linear Model itself. The way to do this is very simple, just instantiate the model object and then use the `.fit` method to train the model based on the created variables. Like This:

In [88]:
linear_model = OLS(y,X)
linear_model = linear_model.fit()
linear_model.summary()

0,1,2,3
Dep. Variable:,lwage,R-squared:,0.378
Model:,OLS,Adj. R-squared:,0.373
Method:,Least Squares,F-statistic:,76.04
Date:,"Thu, 11 May 2023",Prob (F-statistic):,1.4800000000000001e-121
Time:,16:10:35,Log-Likelihood:,-832.59
No. Observations:,1260,AIC:,1687.0
Df Residuals:,1249,BIC:,1744.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.5485,0.094,5.805,0.000,0.363,0.734
educ,0.0674,0.005,12.618,0.000,0.057,0.078
exper,0.0127,0.001,10.533,0.000,0.010,0.015
belavg,-0.1377,0.042,-3.287,0.001,-0.220,-0.056
abvavg,0.0059,0.030,0.193,0.847,-0.054,0.065
female,-0.4201,0.030,-13.897,0.000,-0.479,-0.361
union,0.1818,0.030,5.992,0.000,0.122,0.241
black,-0.1113,0.052,-2.122,0.034,-0.214,-0.008
married,0.0603,0.031,1.936,0.053,-0.001,0.121

0,1,2,3
Omnibus:,70.28,Durbin-Watson:,1.861
Prob(Omnibus):,0.0,Jarque-Bera (JB):,186.389
Skew:,0.274,Prob(JB):,3.36e-41
Kurtosis:,4.803,Cond. No.,185.0


### Model Interpretation
---


### Parameters:

From the obtained results, we can derive some interpretations. The "coef" column in the summary table represents the estimated β coefficients. Using these coefficients, we can construct the regression equation, which in this case would be:

$$
lwage = 0.55 + 0.7educ + 0.01exper-0.14belavg + 0.006abvavg - 0.42female + 0.18union-0.11black + 0.06married + 0.18bigcity + 0.07goodhlth
$$

This equation indicates that each independent variable is associated with a particular change in the dependent variable, lwage, when all other independent variables are held constant.

The obtained parameters have varying signs, indicating that the independent variables can have either a positive or negative impact on the dependent variable, lwage. For instance, the parameter β for the educ variable has a positive value of 0.7, implying that a one-unit increase in education is associated with a 0.07-unit increase in wage. Conversely, the negative β values for female and black suggest that being female or black is linked with lower wages.

##### Intercept: 
$β0$ is the predicted value of Y (lwage) when $x1, x2,..., xn = 0$

In our model, we can interpret the intercept as being the minimum wage of the samples, which is unaffected by other variables in the model. Note that the intercept may or may not have a significant interpretation.

##### Other Coefficients:

The other parameters measure the partial effect in y when all other parameters are held constant, or ceteris paribus. These partial effects can be expressed as the partial derivatives of the regression equation with respect to the dependent variable. This can expressed as:
$$
\frac{∂y}{∂xj}
≈
\frac{∆y}{∆x}
$$
$$
βˆj ≈ \frac{∆y}{∆x}
$$

$$βˆj ≈ ∆y$$
$$ when $$
$$∆x = 1$$

In this practical example:

$$
\frac{∂lwage}{∂educ} = 0.7
$$