# Week 6: From cross tabs to multiple regression

## By Hyunsu Oh and Charlie Eaton

Human capital theorists argue that higher incomes are caused by investment in human capital (i.e. education).

Other sociologists argue that income is influenced by gender and racial discrimination in the labor market.
 
We will test these theories with OLS multiple regression models using data from the General Social Survey 2018.

## Read the data in and describe to see what variables we have

In [None]:
set more off
capture log close
log using w6lecture_multreg_log20200226.log
use realrinc age sex race educ yearsjob paeduc PASEI10 using
describe
summarize realrinc age sex race educ yearsjob paeduc

In [None]:
%head if _n<=5

## What do you think the variables measure?

1. What is the dependent variable?
2. Which variables measure causes of income variation suggested by human capital theory? Why?
3. Which variables measure causes suggested by theories of labor market discrimination? Why?
4. Which variables control for other factors? Why do they matter?

## Develop hypotheses using variables from the data

With your neighbor, make at least 1 hypothesis for each theory that includes an intervening or spurious relationship involving more than 1 independent variable

## Now, let's examine our dependent variable using histogram

In [7]:
%set graph_format svg

In [None]:
[your code here]

## Then, let's take a look at the relationships between our DV and IVs.

Let's generate scatter plots between our DV and appropriate IVs

In [None]:
foreach x of var age age sex race educ educ yearsjob paeduc PASEI10 {
  graph twoway (scatter realrinc `x') (lfit realrinc `x', color(red)), name(`x', replace) legend(off) ytitle(income) scheme(plotplainblind)
}

### FYI, we can visualize these plots in one space. For instance,

In [None]:
graph combine age sex race educ yearsjob paeduc PASEI10, col(3) 

## What do you think the numeric values represent for sex and race?

write some code to find out

In [None]:
[your code here]

You may want to run correlation analysis to see the linear relationship among variables

In [None]:
correlate realrinc age sex race educ paeduc yearsjob

In [None]:
pwcorr realrinc age sex race educ paeduc yearsjob, sig

### Why do you think we get "1.0000" across the diagonal?

### Should we change any of our hypotheses based on the correlation matrix?

## OLS multiple regression accounts for correlations between IVs

Here is the extension of the regression equation to multpiple regression:

$\hat{Y} = \alpha + \beta \times x $

$\hat{Y} = \alpha + \beta_1 \cdot x_1 + \beta_2 \cdot x_2. ... + \beta_k \cdot x_K $

$x_1$ is 1st independent variable

$x_2$ is 2nd indepdent variable

K is the number of independent variables

### Write a multiple regression equation in LaTex

Cut and paste LaTex code from the above cell.

Include all of the 6 independent variables.

Label each independent variable with a subscript with the first 1 or 2 letters for the variable.

### Write the regression equation in Stata and estimate the mode

Replace the numbers for X and for beta with the first initial of the variables you want to include.

## How would you interpret the coefficients?

[your answer here]

## Does this regression analysis really test our hypotheses?

How can we test if there are any of the intervening or spurious relationships we hypothesized?

Write two regression equations that together test one such hypothesis.

In [None]:
[your code here]

## How would you interpret the coefficients of the two models together?

## It's way easier to interpret models together if we combine their results in 1 table

We use -eststo- with regress and -esttab- to do this.

In [81]:
est clear

quietly eststo: reg 
quietly eststo: reg 
quietly eststo: reg 

In [None]:
%html
esttab, stats(r2 N, labels("R-Sqaured" "N")) cells(b(star fmt(3)) se(fmt(3) par)) /// 
  nobase mlabels ("Model 1" "Model 2" "Model 3") starlevels(* .05 ** .01 *** .001) ///
  coeflabels (_cons "Constant" 2.sex "female" 2.race "black" 3.race "other race" ///
educ "years of schooling" yearsjob "job tenure" paeduc "father's education" ///
PASEI10 "father's socioeconomic index") html

In [None]:
quietly esttab, stats(r2 N, labels("R-Sqaured" "N")) cells(b(star fmt(3)) se(fmt(3) par)) /// 
  nobase mlabels ("Model 1" "Model 2" "Model 3") starlevels(* .05 ** .01 *** .001) ///
  coeflabels (_cons "Constant" 2.sex "female" 2.race "black" 3.race "other race" ///
educ "years of schooling" yearsjob "job tenure" paeduc "father's education" ///
PASEI10 "father's socioeconomic index") rtf