# Analysis for China wage vs. education data

#### Since R cannot be ran on normal VS code, I had to install visual studio code insider for it!

## Datasets used

From Dropbox: China-Education-Wages -> Data -> CFPS Data 2010-2016

Downloaded:
- 2010 Egnlish -> ecfps2010adult_112014.dta     
  - Renamed 2010adult.dta
- 2012 Egnlish -> ecfps2012adultcombine…015.dta     
  - Renamed 2012adult.dta
- 2014 English -> ecfps2014adult_170630.dta     
  - Renamed 2014adult.dta
- Mincer16 -> Mincer16.csv 

All 4 datasets were put within one folder named $CFPSdata$ in the same layer as this jupyter notebook.

In [2]:
library(plm)
library(knitr)
library(xtable)
library(broom)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(stargazer)
library(lubridate)
library(haven)


Attaching package: 'dplyr'


The following objects are masked from 'package:plm':

    between, lag, lead


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mtidyr  [39m 1.1.2     [32mv[39m [34mforcats[39m 0.5.0
[32mv[39m [34mreadr  [39m 1.4.0     

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mbetween()[39m masks [34mplm[39m::between()
[31mx[39m [34mdplyr[39m::[32mfilter()[39m  masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m     masks [34mplm[39m::lag(), [34mstats[39m::lag()
[31mx[

## Manipulating data

### Read Data

In [3]:
data10 = read_dta("./CFPSdata/2010adult.dta")
data10 = filter(data10, !qc1 < 0)
data12 = read_dta("./CFPSdata/2012adult.dta")
data12 = filter(data12, !sch2012 < 0)
data14 = read_dta("./CFPSdata/2014adult.dta")
data14 = filter(data14, !pw1r < 0)
data16 = read.csv("./CFPSdata/Mincer16.csv")

### Adding education and year dummies to the data, then combining them into one dataframe

In [8]:
mincer10 = data.frame(
  income = data10$income, 
  age = data10$qa1age,
  gender = data10$gender,
  urban = data10$urban,
  prov = data10$provcd,
  ethnic = data10$qa5code,
  married = 0,
  party = 0,
  postsecondary = 0,
  seniorsecondary = 0, 
  juniorsecondary = 0,
  primary = 0, 
  illiterate = 0,
  y10 = 1,
  y12 = 0,
  y14 = 0,
  y16 = 0)

for (row in 1:nrow(data10)) {
  marriage = data10[row, "qe1"]
  if (marriage == 2) {
    mincer10[row, "married"] = 1
  }
  party = data10[row, "qa7_s_1"]
  if (party == 1) {
    mincer10[row, "party"] = 1
  }
  edu = data10[row, "qc1"]
  if (edu == 1) {
    mincer10[row, "illiterate"] = 1
  } else if (edu == 2) {
    mincer10[row, "primary"] = 1
  } else if (edu == 3) {
    mincer10[row, "juniorsecondary"] = 1
  } else if (edu == 4) {
    mincer10[row, "seniorsecondary"] = 1
  } else if (edu > 4) {
    mincer10[row, "postsecondary"] = 1
  }
}
mincer10 = filter(mincer10, !is.na(income))
mincer10$rinc = mincer10$income / mean(mincer10$income)

In [9]:
mincer12 = data.frame(
  income = data12$income, 
  age = data12$cfps2012_age,
  gender = data12$cfps2012_gender,
  urban = data12$urban12,
  prov = data12$provcd,
  ethnic = data12$qa701code,
  married = 0,
  party = data12$sn401,
  postsecondary = 0,
  seniorsecondary = 0, 
  juniorsecondary = 0,
  primary = 0, 
  illiterate = 0,
  y10 = 0,
  y12 = 1,
  y14 = 0,
  y16 = 0)

for (row in 1:nrow(data12)) {
  marriage = data12[row, "qe104"]
  if (marriage == 2) {
    mincer12[row, "married"] = 1
  }
  edu = data12[row, "sch2012"]
  if (edu == 1) {
    mincer12[row, "illiterate"] = 1
  } else if (edu == 2) {
    mincer12[row, "primary"] = 1
  } else if (edu == 3) {
    mincer12[row, "juniorsecondary"] = 1
  } else if (edu == 4) {
    mincer12[row, "seniorsecondary"] = 1
  } else if (edu > 4) {
    mincer12[row, "postsecondary"] = 1
  }
}
mincer12 = filter(mincer12, !is.na(income))
mincer12$rinc = mincer12$income / mean(mincer12$income)

In [10]:
mincer14 = data.frame(
  income = data14$p_income, 
  age = data14$cfps2014_age,
  gender = data14$cfps_gender,
  urban = data14$urban14,
  prov = data14$provcd14,
  ethnic = data14$cfps_minzu,
  married = 0,
  party = data14$pn401a,
  postsecondary = 0,
  seniorsecondary = 0, 
  juniorsecondary = 0,
  primary = 0, 
  illiterate = 0,
  y10 = 0,
  y12 = 0,
  y14 = 1,
  y16 = 0)

for (row in 1:nrow(data14)) {
  marriage = data14[row, "qea0"]
  if (marriage == 2) {
    mincer14[row, "married"] = 1
  }
  edu = data14[row, "pw1r"]
  if (edu == 1) {
    mincer14[row, "illiterate"] = 1
  } else if (edu == 2) {
    mincer14[row, "primary"] = 1
  } else if (edu == 3) {
    mincer14[row, "juniorsecondary"] = 1
  } else if (edu == 4) {
    mincer14[row, "seniorsecondary"] = 1
  } else if (edu > 4) {
    mincer14[row, "postsecondary"] = 1
  }
}
mincer14 = filter(mincer14, !is.na(income))
mincer14$rinc = mincer14$income / mean(mincer14$income)

In [11]:
mincer16 = data.frame(
  income = data16$income, 
  age = data16$age,
  gender = data16$gender,
  urban = data16$urban16,
  prov = data16$provcd16,
  ethnic = data16$ethnic,
  married = data16$married,
  party = data16$party,
  postsecondary = data16$postsecondary,
  seniorsecondary = data16$seniorsecondary, 
  juniorsecondary = data16$juniorsecondary, 
  primary = data16$primary, 
  illiterate = data16$illiterate,
  y10 = 0,
  y12 = 0,
  y14 = 0,
  y16 = 1)
mincer16 = filter(mincer16, !is.na(income))
mincer16$rinc = mincer16$income / mean(mincer16$income)

In [12]:
combined = full_join(mincer10, full_join(mincer12, full_join(mincer14, mincer16)))
combined$lninc = log(combined$income)
combined$lnrinc = log(combined$rinc)

Joining, by = c("income", "age", "gender", "urban", "prov", "ethnic", "married", "party", "postsecondary", "seniorsecondary", "juniorsecondary", "primary", "illiterate", "y10", "y12", "y14", "y16", "rinc")

Joining, by = c("income", "age", "gender", "urban", "prov", "ethnic", "married", "party", "postsecondary", "seniorsecondary", "juniorsecondary", "primary", "illiterate", "y10", "y12", "y14", "y16", "rinc")

Joining, by = c("income", "age", "gender", "urban", "prov", "ethnic", "married", "party", "postsecondary", "seniorsecondary", "juniorsecondary", "primary", "illiterate", "y10", "y12", "y14", "y16", "rinc")

"NaNs produced"
"NaNs produced"


In [13]:
categorize <- 
  combined %>%
  group_by(rinc) %>%
  tally()
categorize = categorize[order(categorize$rinc), ]
key = categorize$rinc
pinc = c()
sum_so_far = 0
total = nrow(combined)
categorize$pinc = 0
for (row in 1:nrow(categorize)) {
  sum_so_far = sum_so_far + categorize[row, 'n']
  categorize[row, 'pinc'] = (sum_so_far - (0.5 * categorize[row, 'n'])) / total
}

In [14]:
ordered_combined = combined[order(combined$rinc), ]
ordered_combined$pinc = 0
current_rinc = 0
current_pinc = 0
current_row = 0
for (row in 1:nrow(ordered_combined)) {
  rinc = ordered_combined[row, 'rinc']
  if (rinc != current_rinc) {
    current_row = current_row + 1
    current_rinc = rinc
    current_pinc = categorize[current_row, 'pinc']
  }
  ordered_combined[row, 'pinc'] = current_pinc
}
combined = ordered_combined
combined$lnpinc = log(combined$pinc) 

In [16]:
urban_combined = filter(combined, urban == 1)
rural_combined = filter(combined, urban == 0)

## Running the regressions for lninc, rinc, lnrinc, pinc, and lnpinc on combined data, and urban and rural subsamples. Does not contain married or party variables

### Regression for lninc

In [17]:
lnincReg = filter(combined, !is.infinite(lninc))
urban_reg = filter(lnincReg, urban == 1)
rural_reg = filter(lnincReg, urban == 0)
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + urban + y10 + y12 + y14, data=lnincReg))
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=urban_reg))
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=rural_reg))
female_reg = filter(lnincReg, gender == 1)
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + urban + y10 + y12 + y14, data=female_reg))
male_reg = filter(lnincReg, gender == 0)
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + urban + y10 + y12 + y14, data=male_reg))
female_urban_reg = filter(urban_reg, gender == 1)
female_rural_reg = filter(rural_reg, gender == 1)
male_urban_reg = filter(urban_reg, gender == 0)
male_rural_reg = filter(rural_reg, gender == 0)
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + y10 + y12 + y14, data=female_urban_reg))
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + y10 + y12 + y14, data=female_rural_reg))
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + y10 + y12 + y14, data=male_urban_reg))
summary(lm(lninc ~  postsecondary + seniorsecondary + juniorsecondary + primary + y10 + y12 + y14, data=male_rural_reg))


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + urban + y10 + y12 + y14, data = lnincReg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5493  -0.5264   0.2281   0.8410   5.3336 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.797265   0.024534 358.578  < 2e-16 ***
postsecondary    1.732932   0.024212  71.574  < 2e-16 ***
seniorsecondary  1.265809   0.021949  57.670  < 2e-16 ***
juniorsecondary  1.050874   0.018734  56.093  < 2e-16 ***
primary          0.618875   0.020639  29.985  < 2e-16 ***
gender           0.283885   0.009443  30.064  < 2e-16 ***
urban            0.019081   0.005930   3.217  0.00129 ** 
y10             -1.089056   0.021248 -51.255  < 2e-16 ***
y12             -0.446978   0.022283 -20.059  < 2e-16 ***
y14             -1.361638   0.054684 -24.900  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.38


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = urban_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.4944  -0.3971   0.1990   0.6733   4.4564 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.99563    0.03223  279.08   <2e-16 ***
postsecondary    1.49882    0.02997   50.02   <2e-16 ***
seniorsecondary  1.05557    0.02905   36.33   <2e-16 ***
juniorsecondary  0.88624    0.02711   32.70   <2e-16 ***
primary          0.55341    0.03122   17.73   <2e-16 ***
gender           0.24503    0.01227   19.96   <2e-16 ***
y10             -0.80633    0.02533  -31.84   <2e-16 ***
y12             -0.26977    0.02643  -10.21   <2e-16 ***
y14             -1.71081    0.07923  -21.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.242 on 23150 degrees of freedom
  (873 observations deleted due to missingnes


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = rural_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8980 -0.6364  0.2286  0.8992  5.5796 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.85449    0.03690  239.93   <2e-16 ***
postsecondary    1.33277    0.05009   26.61   <2e-16 ***
seniorsecondary  1.04356    0.03552   29.38   <2e-16 ***
juniorsecondary  0.94789    0.02609   36.33   <2e-16 ***
primary          0.56980    0.02716   20.98   <2e-16 ***
gender           0.38078    0.01434   26.55   <2e-16 ***
y10             -1.40198    0.03401  -41.23   <2e-16 ***
y12             -0.66265    0.03585  -18.48   <2e-16 ***
y14             -2.65111    0.09400  -28.20   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.448 on 22284 degrees of freedom
  (985 observations deleted due to missingness)
Multipl


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + urban + y10 + y12 + y14, data = female_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.3722  -0.4926   0.2175   0.7411   5.3627 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      9.29209    0.03067 302.979  < 2e-16 ***
postsecondary    1.49283    0.03120  47.848  < 2e-16 ***
seniorsecondary  1.05032    0.02805  37.439  < 2e-16 ***
juniorsecondary  0.89743    0.02445  36.701  < 2e-16 ***
primary          0.47959    0.02663  18.011  < 2e-16 ***
urban            0.02977    0.00716   4.159 3.21e-05 ***
y10             -1.02714    0.02558 -40.163  < 2e-16 ***
y12             -0.45637    0.02674 -17.068  < 2e-16 ***
y14             -1.06519    0.06668 -15.974  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.257 on 25734 degrees of freedom
  (917 observations deleted due to missingnes


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + urban + y10 + y12 + y14, data = male_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5806  -0.5946   0.2925   0.9042   4.7678 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.65403    0.03860 224.195  < 2e-16 ***
postsecondary    1.89540    0.03767  50.320  < 2e-16 ***
seniorsecondary  1.38428    0.03486  39.707  < 2e-16 ***
juniorsecondary  1.06821    0.02912  36.678  < 2e-16 ***
primary          0.62668    0.03242  19.332  < 2e-16 ***
urban            0.03121    0.01006   3.104  0.00191 ** 
y10             -1.15165    0.03518 -32.738  < 2e-16 ***
y12             -0.48242    0.03710 -13.004  < 2e-16 ***
y14             -1.64882    0.08966 -18.390  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.499 on 20260 degrees of freedom
  (952 observations deleted due to missingness)


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + y10 + y12 + y14, data = female_urban_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.4420  -0.3562   0.1734   0.6363   4.4840 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      9.48845    0.04226 224.499   <2e-16 ***
postsecondary    1.25608    0.04057  30.958   <2e-16 ***
seniorsecondary  0.81127    0.03938  20.601   <2e-16 ***
juniorsecondary  0.72650    0.03710  19.580   <2e-16 ***
primary          0.35789    0.04175   8.572   <2e-16 ***
y10             -0.73316    0.03114 -23.543   <2e-16 ***
y12             -0.30250    0.03248  -9.314   <2e-16 ***
y14             -1.33373    0.09561 -13.950   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.132 on 12387 degrees of freedom
  (430 observations deleted due to missingness)
Multiple R-squared:  0.1498,	Adjusted R-squared:  0.1493


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + y10 + y12 + y14, data = female_rural_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5980 -0.5356  0.2405  0.8424  5.6423 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      9.36113    0.04409  212.31   <2e-16 ***
postsecondary    1.14427    0.06011   19.04   <2e-16 ***
seniorsecondary  0.94628    0.04176   22.66   <2e-16 ***
juniorsecondary  0.85130    0.03236   26.30   <2e-16 ***
primary          0.48013    0.03405   14.10   <2e-16 ***
y10             -1.31387    0.03962  -33.16   <2e-16 ***
y12             -0.61447    0.04153  -14.80   <2e-16 ***
y14             -2.29392    0.11752  -19.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.303 on 12909 degrees of freedom
  (482 observations deleted due to missingness)
Multiple R-squared:  0.1981,	Adjusted R-squared:  0.1977 
F-statis


Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + y10 + y12 + y14, data = male_urban_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.4816  -0.3791   0.2467   0.7425   4.3309 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.84775    0.04812 183.863  < 2e-16 ***
postsecondary    1.63390    0.04424  36.936  < 2e-16 ***
seniorsecondary  1.18685    0.04296  27.625  < 2e-16 ***
juniorsecondary  0.89749    0.03975  22.577  < 2e-16 ***
primary          0.62488    0.04677  13.361  < 2e-16 ***
y10             -0.87725    0.04052 -21.649  < 2e-16 ***
y12             -0.26280    0.04234  -6.207 5.61e-10 ***
y14             -2.19469    0.12995 -16.888  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.332 on 10693 degrees of freedom
  (443 observations deleted due to missingness)
Multiple R-squared:  0.2153,	Adjusted R-squared:  0.2148 



Call:
lm(formula = lninc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + y10 + y12 + y14, data = male_rural_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8023 -0.7819  0.3052  1.0189  5.3141 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.81971    0.06140  143.65   <2e-16 ***
postsecondary    1.48068    0.08454   17.52   <2e-16 ***
seniorsecondary  0.98258    0.06398   15.36   <2e-16 ***
juniorsecondary  0.89573    0.04347   20.60   <2e-16 ***
primary          0.53316    0.04422   12.06   <2e-16 ***
y10             -1.52397    0.05906  -25.80   <2e-16 ***
y12             -0.83170    0.06301  -13.20   <2e-16 ***
y14             -3.00963    0.14965  -20.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.593 on 9300 degrees of freedom
  (503 observations deleted due to missingness)
Multiple R-squared:  0.1971,	Adjusted R-squared:  0.1965 
F-statistic

### Regression for rinc and lnrinc

In [18]:
summary(lm(rinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + urban + y10 + y12 + y14, data=combined))
summary(lm(rinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=urban_combined))
summary(lm(rinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=rural_combined))

lnrincReg = filter(combined, !is.infinite(lnrinc))
urban_reg = filter(lnrincReg, urban == 1)
rural_reg = filter(lnrincReg, urban == 0)
summary(lm(lnrinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + urban + y10 + y12 + y14, data=lnrincReg))
summary(lm(lnrinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=urban_reg))
summary(lm(lnrinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=rural_reg))


Call:
lm(formula = rinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + urban + y10 + y12 + y14, data = combined)

Residuals:
   Min     1Q Median     3Q    Max 
 -3.07  -0.82  -0.30   0.33 467.73 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.145090   0.037147   3.906 9.40e-05 ***
postsecondary   2.432263   0.040480  60.085  < 2e-16 ***
seniorsecondary 0.958014   0.033858  28.295  < 2e-16 ***
juniorsecondary 0.645283   0.027834  23.183  < 2e-16 ***
primary         0.284256   0.030755   9.243  < 2e-16 ***
gender          0.265572   0.014158  18.758  < 2e-16 ***
urban           0.048417   0.009626   5.030 4.92e-07 ***
y10             0.157319   0.034876   4.511 6.47e-06 ***
y12             0.069476   0.034995   1.985   0.0471 *  
y14             0.181437   0.084498   2.147   0.0318 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.829 on 75874 degrees of freedom
Mu


Call:
lm(formula = rinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = urban_combined)

Residuals:
    Min      1Q  Median      3Q     Max 
 -3.344  -1.061  -0.416   0.522 272.301 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.11389    0.05901   1.930   0.0536 .  
postsecondary    2.51051    0.05679  44.204  < 2e-16 ***
seniorsecondary  0.97668    0.05245  18.622  < 2e-16 ***
juniorsecondary  0.64459    0.04787  13.466  < 2e-16 ***
primary          0.29945    0.05551   5.394 6.92e-08 ***
gender           0.38530    0.02332  16.523  < 2e-16 ***
y10              0.33397    0.05083   6.570 5.09e-11 ***
y12              0.30205    0.05131   5.886 3.99e-09 ***
y14             -0.01884    0.14324  -0.132   0.8954    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.922 on 34502 degrees of freedom
Multiple R-squared:  0.0745,	Adjusted R-squared:  0


Call:
lm(formula = rinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = rural_combined)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.14  -0.60  -0.25   0.12 468.43 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.28455    0.04838   5.881 4.10e-09 ***
postsecondary    1.64419    0.07706  21.337  < 2e-16 ***
seniorsecondary  0.68586    0.04926  13.923  < 2e-16 ***
juniorsecondary  0.54701    0.03494  15.654  < 2e-16 ***
primary          0.24471    0.03647   6.709 1.98e-11 ***
gender           0.20746    0.01797  11.544  < 2e-16 ***
y10             -0.03041    0.04771  -0.637   0.5239    
y12             -0.13346    0.04765  -2.801   0.0051 ** 
y14             -0.16933    0.12554  -1.349   0.1774    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.725 on 40329 degrees of freedom
Multiple R-squared:  0.02141,	Adjusted R-squared:  0.02122 
F


Call:
lm(formula = lnrinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + urban + y10 + y12 + y14, data = lnrincReg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5493  -0.5264   0.2281   0.8410   5.3336 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.196444   0.024534 -48.767  < 2e-16 ***
postsecondary    1.732932   0.024212  71.574  < 2e-16 ***
seniorsecondary  1.265809   0.021949  57.670  < 2e-16 ***
juniorsecondary  1.050874   0.018734  56.093  < 2e-16 ***
primary          0.618875   0.020639  29.985  < 2e-16 ***
gender           0.283885   0.009443  30.064  < 2e-16 ***
urban            0.019081   0.005930   3.217  0.00129 ** 
y10             -0.268313   0.021248 -12.628  < 2e-16 ***
y12              0.239660   0.022283  10.755  < 2e-16 ***
y14             -0.711340   0.054684 -13.008  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.


Call:
lm(formula = lnrinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = urban_reg)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.4944  -0.3971   0.1990   0.6733   4.4564 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.99808    0.03223 -30.964   <2e-16 ***
postsecondary    1.49882    0.02997  50.017   <2e-16 ***
seniorsecondary  1.05557    0.02905  36.331   <2e-16 ***
juniorsecondary  0.88624    0.02711  32.696   <2e-16 ***
primary          0.55341    0.03122  17.727   <2e-16 ***
gender           0.24503    0.01227  19.965   <2e-16 ***
y10              0.01442    0.02533   0.569    0.569    
y12              0.41687    0.02643  15.772   <2e-16 ***
y14             -1.06051    0.07923 -13.385   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.242 on 23150 degrees of freedom
  (873 observations deleted due to missingne


Call:
lm(formula = lnrinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = rural_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8980 -0.6364  0.2286  0.8992  5.5796 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.13922    0.03690 -30.869   <2e-16 ***
postsecondary    1.33277    0.05009  26.608   <2e-16 ***
seniorsecondary  1.04356    0.03552  29.379   <2e-16 ***
juniorsecondary  0.94789    0.02609  36.330   <2e-16 ***
primary          0.56980    0.02716  20.977   <2e-16 ***
gender           0.38078    0.01434  26.549   <2e-16 ***
y10             -0.58123    0.03401 -17.092   <2e-16 ***
y12              0.02399    0.03585   0.669    0.503    
y14             -2.00081    0.09400 -21.286   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.448 on 22284 degrees of freedom
  (985 observations deleted due to missingness)
Multip

### Regression for pinc and lnpinc

In [19]:
summary(lm(pinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + urban + y10 + y12 + y14, data=combined))
summary(lm(pinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=urban_combined))
summary(lm(pinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=rural_combined))

lnpincReg = filter(combined, !is.infinite(lnpinc))
urban_reg = filter(lnpincReg, urban == 1)
rural_reg = filter(lnpincReg, urban == 0)
summary(lm(lnpinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + urban + y10 + y12 + y14, data=lnpincReg))
summary(lm(lnpinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=urban_reg))
summary(lm(lnpinc ~  postsecondary + seniorsecondary + juniorsecondary + primary + gender + y10 + y12 + y14, data=rural_reg))


Call:
lm(formula = pinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + urban + y10 + y12 + y14, data = combined)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77112 -0.21207  0.00658  0.20923  0.95514 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.3737159  0.0034250 109.115  < 2e-16 ***
postsecondary    0.3246912  0.0037323  86.995  < 2e-16 ***
seniorsecondary  0.1759524  0.0031217  56.364  < 2e-16 ***
juniorsecondary  0.1366893  0.0025663  53.263  < 2e-16 ***
primary          0.0755914  0.0028356  26.658  < 2e-16 ***
gender           0.0465633  0.0013053  35.671  < 2e-16 ***
urban            0.0050757  0.0008875   5.719 1.07e-08 ***
y10              0.0318290  0.0032156   9.898  < 2e-16 ***
y12             -0.0526646  0.0032266 -16.322  < 2e-16 ***
y14             -0.0283937  0.0077908  -3.645 0.000268 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard e


Call:
lm(formula = pinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = urban_combined)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78212 -0.24837  0.05726  0.22830  0.94429 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.414640   0.005576  74.365  < 2e-16 ***
postsecondary    0.306903   0.005366  57.191  < 2e-16 ***
seniorsecondary  0.166726   0.004956  33.644  < 2e-16 ***
juniorsecondary  0.126478   0.004523  27.964  < 2e-16 ***
primary          0.076421   0.005245  14.570  < 2e-16 ***
gender           0.051959   0.002203  23.582  < 2e-16 ***
y10              0.019381   0.004803   4.035 5.46e-05 ***
y12             -0.034489   0.004848  -7.113 1.15e-12 ***
y14             -0.113082   0.013534  -8.355  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2761 on 34502 degrees of freedom
Multiple R-squared:  0.1213,	A


Call:
lm(formula = pinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = rural_combined)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.69766 -0.19495 -0.02802  0.16987  0.95811 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.355691   0.004258  83.529  < 2e-16 ***
postsecondary    0.261606   0.006782  38.573  < 2e-16 ***
seniorsecondary  0.133588   0.004336  30.812  < 2e-16 ***
juniorsecondary  0.121965   0.003076  39.656  < 2e-16 ***
primary          0.067525   0.003210  21.035  < 2e-16 ***
gender           0.047464   0.001582  30.007  < 2e-16 ***
y10              0.043655   0.004199  10.397  < 2e-16 ***
y12             -0.061921   0.004194 -14.764  < 2e-16 ***
y14             -0.085759   0.011049  -7.761 8.59e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2398 on 40329 degrees of freedom
Multiple R-squared:  0.121,	Ad


Call:
lm(formula = lnpinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + urban + y10 + y12 + y14, data = lnpincReg)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1687 -0.5175  0.2313  0.5502  1.9388 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)     -1.051729   0.010432 -100.816  < 2e-16 ***
postsecondary    0.620287   0.011368   54.563  < 2e-16 ***
seniorsecondary  0.231028   0.009508   24.297  < 2e-16 ***
juniorsecondary  0.193919   0.007817   24.808  < 2e-16 ***
primary          0.120545   0.008637   13.957  < 2e-16 ***
gender           0.101872   0.003976   25.622  < 2e-16 ***
urban           -0.007539   0.002703   -2.789 0.005290 ** 
y10             -0.033663   0.009794   -3.437 0.000589 ***
y12             -0.160249   0.009828  -16.305  < 2e-16 ***
y14             -0.411285   0.023730  -17.332  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0


Call:
lm(formula = lnpinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = urban_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1626 -0.6228  0.2951  0.5584  1.8153 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.953394   0.016533 -57.667   <2e-16 ***
postsecondary    0.568601   0.015912  35.735   <2e-16 ***
seniorsecondary  0.216612   0.014694  14.742   <2e-16 ***
juniorsecondary  0.162430   0.013411  12.112   <2e-16 ***
primary          0.131901   0.015552   8.481   <2e-16 ***
gender           0.104086   0.006533  15.932   <2e-16 ***
y10             -0.088588   0.014240  -6.221    5e-10 ***
y12             -0.124937   0.014376  -8.691   <2e-16 ***
y14             -0.661635   0.040131 -16.487   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8187 on 34502 degrees of freedom
Multiple R-squared:  0.05743,	Adjusted R-sq


Call:
lm(formula = lnpinc ~ postsecondary + seniorsecondary + juniorsecondary + 
    primary + gender + y10 + y12 + y14, data = rural_reg)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0516 -0.4573  0.1741  0.5215  2.0244 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.111976   0.013557 -82.022   <2e-16 ***
postsecondary    0.502659   0.021592  23.280   <2e-16 ***
seniorsecondary  0.114916   0.013803   8.325   <2e-16 ***
juniorsecondary  0.162498   0.009792  16.595   <2e-16 ***
primary          0.095430   0.010220   9.337   <2e-16 ***
gender           0.109670   0.005036  21.778   <2e-16 ***
y10              0.019277   0.013368   1.442    0.149    
y12             -0.168667   0.013353 -12.631   <2e-16 ***
y14             -0.635042   0.035178 -18.052   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7636 on 40329 degrees of freedom
Multiple R-squared:  0.04978,	Adjusted R-sq