For this problem we will use the dataset wages.csv. This dataset contains information on about 300 American workers. 
It includes their average monthly wage (wage), gender (male) and completed years of formal education (educ). 
You suspect that people with higher educational attainment earn more on average.

In [None]:
library(tidyverse)
library(haven)
library(ggplot2)

# read the dataset
data<-read.csv("https://raw.githubusercontent.com/ds-modules/ECON-140-FA22-RDE/main/Sections/103-Daniela/wages.csv")

In [None]:
# check variable names and first observations/cells to understand the dataset
colnames(data)
head(data)


In [None]:
# how many observations and variables are in the dataset? 
dim(data)


In [None]:
# how would you characterize this dataset (think about the sample, unit of analysis, time frame, etc)? 
# View(data)
summary(data)
plot(data)

In [None]:
## are any values missing?
# View(data)
is.na(data)
data[!complete.cases(data),]

In [None]:
# extra: are any of the columns categorical?


In [None]:
## extra: how do you transform a categorical variable into a continuous one?


In [None]:
## what is the proportion of male? 
mean(data$male)

In [None]:
## what is the mean of the education variable? 
summary(data$educ)
mean(data$educ)

round(mean(data$educ),1)



In [None]:
## what is the standard deviation of wage? 
summary(data$wage)
sd(data$wage)

round(sd(data$wage),2)

In [None]:
## extra: do you see outliers? What would you do with outliers in wage? 


In [None]:
## plot a scatter diagram of the average monthly wage against the male dummy. 


## scatter plot of wage and gender
x <- data$male
y <- data$wage

plot(x, y, main = "Wage by Gender",
     xlab = "Male", ylab = "Wage",
     pch = 1)


In [None]:
## what differences do you see? Explain. 

In [None]:
## run linear model that regresses wage on gender

wage_male  <- lm(wage~male, data)
summary(wage_male)
print(wage_male)

In [None]:
## how do you interpret the coefficients (see section slides or class 4 takeaways)? 

mean(data[data$male==0, "wage"])
mean(data[data$male==1, "wage"])




In [None]:
## is this evidence of discrimination or something else? How would you test this? 

mean(data[data$male==0, "educ"])
mean(data[data$male==1, "educ"])

In [None]:
## let's repeat the exercise and now create a scatter plot of wage and education
x_2 <- data$educ
y <- data$wage

plot(x_2, y, main = "Wage by Education",
     xlab = "Education", ylab = "Wage",
     pch = 1)
abline(lm(y ~ x_2, data = data), col = "blue")

In [None]:
## let's perform a linear model that regresses wage on education
wage_educ        <- lm(wage~educ, data)
summary(wage_educ)
print(wage_educ)


In [None]:
## how do you interpret the constant in this case? (see section 2 slides or class 4 takeaways)

## how do you interpret the effect on the education variable?

## extra: how would you change this model to evaluate the theory of the "diploma effect" ?

## do you think our regressions reflects the causal effect of schooling on wages ? (think about cofounders, sampling strategy, outliers)




In [None]:
## run linear model that regresses wage on all variables
wage_all        <- lm(wage~., data)
summary(wage_all)
print(wage_all)


In [None]:
## final discussion of section 2: let's come back to think how to measure discrimination and returns to education


## Ggplot
(good resource: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html)

In [None]:
ggplot(data = data, aes(x = educ, y = wage)) +
  geom_point()

In [None]:
ggplot(data = data, aes(x = educ, y = wage)) +
  geom_point() +
    labs(title = "Returns to Schooling",
        x = "Education Years",
        y = "Wage") 

In [None]:
ggplot(data = data, aes(x = educ, y = wage)) +
    geom_point(alpha = 0.1, aes(color = male))


## Replicating figures of a paper using ggplot
Trafficking Networks and the Mexican Drug War by Melissa Dell

## Faceting 
ggplot has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset

## Replicating tables of a paper: stargazer

## Resources:
regressions: https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
dplyr: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf