# Earnings Survey Analysis

This notebook analyzes the results of a survey carried out in the United States during the 90s, regarding the earnings of its citizens.

Dataset: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/CPSch3.html


The survey responses are stored in a CSV file.

In [1]:
responses <- read.csv("CPSch3.csv")

rows <- nrow(responses)
cols <- ncol(responses)

cat("Rows: ", rows, "\n")
cat("Columns: ", cols, "\n")

Rows:  11130 
Columns:  4 


Let's take a quick look at the data.

In [2]:
head(responses)

X,year,ahe,sex
1,1992,12.999118,male
2,1992,11.617962,male
3,1992,17.377293,male
4,1992,10.061266,female
5,1992,16.756676,male
6,1992,9.216171,female


Before starting to analyze the results, it's good to check the data distribution and potential outlier values.

In [3]:
summary(responses)

       X              year           ahe             sex      
 Min.   :    1   Min.   :1992   Min.   : 2.136   female:5174  
 1st Qu.: 2783   1st Qu.:1992   1st Qu.:11.281   male  :5956  
 Median : 5566   Median :1994   Median :14.984                
 Mean   : 5566   Mean   :1995   Mean   :16.263                
 3rd Qu.: 8348   3rd Qu.:1996   3rd Qu.:20.000                
 Max.   :11130   Max.   :1998   Max.   :52.443                

Everything seems fine, we can proceed and calculate some statistics.

The first interesting thing to compute is the mean earnings by genre.

In [4]:
mean_ahe_genre <- aggregate(responses["ahe"], by = list(genre = responses$sex), FUN = mean)
mean_ahe_genre

genre,ahe
female,15.03801
male,17.32658


We can see a difference between the wages of men and women. However, this result doesn't provide a notion of scale. In order to get a sense of it, we could calculate the mean earnings from all the survey responses and use that as a reference.

Before proceeding, let's check the mean earnings for each year. This information may be relevant in countries with high inflation/deflation, in which salaries may change significantly from one year to the next.

In [5]:
mean_ahe_year <- aggregate(responses["ahe"], by = list(year = responses$year), FUN = mean)
mean_ahe_year

year,ahe
1992,16.48495
1994,16.04378
1996,15.71825
1998,16.8041


As we can see, the average values are quite similar. Then, we can ignore this factor and calculate the mean using the entire dataset.

In [6]:
mean_ahe = mean(responses[,"ahe"])
cat("Mean AHE: ", mean_ahe, "\n")

Mean AHE:  16.2627 


Next, let's compute the earnings by genre as a proportion of the mean.

In [7]:
within(mean_ahe_genre, ahe_scale <- mean_ahe_genre$ahe / mean_ahe)

genre,ahe,ahe_scale
female,15.03801,0.9246937
male,17.32658,1.0654189


The calculations show that men earned around 6,5% more than the average, while women earned around 7% less than the average.

As a final step, let's check the value for this metric on each individual year, so that we can understand if the earnings difference is just the result of a statistical average or if it happened every year.

In [8]:
mean_ahe_year_genre <- aggregate(responses["ahe"], by = list(year = responses$year, genre = responses$sex), FUN = mean)
colnames(mean_ahe_year)[colnames(mean_ahe_year) == 'ahe'] <- 'ahe_mean'
results <- merge(mean_ahe_year_genre, mean_ahe_year, by = "year")
within(results, ahe_scale <- results$ahe / results$ahe_mean)

year,genre,ahe,ahe_mean,ahe_scale
1992,female,15.22047,16.48495,0.9232952
1992,male,17.57457,16.48495,1.0660982
1994,female,15.00655,16.04378,0.93535
1994,male,16.92523,16.04378,1.0549403
1996,female,14.42531,15.71825,0.9177425
1996,male,16.8804,15.71825,1.073936
1998,female,15.49195,16.8041,0.921915
1998,male,17.94387,16.8041,1.0678269


The results show that the difference in earnings between men and women repeated each year in approximately the same proportion.