In [3]:
install.packages("tidyverse", dep = T)

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)
also installing the dependency ‘stringr’



In [4]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 2.2.1     [32m✔[39m [34mpurrr  [39m 0.2.4
[32m✔[39m [34mtibble [39m 1.4.1     [32m✔[39m [34mdplyr  [39m 0.7.4
[32m✔[39m [34mtidyr  [39m 0.7.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.1.1     [32m✔[39m [34mforcats[39m 0.2.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [9]:
# Dataset
mydata = read_csv("survey.csv")
head(mydata) # will show you the headers of the data set
spec(mydata) # you will learn about each column

Parsed with column specification:
cols(
  .default = col_character(),
  Timestamp = col_datetime(format = ""),
  Age = col_double()
)
See spec(...) for full column specifications.


Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,
2014-08-27 11:31:22,33,Male,United States,TN,,Yes,No,Sometimes,6-25,...,Don't know,No,No,Yes,Yes,No,Maybe,Don't know,No,


cols(
  Timestamp = col_datetime(format = ""),
  Age = col_double(),
  Gender = col_character(),
  Country = col_character(),
  state = col_character(),
  self_employed = col_character(),
  family_history = col_character(),
  treatment = col_character(),
  work_interfere = col_character(),
  no_employees = col_character(),
  remote_work = col_character(),
  tech_company = col_character(),
  benefits = col_character(),
  care_options = col_character(),
  wellness_program = col_character(),
  seek_help = col_character(),
  anonymity = col_character(),
  leave = col_character(),
  mental_health_consequence = col_character(),
  phys_health_consequence = col_character(),
  coworkers = col_character(),
  supervisor = col_character(),
  mental_health_interview = col_character(),
  phys_health_interview = col_character(),
  mental_vs_physical = col_character(),
  obs_consequence = col_character(),
  comments = col_character()
)

## Once we obtain the data, we see the following:

```
ols(
  Timestamp = col_datetime(format = ""),
  Age = col_double(),
  Gender = col_character(),
  Country = col_character(),
  state = col_character(),
  self_employed = col_character(),
  family_history = col_character(),
  treatment = col_character(),
  work_interfere = col_character(),
  no_employees = col_character(),
  remote_work = col_character(),
  tech_company = col_character(),
  benefits = col_character(),
  care_options = col_character(),
  wellness_program = col_character(),
  seek_help = col_character(),
  anonymity = col_character(),
  leave = col_character(),
  mental_health_consequence = col_character(),
  phys_health_consequence = col_character(),
  coworkers = col_character(),
  supervisor = col_character(),
  mental_health_interview = col_character(),
  phys_health_interview = col_character(),
  mental_vs_physical = col_character(),
  obs_consequence = col_character(),
  comments = col_character()
)
```


## Let us make the data manageable.

1. We will not need the timestamp, and the US state for any real reason, so we will get rid of it
2. We will need the age, Gender, self_employed, family_history, treatment, work_interfere, no_employes, and all of the other columns for our work
3. We will not need the comments variable.

So let's start by dropping these two variables

In [18]:
mydata2 = mydata %>%
  select(Age:obs_consequence, -state) # we have selected Age variable through all the way to obs_consequence and we w

In [19]:
head(mydata2) # gives you header information

Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
37,Female,United States,,No,Yes,Often,6-25,No,Yes,...,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
44,M,United States,,No,No,Rarely,More than 1000,No,No,...,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
32,Male,Canada,,No,No,Rarely,6-25,No,Yes,...,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
31,Male,United Kingdom,,Yes,Yes,Often,26-100,No,Yes,...,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
31,Male,United States,,No,No,Never,100-500,Yes,Yes,...,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No
33,Male,United States,,Yes,No,Sometimes,6-25,No,Yes,...,Don't know,Don't know,No,No,Yes,Yes,No,Maybe,Don't know,No


In [14]:
# let's check the Age information
mydata2 %>%
 group_by(Gender) %>%
 summarise(mean_age = mean(Age, na.rm = T)) # should give us mean age for men and women


Gender,mean_age
A little about you,8.0
Agender,21.0
All,100000000000.0
Androgyne,28.0
Cis Female,27.0
cis male,38.0
Cis Male,38.0
Cis Man,24.0
cis-female/femme,30.0
Enby,31.0


## If you see the above information, this is not what we want. We will need to clean up the Age and Gender information in Spreadsheet.
You can also do this here. Or better yet, combine the two together. Take the table from here, open the data in Spreadsheet, and clean up the gender information first. Then, clean up the age information. But first, check what's going on with age:

In [20]:
mydata2 %>%
 summary(Age)

      Age                Gender            Country          self_employed     
 Min.   :-1.726e+03   Length:1259        Length:1259        Length:1259       
 1st Qu.: 2.700e+01   Class :character   Class :character   Class :character  
 Median : 3.100e+01   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 7.943e+07                                                           
 3rd Qu.: 3.600e+01                                                           
 Max.   : 1.000e+11                                                           
 family_history      treatment         work_interfere     no_employees      
 Length:1259        Length:1259        Length:1259        Length:1259       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                              

Something seems wrong with the Age variable. Convert Age into a categorical variable using tidyverse

In [36]:
mydata3 = mydata2 %>%
 mutate(age_rec = cut_number(Age, n = 4)) # cut the age into 4 equal groups
head(mydata3) # make sure that you have the data set intact

Age,Gender,Country,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,age_rec
37,Female,United States,,No,Yes,Often,6-25,No,Yes,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,"(36,1e+11]"
44,M,United States,,No,No,Rarely,More than 1000,No,No,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,"(36,1e+11]"
32,Male,Canada,,No,No,Rarely,6-25,No,Yes,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,"(31,36]"
31,Male,United Kingdom,,Yes,Yes,Often,26-100,No,Yes,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,"(27,31]"
31,Male,United States,,No,No,Never,100-500,Yes,Yes,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,"(27,31]"
33,Male,United States,,Yes,No,Sometimes,6-25,No,Yes,...,Don't know,No,No,Yes,Yes,No,Maybe,Don't know,No,"(31,36]"


In [37]:
levels(mydata3$age_rec) = c("lt 27", "27-31", "32-36", "gte 37")
mydata3 %>%
 count(age_rec) %>%
 mutate(pct = n * 100 / sum(n) ) # tabulate to see you have done the right thing

age_rec,n,pct
lt 27,369,29.30898
27-31,283,22.47816
32-36,309,24.54329
gte 37,298,23.66958


## In this way, we do not lose information for Age although it had absurd information but we have converted it to a categorical variable

In [54]:
# we do not need the Age variable any more, so we get rid of it

mydata4 = mydata3 %>%
   select(-Age) # remove Age variable  but keep the others
# What happens with the number of employees?
# tally that number first and see what we get

mydata4 %>%
 count(no_employees) %>%
 arrange(desc(no_employees))
# this has to be fixed. 
# Do not fix this in Microsoft Excel because Excel makes everything complicated
# Change no_employees to a factor variable
mydata4$no_employees = as.factor(mydata4$no_employees)
levels(mydata4$no_employees)
# Let's tally it again
# Not quite what we want
# We save it as a different variable and recode

mydata4$employee = mydata4$no_employees # creates a new variable


In [71]:
mydata4$employee1 = recode(mydata4$employee, 
         '1' = 1,
       '2' = 4,
       '3' = 3,
       '4' = 5,
       '5' = 2,
       '6' = 6) 

mydata4 %>%
count(employee1)

mydata4$employee2 = as.factor(mydata4$employee1)
levels(mydata4$employee2) = c('1-5', '6-25', '26-100', '100-500', '500-1000', 'More than 1000')
levels(mydata4$employee2)

mydata4 %>%
 count(employee2)

employee1,n
1,162
2,60
3,289
4,290
5,176
6,282


employee2,n
1-5,162
6-25,60
26-100,289
100-500,290
500-1000,176
More than 1000,282


## Now, we are in a position to do some more analyses.