**Paper**: [Wrangling categorical in R](https://peerj.com/preprints/3163/)

In [2]:
library(tidyverse)

In [3]:
GSS <- read_csv('https://raw.githubusercontent.com/dsscollection/factor-mgmt/master/data/GSScleaned.csv')
GSS %>% head()


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  Year = col_double(),
  ID = col_double(),
  LaborStatus = col_character(),
  OccupationalPrestigeScore = col_double(),
  MaritalStatus = col_character(),
  NumChildren = col_double(),
  Age = col_character(),
  HighestSchoolCompleted = col_double(),
  Sex = col_character(),
  Race = col_character(),
  ChildhoodFamilyIncome = col_character(),
  TotalFamilyIncome = col_character(),
  RespondentIncome = col_character(),
  PoliticalParty = col_character(),
  OpinionOfIncome = col_character(),
  SexualOrientation = col_character()
)



Year,ID,LaborStatus,OccupationalPrestigeScore,MaritalStatus,NumChildren,Age,HighestSchoolCompleted,Sex,Race,ChildhoodFamilyIncome,TotalFamilyIncome,RespondentIncome,PoliticalParty,OpinionOfIncome,SexualOrientation
2014,1,Working fulltime,0,Divorced,0,53.0,16,Male,White,Below average,$25000 or more,$25000 or more,Not str republican,Above average,Heterosexual or straight
2014,2,Working fulltime,0,Married,0,26.0,16,Female,White,Average,$25000 or more,$25000 or more,Not str republican,Above average,Heterosexual or straight
2014,3,"Unempl, laid off",0,Divorced,1,59.0,13,Male,White,Below average,$25000 or more,Not applicable,Strong republican,Below average,Heterosexual or straight
2014,4,Working parttime,0,Married,2,56.0,16,Female,White,Below average,$25000 or more,$10000 - 14999,Not str republican,Above average,Heterosexual or straight
2014,5,Retired,0,Married,3,74.0,17,Female,White,Above average,Refused,Not applicable,Independent,Average,Heterosexual or straight
2014,6,Working fulltime,0,Married,1,56.0,17,Female,White,Above average,$25000 or more,$25000 or more,Strong republican,Above average,No answer


In [4]:
GSS %>% glimpse()

Rows: 2,540
Columns: 16
$ Year                      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014,...
$ ID                        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
$ LaborStatus               <chr> "Working fulltime", "Working fulltime", "...
$ OccupationalPrestigeScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ MaritalStatus             <chr> "Divorced", "Married", "Divorced", "Marri...
$ NumChildren               <dbl> 0, 0, 1, 2, 3, 1, 2, 2, 4, 3, 2, 0, 5, 2,...
$ Age                       <chr> "53.000000", "26.000000", "59.000000", "5...
$ HighestSchoolCompleted    <dbl> 16, 16, 13, 16, 17, 17, 12, 17, 10, 15, 5...
$ Sex                       <chr> "Male", "Female", "Male", "Female", "Fema...
$ Race                      <chr> "White", "White", "White", "White", "Whit...
$ ChildhoodFamilyIncome     <chr> "Below average", "Average", "Below averag...
$ TotalFamilyIncome         <chr> "$25000 or more", "$25000 or more", "$250...
$ RespondentIncome          

# Changing the labels of factor levels

Our first example works with `LaborStatus` variable.   It is a categorical variable with 9 levels.  Most of the labels are spelled out fully, but a few are strangely formatted.  We wantto change this.  
This is a speci c case of the more general problem of changing the text of factor labels,so they appear more nicely formatted in a plot

In [22]:
LaborStatus <- as.factor(GSS$LaborStatus)

In [28]:
levels(LaborStatus)

In [32]:
LaborStatus %>% summary()

Spell out fully some labels:

In [38]:
LaborStatus1 <- LaborStatus %>% fct_recode('Temporarily not working' = 'Temp not working',
                           'Unemployment, laid off ' = 'Unempl, laid off',
                           'Working full time' = 'Working fulltime',
                           'Working part time ' = 'Working parttime')  

In [39]:
LaborStatus1 %>% levels()

In [40]:
LaborStatus1 %>% summary()

### Aside - Editing whitespace out of levels

A  more  general  problem  sometimes  arises  due  to  extra  spaces  included  when  data  areingested.  Such whitespace can be dealt with when data is read, or addressed later using
string operations.

In [42]:
gender <- factor(c("male ", "male  ", "male    ", "male"))

levels(gender)

In [45]:
gender_trim <- gender %>% fct_relabel(str_trim)

gender_trim

levels(gender_trim)

# Reordering factor levels

In [46]:
income <- as.factor(GSS$OpinionOfIncome)

In [47]:
levels(income)

In [48]:
summary(income)

In [50]:
income1 <- income %>% fct_relevel('Far above average', 'Above average', 'Average', 'Below average', 'Far below average')

In [51]:
levels(income1)

In [52]:
summary(income1)

# Combine several levels into one

### Combine discrete levels

In [53]:
MaritalStatus <- as.factor(GSS$MaritalStatus)

In [55]:
levels(MaritalStatus)

In [56]:
summary(MaritalStatus)

In [58]:
MaritalStatus1 <- MaritalStatus %>% fct_collapse('Not married' = c('Divorced', 'Never married', 'Widowed', 'Separated'))

In [60]:
levels(MaritalStatus1)

In [61]:
summary(MaritalStatus1)

### Combine numeric-type levels

In this data, age is provided as an integer for respondents 18-88, but also includes the possible answers 89 or older,No answer and NA.A common data wrangling task might be to turn this into a factor variable with two levels:   18-65,  and  over  65.   In  this  case,  it  would  be  easier  to  deal  with  a  conditional statement  about  the  numeric  values,  rather  than  writing  out  each  of  the  numbers  as  acharacter vector

In [66]:
Age <- as.factor(GSS$Age)

In [67]:
levels(Age)

In [68]:
summary(Age)

In [74]:
GSS %>% mutate(tidyAge = parse_number(Age)) %>% 
mutate(tidyAge = if_else(tidyAge < 65, 'Below 65', 'Above 65')) %>% 
mutate(tidyAge = as.factor(tidyAge)) %>%
pull(tidyAge) %>% summary()

"Problem with `mutate()` input `tidyAge`.
i 9 parsing failures.
 row col expected    actual
  31  -- a number No answer
  99  -- a number No answer
 835  -- a number No answer
1267  -- a number No answer
1320  -- a number No answer
.... ... ........ .........
See problems(...) for more details.

"9 parsing failures.
 row col expected    actual
  31  -- a number No answer
  99  -- a number No answer
 835  -- a number No answer
1267  -- a number No answer
1320  -- a number No answer
.... ... ........ .........
See problems(...) for more details.
"

# Creating derived categorical variables

# Defensive Coding