# Dataframe manipulation with tidyr


Researchers often want to manipulate their data from the ‘wide’ to the ‘long’ format, or vice-versa. The ‘long’ format is where:

- each column is a variable
- each row is an observation

In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format, each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or both). Many (but not all!) of R’s functions have been designed assuming you have ‘long’ format data. **This tutorial will help you efficiently transform your data regardless of original format.**

For humans, format affects readability. Wide format is often more intuitive since we can see more of the data on the screen. Long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.

![alt text](14-tidyr-fig1.png)

## Getting started

First install the packages if you haven’t already done so (you probably installed dplyr in the previous lesson); then load the packages.

In [None]:
install.packages(c("gapminder"), repos='http://cran.us.r-project.org')

In [None]:
#install.packages("tidyr")
#install.packages("dplyr")
library("tidyr")
library("dplyr")

First, lets look at the structure of our original gapminder dataframe using str()

In [None]:
str(gapminder)

### Question 1
**Is gapminder a purely long, purely wide, or some intermediate format?**  
Ans. The original gapminder data.frame is in an intermediate format. It is not purely long since it had multiple observation variables (ex. pop, lifeExp, gdpPercap).



Sometimes, as with the gapminder dataset, we have multiple types of observed data. It is somewhere in between the purely ‘long’ and ‘wide’ data formats. We have 3 “ID variables” (continent, country, year) and 3 “Observation variables” (pop,lifeExp,gdpPercap). This is often preferred for readability, greater flexibility when analyzing, and units not being the same. 

While using many of the functions in R you usually do not want to do mathematical operations on values with different units. For example, using the purely long format, a single mean for all of the values of population, life expectancy, and GDP would not be meaningful. The solution would be to restructure your dataframe or use a dplyr function for grouping. 

## From wide to long format with tidyr::gather()

Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide format version of the gapminder dataset. 

**Download the wide version of the gapminder data.**


We’ll load the data file and look at it. Note: we don’t want our continent and country columns to be factors, so we use the stringsAsFactors argument for read.csv() to disable that.

In [None]:
gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)

![alt text](14-tidyr-fig2.png)

The first step towards getting our nice intermediate data format is to convert from the wide to the long format. The tidyr function gather() will ‘gather’ your observation variables into a single variable.



![alt text](14-tidyr-fig3.png)

In [None]:
gap_long <- gap_wide %>%
    gather(obstype_year, obs_values, starts_with('pop'),
           starts_with('lifeExp'), starts_with('gdpPercap'))
str(gap_long)

Here we have used piping syntax which is similar to what we were doing in the previous lesson with dplyr. In fact, these are compatible and you can use a mix of tidyr and dplyr functions by piping them together.

Inside gather() we first name the new column for the new ID variable (obstype_year), the name for the new amalgamated observation variable (obs_value), then the names of the old observation variable. We could have typed out all the observation variables, but as in the select() function (see dplyr lesson), we can use the starts_with() argument to select all variables that starts with the desired character string. Gather also allows the alternative syntax of using the - symbol to identify which variables are not to be gathered (i.e. ID variables)

![alt text](14-tidyr-fig4.png)

In [None]:
gap_long <- gap_wide %>% gather(obstype_year,obs_values,-continent,-country)
str(gap_long)

That may seem trivial with this particular dataframe, but sometimes you have 1 ID variable and 40 Observation variables with irregular variables names. The flexibility is a huge time saver!

Now obstype_year actually contains 2 pieces of information, the observation type (pop,lifeExp, or gdpPercap) and the year. We can use the separate() function to split the character strings into multiple variables

In [None]:
gap_long <- gap_long %>% separate(obstype_year,into=c('obs_type','year'),sep="_")
gap_long$year <- as.integer(gap_long$year)

### Question 2
**Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions we learned in the dplyr lesson.**  
gap_long %>% group_by(continent,obs_type) %>%
   summarize(means=mean(obs_values))

## From long to intermediate format with tidyr::spread()


Now let’s use the opposite of gather(), to spread our observation variables back out with the aptly named spread(). We can then spread our gap_long() to the original intermediate format or the widest format. Let’s start with the intermediate format.

In [1]:
gap_normal <- gap_long %>% spread(obs_type,obs_values)
dim(gap_normal)
dim(gapminder)

ERROR: Error in gap_long %>% spread(obs_type, obs_values): could not find function "%>%"


In [2]:
names(gap_normal)
names(gapminder)

ERROR: Error in eval(expr, envir, enclos): object 'gap_normal' not found


Now we’ve got an intermediate dataframe gap_normal with the same dimensions as the original gapminder, but the order of the variables is different. Let’s fix that before checking if they are all.equal().

In [None]:
gap_normal <- gap_normal[,names(gapminder)]
all.equal(gap_normal,gapminder)

In [None]:
head(gap_normal)
head(gapminder)

We’re almost there, the original was sorted by country, continent, then year.

In [None]:
gap_normal <- gap_normal %>% arrange(country,continent,year)
all.equal(gap_normal,gapminder)

That’s great! We’ve gone from the longest format back to the intermediate and we didn’t introduce any errors in our code.

Now lets convert the long all the way back to the wide. In the wide format, we will keep country and continent as ID variables and spread the observations across the 3 metrics (pop,lifeExp,gdpPercap) and time (year). First we need to create appropriate labels for all our new variables (time * metric combinations) and we also need to unify our ID variables to simplify the process of defining gap_wide

In [None]:
gap_temp <- gap_long %>% unite(var_ID,continent,country,sep="_")
str(gap_temp)

In [None]:
gap_temp <- gap_long %>%
    unite(ID_var,continent,country,sep="_") %>%
    unite(var_names,obs_type,year,sep="_")
str(gap_temp)

Using unite() we now have a single ID variable which is a combination of continent,country,and we have defined variable names. We’re now ready to pipe in spread().

In [None]:
gap_wide_new <- gap_long %>%
    unite(ID_var,continent,country,sep="_") %>%
    unite(var_names,obs_type,year,sep="_") %>%
    spread(var_names,obs_values)
str(gap_wide_new)

### Question 3
**Take this 1 step further and create a gap_ludicrously_wide format data by spreading over countries, year and the 3 metrics? Hint this new dataframe should only have 5 rows.** 
gap_ludicrously_wide <- gap_long %>%
   unite(var_names,obs_type,year,country,sep="_") %>%
   spread(var_names,obs_values)

Now we have a great ‘wide’ format dataframe, but the ID_var could be more usable, let’s separate it into 2 variables with separate().

In [None]:
gap_wide_betterID <- separate(gap_wide_new,ID_var,c("continent","country"),sep="_")
gap_wide_betterID <- gap_long %>%
    unite(ID_var, continent,country,sep="_") %>%
    unite(var_names, obs_type,year,sep="_") %>%
    spread(var_names, obs_values) %>%
    separate(ID_var, c("continent","country"),sep="_")
str(gap_wide_betterID)

In [None]:
all.equal(gap_wide, gap_wide_betterID)

## Key Points

Use the **tidyr** package to change the layout of dataframes.  
Use **gather()** to go from wide to long format.  
Use **spread()** to go from long to wide format.