# From wide to long and vice versa

Mark Klik & Misja Mikkers

# Packages


In [30]:
library(tidyverse)

# Introduction

A dataset often looks like:

In [31]:
data1 <- data.frame(
  ID = c(1, 2, 3),
  Year_1 = c("a", "e", "i"),
  Year_2 = c("b", "f", "j"),
  Year_3 = c("c", "g", "k"),
  Year_4 = c("d", "h", "l"), stringsAsFactors = FALSE)
data1

ID,Year_1,Year_2,Year_3,Year_4
1,a,b,c,d
2,e,f,g,h
3,i,j,k,l


But often we need the data to look like this:

In [32]:
data1a <- gather(data1, Year, Value, Year_1:Year_4) %>%
    arrange(ID)
data1a

ID,Year,Value
1,Year_1,a
1,Year_2,b
1,Year_3,c
1,Year_4,d
2,Year_1,e
2,Year_2,f
2,Year_3,g
2,Year_4,h
3,Year_1,i
3,Year_2,j


For the creation of figures, you often need a long format. In this notebook you will learn how to change dataframes from wide to long and vice versa.

# Example

## Dataframe

First, we create a wide dataframe:


In [33]:
data1 <- data.frame(
  ID = c(1, 2, 3),
  Year_1 = c("a", "e", "i"),
  Year_2 = c("b", "f", "j"),
  Year_3 = c("c", "g", "k"),
  Year_4 = c("d", "h", "l"),
  stringsAsFactors = FALSE)
data1


ID,Year_1,Year_2,Year_3,Year_4
1,a,b,c,d
2,e,f,g,h
3,i,j,k,l


## From _wide_ to _long_

If we want to change the format from _wide_ to _long_ , we want to use the function `gather()` with the following syntax:


`gather(data, key = "key", value = "value", ...,)`


The parameter _key_ represents the name of that you want to give to the _key-column_ . (In our example we want to call this column  _Year_). For the parameter _value_ you choose the name for the _value-column_ (in our example _Value_). For the dots you need the fill in the columns that contain a  _value_  (in our case _Year_1_ until _Year_4_).

The command works like this:


In [34]:
data2 <- data1 %>%
  gather(Year, Value, Year_1:Year_4)
print(data2)

   ID   Year Value
1   1 Year_1     a
2   2 Year_1     e
3   3 Year_1     i
4   1 Year_2     b
5   2 Year_2     f
6   3 Year_2     j
7   1 Year_3     c
8   2 Year_3     g
9   3 Year_3     k
10  1 Year_4     d
11  2 Year_4     h
12  3 Year_4     l


We can sort the dataframe on ID.


In [35]:
data2 <- data2 %>%
  arrange(ID)
print(data2)

   ID   Year Value
1   1 Year_1     a
2   1 Year_2     b
3   1 Year_3     c
4   1 Year_4     d
5   2 Year_1     e
6   2 Year_2     f
7   2 Year_3     g
8   2 Year_4     h
9   3 Year_1     i
10  3 Year_2     j
11  3 Year_3     k
12  3 Year_4     l


## From _long_ to _wide_

We can reverse the process with the function `spread()`. This function has the following syntax:

`spread(data, key, value, ...)`

The parameter _key_ is the name of the column that contains the values to be spreaded (in our example _Year_). The parameter _value_ is the name of the column with _values_.

In [36]:
data3 <- data2 %>%
  spread(Year, Value)
print(data3)

  ID Year_1 Year_2 Year_3 Year_4
1  1      a      b      c      d
2  2      e      f      g      h
3  3      i      j      k      l


With this function you achieve the opposite of `gather()`:

In [37]:
data_test <- data1 %>%
  gather(Year, Value, Year_1:Year_4) %>%
  spread(Year, Value)

data_test == data1

ID,Year_1,Year_2,Year_3,Year_4
True,True,True,True,True
True,True,True,True,True
True,True,True,True,True


# Assignment

1. Read the file _groei.csv_ . Please note: because the _csv_ file containts numbers in the column names, R will put an X in front of the column names. Because we don't want the X, you need to add the command `check.names = FALSE`. Then the command will look something like this: `read.csv2("../Sourcedata/your_file_name.csv", check.names = FALSE)`
2.  Change the file to a _long_ format. Please note that you can't use numbers as column names in the function `gather()`, because R would think e.g. 2002 is column 2002. How to solve this?

In [39]:
oecd1 <- 
read.csv2("../sourcedata/growth.csv", check.names = FALSE) %>%
gather(Year,Growth,2:12)
head(oecd1)

oecd2<- 
read.csv2("../sourcedata/growth.csv", check.names = FALSE) %>%
gather(Year,Growth, as.character(2006:2016))
head(oecd2)

oecd3<- 
read.csv2("../sourcedata/growth.csv", check.names = FALSE) %>%
gather(Year,Growth, -Country)
head(oecd3)

Country,Year,Growth
Belgium,2006,0.7
France,2006,0.3
Germany,2006,2.4
Netherlands,2006,2.4
Sweden,2006,2.6
United Kingdom,2006,3.8


Country,Year,Growth
Belgium,2006,0.7
France,2006,0.3
Germany,2006,2.4
Netherlands,2006,2.4
Sweden,2006,2.6
United Kingdom,2006,3.8


Country,Year,Growth
Belgium,2006,0.7
France,2006,0.3
Germany,2006,2.4
Netherlands,2006,2.4
Sweden,2006,2.6
United Kingdom,2006,3.8


You can check whether the 2 methods deliver the same results with:



In [40]:
sum(oecd1 != oecd2, na.rm = TRUE)

No differences!

End of Notebook