In [None]:
library('ggplot2')
library('reshape2')
library('curl')

# Wide versus Long Data Format - Using R

Let's take the example of a hypothetical Library, and look at the statistics the front desk might collect. The following csv file is an example of such data, which has been collected over six different time periods each day and submitted at some point later. Each submission details the date, time of day period, and a column for each enquiry category with a random generated tally.

So, let's begin by importing the raw statistics and have a look at its format.

In [None]:
df <- read.csv(curl('https://raw.githubusercontent.com/drewfrobot/data-literacy-workshops/master/Hypothetical_Library_Queries.csv'))
df[is.na(df)] <- 0
head(df,10)

They are a classic example of a wide, response data format.  Easily read by people, not so easily read by machines eg Skynet.

The categories of each interaction each have their own column.  It is a presentation format,  once again easy for people to read but for a machine to read and perform an analysis and to take advantage of powerful analysis tools it needs to be converted to tall or long or narrow format.

This can easily be achieved for example using a 'melt' function in the reshape library.

In [None]:
df2=melt(df,id_vars=c('Timestamp','Date','Time'),na.rm=TRUE)
head(df2,10)


Now we can perform an analysis and present results. Let's say we would like to know on average which time periods on which weekdays are the busiest at the enquiries desk. Firstly, let's make convert the Date column to Date format add a column which gives the day of the week. Here 0 is Sunday and 6 is Saturday.

In [None]:
df2$Date <- as.Date(df2$Date,"%d/%m/%Y")
df2$dayofweek <- format(as.Date(df2$Date),"%w")
head(df2,5)

Now we can perform an aggregation to find the average (median) number of interactions per day of week and time period.  Here we pivot or recast the data back into a wide format for people to view the summary or result.

In [None]:
df3 <- aggregate(value~Date+Time+dayofweek,df2,sum)
df4 <- aggregate(value~Time+dayofweek,df3,median)
df5 <- dcast(df4,Time~dayofweek,value.var="value")
df5

Let's do a quick, no nonsense graph to display the table.

In [None]:
p<-ggplot(df4, aes(x=Time, y=value,fill=dayofweek)) +
  geom_bar(stat="identity",position=position_dodge())+coord_fixed(ratio = 0.1)
print(p)

What if we wanted to show the interactions for each category over a particular month, say March 2018.

The raw data is hidden away in a table, there's no real need to look at it constantly, we can simply pose different questions and run any subsequent analysis without changing the raw data.

Here is a table view of the interactions per category for March 2018. Here we once again recast the data back into a wide format to view the results.

In [None]:
df6=df2[format.Date(df2$Date, "%m")=="03" & format.Date(df2$Date, "%y")=="18",]
df7 <- aggregate(value~Date+variable,df6,sum)
df8 <- dcast(df7,variable~Date,value.var="value")
df8


Once again, let's do a quick, no nonsense graph to display the table.

In [None]:
df9 <-rbind(df7[df7$variable=='Catalogue',],df7[df7$variable=='Reference',])
p<-ggplot(df9, aes(x=Date, y=value,group=variable)) +
  geom_line(aes(linetype=variable, color=variable),size=1.5) + coord_fixed(ratio = 0.2)
print(p)

So in summary, wide format, whilst ideal for survey responses and displaying data to people, is not an ideal format to work with when using data analysis tools, such as R.  Converting to long data format allows the use of very powerful tools, and then the results can be pivoted or recast back into a wide format which is then easy for people to read and ponder.