Getting data
----

In biomedical contexts, data most often comes from external text files such as spreadsheets. Here we will look at how to import such data into R as a data frame. In order to read spreadsheets correctly, you need to be follow some simple rules when cosntrucitng the table:

### Do this

- A table has column headers and a number of rows and nothing else – it is RECTANGULAR

<img src="good_spreadsheet.png" width="600">

### Not this

- Do not put more than 1 table in a worksheet
- Do not use non-rectangular tables

<img src="bad_spreadsheet.png" width="600">

### Do this

- One cell = one value
- Easy to filter by tube, sample or subject
- Easy to write validation rules or lookup table

<img src="simple_information.png" width="600">

### Not this

- ID column has 3 different values
- Need to do text parsing to recover information – very error prone

<img src="complex_information.png" width="600">

### Round-trip from Excel to CSV and back to Excel

#### Before

- Inofmration in highlighting
- Information in comment notes
- Information in font color
- Merged cells

<img src="before.png" width="600">

### After

- Comments are lost
- Highlighting is lost
- Bad cell formatting is lost
- Merged cells become missing information

<img src="after.png" width="600">


Other suggestions
----

- Use a lookup table rather than typing if possible to avoid errors due to typos
- Use a special marker to indicate misssing values - do not use 0 or 999 etc
- Do not keeep multiple copies of the same spredsheet 
- If you must keep multiple copies, make sure you version them clearly in the fileanme
- Excel is OK if you use almost NONE of its features!

In [58]:
# Don't show warnign messages to keep interface clean
options(warn = -1)

Reading data from a spreadsheet
----

In [59]:
library(RCurl)
library(gdata)

In [60]:
songs.url <- "http://www.acclaimedmusic.net/Current/top_6000_songs_140727.xls"

In [79]:
songs <- read.xls(songs.url)

In [81]:
dim(songs)

### Cleaning the song and album dataframes

In [103]:
head(songs, 3)

Unnamed: 0,ID,Place,Artist,Song,Year
1,991,5,Bob Dylan,Like a Rolling Stone,1965-06-24
2,182,1116,Nirvana,Smells Like Teen Spirit,1991-06-24
3,961,2227,The Beach Boys,Good Vibrations,1966-06-24


In [84]:
colnames(songs)

#### Let's just keep the first 5 columns

In [85]:
keep <- c('ID', 'PLACE.2014.JUL.27', 'Artist', 'Song', 'Year')
songs <- songs[,keep]

#### Shorten the 2nd column name to 'Place'

In [86]:
colnames(songs)[2] <- 'Place'

In [87]:
head(songs, 3)

Unnamed: 0,ID,Place,Artist,Song,Year
1,991,1,Bob Dylan,Like a Rolling Stone,1965
2,182,2,Nirvana,Smells Like Teen Spirit,1991
3,961,3,The Beach Boys,Good Vibrations,1966


#### There are only 6000 ranked songs - get rid of the rest

In [88]:
songs[5998:6003,]

Unnamed: 0,ID,Place,Artist,Song,Year
5998,6538,5998.0,George Michael,Too Funky,1992
5999,6196,5999.0,Eddie Cochran,Three Steps to Heaven,1960
6000,6672,6000.0,Daryl Hall & John Oates,Kiss on My List,1981
6001,5057,,!!!,Hello? Is This Thing On?,2004
6002,5058,,!!!,Pardon My Freedom,2004
6003,8720,,\Tennessee\ Ernie Ford,Shot Gun Boogie,1950


In [89]:
songs <- songs[1:6000,]

In [102]:
tail(songs, 3)

Unnamed: 0,ID,Place,Artist,Song,Year
5998,6538,5558,George Michael,Too Funky,1992-06-24
5999,6196,5559,Eddie Cochran,Three Steps to Heaven,1960-06-24
6000,6672,5563,Daryl Hall & John Oates,Kiss on My List,1981-06-24


#### Take a closer look at the type of each column

In [104]:
sapply(songs, class)

#### Change types to more appropriate classes so that we can manipulate them

In [105]:
songs$Place <- as.numeric(songs$Place)
songs$ID <- as.numeric(songs$ID)

In [100]:
songs$Year <- as.Date(songs$Year, format='%Y')

In [107]:
i <- sapply(songs, is.factor)
songs[i] <- sapply(songs[i], as.character)

In [108]:
sapply(songs, class)

Work!
----

There is a spreadsheet of the top 5,000 albums available at http://www.acclaimedmusic.net/Current/top_3000_albums_140727.xls

Create a dataframe called albums that holds the contents of this spreadsheet.

How many rows and columns are there in the datafraem?

Trim the albums dataframe so that it only contains the top 5,000 albusm and the first 5 columns.

Convert the types of each column similar to what was done for the songs dataframe.

Selecting and summarizing dataframes
----