Merging data
----

A common task in data analysis is to link data from two or more datasts, for example, to relate assay data to clinical phenoytpe. 

Here we will work thought a typical example where the genotype and phenoytpe information come from two different data sets, and the ID information needed to link the two is from a third data set.

In [94]:
phenodat <- read.csv("phenodat.csv") 
gdat1 <- read.csv("gdat1.csv") 
gdat2 <- read.csv("gdat2.csv")
iddat <- read.csv("iddat.csv")

Eyeball data sett
----

A quick sanity check to see what the data look like.

In [100]:
(dim(phenodat))
(dim(gdat1))
(dim(gdat2))
(dim(iddat))

In [101]:
head(phenodat, 3)

Unnamed: 0,pid,trt
1,pid6,0
2,pid15,1
3,pid8,0


In [102]:
head(gdat1, 3)

Unnamed: 0,expid,gene1,gene2
1,100020,-0.4321298,-0.2288958
2,100018,-1.318938,0.7935853
3,100013,1.242919,-1.334354


In [103]:
head(gdat2, 3)

Unnamed: 0,expid,gene1,gene2
1,100009,-1.220512,-0.2416898
2,100008,0.2865486,1.685887
3,100007,-0.7717918,-1.070068


In [104]:
head(iddat, 3)

Unnamed: 0,pid,expid
1,pid20,100020
2,pid9,100009
3,pid13,100013


Combine gene data from two data sets
----

Often, we have the same type of data stroed in mulitple data sets, for example, one per batch. In this case, we want to combine **rows**.

In [97]:
gdat <- rbind(gdat1, gdat2)

Checking for duplicates
----   

In [107]:
show.dups <- function(df) {
    return(df[duplicated(df), ])
    }

In [108]:
show.dups(phenodat)

Unnamed: 0,pid,trt


In [109]:
show.dups(iddat)

Unnamed: 0,pid,expid


In [110]:
show.dups(gdat)

Unnamed: 0,expid,gene1,gene2
13,100008,0.2865486,1.685887
15,100003,0.8867361,0.2760235
17,100004,-0.151396,-1.048976
18,100018,-1.318938,0.7935853
20,100011,0.8001769,-0.7729782
21,100001,-0.5996083,1.689873


Remove duplicates
----

In [111]:
gdat <- unique(gdat)

In [113]:
dim(gdat)

In [112]:
show.dups(gdat)

Unnamed: 0,expid,gene1,gene2


Merging
----

To combine columns from different data sets, we can perform a `merge` operation. Rows in the different data set need some common identifier to be merged, typcialy information from one or more "ID" columns.

### Merge all rows with information for both phenotype and gene

#### First we merge phnenoytpe data with the ID data

In [114]:
(df1 <- merge(phenodat, iddat, by="pid", all.x=TRUE))

Unnamed: 0,pid,trt,expid
1,pid1,0,100001
2,pid12,1,100012
3,pid15,1,100015
4,pid16,0,100016
5,pid17,0,100017
6,pid18,1,100018
7,pid20,0,100020
8,pid6,0,100006
9,pid7,0,100007
10,pid8,0,100008


#### Then we merge with gene data

In [116]:
(df2 <- merge(gdat, df1, by="expid"))

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0
2,100007,-0.7717918,-1.070068,pid7,0
3,100008,0.2865486,1.685887,pid8,0
4,100015,0.3937087,1.233976,pid15,1
5,100017,-0.8864367,0.4120223,pid17,0
6,100018,-1.318938,0.7935853,pid18,1
7,100020,-0.4321298,-0.2288958,pid20,0


Note that there are now only 7 rows becasue 3 phenotypes did not have matching gene data.

### What if we want to show all genes even if there is no matching phenotype data?

In [117]:
(df3 <- merge(gdat, df1, by="expid", all.x=TRUE))

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0.0
2,100002,-0.1294107,1.228393,,
3,100003,0.8867361,0.2760235,,
4,100004,-0.151396,-1.048976,,
5,100005,0.3297912,-0.5208693,,
6,100007,-0.7717918,-1.070068,pid7,0.0
7,100008,0.2865486,1.685887,pid8,0.0
8,100009,-1.220512,-0.2416898,,
9,100011,0.8001769,-0.7729782,,
10,100013,1.242919,-1.334354,,


### What if we want to show all phenotypes even if there is no matching gene data?

In [118]:
(df4 <- merge(gdat, df1, by="expid", all.y=TRUE))

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0
2,100006,,,pid6,0
3,100007,-0.7717918,-1.070068,pid7,0
4,100008,0.2865486,1.685887,pid8,0
5,100012,,,pid12,1
6,100015,0.3937087,1.233976,pid15,1
7,100016,,,pid16,0
8,100017,-0.8864367,0.4120223,pid17,0
9,100018,-1.318938,0.7935853,pid18,1
10,100020,-0.4321298,-0.2288958,pid20,0


### What if we want to show everything?

In [119]:
(df5 <- merge(gdat, df1, by="expid", all.x=TRUE, all.y=TRUE))

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0.0
2,100002,-0.1294107,1.228393,,
3,100003,0.8867361,0.2760235,,
4,100004,-0.151396,-1.048976,,
5,100005,0.3297912,-0.5208693,,
6,100006,,,pid6,0.0
7,100007,-0.7717918,-1.070068,pid7,0.0
8,100008,0.2865486,1.685887,pid8,0.0
9,100009,-1.220512,-0.2416898,,
10,100011,0.8001769,-0.7729782,,


Rearrange column order
-----

In [56]:
df2[, c(4,1,2,3,5)]

Unnamed: 0,pid,expid,gene1,gene2,trt
1,pid1,100001,0.6886403,-0.4028848,1
2,pid8,100008,0.5539177,-0.4666554,1
3,pid18,100018,0.8951257,2.168956,1
4,pid20,100020,-0.2950715,-1.265396,0


Sorting data
---

In [120]:
df2

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0
2,100007,-0.7717918,-1.070068,pid7,0
3,100008,0.2865486,1.685887,pid8,0
4,100015,0.3937087,1.233976,pid15,1
5,100017,-0.8864367,0.4120223,pid17,0
6,100018,-1.318938,0.7935853,pid18,1
7,100020,-0.4321298,-0.2288958,pid20,0


#### Sort by expid

In [122]:
df2[order(df2$expid),]

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0
2,100007,-0.7717918,-1.070068,pid7,0
3,100008,0.2865486,1.685887,pid8,0
4,100015,0.3937087,1.233976,pid15,1
5,100017,-0.8864367,0.4120223,pid17,0
6,100018,-1.318938,0.7935853,pid18,1
7,100020,-0.4321298,-0.2288958,pid20,0


#### Sort by pid

In [123]:
df2[order(df2$pid),]

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0
4,100015,0.3937087,1.233976,pid15,1
5,100017,-0.8864367,0.4120223,pid17,0
6,100018,-1.318938,0.7935853,pid18,1
7,100020,-0.4321298,-0.2288958,pid20,0
2,100007,-0.7717918,-1.070068,pid7,0
3,100008,0.2865486,1.685887,pid8,0


#### Sort by pid, then by expid

In [125]:
df2[order(df2$pid, df2$expid),]

Unnamed: 0,expid,gene1,gene2,pid,trt
1,100001,-0.5996083,1.689873,pid1,0
4,100015,0.3937087,1.233976,pid15,1
5,100017,-0.8864367,0.4120223,pid17,0
6,100018,-1.318938,0.7935853,pid18,1
7,100020,-0.4321298,-0.2288958,pid20,0
2,100007,-0.7717918,-1.070068,pid7,0
3,100008,0.2865486,1.685887,pid8,0


#### Sort by gene1 in decreasing order

In [128]:
df2[order(df2$gene1, decreasing = TRUE),]

Unnamed: 0,expid,gene1,gene2,pid,trt
4,100015,0.3937087,1.233976,pid15,1
3,100008,0.2865486,1.685887,pid8,0
7,100020,-0.4321298,-0.2288958,pid20,0
1,100001,-0.5996083,1.689873,pid1,0
2,100007,-0.7717918,-1.070068,pid7,0
5,100017,-0.8864367,0.4120223,pid17,0
6,100018,-1.318938,0.7935853,pid18,1
