### Goal
Parse and clean *recount2*, gene counts to AnnData format.

Use first `../../data/raw/gtex/recount/splitGtexData.R` - to split in counts, rowData, colData (the data was available as .RData object)

### TODO
- Get GTEx metadata:
  - check individuals if they have WGS/WXS or RNA-SEQ whole blood
  - integrate GTEx metadata (Assay, Gender, Sample Id, Individual ...) into the recount2 sample_description
- Create our custom recount2 metadata containing: Assay, Gender, Sample id, Tissue, Parent Tissues, Individual ...
- Create the Summarized experiment  + store data as HDF5 format

### Conclusions
`GTEx` metadata - short analysis:
- Total number of individuals: 723
- #individuals having WGS or WES: 679
- #individuals having Whole Blood RNA-Seq: 427
- #individuals having at least 2 Tissues available: 544
- #individuals having at least 2 Tissues available + having WGS or WES: 501

`recount2` Summarized experiment: `/s/project/rep/processed/gtex/recount/recount_gtex.h5ad`


In [1]:
import sys
import os

import pandas as pd
import numpy as np

from rep import preprocessing as p

#### 1. GTEx metadata - short analysis

In [2]:
# add metadata patient related - used from gtex v7
metadata = '/s/project/gtex-processed/gene_counts_v7/SraRunTable.txt'
gtex_meta = p.load_df(metadata,delimiter="\t",header=0,index_col=None)

In [214]:
# get individual information
gtex_meta_short = gtex_meta[['Sample_Name_s','sex_s','Assay_Type_s','Instrument_s','body_site_s','histological_type_s']]
gtex_meta_short.loc[:,'Sample_Name'] = pd.Series(gtex_meta_short['Sample_Name_s']).str.split("_",expand=True)[0]
aux = pd.Series(gtex_meta_short['Sample_Name']).str.split("-",expand=True)
gtex_meta_short.loc[:,'Individual'] = aux[0].map(str) + "-" + aux[1]
gtex_meta_unique = gtex_meta_short.iloc[:,1:]
gtex_meta_unique.drop_duplicates(inplace=True)
gtex_meta_unique.iloc[:3,:]

Unnamed: 0,sex_s,Assay_Type_s,Instrument_s,body_site_s,histological_type_s,Sample_Name,Individual
0,female,RNA-Seq,Illumina MiSeq,Skin - Sun Exposed (Lower leg),Skin,GTEX-WEY5-1826-SM-5CHRT,GTEX-WEY5
1,male,RNA-Seq,Illumina MiSeq,Testis,Testis,GTEX-SUCS-1326-SM-5CHQI,GTEX-SUCS
2,male,RNA-Seq,Illumina MiSeq,Thyroid,Thyroid,GTEX-SUCS-0226-SM-5CHQG,GTEX-SUCS


- Count total number of individuals

In [215]:
subset = gtex_meta_unique[['body_site_s','Assay_Type_s','Individual']]
subset.drop_duplicates(inplace=True)
subset[:3]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,body_site_s,Assay_Type_s,Individual
0,Skin - Sun Exposed (Lower leg),RNA-Seq,GTEX-WEY5
1,Testis,RNA-Seq,GTEX-SUCS
2,Thyroid,RNA-Seq,GTEX-SUCS


In [216]:
indiv = subset['Individual'].drop_duplicates()
indiv.count()

723

- Count individuals having WGS or WES

In [217]:
count_dnaseq_individuals = subset[subset['Assay_Type_s'].isin(['WGS','WXS'])]['Individual'].drop_duplicates()
count_dnaseq_individuals.count()

679

- Count individuals having whole blood RNA-Seq

In [218]:
c = subset[subset['Assay_Type_s'].isin(['RNA-Seq'])]
count_rnaseq_individuals = c[c['body_site_s'].isin(['Whole Blood'])]['Individual'].drop_duplicates()
count_rnaseq_individuals.count()

427

- Count individuals having at least two tissues available

In [219]:
c = subset[subset['Assay_Type_s'].isin(['RNA-Seq'])]
count_tissues = c.groupby(['Individual','Assay_Type_s'], sort=True).size()
count_filter = count_tissues[count_tissues > 1]
len(count_filter)

544

- Count indiv with at least two tissues + having WGS/WES

In [220]:
tokeep = list(set(count_dnaseq_individuals.tolist()) & set(count_filter.index.get_level_values('Individual')))
len(tokeep)

501

In [221]:
# this array will be further used to complete the recount2 sample data information
gtex_tokeep = gtex_meta_unique[gtex_meta_unique['Individual'].isin(tokeep)]
gtex_tokeep.rename(columns={"histological_type_s": "Parent_Tissue", "body_site_s": "Tissue"},inplace=True)
gtex_tokeep[:3]

Unnamed: 0,sex_s,Assay_Type_s,Instrument_s,Tissue,Parent_Tissue,Sample_Name,Individual
0,female,RNA-Seq,Illumina MiSeq,Skin - Sun Exposed (Lower leg),Skin,GTEX-WEY5-1826-SM-5CHRT,GTEX-WEY5
1,male,RNA-Seq,Illumina MiSeq,Testis,Testis,GTEX-SUCS-1326-SM-5CHQI,GTEX-SUCS
2,male,RNA-Seq,Illumina MiSeq,Thyroid,Thyroid,GTEX-SUCS-0226-SM-5CHQG,GTEX-SUCS


- Individuals without DNA-seq

In [222]:
# list individuals without WGS or WXS
black = list(set(indiv.tolist())-set(count_dnaseq_individuals.tolist()))

#### 2. GTEx to AnnData object
GTEx keep only individuals with WES/WGS having at least 2 tissues available.
Parse to the recommanded format

In [223]:
data_path = os.readlink(os.path.join("..","..","data"))
path = os.path.join(data_path,"raw","gtex","recount","version2")
counts_file = "recount_gene_counts.csv"
rowdata_file = "recount_rowdata.csv"
coldata_file = "recount_coldata.csv"

In [224]:
# might take longer to create a large object
annobj = p.create_anndata(os.path.join(path,counts_file), sep="\t", samples_anno=os.path.join(path,coldata_file),genes_anno=os.path.join(path,rowdata_file))

In [225]:
# parse samples metadata to our standardize form: Individual, Parent_Tissue, Tissue, Indiv_Seq_Assay, Gender
sample_description = annobj.var
sample_description[:3]

Unnamed: 0,project,sample,experiment,mapped_read_count,avg_read_length,sampid,smatsscr,smts,smtsd
SRR660824,SRP012682,SRS389722,SRX222703,170790002,152,GTEX-QMR6-1926-SM-32PL9,3.0,Lung,Lung
SRR2166176,SRP012682,SRS1036203,SRX1152700,191059974,500,GTEX-T5JC-0011-R11A-SM-5S2RX,,Brain,Brain - Cerebellar Hemisphere
SRR606939,SRP012682,SRS333474,SRX199032,159714774,136,GTEX-POMQ-0326-SM-2I5FO,1.0,Heart,Heart - Left Ventricle


In [226]:
aux = pd.Series(sample_description['sampid']).str.split("-",expand=True)
sample_description.loc[:,'Individual'] = aux[0].map(str) + "-" + aux[1]
sample_description.rename(index=str, columns={"smts": "Parent_Tissue", "smtsd": "Tissue","sampid":"Sample_Name"},inplace=True)
sample_description[:3]

Unnamed: 0,project,sample,experiment,mapped_read_count,avg_read_length,Sample_Name,smatsscr,Parent_Tissue,Tissue,Individual
SRR660824,SRP012682,SRS389722,SRX222703,170790002,152,GTEX-QMR6-1926-SM-32PL9,3.0,Lung,Lung,GTEX-QMR6
SRR2166176,SRP012682,SRS1036203,SRX1152700,191059974,500,GTEX-T5JC-0011-R11A-SM-5S2RX,,Brain,Brain - Cerebellar Hemisphere,GTEX-T5JC
SRR606939,SRP012682,SRS333474,SRX199032,159714774,136,GTEX-POMQ-0326-SM-2I5FO,1.0,Heart,Heart - Left Ventricle,GTEX-POMQ


In [227]:
new = sample_description.merge(gtex_tokeep,how='left').fillna("").set_index(sample_description.index)
new.rename(index=str, columns={"sex_s": "Gender"},inplace=True)
new[:3]

Unnamed: 0,project,sample,experiment,mapped_read_count,avg_read_length,Sample_Name,smatsscr,Parent_Tissue,Tissue,Individual,Gender,Assay_Type_s,Instrument_s
SRR660824,SRP012682,SRS389722,SRX222703,170790002,152,GTEX-QMR6-1926-SM-32PL9,3.0,Lung,Lung,GTEX-QMR6,male,RNA-Seq,Illumina HiSeq 2000
SRR2166176,SRP012682,SRS1036203,SRX1152700,191059974,500,GTEX-T5JC-0011-R11A-SM-5S2RX,,Brain,Brain - Cerebellar Hemisphere,GTEX-T5JC,male,RNA-Seq,Illumina HiSeq 2000
SRR606939,SRP012682,SRS333474,SRX199032,159714774,136,GTEX-POMQ-0326-SM-2I5FO,1.0,Heart,Heart - Left Ventricle,GTEX-POMQ,female,RNA-Seq,Illumina HiSeq 2000


In [228]:
new.shape, len(list(set(new['Individual'].tolist()) & set(tokeep))), len(list(set(new['Individual'].tolist())))

((9662, 13), 501, 551)

In [232]:
# assign right assay to the experiments
new.loc[:,'Indiv_Seq_Assay'] = pd.Series('WGS', index=new.index)
for indiv in tokeep:
    assay = list(set(gtex_tokeep[gtex_tokeep['Individual'] == indiv]['Assay_Type_s'].tolist()))
    if 'WGS' not in assay and 'WXS' in assay:
        # replace assay to whole exome
        new.loc[new['Individual']  == indiv, 'Indiv_Seq_Assay'] = 'WES'
    elif 'WGS' not in assay and 'WXS' not in assay:
        print("Error - individual has not DNA-seq: ",indiv)

In [275]:
# remove the list of experiments which do not have WGS/WES
new_filtered = new[new['Individual'].isin(tokeep)]

In [276]:
annobj.var = new
filtered_annobj = annobj[:,new_filtered.index.tolist()]
filtered_annobj.var_names = filtered_annobj.var.index.tolist()

for indiv in tokeep:
    gender = filtered_annobj.var.loc[filtered_annobj.var['Individual'] == indiv][['Individual','Gender']].drop_duplicates()
    if gender.shape[0] != 1:
        genders = gender['Gender'].tolist()
        if 'male' in genders and 'female' in genders: print("Erroorrrr")
        elif 'male' in genders:
            filtered_annobj.var.loc[filtered_annobj.var['Individual'] == indiv,'Gender'] = 'male'
        elif 'female' in genders: 
            filtered_annobj.var.loc[filtered_annobj.var['Individual'] == indiv,'Gender'] = 'female'
        else:
            print("could not found gender ", genders, " ", indiv)

In [277]:
filtered_annobj.var.shape, new.shape, new_filtered.shape, len(tokeep), filtered_annobj.var['Individual'].drop_duplicates().shape

((8911, 14), (9662, 14), (8911, 14), 501, (501,))

In [278]:
# save AnnData object to HDF5 format
output_file = os.path.join(data_path,"processed","gtex","recount","recount_gtex.h5ad")
p.save(filtered_annobj, outname=output_file)

/s/project/rep/processed/gtex/recount/recount_gtex.h5ad


... storing 'symbol' as categorical
... storing 'project' as categorical
... storing 'smatsscr' as categorical
... storing 'Parent_Tissue' as categorical
... storing 'Tissue' as categorical
... storing 'Individual' as categorical
... storing 'Gender' as categorical
... storing 'Assay_Type_s' as categorical
... storing 'Instrument_s' as categorical
... storing 'Indiv_Seq_Assay' as categorical


'/data/nasif12/home_if12/giurgiu/rep_gagneur/rep/notebooks/preprocessing/tmp1547640828.h5ad'