2. Assignment 1: Data set selection and initial processing

Objective

Select human RNAseq data from GEO and process the data to prepare for further analyses.

Duration

Expected duration: 3h
Actual duration: 13h

Progress

Tasks

Select an expression dataset.
Clean the data and map to HUGO symbols.
Normalize data using edgeR.
Interpret the dataset and add comments to notebook.

Dataset selection

I used the following code to select my dataset (related to cancer):

# Get meta data
if(!file.exists('GEOmetadb.sqlite')) GEOmetadb::getSQLiteFile()

# Connect to db
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')

# Check out tables
geo_tables <- dbListTables(con)
geo_tables
dbListFields(con,'gse')

# Run SQL queries to find datasets
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
             " gse.submission_date,",
             " gse.supplementary_file",
             "FROM",
             " gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
             " JOIN gpl ON gse_gpl.gpl=gpl.gpl",
             "WHERE",
             " gse.submission_date > '2017-01-01' AND",
             " gpl.organism LIKE '%Homo sapiens%' AND",
             " gse.title LIKE '%cancer%' AND", 
             " gpl.technology LIKE '%high-throughput sequencing%' ",
             " ORDER BY gse.submission_date DESC",sep=" ")
rs <- dbGetQuery(con,sql)
#break the file names up and just get the actual file name
sfile_names <- unlist(lapply(rs$supplementary_file,
              FUN = function(x){x <- unlist(strsplit(x,";")) ;
              x <- x[grep(x,pattern="txt",ignore.case = TRUE)];
              tail(unlist(strsplit(x,"/")),n=1)}))
# Get file names that contain count
counts_files <- sfile_names[grep(sfile_names,pattern = "count",ignore.case = TRUE)]

The results are:

 [1] "GSE166697_Urinary_miRNA_counts.txt.gz"                                  "GSE165452_ACY241_PRL_RNAseq_no_rRNA_Kallisto_Gene_Counts_matrix.txt.gz"
 [3] "GSE165247_RawCounts_Matrix_2019.txt.gz"                                 "GSE165115_Transcriptome_counts.txt.gz"                                 
 [5] "GSE165115_Transcriptome_counts.txt.gz"                                  "GSE164531_NEDD9_count_table.txt.gz"                                    
 [7] "GSE163374_rat.raw.counts.txt.gz"                                        "GSE162515_RNAseq_rawCounts.txt.gz"                                     
 [9] "GSE162564_22RV1_CountTable.txt.gz"                                      "GSE162285_gene_raw_counts_matrix.txt.gz"                               
[11] "GSE162104_count.txt.gz"                                                 "GSE161691_TMM_normalized_reads_count_per_million_LNCaP.txt.gz"         
[13] "GSE161502_Raw_Counts.txt.gz"                                            "GSE161349_Counts_noadj_exons_condense.txt.gz"                          
[15] "GSE161243_compiled_counts.txt.gz"                                       "GSE160693_Normalized_Gene_Counts_Matrix.txt.gz"                        
[17] "GSE160723_read_counts.txt.gz"                                           "GSE160314_BM_vs_ES_normalized_count_matrix.txt.gz"                     
[19] "GSE160252_miRNA.canonical.rawcounts.txt.gz"                             "GSE160252_miRNA.canonical.rawcounts.txt.gz"                            
[21] "GSE159493_raw_gene_counts_matrix.txt.gz"                                "GSE158945_Raw_counts_per_million.txt.gz"                               
[23] "GSE158949_voomNormalizedCountsMatrix.txt.gz"                            "GSE158722_P24.counts.txt.gz"                                           
[25] "GSE158722_P24.counts.txt.gz"                                            "GSE158724_icell8.counts.txt.gz"                                        
[27] "GSE158724_icell8.counts.txt.gz"                                         "GSE158317_cell_line_mRNA_raw_counts.txt.gz"                            
[29] "GSE157927_htseqcounts_mp2_tpl_thz1.txt.gz"                              "GSE157927_htseqcounts_mp2_tpl_thz1.txt.gz"

I selected the 15th dataset: GSE161243

Data processing and analyses

See R Notebook: Assignment 1 html notebook

Conclusions and Outlook

Data need to be normalized to reduce technical variation that might confound true biological variations
For gene expression data, we need to filter out the low-count genes
Density plot and boxplot can be used to visualize sample data distributions
MDS plot can be used to access distances between samples
Techniques learned in using edgeR, biomaRt, and GEOmetadb can be used in future work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2. Assignment 1: Data set selection and initial processing

Objective

Duration

Progress

Tasks

Dataset selection

Data processing and analyses

Conclusions and Outlook

Clone this wiki locally