Skip to content

2. Assignment 1: Data set selection and initial processing

Yuzi Li edited this page Feb 15, 2022 · 9 revisions

Objective

Select human RNAseq data from GEO and process the data to prepare for further analyses.

Duration

Expected duration: 3h
Actual duration: 13h

Progress

Tasks

  1. Select an expression dataset.
  2. Clean the data and map to HUGO symbols.
  3. Normalize data using edgeR.
  4. Interpret the dataset and add comments to notebook.

Dataset selection

  • I used the following code to select my dataset (related to cancer):
# Get meta data
if(!file.exists('GEOmetadb.sqlite')) GEOmetadb::getSQLiteFile()

# Connect to db
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')

# Check out tables
geo_tables <- dbListTables(con)
geo_tables
dbListFields(con,'gse')

# Run SQL queries to find datasets
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
             " gse.submission_date,",
             " gse.supplementary_file",
             "FROM",
             " gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
             " JOIN gpl ON gse_gpl.gpl=gpl.gpl",
             "WHERE",
             " gse.submission_date > '2017-01-01' AND",
             " gpl.organism LIKE '%Homo sapiens%' AND",
             " gse.title LIKE '%cancer%' AND", 
             " gpl.technology LIKE '%high-throughput sequencing%' ",
             " ORDER BY gse.submission_date DESC",sep=" ")
rs <- dbGetQuery(con,sql)
#break the file names up and just get the actual file name
sfile_names <- unlist(lapply(rs$supplementary_file,
              FUN = function(x){x <- unlist(strsplit(x,";")) ;
              x <- x[grep(x,pattern="txt",ignore.case = TRUE)];
              tail(unlist(strsplit(x,"/")),n=1)}))
# Get file names that contain count
counts_files <- sfile_names[grep(sfile_names,pattern = "count",ignore.case = TRUE)]
  • The results are:
 [1] "GSE166697_Urinary_miRNA_counts.txt.gz"                                  "GSE165452_ACY241_PRL_RNAseq_no_rRNA_Kallisto_Gene_Counts_matrix.txt.gz"
 [3] "GSE165247_RawCounts_Matrix_2019.txt.gz"                                 "GSE165115_Transcriptome_counts.txt.gz"                                 
 [5] "GSE165115_Transcriptome_counts.txt.gz"                                  "GSE164531_NEDD9_count_table.txt.gz"                                    
 [7] "GSE163374_rat.raw.counts.txt.gz"                                        "GSE162515_RNAseq_rawCounts.txt.gz"                                     
 [9] "GSE162564_22RV1_CountTable.txt.gz"                                      "GSE162285_gene_raw_counts_matrix.txt.gz"                               
[11] "GSE162104_count.txt.gz"                                                 "GSE161691_TMM_normalized_reads_count_per_million_LNCaP.txt.gz"         
[13] "GSE161502_Raw_Counts.txt.gz"                                            "GSE161349_Counts_noadj_exons_condense.txt.gz"                          
[15] "GSE161243_compiled_counts.txt.gz"                                       "GSE160693_Normalized_Gene_Counts_Matrix.txt.gz"                        
[17] "GSE160723_read_counts.txt.gz"                                           "GSE160314_BM_vs_ES_normalized_count_matrix.txt.gz"                     
[19] "GSE160252_miRNA.canonical.rawcounts.txt.gz"                             "GSE160252_miRNA.canonical.rawcounts.txt.gz"                            
[21] "GSE159493_raw_gene_counts_matrix.txt.gz"                                "GSE158945_Raw_counts_per_million.txt.gz"                               
[23] "GSE158949_voomNormalizedCountsMatrix.txt.gz"                            "GSE158722_P24.counts.txt.gz"                                           
[25] "GSE158722_P24.counts.txt.gz"                                            "GSE158724_icell8.counts.txt.gz"                                        
[27] "GSE158724_icell8.counts.txt.gz"                                         "GSE158317_cell_line_mRNA_raw_counts.txt.gz"                            
[29] "GSE157927_htseqcounts_mp2_tpl_thz1.txt.gz"                              "GSE157927_htseqcounts_mp2_tpl_thz1.txt.gz"
  • I selected the 15th dataset: GSE161243

Data processing and analyses

Conclusions and Outlook

  • Data need to be normalized to reduce technical variation that might confound true biological variations
  • For gene expression data, we need to filter out the low-count genes
  • Density plot and boxplot can be used to visualize sample data distributions
  • MDS plot can be used to access distances between samples
  • Techniques learned in using edgeR, biomaRt, and GEOmetadb can be used in future work