2. Assignment 1: Data set selection and initial processing
Yuzi Li edited this page Feb 15, 2022
·
9 revisions
Select human RNAseq data from GEO and process the data to prepare for further analyses.
Expected duration: 3h
Actual duration: 13h
- Select an expression dataset.
- Clean the data and map to HUGO symbols.
- Normalize data using edgeR.
- Interpret the dataset and add comments to notebook.
- I used the following code to select my dataset (related to cancer):
# Get meta data
if(!file.exists('GEOmetadb.sqlite')) GEOmetadb::getSQLiteFile()
# Connect to db
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
# Check out tables
geo_tables <- dbListTables(con)
geo_tables
dbListFields(con,'gse')
# Run SQL queries to find datasets
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.submission_date > '2017-01-01' AND",
" gpl.organism LIKE '%Homo sapiens%' AND",
" gse.title LIKE '%cancer%' AND",
" gpl.technology LIKE '%high-throughput sequencing%' ",
" ORDER BY gse.submission_date DESC",sep=" ")
rs <- dbGetQuery(con,sql)
#break the file names up and just get the actual file name
sfile_names <- unlist(lapply(rs$supplementary_file,
FUN = function(x){x <- unlist(strsplit(x,";")) ;
x <- x[grep(x,pattern="txt",ignore.case = TRUE)];
tail(unlist(strsplit(x,"/")),n=1)}))
# Get file names that contain count
counts_files <- sfile_names[grep(sfile_names,pattern = "count",ignore.case = TRUE)]
- The results are:
[1] "GSE166697_Urinary_miRNA_counts.txt.gz" "GSE165452_ACY241_PRL_RNAseq_no_rRNA_Kallisto_Gene_Counts_matrix.txt.gz"
[3] "GSE165247_RawCounts_Matrix_2019.txt.gz" "GSE165115_Transcriptome_counts.txt.gz"
[5] "GSE165115_Transcriptome_counts.txt.gz" "GSE164531_NEDD9_count_table.txt.gz"
[7] "GSE163374_rat.raw.counts.txt.gz" "GSE162515_RNAseq_rawCounts.txt.gz"
[9] "GSE162564_22RV1_CountTable.txt.gz" "GSE162285_gene_raw_counts_matrix.txt.gz"
[11] "GSE162104_count.txt.gz" "GSE161691_TMM_normalized_reads_count_per_million_LNCaP.txt.gz"
[13] "GSE161502_Raw_Counts.txt.gz" "GSE161349_Counts_noadj_exons_condense.txt.gz"
[15] "GSE161243_compiled_counts.txt.gz" "GSE160693_Normalized_Gene_Counts_Matrix.txt.gz"
[17] "GSE160723_read_counts.txt.gz" "GSE160314_BM_vs_ES_normalized_count_matrix.txt.gz"
[19] "GSE160252_miRNA.canonical.rawcounts.txt.gz" "GSE160252_miRNA.canonical.rawcounts.txt.gz"
[21] "GSE159493_raw_gene_counts_matrix.txt.gz" "GSE158945_Raw_counts_per_million.txt.gz"
[23] "GSE158949_voomNormalizedCountsMatrix.txt.gz" "GSE158722_P24.counts.txt.gz"
[25] "GSE158722_P24.counts.txt.gz" "GSE158724_icell8.counts.txt.gz"
[27] "GSE158724_icell8.counts.txt.gz" "GSE158317_cell_line_mRNA_raw_counts.txt.gz"
[29] "GSE157927_htseqcounts_mp2_tpl_thz1.txt.gz" "GSE157927_htseqcounts_mp2_tpl_thz1.txt.gz"
- I selected the 15th dataset: GSE161243
- See R Notebook: Assignment 1 html notebook
- Data need to be normalized to reduce technical variation that might confound true biological variations
- For gene expression data, we need to filter out the low-count genes
- Density plot and boxplot can be used to visualize sample data distributions
- MDS plot can be used to access distances between samples
- Techniques learned in using edgeR, biomaRt, and GEOmetadb can be used in future work