Skip to content

Assignment 1: Data set selection and initial Processing

Metyu Melkonyan edited this page Apr 6, 2023 · 11 revisions

Part 1:Finding the Expresison Dataset from GEO

Objective

  • To find RNA expreression dataset for the part 1 of Assignment 1
  • To get familiarize with the RNA expression dataset
  • Explore the dataset to see non-redundant genes and many more genes that can be assocaited with other cancer types
  • Do further research on pancreas cancer
  • Do further research on RNA-seq and other methods to quantify the gene expression of gene sets

Duration

Time estimated : 2 hours
Time took: 3 hours
Date started: 2023-02-02
Completed: 2023-02-02

Micro Array explanation: Using a prope and hybdidization. Later image analysis allows for expression measurement. chip. The illunation method is sued for measure the fleurescent.

RNA seq:

  • RNA seq sampling
  • RNA extraction target enrichment analysis
  • Freamentation of the RNA molecules and cDNA library assembly.
  • Sequencing and FASTAQ file generation
  • Transcriptome mapping via using the sequncing data.
  • Bioinformatics: Differential expression analysis, variant alling analysis, annotation, novel transcription discover and RNA editing via using different computational methods.

Bulk RNASeq:

  • We are using Bulk RNA seq for this assignment because it has a small size and using HTSeq raw count which faciliates the normalization process *The GEO database is used to retrieve the gene expression data along with GEOmetadb has been used. *SQLite has been used to retrieve information from the GEOmetadatabase

Conclusion

  • The template code was structured based on the query search
  • The potential data set for GSE164730 is used and found
  • I got more familiarize with RNA-seq procedure as well as the methods
  • Other research in Pancreas cancer and different cancer's is promising.

The expression dataset is retrieved from GEO metadatabase. The following portion until the data processing is only for expression data finding
if(!file.exists('GEOmetadb.sqlite')) 
  GEOmetadb::getSQLiteFile()

Setting up the tables and connect of the metadb. The connection is established with the databases
con <- DBI::dbConnect(RSQLite::SQLite(),'GEOmetadb.sqlite')

For better visualization conenction has been shown as a table
Geo_tables <- DBI::dbListTables(con)
Geo_tables

Retrieving the information about the metadata
results <- DBI::dbGetQuery(con,'select * from gpl limit 5')
knitr::kable(head(results[,1:10]), format = "html")


The following query was used as a template for the expression data finding
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.submission_date > '2014-01-01' AND",
" gse.title LIKE '%Cancer%' AND", 
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%High-throughput sequencing%' ",
" ORDER BY gse.submission_date DESC",sep=" ")

The search results from the query can be obtained from the following command
result_query <- DBI::dbGetQuery(con,sql)

Part 2 Normalization & Data Cleaning.

Objective

  • To clean the gene expression dataset of GSE131222
  • To normalize the expression values
  • Analyze the normalization values
  • Validate if the normalization values make sense
  • Further match tne nromalization values with the results that you obtain from the final concluding analysis (Does it make sense!)

Duration

Time Estimated 3 hours Time taken: 5 hours
Date started: 2023-02-08 Completed: 2023-02-12

Conclusion

  • Different data sets were cleaned
  • Replicated gene expression rows were elimianted
  • Normalized used to see the difference between unreplicated and replicated values.
  • The normalization values make sense, due to it's consensus with the expected analysis result (Next part validates it !)

Part 3 Interpretation of The Expression Data

Objectives

  • To interpret data and have an understanding of what the actual case study is conducting
  • Use HUGO symbols provided by the dataset to sort and map the indentifiers
  • Prevent any inconsistencies within data that can be generated via HUGO symbol covnersion
  • Find the difference between normalized and the converted data
  • Analyze the HUGO symbols if they make sense, and if they correlate with the previous symbols

Duration

Time estimate 4 hours Time Taken: 8 hours
Completed: around the same time of the submission
Start date: Unknown (Please time yourself next time!)

Conclusion

  • The data has been mapped with the correct identifier with normalized values
  • Normalized and clean data has been visualzied using different plots
  • The divergence and variance have been shown and calculated
  • Importantly!! (Attention) The HUGO symbols are validated by matching the HUGO symbol data, this allowed me to both validate the identity of the HUGO symbols that I have at the end. This is important !

References

Adams, C. R., Htwe, H. H., Marsh, T., Wang, A. L., Montoya, M. L., Subbaraj, L., Tward, A. D., Bardeesy, N., & Perera, R. M. (2019). Transcriptional control of subtype switching ensures adaptation and growth of pancreatic cancer. ELife, 8. https://doi.org/10.7554/elife.45313

Bioconductor - home. (n.d.). Bioconductor.org. Retrieved February 13, 2023, from https://www.bioconductor.org/

EdgeR. (n.d.). Bioconductor. Retrieved February 13, 2023, from https://bioconductor.org/packages/release/bioc/html/edgeR.html

Ensembl genome browser 109. (n.d.). Ensembl.org. Retrieved February 12, 2023, from http://useast.ensembl.org/index.html

GEO overview. (n.d.). Nih.gov. Retrieved February 13, 2023, from https://www.ncbi.nlm.nih.gov/geo/info/overview.html

National center for biotechnology information. (n.d.). Nih.gov. Retrieved February 12, 2023, from https://www.ncbi.nlm.nih.gov/

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. CRC Press.