<a href="https://colab.research.google.com/github/fogg-lab/transcriptomics-data-query-and-retrieval/blob/main/notebooks/GEO_search_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [1]:
# Clone repository
!git clone https://github.com/fogg-lab/transcriptomics-data-query-and-retrieval.git

# Install package
!pip install ./transcriptomics-data-query-and-retrieval

Cloning into 'transcriptomics-data-query-and-retrieval'...
remote: Enumerating objects: 75, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 75 (delta 30), reused 42 (delta 10), pack-reused 0[K
Receiving objects: 100% (75/75), 19.53 KiB | 3.91 MiB/s, done.
Resolving deltas: 100% (30/30), done.
Processing ./transcriptomics-data-query-and-retrieval
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting biopython (from transcriptomics-data-query==0.1)
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting GEOparse (from transcriptomics-data-query==0.1)
  Downloading GEOparse-2.0.3.tar.gz (278 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.5/278.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparin

In [2]:
# Privately configure email address for the NCBI API
from getpass import getpass
import subprocess
with open('/dev/null', 'w') as devnull:
  subprocess.run(["configure-ncbi-email", getpass('Enter your email address: ')],
                 stdout=devnull, stderr=devnull, check=True)

Enter your email address: ··········


## Examples

In [3]:
""" Example 1: Basic keyword search """
import transcriptomics_data_query as tdq

# Note: If you get HTTP errors, try running it again. The API is finicky.

query = "heart disease homo sapiens"
# Limit to 20 results (can be increased)
max_results = 20

id_list = tdq.geo.search_geo(query, max_results=max_results)

descriptions = tdq.geo.get_descriptions_from_ids(id_list)
for accession, study_description in descriptions.items():
    print(f"{accession}: {study_description}")

Hits: 20794
GSE199647: Torsion of the heart tube by shortage of progenitor cells: identification of Greb1l as a genetic determinant of criss-cross hearts in mice
GSE240321: Identification of host endotypes using peripheral blood transcriptomics in a prospective cohort of patients with endocarditis
GSE240320: Identification of host endotypes using peripheral blood transcriptomics in a prospective cohort of patients with endocarditis [pre-post_qx]
GSE240319: Identification of host endotypes using peripheral blood transcriptomics in a prospective cohort of patients with endocarditis [pre]
GSE240215: MYSM1 acts as a novel co-activator of ERα via histone and non-histone deubiquitination to confer antiestrogen resistance in breast cancer
GSE229229: Cardiac Myofibrillogenesis 1 is Spatiotemporally Modulated by the Molecular Chaperone UNC45B
GSE228966: Phosphorylated nuclear DICER1 promotes open chromatin state and gastric cell fate in lung adenocarcinomas [Nanostring]
GSE228965: Phosphorylate

In [4]:
""" Example 2: Search by gene symbol and organism """
import transcriptomics_data_query as tdq

query = "BRCA1[Gene Name] AND Homo sapiens[Organism]"

# Using default max_results which is 25
id_list = tdq.geo.search_geo(query)

descriptions = tdq.geo.get_descriptions_from_ids(id_list)
for accession, study_description in descriptions.items():
    print(f"{accession}: {study_description}")

Hits: 10436
GSE237361: Effects of PARP inhibition on the transcriptome in BRCA1 wild-type and BRCA1 deficient ovarian cancer cell lines
GSE235980: RNA-Seq of PARPi-resistant ovarian cancer cells
GSE226018: Role of ALDH1A1 in PARPi resistance in ovarian cancer
GSE205366: FBL promotes cancer cell proliferation and DNA damage repair via YBX1
GSE180865: GNL3/nucleostemin links DNA replication homeostasis and replication forks stability
GSE218171: β-Trcp and RSK2-mediated ubiquitination of FOXN3 facilitates BRCA1-dependent DNA damage repair in lung cancer
GSE215908: Single cell RNA sequence reveals C5aR1 inhibition selectively targets pro-tumorigenic M2 macrophages reversing PARP inhibitor resistance
GSE234482: DNA Repair Function Scores for 2172 Variants in the BRCA1 Amino-Terminus
GSE205221: A genome-wide CRISPR screen identifies ZNF251 critical for resistance to PARP inhibitors
GSE226445: Identification of BRCA1/2 mutation carriers using circulating microRNA profiles 
GSE173223: Excessiv

In [5]:
""" Example 3: Search by platform technology """
import transcriptomics_data_query as tdq

query = "GPL10558[Platform] AND cancer"

# Limit to 50 results
max_results = 50

id_list = tdq.geo.search_geo(query, max_results=max_results)

# Set the default accession to "unknown"
descriptions = tdq.geo.get_descriptions_from_ids(id_list, default_accession="unknown")
for accession, study_description in descriptions.items():
    print(f"{accession}: {study_description}")

Hits: 24959
GSE235641: Bi-directional interactions of omental adipocytes and SKOV3ip1 cells
GSE211913: C99R mutation in IRF4 drives a novel gain of function binding and gene upregulation in classical Hodgkin lymphoma [Illumina BeadChip]
GSE211445: C99R mutation in IRF4 drives a novel gain of function binding and gene upregulation in classical Hodgkin lymphoma
GSE218172: KLF10 modulates stem cell phenotypes of pancreatic adenocarcinoma by transcriptionally regulating notch receptors
GSE233350: Identification of KMT5A regulated genes in prostate cancer
GSE61883: Distinct molecular profiles for histological subtypes of epithelial ovarian adenocarcinomas
GSE113865: Genome-wide gene expression analysis of triple negative breast cancer tissue and matched normal tissue
GSE107636: MacroH2A1.2 inhibits prostate cancer-induced osteoclastogensis through cooperation with HP1a and H1.2
GSE221311: Expression data from HDLEC treated with melanosome
GSE207647: Microarray gene expression analyses of HC

In [6]:
""" Example 4: Search by publication date range """
import transcriptomics_data_query as tdq

# Liver disease studies added to GEO in the first week of 2023
query = "2023/01/01[PDAT] : 2023/01/07[PDAT] AND liver disease"

id_list = tdq.geo.search_geo(query)

descriptions = tdq.geo.get_descriptions_from_ids(id_list, default_accession="unknown")
for accession, study_description in descriptions.items():
    print(f"{accession}: {study_description}")

Hits: 130
GSE203329: Gene expression profiles of  METTL5-Wild type(WT) knockout(KO) in HCC cells
GSE164359: Integrated multiomic analysis reveals comprehensive tumor heterogeneity in primary and recurrent hepatocellular carcinomas
GSE222171: Liver-specific FGFR4 knockdown in mice on a HFD increases bile acid synthesis and improves hepatic steatosis
GSE212363: mRNA expression data from livers of wild-type (WT) mice, mice that lack ribosomal protein S6 (Rps6) (DS6), mice that overexpress c-Myc (Myc) and livers that lack Rps6 and also overexpress c-Myc (DS6 Myc)
GSE215909: A CRISPR/Cas9 library screening identified CARM1 as a critical inhibitor for Sorafenib-induced ferroptosis in hepatocellular carcinoma cells [CRISPR screen]
GSE215263: A CRISPR/Cas9 library screening identified CARM1 as a critical inhibitor for Sorafenib-induced ferroptosis in hepatocellular carcinoma cells
GSE207758: Lysine 117 Residue is Essential for the Function of the Hepatocyte nuclear factor 1α
GSE207303: HepG2 H

In [7]:
""" Example 5: Search by study type """
import transcriptomics_data_query as tdq

# Microarray studies related to diabetes
query = "Expression profiling by array[Study Type] AND diabetes"

# Show up to 75 results
id_list = tdq.geo.search_geo(query, max_results=75)

descriptions = tdq.geo.get_descriptions_from_ids(id_list)
for accession, study_description in descriptions.items():
    print(f"{accession}: {study_description}")

Hits: 1270
GSE161750: Acute and long-term exercise adaptation of adipose tissue and skeletal muscle in humans: a matched transcriptomics approach after 8-week training-intervention
GSE161749: Acute and long-term exercise adaptation of adipose tissue and skeletal muscle in humans: a matched transcriptomics approach after 8-week training-intervention
GSE156908: Reducing NADPH synthesis counteracts diabetic nephropathy through restoration of AMPK activity
GSE179768: The link between diabetes main metabolic alterations and breast cancer progression
GSE210517: Protective role of gut insulin action in the development of nonalcoholic steatohepatitis and hepatocellular carcinoma associated with diabetes in mice
GSE189005: RNA expression profiles of whole blood cells from a Han Chinese population with or without Type-2 Diabetes Mellitus or/and its complications in nephropathy and retinopathy
GSE101820: Cdk4-E2F3 signals enhance skeletal muscle oxidative function and improve whole  body metaboli