<a href="https://colab.research.google.com/github/fogg-lab/transcriptomics-data-query-and-retrieval/blob/main/notebooks/GEO_data_retrieval_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

### 1) Install the Python package

In [3]:
# Clone repository
!git clone https://github.com/fogg-lab/transcriptomics-data-query-and-retrieval.git

# Install package
!pip install ./transcriptomics-data-query-and-retrieval

Cloning into 'transcriptomics-data-query-and-retrieval'...
remote: Enumerating objects: 266, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 266 (delta 21), reused 28 (delta 19), pack-reused 231[K
Receiving objects: 100% (266/266), 74.89 KiB | 3.12 MiB/s, done.
Resolving deltas: 100% (141/141), done.
Processing ./transcriptomics-data-query-and-retrieval
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting biopython (from transcriptomics-data-query==0.1)
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting GEOparse (from transcriptomics-data-query==0.1)
  Downloading GEOparse-2.0.3.tar.gz (278 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.5/278.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h  

### 2) Install R packages
*This will take up to 10 minutes*

In [4]:
# Extra preliminary steps to install packages faster in Colab/Ubuntu, using bspm
!sudo add-apt-repository -y ppa:marutter/rrutter4.0
!sudo add-apt-repository -y ppa:c2d4u.team/c2d4u4.0+
!sudo apt-get update && sudo apt-get install -y python3-{dbus,gi,apt}
!wget https://github.com/Enchufa2/bspm/archive/refs/tags/v0.5.4.tar.gz
!sudo R CMD INSTALL v0.5.4.tar.gz
!echo "bspm::enable()" | sudo tee -a /etc/R/Rprofile.site

Repository: 'deb https://ppa.launchpadcontent.net/marutter/rrutter4.0/ubuntu/ jammy main'
Description:
A PPA for the base R package, from version 4.0 and higher
More info: https://launchpad.net/~marutter/+archive/ubuntu/rrutter4.0
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/marutter-ubuntu-rrutter4_0-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/marutter-ubuntu-rrutter4_0-jammy.list
Adding key to /etc/apt/trusted.gpg.d/marutter-ubuntu-rrutter4_0.gpg with fingerprint C9A7585B49D51698710F3A115E25F516B04C661B
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu ja

In [6]:
# Install packages using the script, install_r_packages.R
!sudo Rscript ./transcriptomics-data-query-and-retrieval/install_r_packages.R

# Install preprocessCore in single threaded mode to avoid threading bug on Colab
!git clone https://github.com/bmbolstad/preprocessCore.git
!R CMD INSTALL --configure-args="--disable-threading" ./preprocessCore

Loading required package: utils
Tracing function "install.packages" in package "utils"
Hit http://archive.ubuntu.com/ubuntu jammy InRelease
Hit https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit https://ppa.launchpadcontent.net/marutter/rrutter4.0/ubuntu jammy InRelease
Hit https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 0 B in 0s (0 B/s)
Available system packages as root...
Install system packages as root...
Reading package lists

## GEO retrieval and preprocessing example

Prepare expression data from a GEO study. This process is similar for microarray and RNASeq data, with some minor differences. Refer to the [documentation](https://github.com/fogg-lab/transcriptomics-data-query-and-retrieval/blob/main/DOCUMENTATION.md) for more information.

In [1]:
### 1. Import packages

import GEOparse
import pandas as pd
import transcriptomics_data_query as tdq

In [2]:
### 2. Obtain a GSE object for a GEO series accession using the GEOparse package

# Link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161750
accession = "GSE161750"
gse = GEOparse.get_GEO(accession)

15-Aug-2023 02:51:27 DEBUG utils - Directory ./ already exists. Skipping.
DEBUG:GEOparse:Directory ./ already exists. Skipping.
15-Aug-2023 02:51:27 INFO GEOparse - File already exist: using local version.
INFO:GEOparse:File already exist: using local version.
15-Aug-2023 02:51:27 INFO GEOparse - Parsing ./GSE161750_family.soft.gz: 
INFO:GEOparse:Parsing ./GSE161750_family.soft.gz: 
15-Aug-2023 02:51:27 DEBUG GEOparse - DATABASE: GeoMiame
DEBUG:GEOparse:DATABASE: GeoMiame
15-Aug-2023 02:51:27 DEBUG GEOparse - SERIES: GSE161750
DEBUG:GEOparse:SERIES: GSE161750
15-Aug-2023 02:51:27 DEBUG GEOparse - PLATFORM: GPL23159
DEBUG:GEOparse:PLATFORM: GPL23159


15-Aug-2023 02:51:30 DEBUG GEOparse - SAMPLE: GSM4913974
DEBUG:GEOparse:SAMPLE: GSM4913974
15-Aug-2023 02:51:31 DEBUG GEOparse - SAMPLE: GSM4913975
DEBUG:GEOparse:SAMPLE: GSM4913975
15-Aug-2023 02:51:31 DEBUG GEOparse - SAMPLE: GSM4913976
DEBUG:GEOparse:SAMPLE: GSM4913976
15-Aug-2023 02:51:31 DEBUG GEOparse - SAMPLE: GSM4913977
DEBUG:GEOpa

In [3]:
### 3. Download the raw data

# If the output_dir argument is not specified, it will be created in the working directory.
# For microarray data, this results in a directory of CEL.gz files.
# If the accession was an RNASeq study, a counts file would be downloaded instead.
tdq.geo.download_geo_expression_data(gse)

In [4]:
### 4. Download and parse clinical characteristics

# Save in current working directory, ./GSE161750_clinical.tsv
clinical_file = f"{accession}_clinical.tsv"
tdq.geo.get_geo_clinical_characteristics(gse, output_file=clinical_file)

# View parsed clinical characteristics.
# Probably need to clean it up manually in a spreadsheet program.
pd.read_csv(clinical_file, sep='\t')

Unnamed: 0,sample_id,tissue,title,description,source_name_ch1
0,GSM4913974,muscle,Muscle_Baseline_Acute_rep1,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
1,GSM4913975,muscle,Muscle_Baseline_Acute_rep2,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
2,GSM4913976,muscle,Muscle_Baseline_Acute_rep3,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
3,GSM4913977,muscle,Muscle_Baseline_Acute_rep4,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
4,GSM4913978,muscle,Muscle_Baseline_Acute_rep5,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
5,GSM4913979,muscle,Muscle_Baseline_Acute_rep6,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
6,GSM4913980,muscle,Muscle_Baseline_Acute_rep7,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
7,GSM4913981,muscle,Muscle_Baseline_Acute_rep8,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
8,GSM4913982,muscle,Muscle_Baseline_Acute_rep9,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...
9,GSM4913983,muscle,Muscle_Baseline_Acute_rep10,Gene expression data from muscle biopsy untrai...,Muscle biopsy untrained after exercise derived...


In [5]:
### 5. Normalization (RMA for microarray, TMM for RNASeq)

# If the accession is microarray data, input_path should be a directory of CEL.gz files
norm_input_path = accession
norm_save_path = f"{accession}_expression_matrix.tsv"
tdq.preprocess.normalize(input_path=norm_input_path, output_file=norm_save_path)

# For RNASeq, input_path should be the raw counts file, and clinical_file must also be specified.
# norm_input_path = f"{accession}_expression_matrix.tsv"
# norm_save_path = f"{accession}_expression_matrix.tsv"
# tdq.preprocess.normalize(norm_input_path, norm_save_path, clinical_file)

# Read normalized expression matrix from file
expr_df = pd.read_csv(norm_save_path, sep="\t", index_col=0)

# Clean up sample names (extract "GSMxxxx")
expr_df = tdq.geo.clean_geo_sample_columns(expr_df)

# Save expression matrix with cleaned sample names
cleaned_save_path = norm_save_path  # overwrite old file
expr_df.to_csv(cleaned_save_path, sep="\t")

# Preview current expression matrix
print(f"Expression matrix for {accession} after normalization:")
expr_df.iloc[:8, :3]

Executing: Rscript /usr/local/lib/python3.10/dist-packages/transcriptomics_data_query/rscripts/rma_normalization.R GSE161750 GSE161750_expression_matrix.tsv
Normalization complete.
Expression matrix for GSE161750 after normalization:


Unnamed: 0,GSM4913974,GSM4913975,GSM4913976
23064070,8.456688,8.332458,8.521047
23064071,7.780182,7.467921,7.677162
23064072,5.602551,6.618323,6.485193
23064073,7.595782,7.153539,7.549301
23064074,5.948645,5.597582,5.451984
23064075,8.047622,7.734275,8.066836
23064076,6.540991,6.105066,6.668104
23064077,9.316287,9.071693,9.154063


In [6]:
### 5b Map probes to genes (for microarray data only)
expr_df = tdq.geo.map_probes_to_genes(expr_df, gse)

# Overwrite previous expression matrix file
expr_df.to_csv(cleaned_save_path, sep="\t")

# Preview current expression matrix
print(f"Expression matrix for {accession} after mapping probes to genes:")
expr_df.iloc[:8, :3]

15-Aug-2023 02:52:57 DEBUG utils - Directory ./ already exists. Skipping.
DEBUG:GEOparse:Directory ./ already exists. Skipping.
15-Aug-2023 02:52:57 INFO GEOparse - File already exist: using local version.
INFO:GEOparse:File already exist: using local version.
15-Aug-2023 02:52:57 INFO GEOparse - Parsing ./GPL23159.txt: 
INFO:GEOparse:Parsing ./GPL23159.txt: 
15-Aug-2023 02:52:57 DEBUG GEOparse - PLATFORM: GPL23159
DEBUG:GEOparse:PLATFORM: GPL23159




Using annotation column SPOT_ID.1 for Ensembl IDs
Expression matrix for GSE161750 after mapping probes to genes:


Unnamed: 0,GSM4913974,GSM4913975,GSM4913976
ENSG00000093100,8.33646,7.974318,8.300854
ENSG00000106540,6.611486,6.178451,6.286634
ENSG00000137808,5.914954,5.773396,5.592572
ENSG00000145063,6.823582,6.514701,7.082004
ENSG00000151303,7.108568,7.324181,7.424647
ENSG00000157306,8.341304,8.440072,8.57301
ENSG00000167355,4.098994,4.081819,4.494833
ENSG00000170089,7.70866,8.472836,8.133657


In [12]:
### 6. Further processing

# The annotation table for GSE161750 (platform GPL23159) contains Ensembl IDs.
# Suppose we are interested in human matrisome gene expression, and want them as gene symbols.
# We can use the preprocess module to obtain a matrisome expression matrix and convert to symbols.

In [14]:
### 6.1. Get gene symbols in the KEGG pathway for olfactory transduction
gene_set = "KEGG_OLFACTORY_TRANSDUCTION"
gene_set_symbols = tdq.preprocess.get_genes_from_msig_set(gene_set)
# Alternatively you can read genes from a text file (1 gene per line):
# gene_set_symbols = tdq.preprocess.get_genes_from_file("genes_of_interest.txt")

# Preview gene symbols in the list
gene_set_symbols[:10]

['ADCY3',
 'ARRB2',
 'CALM1',
 'CALM2',
 'CALM3',
 'CALML3',
 'CALML5',
 'CALML6',
 'CAMK2A',
 'CAMK2B']

In [16]:
# 6.2. Obtain the ensembl IDs for this set of genes
# The convert_genes function uses mygene.info to convert genes from one format to another.
# The function accepts 3 formats: symbol, Enseml ID (ensembl.gene), and Entrez ID (entrezgene).

# Convert symbols to Ensembl IDs
gene_set_ensembl = tdq.preprocess.convert_genes(
    genes=gene_set_symbols, in_format="symbol", out_format="ensembl.gene", species="human")

# Discard not found genes
gene_set_ensembl.dropna(inplace=True)

# Preview Ensembl IDs in the gene set
gene_set_ensembl.head()



query
ADCY3    ENSG00000138031
ARRB2    ENSG00000141480
CALM1    ENSG00000198668
CALM2    ENSG00000143933
CALM3    ENSG00000160014
Name: ensembl.gene, dtype: object

In [18]:
# 6.3. Select rows in `expr_df` that contain these genes
filtered_expression = tdq.preprocess.select_rows(expr_df, gene_set_ensembl)

# 6.5. Preview the filtered expression matrix
print(f"\nFiltered expression matrix for {accession} (Ensembl IDs)")
print(filtered_expression.iloc[:8, :3])


Filtered expression matrix for GSE161750 (Ensembl IDs)
                 GSM4913974  GSM4913975  GSM4913976
ENSG00000167355    4.098994    4.081819    4.494833
ENSG00000181214    4.359115    4.267042    4.459215
ENSG00000278870    3.904819    3.802722    4.531649
ENSG00000279051    5.921605    5.524552    5.805795
ENSG00000279263    3.919104    3.831666    4.168566
ENSG00000279301    5.101743    4.960626    4.930736
ENSG00000279761    3.926455    3.914708    3.891808


In [19]:
# 6.4. Optionally, convert the Ensemble IDs to gene symbols
genes = filtered_expression.index
filtered_expression.index = tdq.preprocess.convert_genes(genes, "ensembl.gene", "symbol")

# Preview the matrisome expression matrix
print(f"\nFiltered expression matrix for {accession} (gene symbols)")
print(filtered_expression.iloc[:8, :3])


Filtered expression matrix for GSE161750 (gene symbols)
        GSM4913974  GSM4913975  GSM4913976
symbol                                    
OR51B5    4.098994    4.081819    4.494833
OR8G2P    4.359115    4.267042    4.459215
OR51G1    3.904819    3.802722    4.531649
OR6Q1     5.921605    5.524552    5.805795
OR2L8     3.919104    3.831666    4.168566
OR2T11    5.101743    4.960626    4.930736
OR5D13    3.926455    3.914708    3.891808
