In [25]:
import GEOparse
# https://geoparse.readthedocs.io/en/latest/

import pandas as pd
from matplotlib import pylab as pl
import seaborn as sns
pl.rcParams['figure.figsize'] = (14, 10)
pl.rcParams['ytick.labelsize'] = 12
pl.rcParams['xtick.labelsize'] = 11
pl.rcParams['axes.labelsize'] = 23
pl.rcParams['legend.fontsize'] = 20
sns.set_style('ticks')
c1, c2, c3, c4 = sns.color_palette("Set1", 4)

### 1.Read data from GEO using GEOparse library

We are going to use GEOparse library to access GEO data. We need to know access series ID to read it. 

In [2]:
# Define the GEO access ID of the experiment you wish to read out
geo_id = "GSE188461" 

# Read data from GEO
geo_data = GEOparse.get_GEO(geo=geo_id)

09-Apr-2024 13:26:28 DEBUG utils - Directory ./ already exists. Skipping.
09-Apr-2024 13:26:28 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE188nnn/GSE188461/soft/GSE188461_family.soft.gz to ./GSE188461_family.soft.gz
100%|██████████| 6.28k/6.28k [00:00<00:00, 11.2kB/s]
09-Apr-2024 13:26:31 DEBUG downloader - Size validation passed
09-Apr-2024 13:26:31 DEBUG downloader - Moving /tmp/tmp_a2smvqe to /home/alopez/3.Aging_and_chemical_reprogramming/2022_04_Guan_Chemical_reprogramming/GSE188461_family.soft.gz
09-Apr-2024 13:26:31 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE188nnn/GSE188461/soft/GSE188461_family.soft.gz
09-Apr-2024 13:26:31 INFO GEOparse - Parsing ./GSE188461_family.soft.gz: 
09-Apr-2024 13:26:31 DEBUG GEOparse - DATABASE: GeoMiame
09-Apr-2024 13:26:31 DEBUG GEOparse - SERIES: GSE188461
09-Apr-2024 13:26:31 DEBUG GEOparse - PLATFORM: GPL24676
09-Apr-2024 13:26:31 DEBUG GEOparse - SAMPLE: GSM5387804
09-Apr-2024 13

In [3]:
geo_data

<SERIES: GSE188461 - 39 SAMPLES, 1 d(s)>

This is a GSE (or a Series), an original submitter-supplied record that summarizes whole study including samples and platforms. GSE is assigned to unique and stable GEO accession number that starts at GSE followed by numbers eg. GSE188461.

All GEO objects inherit from abstract base class GEOparse.BaseGEO. Two main attributes of that class are:

* .name: series' ID
* .metadata: dictionary of useful information about samples which occurs in the SOFT file with bang (!) in the beginning. Each value of this dictionary id a list (even with one element)

https://geoparse.readthedocs.io/en/latest/GEOparse.html

In [8]:
geo_data.name

'GSE188461'

In [6]:
geo_data.metadata

{'title': ['Chemical-based external stimulation reprograms human somatic cells into pluripotency'],
 'geo_accession': ['GSE188461'],
 'status': ['Public on Feb 09 2022'],
 'submission_date': ['Nov 09 2021'],
 'last_update_date': ['May 10 2022'],
 'pubmed_id': ['35418683'],
 'summary': ['This SuperSeries is composed of the SubSeries listed below.'],
 'overall_design': ['Refer to individual Series'],
 'type': ['Expression profiling by high throughput sequencing',
  'Genome binding/occupancy profiling by high throughput sequencing'],
 'sample_id': ['GSM5387804',
  'GSM5387805',
  'GSM5387806',
  'GSM5387807',
  'GSM5387808',
  'GSM5387809',
  'GSM5387810',
  'GSM5387811',
  'GSM5387812',
  'GSM5387813',
  'GSM5387814',
  'GSM5387815',
  'GSM5387816',
  'GSM5534140',
  'GSM5534141',
  'GSM5534142',
  'GSM5534143',
  'GSM5534144',
  'GSM5534145',
  'GSM5534146',
  'GSM5534147',
  'GSM5534148',
  'GSM5534149',
  'GSM5534150',
  'GSM5534151',
  'GSM5534152',
  'GSM5534153',
  'GSM5534154',
  

#### 1.1 Samples and platforms

In GEOparse Series is represented by GEOparse.GSE object that contains tree main attributes:
* inherited from BaseGEO metadata
* gsms – dict with all samples in this GSE as GSM objects
* gpls – dict with all platforms in this GSE as GSM objects

In [13]:
# Access to sample ids
# gsms -> dict de samples {series_ID: sample_obj}
samples = geo_data.gsms

# Sample iteration
for gsm_name, gsm in samples.items():
    print("Sample name:", gsm_name)
    print("Sample data:", gsm.table)
    break

Sample name: GSM5387804
Sample data: Empty DataFrame
Columns: []
Index: []


A GSM (or a Sample) contains information the conditions and preparation of a Sample. In the GEO database sample is assigned to unique and stable GEO accession number that is composed of ‘GSM’ followed by numbers eg. GSM906.

In GEOparse Sample is represented by GEOparse.GSM object that contains tree main attributes:
* inherited from BaseGEO metadata
* table – pandas.DataFrame with the data table from SOFT file
* columns – pandas.DataFrame that contains description column with the information about columns in the table

In [19]:
gsm.metadata

{'title': ['scATAC-Seq fibroblast'],
 'geo_accession': ['GSM5387804'],
 'status': ['Public on Feb 08 2022'],
 'submission_date': ['Jun 16 2021'],
 'last_update_date': ['Feb 08 2022'],
 'type': ['SRA'],
 'channel_count': ['1'],
 'source_name_ch1': ['human embryonic fibroblasts'],
 'organism_ch1': ['Homo sapiens'],
 'taxid_ch1': ['9606'],
 'characteristics_ch1': ['cell line: 0330'],
 'treatment_protocol_ch1': ['Human somatic cells were treated with different small molecule combinations during human chemical reprogramming.'],
 'growth_protocol_ch1': ['Human embryonic fibroblasts (HEFs) were cultured in high glucose DMEM supplemented with 15% Fetal Bovine Serum (FBS), 1% GlutaMAX™, 1% MEM Non-Essential Amino Acids Solution (NEAA), 1% Penicillin-Streptomycin and 0.055 mM 2-mercaptoethanol under 21% O2, 5% CO2 at 37 ℃.'],
 'molecule_ch1': ['genomic DNA'],
 'extract_protocol_ch1': ['The nuclei were isolated and washed according to the method supplied by 10X Gemomis: Nuclei Isolation for Singl

#### 1.2 Example: Analyse samples

We are going to analyse:
*	scRNA-Seq HEFs-0330: human embryonic fibroblasts -> GSM5387808
* 	hCiPSCs-1117: human chemically induced pluripotent stem cells from human embryonic fibroblasts -> GSM5534156

In [32]:
# Read data from GEO
geo_id = "GSE188461" 
geo_data = GEOparse.get_GEO(geo=geo_id)

09-Apr-2024 15:05:52 DEBUG utils - Directory ./ already exists. Skipping.
09-Apr-2024 15:05:52 INFO GEOparse - File already exist: using local version.
09-Apr-2024 15:05:52 INFO GEOparse - Parsing ./GSE188461_family.soft.gz: 
09-Apr-2024 15:05:52 DEBUG GEOparse - DATABASE: GeoMiame
09-Apr-2024 15:05:52 DEBUG GEOparse - SERIES: GSE188461
09-Apr-2024 15:05:52 DEBUG GEOparse - PLATFORM: GPL24676
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387804
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387805
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387806
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387807
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387808
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387809
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387810
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387811
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387812
09-Apr-2024 15:05:52 DEBUG GEOparse - SAMPLE: GSM5387813
09-Apr-2024 15:05:52 DEBUG GEOpars

In [27]:
# Platform
geo_data.gpls['GPL24676'].columns

Unnamed: 0,description


In [37]:
samples = geo_data.gsms

In [42]:
samples["GSM5534156"].database

AttributeError: 'GSM' object has no attribute 'database'

In [36]:
# Samples data
geo_data.gsms["GSM5534156"].gses

AttributeError: 'GSM' object has no attribute 'gses'

In [35]:
samples = geo_data.gsms

# Sample iteration
for gsm_name, gsm in samples.items():
    print("Sample name:", gsm_name)
    print("Sample data:", gsm.table)

Sample name: GSM5387804
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387805
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387806
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387807
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387808
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387809
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387810
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387811
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387812
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387813
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387814
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387815
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5387816
Sample data: Empty DataFrame
Columns: []
Index: []
Sample name: GSM5534140
S

In [33]:
# Select samples
controls = ["GSM5387808","GSM5534156"]

pivoted_control_samples = geo_data.pivot_samples('VALUE')[controls]
pivoted_control_samples.head()

KeyError: 'ID_REF'

In [44]:
# Descargar y cargar el conjunto de datos GEO
gse = GEOparse.get_GEO("GSE188461")  # Reemplaza "GSEXXXXX" con el código de acceso del conjunto de datos GEO que deseas cargar

# Acceder a la matriz de datos
data_matrix = gse.pivot_samples('VALUE')  # 'VALUE' representa la columna que contiene los datos de expresión génica

# Mostrar las primeras filas de la matriz de datos
print(data_matrix.head())

09-Apr-2024 15:18:34 DEBUG utils - Directory ./ already exists. Skipping.
09-Apr-2024 15:18:34 INFO GEOparse - File already exist: using local version.
09-Apr-2024 15:18:34 INFO GEOparse - Parsing ./GSE188461_family.soft.gz: 
09-Apr-2024 15:18:34 DEBUG GEOparse - DATABASE: GeoMiame
09-Apr-2024 15:18:34 DEBUG GEOparse - SERIES: GSE188461
09-Apr-2024 15:18:34 DEBUG GEOparse - PLATFORM: GPL24676
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387804
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387805
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387806
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387807
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387808
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387809
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387810
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387811
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387812
09-Apr-2024 15:18:34 DEBUG GEOparse - SAMPLE: GSM5387813
09-Apr-2024 15:18:34 DEBUG GEOpars

KeyError: 'ID_REF'