# Extract data from GEO accession display on NCBI

The [NCBI][1] website maintains Gene Expression Omnibus (GEO), a public functional genomics data repository, and provides a GEO accession display tool to display [GEO accessions][2]. 

For example, look at the [GEO accession for GSE109816][3].

In this notebook we'll explore how to access that programtically using Python.

[1]: https://www.ncbi.nlm.nih.gov/
[2]: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
[3]: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109816

## GEOparse

There is a Python library `GEOparse` to access the GEO Database.

You can install it using:

In [1]:
!pip install GEOparse

Defaulting to user installation because normal site-packages is not writeable


Once it is installed, you can import the module.

In [2]:
import GEOparse

In [32]:
geo_id = "GSE109816"

In [138]:
gse = GEOparse.get_GEO(geo=geo_id)

19-May-2023 21:40:54 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:40:54 INFO GEOparse - File already exist: using local version.
19-May-2023 21:40:54 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:40:54 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:40:54 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:40:54 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:40:54 DEBUG GEOpars

In [132]:
gse

<SERIES: None - 12 SAMPLES, 0 d(s)>

In [133]:
gse.name

In [134]:
gse.geotype

'SERIES'

In [137]:
gse.gpls

{}

## Extracing Information from the record

We are interested to extract the following information from the record:

- Title
- Organism
- Experiment type
- Summary
- Contact name
- Contributor
- Submitter
- Overall Design
- Platform  (available by following the the platform ID link)

The `gse.metatadata` has all these fields. 

In [41]:
title = gse.metadata['title'][0]
expression_type = gse.metadata['type'][0]
summary = gse.metadata['summary'][0]
contact_name = gse.metadata['contact_name'][0]
contributors = gse.metadata['contributor']
overall_design = gse.metadata['overall_design'][0]

print("Title:", title)
print("Contact Name:", contact_name)
print("Contributors:", contributor)
print()

print("Expression Type:", expression_type)
print("Overall Design:", overall_design)
print()

print("Summary:")
print(summary)
print()


Title: Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing
Contact Name: Li,,Wang
Contributors: ['Li,,Wang', 'Peng,,Yu', 'Zheng,,Li', 'Zongna,,Ren']

Expression Type: Expression profiling by high throughput sequencing
Overall Design: Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.

Summary:
We studied the cell compositon of normal human heart by single-cell sequencing. Distint subgroups of cardiac muscle, fibroblast cell and endothelial cell were detected. We drawed a cell-cell interaction network using specific expressed ligands and receptors of cells. And we also observed the change of interaction and cell transformation with age.



We got all the fields except the _organism_ and _technology type_. They are references to other geo records.

In [91]:
organism_id = gse.metadata['sample_taxid'][0]
platform_id = gse.metadata['platform_id'][0]

In [92]:
organism_id

'9606'

In [93]:
platform_id

'GPL18573'

In [79]:
from Bio import Entrez
Entrez.email = "anand+pybfx@pipal.in"

def get_tax_data(taxid):
    handle = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
    records = Entrez.read(handle)
    if records:
        return records[0]

def get_scientific_name(tax_id):
    data = get_tax_data(tax_id)
    return data['ScientificName']

In [80]:
get_scientific_name("9606")

'Homo sapiens'

In [94]:
platform_id

'GPL18573'

In [96]:
geo_tech = GEOparse.get_GEO(platform_id)

19-May-2023 21:24:57 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:24:57 INFO GEOparse - File already exist: using local version.
19-May-2023 21:24:57 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:24:57 DEBUG GEOparse - PLATFORM: GPL18573


In [101]:
platform = geo_tech.metadata['title'][0]

In [102]:
platform

'Illumina NextSeq 500 (Homo sapiens)'

### Putting all of this together

In [103]:
import GEOparse
from Bio import Entrez

# replace this with your email
Entrez.email = "anand+pybfx@pipal.in"


def get_tax_data(taxid):
    handle = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
    records = Entrez.read(handle)
    if records:
        return records[0]

def get_scientific_name(tax_id):
    data = get_tax_data(tax_id)
    return data['ScientificName']

def get_geo_title(geo_id):
    return GEOparse.get_GEO(technology_type_id).metadata['title'][0]

geo_id = "GSE109816"

gse = GEOparse.get_GEO(geo=geo_id)

title = gse.metadata['title'][0]
expression_type = gse.metadata['type'][0]

contact_name = gse.metadata['contact_name'][0]
contributors = gse.metadata['contributor']
overall_design = gse.metadata['overall_design'][0]

organization_id = gse.metadata['sample_taxid'][0]
organization = get_scientific_name(organization_id)

platform_id = gse.metadata['platform_id']
platform = get_geo_title(platform_id)

summary = gse.metadata['summary'][0]

print("Title:", title)
print("Contact Name:", contact_name)
print("Contributors:", contributor)
print()

print("Expression Type:", expression_type)
print("Overall Design:", overall_design)
print()

print("Organization:", organization)
print("Platform:", platform)
print()

print("Summary:")
print(summary)
print()


19-May-2023 21:25:26 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:25:26 INFO GEOparse - File already exist: using local version.
19-May-2023 21:25:26 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:25:26 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:25:26 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:25:26 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:25:26 DEBUG GEOpars

Title: Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing
Contact Name: Li,,Wang
Contributors: ['Li,,Wang', 'Peng,,Yu', 'Zheng,,Li', 'Zongna,,Ren']

Expression Type: Expression profiling by high throughput sequencing
Overall Design: Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.

Organization: Homo sapiens
Platform: Illumina NextSeq 500 (Homo sapiens)

Summary:
We studied the cell compositon of normal human heart by single-cell sequencing. Distint subgroups of cardiac muscle, fibroblast cell and endothelial cell were detected. We drawed a cell-cell interaction network using specific expressed ligands and receptors of cells. And we also observed the change of interaction and cell transformation with age.



## Extracting Multiple Records and saving as CSV

We can change the prpgram to download multiple records together and convert the data as a pandas Dataframe and then export to a csv file.

In [110]:
import GEOparse
from Bio import Entrez
import pandas as pd

# replace this with your email
Entrez.email = "anand+pybfx@pipal.in"


In [109]:
def get_tax_data(taxid):
    """Returns the record from the NCBI taxonomy database.
    """
    handle = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
    records = Entrez.read(handle)
    if records:
        return records[0]

def get_scientific_name(tax_id):
    """Returns the scientific name given a taxonomy id.
    """
    data = get_tax_data(tax_id)
    return data['ScientificName']

def get_geo_title(geo_id):
    """Returns the title from the GEO record given the id.
    """
    return GEOparse.get_GEO(technology_type_id).metadata['title'][0]

def get_geo_record(geo_id):
    """Returns the GEO record as a dictionary.
    """
    gse = GEOparse.get_GEO(geo=geo_id)

    title = gse.metadata['title'][0]
    expression_type = gse.metadata['type'][0]

    contact_name = gse.metadata['contact_name'][0]
    
    # we need contributors as a single field, seperating them with |
    contributors = " | ".join(gse.metadata['contributor'])
    
    overall_design = gse.metadata['overall_design'][0]

    organization_id = gse.metadata['sample_taxid'][0]
    organization = get_scientific_name(organization_id)

    platform_id = gse.metadata['platform_id']
    platform = get_geo_title(platform_id)

    summary = gse.metadata['summary'][0]
    return {
        "id": geo_id,
        "title": title,
        "expression_type": expression_type,
        "contact_name": contact_name,
        "contributors": contributors,
        "overall_design": overall_design,
        "organization": organization,
        "platform": platform
    }

Let's see if the `get_geo_record` is working.

In [108]:
get_geo_record("GSE109816")

19-May-2023 21:31:37 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:31:37 INFO GEOparse - File already exist: using local version.
19-May-2023 21:31:37 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:31:37 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:31:37 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:31:37 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:31:37 DEBUG GEOpars

{'title': 'Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing',
 'expression_type': 'Expression profiling by high throughput sequencing',
 'contact_name': 'Li,,Wang',
 'contributors': 'Li,,Wang | Peng,,Yu | Zheng,,Li | Zongna,,Ren',
 'overall_design': 'Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.',
 'organization': 'Homo sapiens',
 'platform': 'Illumina NextSeq 500 (Homo sapiens)'}

That seems to be working. 

Let's see how to process mutliple of the them.

In [139]:
def get_geo_records(geo_ids):
    """Gets multiple geo records from NCBI geo database as a pandas dataframe.
    """
    records = [get_geo_record(geo_id) for geo_id in geo_ids]
    df = pd.DataFrame(records)
    df.set_index("id", inplace=True)
    return df

In [140]:
geo_ids = ["GSE109816", "GSE109817", "GSE109818", "GSE109819", "GSE109820"]

In [147]:
df = get_geo_records(geo_ids)

19-May-2023 21:50:57 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:50:57 INFO GEOparse - File already exist: using local version.
19-May-2023 21:50:57 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:50:57 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:50:57 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:50:57 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:50:57 DEBUG GEOpars

In [143]:
df

Unnamed: 0_level_0,title,expression_type,contact_name,contributors,overall_design,organization,platform
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GSE109816,Dissecting cell composition and cell-cell inte...,Expression profiling by high throughput sequen...,"Li,,Wang","Li,,Wang | Peng,,Yu | Zheng,,Li | Zongna,,Ren",Extract cells from left ateria and ventricle o...,Homo sapiens,Illumina NextSeq 500 (Homo sapiens)
GSE109817,RNA-sequencing of mouse adult hippocampal prog...,Expression profiling by high throughput sequen...,"Michael,,Piper","Lachlan,,Harris | Michael,,Piper",Hippocampal nestin+ flox-reporter progenitor c...,Mus musculus,Illumina NextSeq 500 (Homo sapiens)
GSE109818,Changes in gene expression in human skeletal s...,Expression profiling by array,"domenico,,raimondo","Domenico,,Raimondo | Cristina,,Remoli | Letizi...",Human bone marrow stromal cells (hBMSCs) (deri...,Homo sapiens,Illumina NextSeq 500 (Homo sapiens)
GSE109819,Transcriptome analysis of Escherichia coli str...,Expression profiling by high throughput sequen...,"Pablo,Emiliano,Tomatis","Pablo,E,Tomatis | Andreas,,Plueckthun","6 samples, 3 replicates",Escherichia coli,Illumina NextSeq 500 (Homo sapiens)
GSE109820,Early dynamics of ERa and GRHL2 binding on sti...,Genome binding/occupancy profiling by high thr...,"Andrew,Nicholas,Holding","Andrew,N,Holding | Florian,,Markowetz",ChIP-seq data in MCF7 at three time-points for...,Homo sapiens,Illumina NextSeq 500 (Homo sapiens)


In [144]:
df.to_csv("gse.csv")

In [145]:
!cat gse.csv

id,title,expression_type,contact_name,contributors,overall_design,organization,platform
GSE109816,Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing,Expression profiling by high throughput sequencing,"Li,,Wang","Li,,Wang | Peng,,Yu | Zheng,,Li | Zongna,,Ren",Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.,Homo sapiens,Illumina NextSeq 500 (Homo sapiens)
GSE109817,RNA-sequencing of mouse adult hippocampal progenitor cells in which Nfix was deleted.,Expression profiling by high throughput sequencing,"Michael,,Piper","Lachlan,,Harris | Michael,,Piper","Hippocampal nestin+ flox-reporter progenitor cells (3 wt, 3 kos - 60 days post tamoxifen adminstration), dcx+ flox-reporter progenitor cells (3 wt, 3 kos - 7 days post administration)",Mus musculus,Illumina NextSeq 500 (Homo sapiens)
GSE109818,Changes in gene expression in human skeletal stem cells transduced w

## Summary

We've seen how to download data from a GEO accession from NCBI GEO profiles dataset using Python. We've also extended it to download multiple records and save it as pandas datafame and export as csv.

Please note that the `GEOparse` library downloads the GEO data into the current directory. See the [Geoparse documentation][1] to see how to use a different directory for storing the downloaded files.

[1]: https://geoparse.readthedocs.io/en/latest/

## References

* [GEOparse documentation][1]
* [How to Retrieve NCBI GEO Information using Python (Bio.Entrez and GEOparse) - Xinzhou Liu][2]
* [Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks][3]

[1]: https://geoparse.readthedocs.io/en/latest/
[2]: https://newarkcaptain.com/how-to-retrieve-ncbi-geo-information-using-apis-part1/
[3]: https://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1007007

