## Download the manifest files

The IDR team curates two metadata files, which includes experimental details and downloadable file paths.

1. Experimental details
2. Plate info (including file paths)

This notebook will download the relevant metadata file, and demonstrate how to extract file paths.

In [1]:
import pathlib
import pandas as pd

In [2]:
# The metadata files are stored on github
repo = "https://github.com/IDR/idr0080-way-perturbation"
commit = "74e537fecaa4690f0c98cb1e9a64b45d103de3e3"

github_dir = f"{repo}/raw/{commit}/screenA/"
output_dir = "manifest"

metadata_file = "idr0080-screenA-annotation.csv"
plate_file = "idr0080-screenA-plates.tsv"

In [3]:
# Load metadata file and write to local disk
metadata_df = pd.read_csv(f"{github_dir}/{metadata_file}")

output_file = pathlib.Path(output_dir, metadata_file)
metadata_df.to_csv(output_file, sep=",", index=False)

print(metadata_df.shape)
metadata_df.head(2)

(6912, 17)


Unnamed: 0,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,Reagent Identifier,Sense Sequence,Antisense Sequence,Reagent Design Gene Annotation Build,Gene Identifier,Gene Symbol,Control Type,Channels,Comment [Image]
0,SQ00014610_illum_corrected,A1,Homo sapiens,NCBITaxon,NCBITaxon_9606,A549,EFO,EFO_0001086,,,,,,,no reagent,Hoechst 33342 (DNA); Concanavalin A/Alexa 488 ...,images are illumination corrected; empty well
1,SQ00014610_illum_corrected,A2,Homo sapiens,NCBITaxon,NCBITaxon_9606,A549,EFO,EFO_0001086,MCL1-5,CATTCCTGATGCCACCTTCT,GTAAGGACTACGGTGGAAGA,Ensembl release 101 - August 2020,ENSG00000143384,MCL1,,Hoechst 33342 (DNA); Concanavalin A/Alexa 488 ...,images are illumination corrected


In [4]:
# Load plate file
plate_df = pd.read_csv(f"{github_dir}/{plate_file}", sep="\t", header=None)

plate_df.columns = ["plate", "manifest_path"]

print(plate_df.shape)
plate_df.head(2)

(18, 2)


Unnamed: 0,plate,manifest_path
0,SQ00014610,/uod/idr/filesets/idr0080-way-perturbation/202...
1,SQ00014611,/uod/idr/filesets/idr0080-way-perturbation/202...


According to IDR instructions, only part of the file name is useful

> After removing the leading /uod/idr/filesets/idrNNN-author-description/, you can then download a subfolder using the same commands as above:

In [5]:
# Strip this detail from the plate manifest and add as a column
idr_id = "idr0080-way-perturbation"
strip_id = f"/uod/idr/filesets/{idr_id}/"

plate_df = plate_df.assign(download_path = plate_df.manifest_path.str.replace(strip_id, ""))

In [6]:
# Write to local disk
output_file = pathlib.Path(output_dir, plate_file)
plate_df.to_csv(output_file, sep=",", index=False)

plate_df.head(2)

Unnamed: 0,plate,manifest_path,download_path
0,SQ00014610,/uod/idr/filesets/idr0080-way-perturbation/202...,20200316-s3/2015_07_01_Cell_Health_Vazquez_Can...
1,SQ00014611,/uod/idr/filesets/idr0080-way-perturbation/202...,20200316-s3/2015_07_01_Cell_Health_Vazquez_Can...


In [7]:
plate_df.download_path.tolist()

['20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014610__2016-06-16T00_38_35-Measurement2',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014611__2016-06-16T02_16_27-Measurement2',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014612__2016-06-15T19_44_15-Measurement2',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014613__2016-06-16T07_10_56-Measurement1',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014614__2016-06-16T08_48_59-Measurement1',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014615__2016-06-15T21_22_09-Measurement1',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014616__2016-06-15T23_00_48-Measurement1',
 '20200316-s3/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images/SQ00014617__2016-06-16T