# Load datasets 

The gene expression tables of UMI counts (cells x genes) for each experiment are found at [GEO:GSE173947](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE173947). Information about each sample (including the experimental conditions, the number of OSNs, the 10x kit, the corresponding raw data at the GEO, etc.) can be found in [Table S1](../data/tables/GSE173947_Table_S1_datasets.csv) and in [GSE173947_Dataset_raw_file_names.csv](../data/tables/GSE173947_Dataset_raw_file_names.csv). 

As described in the [README](../README.md), please download the `<expt>_umi_counts.csv.gz` files from the experiments of your interest to the [data/raw](../data/raw) folder.

This notebooks loads those supplemental `<expt>_umi_counts.csv.gz` files from the GEO and converts the raw UMI counts into `AnnData` objects (saved in [data/processed](../data/processed)) to use for downstream analyses.

In [1]:
from pprint import pprint

import pandas as pd

from osn.preprocess import get_data_folders, find_raw_count_files
from osn.preprocess.make_adata import load_expts

## Load supplemental file with information about each dataset 

In [2]:
data_fold = get_data_folders()
df_expt = pd.read_csv(data_fold.tables / 'GSE173947_Dataset_raw_file_names.csv')
print(f"Found {df_expt.GEO_sample.nunique()} unique samples")
display(df_expt.head(6))
expts = dict(zip(df_expt.counts, df_expt.metadata))
pprint(expts)
print("\n# of samples for each condition:")
pprint(df_expt.groupby(['genotype', 'environment', 'odor', 'time']).orig_ident.nunique())

Found 152 unique samples


Unnamed: 0,GEO_sample,genotype,environment,odor,time,source,orig_ident,bam,counts,metadata
0,homecage-1,WT,home-cage,,,baseline-cage,baseline-1,homecage-1_possorted_genome_bam.bam,home_cage_umi_counts.csv.gz,home_cage_metadata.csv.gz
1,homecage-2,WT,home-cage,,,baseline-cage,baseline-10,homecage-2_possorted_genome_bam.bam,home_cage_umi_counts.csv.gz,home_cage_metadata.csv.gz
2,homecage-3,WT,home-cage,,,baseline-cage,baseline-2,homecage-3_possorted_genome_bam.bam,home_cage_umi_counts.csv.gz,home_cage_metadata.csv.gz
3,homecage-4,WT,home-cage,,,baseline-cage,baseline-3,homecage-4_possorted_genome_bam.bam,home_cage_umi_counts.csv.gz,home_cage_metadata.csv.gz
4,homecage-5,WT,home-cage,,,baseline-cage,baseline-4,homecage-5_possorted_genome_bam.bam,home_cage_umi_counts.csv.gz,home_cage_metadata.csv.gz
5,homecage-6,WT,home-cage,,,baseline-cage,baseline-9,homecage-6_possorted_genome_bam.bam,home_cage_umi_counts.csv.gz,home_cage_metadata.csv.gz


{'ActSeq_24h_umi_counts.csv.gz': 'ActSeq_24h_metadata.csv.gz',
 'ActSeq_conc_analog_umi_counts.csv.gz': 'ActSeq_conc_analog_metadata.csv.gz',
 'ActSeq_umi_counts.csv.gz': 'ActSeq_metadata.csv.gz',
 'Arrb2KO_umi_counts.csv.gz': 'Arrb2KO_metadata.csv.gz',
 'ChronicOccl_umi_counts.csv.gz': 'ChronicOccl_metadata.csv.gz',
 'OR_swap_umi_counts.csv.gz': 'OR_swap_metadata.csv.gz',
 'RevOccl_umi_counts.csv.gz': 'RevOccl_metadata.csv.gz',
 'chronic_ACE_umi_counts.csv.gz': 'chronic_ACE_metadata.csv.gz',
 'envA_bidirectional_switch_umi_counts.csv': 'envA_bidirectional_switch_metadata.csv',
 'envA_timecourse_umi_counts.csv.gz': 'envA_timecourse_metadata.csv.gz',
 'env_switch_umi_counts.csv.gz': 'env_switch_metadata.csv.gz',
 'home_cage_immature_umi_counts.csv.gz': 'home_cage_immature_metadata.csv.gz',
 'home_cage_umi_counts.csv.gz': 'home_cage_metadata.csv.gz',
 'opto_umi_counts.csv.gz': 'opto_metadata.csv.gz'}

# of samples for each condition:
genotype  environment  odor       time                

# Make adata objects from downloaded csv files
To run only on the `home_cage` data do:
```python
load_expts(files, name="home_cage")
```

In [3]:
# look to see which csv files were downloaded
files = find_raw_count_files(data_fold)
for name, (count, meta) in files.items():
    print(name, count.name, meta.name)

ActSeq_conc_analog GSE173947_ActSeq_conc_analog_umi_counts.csv.gz GSE173947_ActSeq_conc_analog_metadata.csv.gz
ChronicOccl GSE173947_ChronicOccl_umi_counts.csv.gz GSE173947_ChronicOccl_metadata.csv.gz
home_cage GSE173947_home_cage_umi_counts.csv.gz GSE173947_home_cage_metadata.csv.gz
env_switch GSE173947_env_switch_umi_counts.csv.gz GSE173947_env_switch_metadata.csv.gz
ActSeq GSE173947_ActSeq_umi_counts.csv.gz GSE173947_ActSeq_metadata.csv.gz


In [4]:
# if you have enough memory could try `low_memory=False` and various `chunksize`
load_expts(files, read_kwargs={"low_memory":True})

2021-12-18 12:37:30,189 - INFO - Already found ActSeq_conc_analog_norm.h5ad file for ActSeq_conc_analog, skipping. To overwrite use `force=True`.
2021-12-18 12:37:30,190 - INFO - Already found ChronicOccl_norm.h5ad file for ChronicOccl, skipping. To overwrite use `force=True`.
2021-12-18 12:37:30,191 - INFO - Already found home_cage_norm.h5ad file for home_cage, skipping. To overwrite use `force=True`.
2021-12-18 12:37:30,192 - INFO - Already found env_switch_norm.h5ad file for env_switch, skipping. To overwrite use `force=True`.
2021-12-18 12:37:30,192 - INFO - Already found ActSeq_norm.h5ad file for ActSeq, skipping. To overwrite use `force=True`.
