### 3K Rice Genome GWAS Dataset Export Usage

Data for this was exported as single Hail MatrixTable (`.mt`) as well as individual variants (`csv.gz`), samples (`csv`), and call datasets (`zarr`).

In [10]:
from pathlib import Path
import pandas as pd
import numpy as np
import hail as hl
import zarr
hl.init()

Running on Apache Spark version 2.4.4
SparkUI available at http://8352602c2ab9:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.32-a5876a0a2853
LOGGING: writing to /home/eczech/repos/gwas-analysis/notebooks/organism/rice/hail-20200514-1737-0.2.32-a5876a0a2853.log


In [11]:
path = Path('~/data/gwas/rice-snpseek/1M_GWAS_SNP_Dataset/rg-3k-gwas-export').expanduser()
path

PosixPath('/home/eczech/data/gwas/rice-snpseek/1M_GWAS_SNP_Dataset/rg-3k-gwas-export')

In [15]:
!du -sh {str(path)}/*

582M	/home/eczech/data/gwas/rice-snpseek/1M_GWAS_SNP_Dataset/rg-3k-gwas-export/rg-3k-gwas-export.calls.zarr
336K	/home/eczech/data/gwas/rice-snpseek/1M_GWAS_SNP_Dataset/rg-3k-gwas-export/rg-3k-gwas-export.cols.csv
471M	/home/eczech/data/gwas/rice-snpseek/1M_GWAS_SNP_Dataset/rg-3k-gwas-export/rg-3k-gwas-export.mt
7.5M	/home/eczech/data/gwas/rice-snpseek/1M_GWAS_SNP_Dataset/rg-3k-gwas-export/rg-3k-gwas-export.rows.csv.gz


#### Hail

In [16]:
# The entire table with row, col, and call data:
hl.read_matrix_table(str(path / 'rg-3k-gwas-export.mt')).describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'acc_seq_no': int64
    'acc_stock_id': int64
    'acc_gs_acc': float64
    'acc_gs_variety_name': str
    'acc_igrc_acc_src': int64
    'pt_APANTH_REPRO': float64
    'pt_APSH': float64
    'pt_APCO_REV_POST': float64
    'pt_APCO_REV_REPRO': float64
    'pt_AWCO_LREV': float64
    'pt_AWCO_REV': float64
    'pt_AWDIST': float64
    'pt_BLANTHPR_VEG': float64
    'pt_BLANTHDI_VEG': float64
    'pt_BLPUB_VEG': float64
    'pt_BLSCO_ANTH_VEG': float64
    'pt_BLSCO_REV_VEG': float64
    'pt_CCO_REV_VEG': float64
    'pt_CUAN_REPRO': float64
    'pt_ENDO': float64
    'pt_FLA_EREPRO': float64
    'pt_FLA_REPRO': float64
    'pt_INANTH': float64
    'pt_LIGCO_REV_VEG': float64
    'pt_LIGSH': float64
    'pt_LPCO_REV_POST': float64
    'pt_LPPUB': float64
    'pt_LSEN': float64
    'pt_NOANTH': float64
    'pt_PEX_REPRO': float64
    'pt_PTH': float64
 

### Pandas

Sample data contains phenotypes prefixed by `pt_` and `s` (sample_id) in the MatrixTable matches to the `s` in this table, as does the order:

In [18]:
pd.read_csv(path / 'rg-3k-gwas-export.cols.csv').head()

Unnamed: 0,s,acc_seq_no,acc_stock_id,acc_gs_acc,acc_gs_variety_name,acc_igrc_acc_src,pt_APANTH_REPRO,pt_APSH,pt_APCO_REV_POST,pt_APCO_REV_REPRO,...,pt_LPPUB,pt_LSEN,pt_NOANTH,pt_PEX_REPRO,pt_PTH,pt_SCCO_REV,pt_SECOND_BR_REPRO,pt_SLCO_REV,pt_SPKF,pt_SLLT_CODE
0,IRIS_313-10000,335,387,125907.0,SUWEON 311::IRGC 61890-1,61890,,,,20.0,...,2.0,9.0,,7.0,1.0,10.0,1.0,20.0,5.0,1.0
1,IRIS_313-10001,336,388,125692.0,C 662083::IRGC 62101-1,62101,,,,20.0,...,2.0,7.0,,5.0,2.0,10.0,1.0,20.0,4.0,3.0
2,IRIS_313-10002,103,129,125955.0,BW 295-5::IRGC 63098-1,63098,,,20.0,20.0,...,4.0,7.0,,7.0,3.0,10.0,1.0,20.0,4.0,1.0
3,IRIS_313-10007,337,389,125749.0,GARURA::IRGC 64111-1,64111,,,,10.0,...,4.0,3.0,,5.0,3.0,10.0,1.0,20.0,4.0,3.0
4,IRIS_313-10010,338,390,125818.0,LALKA (LAL DHAN)::IRGC 64946-1,64946,,,,20.0,...,2.0,5.0,,7.0,3.0,10.0,1.0,20.0,4.0,3.0


Variant data shouldn't be needed for much, but it's here:

In [20]:
pd.read_csv(path / 'rg-3k-gwas-export.rows.csv.gz').head()

Unnamed: 0,locus.contig,locus.position,alleles,rsid,cm_position
0,1,1151,"['C', 'A']",1151,0.0
1,1,1178,"['G', 'T']",1178,0.0
2,1,1203,"['T', 'C']",1203,0.0
3,1,1248,"['G', 'A']",1248,0.0
4,1,1282,"['G', 'A']",1282,0.0


### Zarr

Call data (dense and mean imputed in this case) can be sliced from a zarr array:

In [29]:
gt = zarr.open(str(path / 'rg-3k-gwas-export.calls.zarr'), mode='r')
# Get calls for 10 variants and 5 samples
gt[5:15, 5:10]

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 2, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 2, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 2, 0]], dtype=int8)

### Selecting Phenotypes

Pick a phenotype:
    
- Definitions are in https://s3-ap-southeast-1.amazonaws.com/oryzasnp-atcg-irri-org/3kRG-phenotypes/3kRG_PhenotypeData_v20170411.xlsx
    - The ">2007 Dictionary" sheet
- Choose one with low sparsity


In [28]:
df = pd.read_csv(path / 'rg-3k-gwas-export.cols.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2113 entries, 0 to 2112
Data columns (total 37 columns):
s                      2113 non-null object
acc_seq_no             2113 non-null int64
acc_stock_id           2113 non-null int64
acc_gs_acc             2113 non-null float64
acc_gs_variety_name    2113 non-null object
acc_igrc_acc_src       2113 non-null int64
pt_APANTH_REPRO        91 non-null float64
pt_APSH                133 non-null float64
pt_APCO_REV_POST       552 non-null float64
pt_APCO_REV_REPRO      2108 non-null float64
pt_AWCO_LREV           133 non-null float64
pt_AWCO_REV            2112 non-null float64
pt_AWDIST              30 non-null float64
pt_BLANTHPR_VEG        133 non-null float64
pt_BLANTHDI_VEG        13 non-null float64
pt_BLPUB_VEG           2112 non-null float64
pt_BLSCO_ANTH_VEG      133 non-null float64
pt_BLSCO_REV_VEG       2111 non-null float64
pt_CCO_REV_VEG         2110 non-null float64
pt_CUAN_REPRO          2111 non-null float64
pt_ENDO     

In [35]:
# First 1k variants with samples having data for this phenotype
mask = df['pt_FLA_REPRO'].notnull()
gtp = gt[:1000][:,mask]
gtp.shape, gtp.dtype

((1000, 2109), dtype('int8'))