Fetching and cleaning data from the szdb, the schizophrenia database.

In [98]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
import pandas as pd
import re
import szdb

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## CNVs
###  From the szdb based on 77 studies

The location of the CNVs implicated in schizophrenia is given in terms of cytogenic bands.  Source URL:

http://www.szdb.org/download/CNV.txt

In [99]:
fpath = '/big/results/bsm/2020-07-24-szdb/downloaded/CNV.txt'
CNV = pd.read_csv(fpath, sep='\t')

Clean and get start and end cytobands

In [100]:
CNV = CNV.loc[[bool(re.match('^[pq][0-9]+', y)) for y in CNV['cytoband']], :]
CNV['start cytoband'] = [re.sub('([pq][0-9]+(\.[0-9]+)?).*$', '\\1', c) for c in CNV.loc[:, 'cytoband']]
CNV['end cytoband'] = CNV.loc[:, 'start cytoband']

### Cytoband mapping for hg19
Mapping between cytobands and nucleotide base positions for hg19/GRCh37

https://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cytoBand.txt.gz

First I extend the mapping with GRCh37 style contig names and index the DataFrame.

In [101]:
hgcb = pd.read_csv('/big/data/refgenome/GRCh37/cytoBand.tsv', sep='\t')
# 'chromosome' column will hold contig names in GRCh37 style
hgcb['chromosome'] = [c.replace('chr', '') for c in hgcb['chr']]
# Index with tuples.  Multiindex might be better.
hgcb.index = [(c.replace('chr', ''), n) for c, n in zip(hgcb['chr'], hgcb['name'])] 
hgcb

Unnamed: 0,chr,start,end,name,score,chromosome
"(1, p36.33)",chr1,0,2300000,p36.33,gneg,1
"(1, p36.32)",chr1,2300000,5400000,p36.32,gpos25,1
"(1, p36.31)",chr1,5400000,7200000,p36.31,gneg,1
"(1, p36.23)",chr1,7200000,9200000,p36.23,gpos25,1
"(1, p36.22)",chr1,9200000,12700000,p36.22,gneg,1
...,...,...,...,...,...,...
"(Y, q11.221)",chrY,15100000,19800000,q11.221,gpos50,Y
"(Y, q11.222)",chrY,19800000,22100000,q11.222,gneg,Y
"(Y, q11.223)",chrY,22100000,26200000,q11.223,gpos50,Y
"(Y, q11.23)",chrY,26200000,28800000,q11.23,gneg,Y


Comparing the set of cytobands used in `CNV` (from szdb) to those in `hgcb` (from the hg19 bundle):

In [102]:
set_hgcb = set(hgcb.index)
set_CNV_start = set(list(zip(CNV['chromosome'], CNV['start cytoband'])))
set_CNV_end = set(list(zip(CNV['chromosome'], CNV['end cytoband'])))

In the previous code chunck `set_hgcb` is the set of cytoband names in `hgcb`. `set_CNV_starts` and `set_CNV_end` are the set of cytoband names that define the *start* and *end* of some schizophrenia related CNV, respectively.

Both `set_CNV_start` and `set_CNV_end` contain cytoband names that are missing from `set_hgcb`. The following code shows that the same names are missing for `set_CNV_start` as for `set_CNV_end`.

In [103]:
set_CNV_start - set_hgcb == set_CNV_end - set_hgcb

True

I refer the missing names as *incorrect*.  The incorrect names are as follows

In [104]:
incorr = list(set_CNV_start - set_bgcb)
incorr

[('16', 'p12'),
 ('16', 'p13.1'),
 ('6', 'q22'),
 ('13', 'q31'),
 ('7', 'p36.3'),
 ('22', 'q11.2'),
 ('3', 'p263'),
 ('21', 'q22'),
 ('7', 'q11.2'),
 ('22', 'q11'),
 ('8', 'q11.2'),
 ('16', 'p11'),
 ('15', 'q11'),
 ('6', 'p13.11'),
 ('10', 'q23'),
 ('6', 'q24'),
 ('20', 'q11'),
 ('15', 'q12.1'),
 ('17', 'q21.3'),
 ('3', 'p21')]

By inspecting `hgcb` it turns out that the incorrect cytobands are all composed of multiple, more finely divided sub-cytobands.  So to correct each incorrect name I take the start (5') and end (3') member of its sub-cytobands from `hgcb`.  I will enter these semi-manually into the `corr start` and `corr end` columns of the following data frame.

In [105]:
corr_cb = pd.DataFrame({'incorr': incorr}, index=incorr)
corr_cb['corr start'] = corr_cb.loc[:, 'incorr']
corr_cb['corr end'] = corr_cb.loc[:, 'incorr']

These functions will make manual data entry easier

In [106]:
def correct(incorr, corr, which='start'):
    corr_cb.loc[incorr, 'corr ' + which] = corr
    if corr not in set_bgcb:
        print('Wrong input:', corr, '\nTry again')
    
def corrs(incorr, corr):
    correct(incorr, corr, which='start')
    
def corre(incorr, corr):
    correct(incorr, corr, which='end')

Look at `/big/data/refgenome/GRCh37/cytoBand.tsv` for the following operations

In [107]:
# I hate manual data entry
corrs(('20', 'q11'), ('20', 'q11.1'))
corre(('20', 'q11'), ('20', 'q11.23'))
corrs(('15', 'q12.1'), ('15', 'q12'))
corre(('15', 'q12.1'), ('15', 'q12'))
corrs(('8', 'q11.2'), ('8', 'q11.21'))
corre(('8', 'q11.2'), ('8', 'q11.23'))
corrs(('16', 'p11'), ('16', 'p11.1'))
corre(('16', 'p11'), ('16', 'p11.2'))
corrs(('6', 'q24'), ('6', 'q24.1'))
corre(('6', 'q24'), ('6', 'q24.3'))
corrs(('22', 'q11.2'), ('22', 'q11.21'))
corre(('22', 'q11.2'), ('22', 'q11.23'))
corrs(('10', 'q23'), ('10', 'q23.1'))
corre(('10', 'q23'), ('10', 'q23.33'))
corrs(('6', 'p13.11'), ('6', 'p13.11'))
corre(('6', 'p13.11'), ('6', 'p13.11'))
corrs(('13', 'q31'), ('13', 'q31.1'))
corre(('13', 'q31'), ('13', 'q31.3'))
corrs(('16', 'p12'), ('16', 'p12.1'))
corre(('16', 'p12'), ('16', 'p12.3'))
corrs(('17', 'q21.3'), ('17', 'q21.31'))
corre(('17', 'q21.3'), ('17', 'q21.33'))
corrs(('3', 'p21'), ('3', 'p21.1'))
corre(('3', 'p21'), ('3', 'p21.33'))
corrs(('3', 'p263'), ('3', 'p26.3'))
corre(('3', 'p263'), ('3', 'p26.3'))
corrs(('15', 'q11'), ('15', 'q11.1'))
corre(('15', 'q11'), ('15', 'q11.2'))
corrs(('7', 'p36.3'), ('7', 'p36.3'))
corre(('7', 'p36.3'), ('7', 'p36.3'))
corrs(('22', 'q11'), ('22', 'q11.1'))
corre(('22', 'q11'), ('22', 'q11.23'))
corrs(('16', 'p13.1'), ('16', 'p13.11'))
corre(('16', 'p13.1'), ('16', 'p13.13'))
corrs(('7', 'q11.2'), ('7', 'q11.21'))
corre(('7', 'q11.2'), ('7', 'q11.23'))
corrs(('6', 'q22'), ('6', 'q22.1'))
corre(('6', 'q22'), ('6', 'q22.33'))
corrs(('21', 'q22'), ('21', 'q22.11'))
corre(('21', 'q22'), ('21', 'q22.13'))

Wrong input: ('6', 'p13.11') 
Try again
Wrong input: ('6', 'p13.11') 
Try again
Wrong input: ('7', 'p36.3') 
Try again
Wrong input: ('7', 'p36.3') 
Try again


The cytobands marked by `Wrong input` are probably typos made by the creators of `/big/results/bsm/2020-07-24-szdb/downloaded/CNV.txt` so I remove them from `CNV`:

In [108]:
l = list(zip(CNV['chromosome'], CNV['start cytoband']))
print(l.index(('6', 'p13.11')), l.index(('7', 'p36.3')))

748 943


In [110]:
ix_to_drop = CNV.index[[748, 943]]

In [111]:
CNV.drop(index=ix_to_drop, inplace=True)
CNV.index[[748, 943]]

Int64Index([760, 959], dtype='int64')

Now expand rows of `CNV` that correspond to CNVs spanning multiple cytogenic bands.

In [14]:
%connect_info

{
  "shell_port": 46707,
  "iopub_port": 55687,
  "stdin_port": 52971,
  "control_port": 44327,
  "hb_port": 46589,
  "ip": "127.0.0.1",
  "key": "7d9b9ec4-a8135908a611d2f56b985617",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-a22a0553-7696-4322-989d-87968b04446b.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
