<a id='home'></a>

### purpose

create table of locus counts for manuscript using printouts from `04_create_datasets-wza_baypass_random.ipynb`

### notes

the counts in this notebook for baypass loci with linear models does not match the [original submission](https://github.com/brandonlind/offset_validation/releases/tag/preprint_release) because we excluded elevation in this round (when elevation was included, this added loci to marker sets). (with the same variables, the loci counts with sig linear models should have been the same).

### outline

1. [full data sets](#full)
    - how many loci were in the full wza and baypass sets (+ random/pseudo random)?
    
1. [RONA](#rona)

    - how many loci were used in RONA?

In [1]:
from pythonimports import *

latest_commit()
sinfo(html=True)

##################################################################
Today:	July 10, 2023 - 09:54:37
python version: 3.8.5
conda env: newpy385

Current commit of pythonimports:
commit 03d76f7a992130f4b94ac34a09ad439e918d3044  
Author: Brandon Lind <lind.brandon.m@gmail.com>  
Date:   Fri Jun 9 09:42:21 2023 -0400
##################################################################



<a id='full'></a>
# 1. full data sets

[top](#home)

In [2]:
# # before revision
# text = '''fdc-baypass-real 17516
# fdi-baypass-real 12398
# jp-baypass-real 22635
# combined-baypass-real 25219
# fdc-wza-real 4886
# fdi-wza-real 11434
# jp-wza-real 4788
# wl-wza-real 8496
# combined-wza-real 14760
# fdc-baypass-random 17516
# fdi-baypass-random 12398
# jp-baypass-random 22635
# combined-baypass-random 25219
# fdc-wza-random 4886
# fdi-wza-random 11434
# jp-wza-random 4788
# wl-wza-random 8496
# combined-wza-random 14760
# fdc-baypass-pseudo_random_loci 17516
# fdi-baypass-pseudo_random_loci 12398
# jp-baypass-pseudo_random_loci 22635
# combined-baypass-pseudo_random_loci 25219
# fdc-wza-pseudo_random_loci 4886
# fdi-wza-pseudo_random_loci 11434
# jp-wza-pseudo_random_loci 4788
# wl-wza-pseudo_random_loci 8496
# combined-wza-pseudo_random_loci 14760'''

# after revision
# text from 04_create_datasets-wza_baypass_random.ipynb cell 36
text = '''fdc-baypass-real 17516
fdi-baypass-real 12398
jp-baypass-real 22635
combined-baypass-real 25219
fdc-wza-real 3770
fdi-wza-real 1787
jp-wza-real 8564
wl-wza-real 79
combined-wza-real 4810
fdc-baypass-random 17516
fdi-baypass-random 12398
jp-baypass-random 22635
combined-baypass-random 25219
fdc-wza-random 3770
fdi-wza-random 1787
jp-wza-random 8564
wl-wza-random 79
combined-wza-random 4810'''


locus_counts = wrap_defaultdict(None, 3)
for line in text.split('\n'):
    dataset,count = line.split()
    spp,method,setname = dataset.split('-')
    locus_counts[spp][method][setname] = count

spp,method,setname,count

('combined', 'wza', 'random', '4810')

In [3]:
for spp,_dict in locus_counts.items():
    print(ColorText(spp).bold())
    display(pd.DataFrame(_dict))

[1mfdc[0m


Unnamed: 0,baypass,wza
real,17516,3770
random,17516,3770


[1mfdi[0m


Unnamed: 0,baypass,wza
real,12398,1787
random,12398,1787


[1mjp[0m


Unnamed: 0,baypass,wza
real,22635,8564
random,22635,8564


[1mcombined[0m


Unnamed: 0,baypass,wza
real,25219,4810
random,25219,4810


[1mwl[0m


Unnamed: 0,wza
random,79
real,79


<a id='rona'></a>
# 2. RONA

[top](#home)

In [4]:
# get files that have loci IDs among sets created in ../04_create_datasets-wza_baypass_random.ipynb#outliers
training_dir = '/data/projects/pool_seq/phenotypic_data/offset_misc_files/training/training_files'
files = fs(training_dir, 'full', endswith='.txt', exclude='envdata')
len(files)

18

In [5]:
# get the loci that were assigned to testing sets in ../04_create_datasets-wza_baypass_random.ipynb#outliers
    # ie union of random/baypass/wza
spploci = defaultdict(list)
grouploci = wrap_defaultdict(None, 3)
for f in files:
    spp, method, setname, kfold = op.basename(f).rstrip('.txt').split('-')
    
    df = pd.read_table(f, index_col='index', nrows=1)
    # this next line from a previous run where I had accidentally wrote txt with multiple index 
        # (can prob deprecate, but no harm if I leave it)
    loci = [locus for locus in df.columns if 'Unnamed' not in locus and 'level' not in locus]
    
    print(spp, method, setname, len(loci))
    grouploci[spp][method][setname] = loci
    spploci[spp].extend(loci)
    
print('\n')

for spp, loci in spploci.items():
    spploci[spp] = uni(loci)
    print(spp, len(spploci[spp]))

combined baypass random 25219
combined baypass real 25219
combined wza random 4810
combined wza real 4810
fdc baypass random 17516
fdc baypass real 17516
fdc wza random 3770
fdc wza real 3770
fdi baypass random 12398
fdi baypass real 12398
fdi wza random 1787
fdi wza real 1787
jp baypass random 22635
jp baypass real 22635
jp wza random 8564
jp wza real 8564
wl wza random 79
wl wza real 79


combined 57234
fdc 40448
fdi 27540
jp 55362
wl 158


In [6]:
ronadir = '/data/projects/pool_seq/phenotypic_data/offset_misc_files/results/rona'

In [7]:
gettimestamp(op.join(ronadir, 'linear_models.pkl'))

'Mon Jun 26 10:10:03 2023'

In [8]:
# retrieve loci with significant linear models as in 09_RONA.ipynb cell 16
results = pklload(op.join(ronadir, 'linear_models.pkl'))

# determine which of the loci had pvals <= 0.05
keep = wrap_defaultdict(list, 2)
for spp, envdict in results.items():
    for env, locusdict in pbar(envdict.items(), desc=spp):
        for locus, (slope, intercept, pval) in locusdict.items():
            if pval <= 0.05:
                keep[spp][env].append(locus)


jp: 100%|███████████████| 19/19 [00:00<00:00, 59.65it/s]
fdi: 100%|███████████████| 19/19 [00:00<00:00, 138.32it/s]
fdc: 100%|███████████████| 19/19 [00:00<00:00, 90.70it/s]
combined: 100%|███████████████| 19/19 [00:00<00:00, 44.67it/s]


In [9]:
keep['jp']['MAP'][0]

'>super81-429557'

In [10]:
keepers = {}
for spp, envdict in keep.items():
    keepers[spp] = uni(flatten(envdict.values()))
    print(spp, len(keepers[spp]))

jp 39852
fdi 20071
fdc 30827
combined 53836


In [11]:
# how many unique loci with sig linear models for each locus set?
rona_table = wrap_defaultdict(None, 2)
for (spp, method, setname), loci in unwrap_dictionary(grouploci):
    if spp != 'wl':
        interloci = set(loci).intersection(keepers[spp])
        print(spp, method, setname, len(loci), len(interloci))
        rona_table[spp][f'{method}-{setname}'] = len(interloci)

combined baypass random 25219 22857
combined baypass real 25219 24687
combined wza random 4810 4337
combined wza real 4810 4756
fdc baypass random 17516 9684
fdc baypass real 17516 17433
fdc wza random 3770 2050
fdc wza real 3770 3766
fdi baypass random 12398 5973
fdi baypass real 12398 12262
fdi wza random 1787 873
fdi wza real 1787 1787
jp baypass random 22635 11383
jp baypass real 22635 22570
jp wza random 8564 4281
jp wza real 8564 8563


In [12]:
for spp,_dict in rona_table.items():
    df = pd.DataFrame(_dict, index=[spp] * len(_dict))
    print(spp)
    display(df)


combined


Unnamed: 0,baypass-random,baypass-real,wza-random,wza-real
combined,22857,24687,4337,4756
combined,22857,24687,4337,4756
combined,22857,24687,4337,4756
combined,22857,24687,4337,4756


fdc


Unnamed: 0,baypass-random,baypass-real,wza-random,wza-real
fdc,9684,17433,2050,3766
fdc,9684,17433,2050,3766
fdc,9684,17433,2050,3766
fdc,9684,17433,2050,3766


fdi


Unnamed: 0,baypass-random,baypass-real,wza-random,wza-real
fdi,5973,12262,873,1787
fdi,5973,12262,873,1787
fdi,5973,12262,873,1787
fdi,5973,12262,873,1787


jp


Unnamed: 0,baypass-random,baypass-real,wza-random,wza-real
jp,11383,22570,4281,8563
jp,11383,22570,4281,8563
jp,11383,22570,4281,8563
jp,11383,22570,4281,8563
