<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Generating-and-evaluating-validation" data-toc-modified-id="Generating-and-evaluating-validation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Generating and evaluating validation</a></span><ul class="toc-item"><li><span><a href="#Set-up" data-toc-modified-id="Set-up-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Set-up</a></span><ul class="toc-item"><li><span><a href="#Import-necessary-packages" data-toc-modified-id="Import-necessary-packages-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Import necessary packages</a></span></li><li><span><a href="#Read-in-key-value-pairs-and-create-pandas-dataframe" data-toc-modified-id="Read-in-key-value-pairs-and-create-pandas-dataframe-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Read in key-value pairs and create pandas dataframe</a></span></li><li><span><a href="#Keep-only-samples-with-a-TITLE-longer-than-certain-length-and-have-valid-data" data-toc-modified-id="Keep-only-samples-with-a-TITLE-longer-than-certain-length-and-have-valid-data-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Keep only samples with a TITLE longer than certain length and have valid data</a></span></li><li><span><a href="#Select-model-type" data-toc-modified-id="Select-model-type-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Select model type</a></span></li></ul></li><li><span><a href="#Process-TITLEs-into-validation-sets" data-toc-modified-id="Process-TITLEs-into-validation-sets-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Process TITLEs into validation sets</a></span><ul class="toc-item"><li><span><a href="#Loop-through-each-class-to-generate-validation-sets" data-toc-modified-id="Loop-through-each-class-to-generate-validation-sets-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Loop through each class to generate validation sets</a></span></li></ul></li><li><span><a href="#Pull-out-the-SRS's-of-all-validation-examples-to-holdout-from-training-and-testing-model" data-toc-modified-id="Pull-out-the-SRS's-of-all-validation-examples-to-holdout-from-training-and-testing-model-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Pull out the SRS's of all validation examples to holdout from training and testing model</a></span></li></ul></li></ul></div>

# Generating and evaluating validation
Adam Klie<br>
12/08/2019<br>
Script to predict generate and then evaluate prediction on validation

## Set-up

### Import necessary packages

In [1]:
%matplotlib inline
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Read in key-value pairs and create pandas dataframe

In [3]:
# SRA BioSample key-value pairs
SRS_dir = "../data/sra/allSRS_05_15_2018.pickle"
allSRS = pd.read_pickle(SRS_dir)

In [4]:
SRS_df = pd.DataFrame(allSRS).reset_index()
SRS_df.columns = ['srs', 'attribute', 'value']
SRS_df = SRS_df.set_index('srs')

### Keep only samples with a TITLE longer than certain length and have valid data

In [5]:
title_len = 5 # min length of titles to predict on

In [6]:
SRS_df['word_count'] = (SRS_df['value'].str.count(' ') + 1)
validation_srs = SRS_df[(SRS_df['attribute'].isin(['TITLE'])) & (SRS_df['word_count'] >= title_len)].index
validation_all = SRS_df.loc[validation_srs].sample(validation_srs.shape[0])

The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.


In [7]:
# Filter out any samples with non-usable values
filterTextList = ['not collected','not applicable','missing','n[/]?a','unknown', '-', '--', 'none', 'no']
filterTextRegex = "|".join(map(lambda myStr:'(?:{})'.format(myStr), filterTextList))
filter_mask = validation_all['value'].str.contains(filterTextRegex, case=False)
validation_all = validation_all[~filter_mask]

### Select model type

In [8]:
model_iter = '11_class'
save_dir = '../results/validation/{model}'.format(model=model_iter)

In [9]:
grouping = pd.read_csv('../results/embedding/{model}/entity_merging.csv'.format(model=model_iter), index_col=0)

In [10]:
groups = grouping[grouping["I"] == 0][["attribute", "GroupName"]]

## Process TITLEs into validation sets

### Loop through each class to generate validation sets

In [11]:
for group in groups.iterrows():
    srs_class = (group[1].values[0])
    predicted_class = (group[1].values[1])
    sub_groups = grouping[grouping["GroupName"] == predicted_class]["attribute"].values
    
    # Get a dataframe with the class values to try to predict for this specific class
    tmp_df = validation_all[validation_all['attribute'] == srs_class]
    #tmp_df = validation_all[validation_all['attribute'].isin(sub_groups)]
    
    # Cap attributes to get no duplicates and max 1000
    nDupTextMax = 10  # number of duplicate values allowed
    numSamples = 1000  # number of samples to evaluate for a given class
    total_samples = tmp_df.groupby(['value']).head(n = nDupTextMax).shape[0]
    if min(numSamples, total_samples) == 0:
        continue
    class_validation = tmp_df.groupby(['value']).head(n = nDupTextMax).sample(min(numSamples, total_samples))
    class_validation.shape

    # Get the TITLES for this validation set
    validation_sample_ids = class_validation.index
    validation_samples = SRS_df.loc[validation_sample_ids]
    validation_titles = validation_samples[
        validation_samples['attribute'].isin(['TITLE'])].reset_index().set_index(['srs', 'attribute'])
    validation_set = validation_titles['value']  # get a series object compatible with prediction script
    
    # Fix for saving
    predicted_class = predicted_class.replace('/', '_')
    predicted_class = predicted_class.replace(' ', '_')
        
    # Save as pickle objects
    class_validation.to_pickle('{dir}/{myclass}_validation_values.pickle'.format(dir=save_dir, 
                                                                                 myclass=predicted_class))
    validation_set.to_pickle('{dir}/{myclass}_validation_set.pickle'.format(dir=save_dir, 
                                                                            myclass=predicted_class))

## Pull out the SRS's of all validation examples to holdout from training and testing model
Can only run once validation sets have been generated

In [12]:
validations = groups["GroupName"].values
test_SRSs = []
for valid in validations:
    valid = valid.replace('/', '_')
    valid = valid.replace(' ', '_')
    try:
        curr_df = pd.read_pickle('{dir}/{myclass}_validation_set.pickle'.format(dir=save_dir, myclass = valid))
        curr_SRSs = list(curr_df.index.get_level_values('srs'))
        print(valid, len(curr_SRSs))
        test_SRSs = test_SRSs + curr_SRSs
    except:
        print('{dir}/{myclass}_validation_set.pickle does not exist!'.format(dir=save_dir,myclass = valid))

Species 1000
Strain 1000
Cell_type 732
Genotype 575
Condition_Disease 126
Tissue 1000
Sex 184
Age 1000
Data_type 79
Platform 275
Protocol 26


In [13]:
test_list = set(test_SRSs)

In [14]:
with open('{dir}/validation_SRSs.txt'.format(dir=save_dir), 'w') as f:
    f.writelines('\n'.join(test_list))