## This is a notebook for thinking about which filters to use on the data as well as scratch work for the filtering

### General filters:

* Drop y chr
* Isoforms with at less than 10 samples with <1 TPM and <6 reads.
* We filtered out genes where the Ensembl gene ID did not uniquely map to a single HGNC gene symbol. 
* Isoform ratio was computed by using annotated isoforms in GENCODE V19 annotation, and undefined ratios (0/0, when none of the isoforms were expressed)were imputed from the mean ratio per isoform across individuals.

Exon expression:
* Drop y chr
* Mean expression > 0
* At less than 10 samples with <1 TPM and <6 reads.

Transcript ratio:
* Drop y chr
* Transcript ratio was computed by using annotated isoforms in GENCODE V19 annotation, and undefined ratios (0/0, when none of the isoforms were expressed)were imputed from the mean ratio per isoform across individuals.
* Each gene’s least abundant isoform was excluded to avoid linear dependency between isoform ratio values. ??

PSI:
* Drop y chr
* Minimum variance for the inclusion of an exon, we don't really care to look for a relationship between PSI and PSI if the exon is not variably expressed - ended up using >10% of the samples had to have PSI != 0% or 100%
* Minimum of 10 reads (across A+B+C) in at least 20% of samples
* Minimum of 5 reads per sample (?)

STRs:
* Minimum 3 genotypes
* Minimum 5 samples per genotype

### Confounding factors:

How other people have corrected for these facors:

* By Mostafavi et al
>'To correct hidden confounding factors, we applied hidden covariates with prior (HCP) method' [(Mostafavi et al. 2013)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068141), whose parameters were selected based on an external signal relevant to regulatory relationships. 'Namely, we selected parameters that produced maximal replication of an independent set of trans-eQTLs from meta-analysis of a large collection of independent whole blood studies (Westra et al. 2013).

* By Takata et al, who used PSI calculations for spliceQTLs in GTEx brain tissue samples. 
>'To control potential confounding factors, the following parameters were included in the analysis as covariates; gender, age of death, research institute where the samples were collected (Mount Sinai, Pennsylvania or Pittsburg), post-mortem interval, brain pH, RNA integrity number and sequencing library batch. [(Takata et al. 2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333373/)

# Below is scratch work for filtering - not important for other people.

-------------------

## GENOTYPE FILTERS: minimum count and minimum number of genotypes

In [19]:
import pandas as pd
import vcf
GTFIELD = "GB"
MINCOUNT = 3
MINGENOTYPES = 3
MINSAMPLES = 50

In [35]:
def GetGT(gt):
    if gt is None or gt == "": 
        return None
    if str(gt) == "." or str(gt) == "./.": 
        return None #Missing genotype control
    if "|" in gt:
        return map(int, gt.split("|"))
    elif "/" in gt: 
        return map(int, gt.split("/"))
    else: 
        return [int(gt), int(gt)] # haploid

In [3]:
def checkgt(record,s):
    try:
        return(record.genotype(s)[GTFIELD])
    except:
        return None

In [36]:
str_vcf = '/storage/szfeupe/Runs/650GTEx_estr/Filter_Merged_STRs_All_Samples_New.vcf.gz'
vcf_reader = vcf.Reader(open(str_vcf,"rb"))
vcf_reads = vcf_reader.fetch('1')
samples = vcf_reads.samples

for record in vcf_reads:
    genotypes = [GetGT(checkgt(record,s)) for s in samples]
    gt_sum = [sum(gt) if gt is not None else None for gt in genotypes]
    unique_vals = pd.Series(gt_sum).value_counts(dropna=True)
    good_gts = unique_vals[unique_vals >= MINCOUNT].index
    if len(good_gts) < MINGENOTYPES:
        print('NOT ENOUGH GENOTYPES')
        continue
    else:
        print('THERE WERE ENOUGH GENOTYPES')
    genotypes = [genotypes[i] if gt_sum[i] in good_gts else None for i in range(len(gt_sum))]
    if len([item for item in genotypes if item is not None]) < MINSAMPLES:
        print('NOT ENOUGH SAMPLES')
        break
    else:
        print('THERE WERE ENOUGH SAMPLES')
        

NOT ENOUGH GENOTYPES
NOT ENOUGH GENOTYPES
THERE WERE ENOUGH GENOTYPES
NOT ENOUGH SAMPLES


In [33]:
len([item for item in genotypes if item is not None])

31

In [25]:
genotypes = [GetGT(checkgt(record,s)) for s in samples]

In [28]:
gt_sum = [sum(gt) if gt is not None else None for gt in genotypes]
unique_vals = pd.Series(gt_sum).value_counts(dropna=True)
unique_vals

0.0    642
1.0      1
dtype: int64

In [34]:
my_strgt = pd.read_csv('/storage/dana/spliceSTR/genotypes_mincount/Allele_Gentype_chr1.table', sep="\t", low_memory=False)


In [36]:
my_strgt[my_strgt['start'] == 1298161]

Unnamed: 0,chrom,start,GTEX-PLZ4,GTEX-NFK9,GTEX-OHPM,GTEX-X4EO,GTEX-UTHO,GTEX-TMZS,GTEX-WY7C,GTEX-P44H,...,GTEX-1212Z,GTEX-14C39,GTEX-131XF,GTEX-111YS,GTEX-ZXES,GTEX-11WQK,GTEX-ZVP2,GTEX-Y8E4,GTEX-1GN2E,GTEX-14PJM
412,chr1,1298161,"-1,-1","-1,-1","-1,-1","NA,NA","NA,NA","-1,-1","-1,-1","-1,-1",...,"-1,-1","-1,-1","0,-1","-1,-1","-1,-1","NA,NA","-1,-1","-1,-1","-1,-1","-1,-1"


In [30]:
record.genotype('GTEX-PLZ4')

Call(sample=GTEX-PLZ4, CallData(GT=./., GB=0|0))

In [32]:
strgt = pd.read_csv('/storage/szfeupe/Runs/650GTEx_estr/Genotypes/Allele_Gentypes.table', sep="\t", low_memory=False)

In [34]:
test = ["1","2"]
sum(map(int,test))

3