Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Distributions.Empirical.create produces strange propabilities #61

Closed
Freymaurer opened this issue Mar 11, 2020 · 3 comments
Closed
Labels

Comments

@Freymaurer
Copy link
Contributor

Describe the bug
I tried using FSharp.Stats.Distributions.Empirical.create 1. dataPoints to bin my dataPoints with a bandWidth of 1. Afterward i wanted to sample from the distribution via Distributions.Empirical.random but it returned the following error every other time

> System.Collections.Generic.KeyNotFoundException: An index satisfying the predicate was not found in the collection.
   at Microsoft.FSharp.Collections.SeqModule.Find[T](FSharpFunc`2 predicate, IEnumerable`1 source)
   at <StartupCode$FSI_0249>.$FSI_0249.main@() in D:\Freym\Source\Repos\Freymaurer\Jupyter_PraktikumBiotech\Scripts\JP06_Retention_time_and_scan_time.fsx:line 86
Stopped due to error

To Reproduce
Steps to reproduce the behavior:

let file = "Chlamy_JGI5_5(Cp_Mp).fasta" // can be found in BioFSharp.Mz

let sequences = 
    filePath
    |> FastA.fromFile BioArray.ofAminoAcidString
    |> Seq.toArray

let digestedProteins =
    sequences
    |> Array.mapi (fun i fastAItem ->
        Digestion.BioArray.digest Digestion.Table.Trypsin i fastAItem.Sequence
        |> Digestion.BioArray.concernMissCleavages 0 0
    )
    |> Array.concat

let digestedPeptideMasses =
    digestedProteins
    |> Array.choose (fun peptide ->
        let mass = BioSeq.toMonoisotopicMassWith (BioItem.monoisoMass ModificationInfo.Table.H2O) peptide.PepSequence
        if mass < 3000. then Some mass else None
    )

let empHis = FSharp.Stats.Distributions.Empirical.create 1. digestedPeptideMasses

Expected behavior
Should produce a distribution, from which one can randomly sample according to the propabilities.

Additional context
I already looked in the FSharp.Stats.Distribution.Empirical module and think that the Distributions.Empirical.random function is just fine, but that the FSharp.Stats.Distributions.Empirical.create function is not correct when calculating the area subfunction. It seems unintuitive.

For my example above the total of all summed up propabilities was float = 0.6921621451, so everytime the random generated target subfunction of Distributions.Empirical.random produced an value above the ~0.69 it returned an error.

@Freymaurer Freymaurer added the bug label Mar 11, 2020
@bvenn
Copy link
Member

bvenn commented May 10, 2020

Thanks for the report.

You are right, the creation of the probability mass function (PMF) is incorrect.
Since the PMF must sum up to one, all bin sizes are divided by the total count of observations. This area calculation seems to be faulty.

If empty bins are present in the data, the area is overestimated (acc + (abs (x1 - x0)) * ((y0 + y1) / 2.)).

It is sufficient to divide the bin counts by the total number of items. A fix is prepared and will be pushed soon.

@bvenn
Copy link
Member

bvenn commented May 10, 2020

Additional comment:

There are functions to normalize the Map generated by Empirical.create. I think the normalization should be incorporated into the create function to avoid confusion about probabilities that do not reach 100%.

bvenn pushed a commit that referenced this issue May 11, 2020
@bvenn
Copy link
Member

bvenn commented May 11, 2020

closed by 75f5262

@bvenn bvenn closed this as completed May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants