Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Empirical PMF functions be usable with generic keys? #245

Closed
HarryMcCarney opened this issue Feb 20, 2023 · 4 comments
Closed

Should Empirical PMF functions be usable with generic keys? #245

HarryMcCarney opened this issue Feb 20, 2023 · 4 comments

Comments

@HarryMcCarney
Copy link
Member

I can create this

#r "nuget: fsharp.stats"
open FSharp.Stats
open FSharp.Stats.Distributions

let letters =
    "mississippi".ToCharArray()
    |> Array.map string
    |> Array.toList
    |> Frequency.createGeneric
    |> Empirical.ofHistogram

But then cant get probability for specific value as all functions except ofHistogram take a float as the map key.
I can work around this by querying the map directly with letters["i"]. But then letters["z"] returns an error instead of a zero.

Would prefer to use probabilityAt but this expects Map<float,float>. Should this function be generic or have I missed something?

thanks

@bvenn
Copy link
Member

bvenn commented Feb 21, 2023

I will have a look at this. Maybe there is a performance advantage if you explicitly restrict it to float. If so, there should be additional "generic" functions. I'll test it and make the functions usable for "non-float" lists as well.

That you don't have access to non-float letters in your case is hard to work around in the module. There are a lot of possible alphabets that could be considered (upper case, lower case, äüö, special characters, numbers). I assume you have to add your desired set of characters separately by:

let myAlphabet = 
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ".ToCharArray()

With this at hand you can use this as template and just replace counts of characters that are existing in your text.

#r "nuget: FSharp.Stats"
#r "nuget: Plotly.NET"

open FSharp.Stats
open FSharp.Stats.Distributions
open Plotly.NET

let myAlphabet = 
    "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ".ToCharArray()

let myTextMap = 
    "mississippi".ToCharArray()
    |> List.ofArray
    |> Frequency.createGeneric

let myFinalMap = 
    // use your own defined alphabet to include the desired set of characters
    myAlphabet
    |> Array.map (fun key -> 
        // if the text contains the current character, its value is used
        if myTextMap.ContainsKey key then 
            key,myTextMap.[key] 
        // if the text does NOT contain the current character, set its count to 0
        else 
            key,0
        )
    |> Map.ofArray

// accession of character frequencies    
myFinalMap.['z'] // 0
myFinalMap.['s'] // 4

// visualization
myFinalMap
|> Map.toArray
|> Chart.Column
|> Chart.withSize (1000.,500.) // quick way to depict all characters
|> Chart.show

image

I'll comment if I have any news.

bvenn added a commit that referenced this issue Feb 21, 2023
@bvenn
Copy link
Member

bvenn commented Feb 21, 2023

I fixed the issue, tested the Empirical.create function, and added a convenience layer for nominal/categorical inputs.

32fa0c2

  • fix floating point error when handling floats on bin border value 0.3 when binwidth is 0.1

060f696

  • make Frequency functions generic
  • make Empirical functions generic
  • add Empirical.create for nominal data
  • add convenience layer

7c1242d

  • add tests

still missing

  • add documentation

Usage

You can build the binaries yourself or wait for the next FSharp.Stats release.
(Update: You can use #r "nuget: FSharp.Stats, 0.4.12-preview.1")

Define the set of characters to search for:

#r @"<PathToFSharp.Stats>\FSharp.Stats\src\FSharp.Stats\bin\Release\netstandard2.0\FSharp.Stats.dll"
#r "nuget: Plotly.NET"

open FSharp.Stats
open FSharp.Stats.Distributions
open Plotly.NET

let letters = "Mississippi"

// Define your set of characters that should be checked for
// Any character that is not present in these sets is ignored
let myAlphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" |> Set.ofSeq
let mySmallAlphabet = "abcdefghijklmnopqrstuvwxyz" |> Set.ofSeq

These alphabets can be used to create the probability maps.

//takes the characters and determines their probabilities without considering non-existing characters
let myFrequencies0 = EmpiricalDistribution.createNominal() letters

//takes upper and lower case characters and determines their probability
let myFrequencies1 = EmpiricalDistribution.createNominal(Template=myAlphabet) letters

//takes only lower case characters and determines their probability
let myFrequencies2 = EmpiricalDistribution.createNominal(Template=mySmallAlphabet) letters

An additional field for transforming the input sequence may be beneficial if it does not matter if an character is lower case or upper case:

//converts all characters to lower case characters and determines their probability
let myFrequencies3 = EmpiricalDistribution.createNominal(Template=mySmallAlphabet,Transform=System.Char.ToLower) letters

// check probability of non existing characters, that are within the search scope (Template alphabet)
myFrequencies3.['z'] //returns 0.0

Visualization

[
Chart.Column(myFrequencies0 |> Map.toArray,"noTemplate") |> Chart.withYAxisStyle "probability"
Chart.Column(myFrequencies1 |> Map.toArray,"bigAlphabet") |> Chart.withYAxisStyle "probability"
Chart.Column(myFrequencies2 |> Map.toArray,"smallAlphabet") |> Chart.withYAxisStyle "probability"
Chart.Column(myFrequencies3 |> Map.toArray,"toLower + smallAlphabet") |> Chart.withYAxisStyle "probability"
]
|> Chart.Grid(4,1)
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withTitle letters
|> Chart.withSize(1000.,900.)
|> Chart.show

image

bvenn added a commit that referenced this issue Feb 21, 2023
@bvenn bvenn closed this as completed Feb 21, 2023
@bvenn
Copy link
Member

bvenn commented Feb 21, 2023

A prerelease is published and can be used:

#r "nuget: FSharp.Stats, 0.4.12-preview.1"

The documentation that contains the same information as this thread can be found here.

@HarryMcCarney
Copy link
Member Author

Thanks Benedikt, nice solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants