[Feature Request] `Frequency` merge operations #263

bvenn · 2023-04-26T11:08:44Z

Merge operations for Maps

Data can be sorted into bins of predefined width using the Frequency or EmpiricalDistribution module. If two datasets are binned and should be merged afterwards, several merging strategies are possible. A simple merge of freqA and freqB is straightforward with keys that are present in freqA and freqB are replaced with the values of freqB.

let a =
    [("k1",1);("k2",3)]
    |> Map.ofList

let b =
    [("k2",1);("k3",4)]
    |> Map.ofList

merge a b

results in the following combination with ("k2",3) from a being replaced by ("k2",1) from b:

val it: Map<float,int> = map [("k1", 1); ("k2", 1); ("k3", 4)]

Generic formulation of merge operations

I'm in the process of adding a generic function that gets an additional function that handles key duplicates. E.g.:

add a b

resulting in the combination of a and b with ("k2",3) from a being added to ("k2",1) from b:

val it: Map<float,int> = map [("k1", 1); ("k2", 4); ("k3", 4)]

While this is trivial, I'm not sure how to handle a subtraction. Should the result from subtract a b result in:

a) val it: Map<float,int> = map [("k1", 1); ("k2", 2); ("k3", 4)]
- counts from a are subtracted by the corresponding values from b if keys are present in both maps
- here the values of a that are not present in b are untouched
b) val it: Map<float,int> = map [("k1", 1); ("k2", 2); ("k3", -4)]
- counts from a are subtracted by the values from b, even for keys that are not present in a

The latter option (b) makes no sense to me since frequency counts should not be negative, but I cannot think of applications in which the result of (a) makes any sense. Maybe the subtract function is not the best to start with because in this post they implemented (a) with addition and multiplication examples. Especially for the addition, a and b would give the correct result and I think it is intuitive to just apply the function to values of keys that are present in both maps.

@HarryMcCarney, do you know use cases that use subtract? Do you have any thoughts about this? I would suggest to add version (a) to Frequency as well as EmpiricalDistribution

Additional remark: When applied to continuous data bandwidths must be equal, to not merge counts from overlapping bins!

The text was updated successfully, but these errors were encountered:

#263

bvenn · 2023-04-27T09:31:09Z

A difficulty that I'm not sure how to tackle is the required bandwidth equality on continuous data. If you want to add the following histograms:

let a =
    [(0.1,1);(0.2,1);(0.3,1)] //bandwith = 0.1
    |> Map.ofList

let b =
    [(0.15,1);(0.3,1)] //bandwidth = 0.15 or 0.05, nobody knows..
    |> Map.ofList

merge a b  
// result: [(0.1,1);(0.15,1);(0.2,1);(0.3,2)] is not valid!!

Histograms (regardless if they are Frequencies or EmpiricalDistributions) that should be merged, have to have the same bandwidth. For categorical data this is no issue!

Solution

(A) introduce a Frequency/EmpiricalDistribution type that contains the frequency map as well as a bandwidth field that can be checked when merged
- downside: When dealing with categorical data this is totally useless.
(B) Leave it as it is and properly document this behaviour and trust the user not to merge histograms of differing bandwidth
- downside: Can you make the user responsible for this issue?
(C) Check the bandwidth when merged
- this is not possible in the current form because bin can be missing in the map structure and the bandwidth would be determined wrongly
(D) Add parameters to merge functions that requests the user to state the used bandwidth of both histograms. Nothing is done with these bandwidth except to check and fail if they do not match. This adds irrelevant ceremony to the function call, but ensures to not merge histograms of differing bandwidth.
- downside: For categorical data however this parameter would be hard to define

#263

bvenn · 2023-04-27T12:13:18Z

For now I decided to go with an unsatisfactory hybrid of (B) and (C). I added a parameter that requests the user to specify if the maps are based on equal binning or if it is categorical data. If its continuous data with unequal binning, the merge fails with a description explaining the issue.

In future a procedure could be implemented that dissect both maps and creates a new one with a new binning. If my understanding is correct the bandwidth must be double the maximal bandwidth that is observed in the input maps.

bvenn assigned HarryMcCarney and bvenn Apr 26, 2023

bvenn added a commit that referenced this issue Apr 26, 2023

add map merge

44c8071

#263

bvenn added a commit that referenced this issue Apr 26, 2023

add map merge tests

9bd6c08

#263

bvenn added a commit that referenced this issue Apr 26, 2023

add distribution merge documentation

6e0a7ad

#263

bvenn mentioned this issue Apr 26, 2023

Add merge strategies for maps #264

Merged

2 tasks

bvenn added a commit that referenced this issue Apr 27, 2023

add bandwidth test

814ae18

#263

bvenn added a commit that referenced this issue Apr 27, 2023

Merge pull request #264 from fslaborg/#263-add-map-merge

4f71652

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] `Frequency` merge operations #263

[Feature Request] `Frequency` merge operations #263

bvenn commented Apr 26, 2023

bvenn commented Apr 27, 2023 •

edited

Loading

bvenn commented Apr 27, 2023

[Feature Request] Frequency merge operations #263

[Feature Request] Frequency merge operations #263

Comments

bvenn commented Apr 26, 2023

Merge operations for Maps

Generic formulation of merge operations

bvenn commented Apr 27, 2023 • edited Loading

Solution

bvenn commented Apr 27, 2023

[Feature Request] `Frequency` merge operations #263

[Feature Request] `Frequency` merge operations #263

bvenn commented Apr 27, 2023 •

edited

Loading