Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weighted statistics #681

Closed
wants to merge 3 commits into from
Closed

Conversation

j08lue
Copy link
Contributor

@j08lue j08lue commented Feb 26, 2024

Fixes #680

  • Change the simple tests for get_array_statistics to reflect more details of weighted stats - tests will fail ❌
  • Add tests for more weighted stats
  • Implement correct weighted statistics in get_array_stats

@j08lue j08lue changed the title Fix weighted statistics WIP: Fix weighted statistics Feb 26, 2024
@j08lue j08lue changed the title WIP: Fix weighted statistics [WIP] Fix weighted statistics Feb 26, 2024
@j08lue j08lue marked this pull request as draft February 27, 2024 12:57
@j08lue j08lue changed the title [WIP] Fix weighted statistics Fix weighted statistics Feb 27, 2024
assert stats[0]["min"] == 2
assert stats[0]["max"] == 3
assert stats[0]["mean"] == (1 * 0 + 2 * 0.25 + 3 * 1.0 + 4 * 0) / 1.25
assert stats[0]["count"] == 1.25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's look at the data

data = np.ma.array((1, 2, 3, 4)).reshape((1, 2, 2))
coverage = np.array((0, 0.25, 1, 0)).reshape((2, 2))
data * coverage

>> masked_array(
  data=[[[0. , 0.5],
         [3. , 0. ]]],
  mask=False,
  fill_value=1e+20)

the stats should then be:

min: 0
max: 3
mean: (0 + 0.5 + 3.0 + 0.) / 4  = 0.875
sum: 0 + 0.5 + 3.0 + 0. = 3.5
count: 0 + 0.25 + 1 + 0 = 1.25 

I'm not sure to understand why the mean should be the sum of the data * coverage divided by the sum of the coverage. We already apply the coverage factor so we (IMO) just need to divide by the number of pixel

Copy link
Contributor Author

@j08lue j08lue Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because coverage does not sum up to 1. Its size is arbitrary and not related to the overall weight of all cells, so it skews the overall quantity.

A simple case illustrates this: imagine 2 x 2 cells, all containing pixel value 20, the coverage is the same, 0.1, for all cells.

Since the coverage / weight is the same for all cells, you would expect that their weighted average is the same as their simple average, namely 20, right?

Simple average:

(20 + 20 + 20 + 20) / 4 = 20

Weighted sum:

(20 * 0.1 + 20 * 0.1 + 20 * 0.1 + 20 * 0.1) = 8

If you divide that by 4 (the number of cells), you get 2.

You need to divide by the sum of the weights to get the expected result:

8 / (0.1 + 0.1 + 0.1 + 0.1) = 20

This is not a proof, but perhaps still helps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean

$\bar{x} = \frac{w_1 x_1 + w_2 x_2 + ... + w_n x_n}{w_1 + w_2 + ... + w_n}$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I'm not sure to full understand but in our case we don't use weight but % of each pixel, coverage so the sum of all the weight should be the number of pixels not the sum of coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect array statistics with (coverage) weights
2 participants