Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code coverage uploads fail occasionally at the CI #9832

Open
ahukkanen opened this issue Sep 15, 2022 · 20 comments · Fixed by #10344
Open

Code coverage uploads fail occasionally at the CI #9832

ahukkanen opened this issue Sep 15, 2022 · 20 comments · Fixed by #10344
Labels
type: bug Issues that describe a bug type: internal PRs that aren't necessary to add to the CHANGELOG for implementers

Comments

@ahukkanen
Copy link
Contributor

ahukkanen commented Sep 15, 2022

Describe the bug
At #9686 we tried to solve the issue with the "flaky" code coverage reports that sometimes report coverage dropped even when it didn't really drop.

The implemented solution was based on simplecov-ruby/simplecov#1019 which explains simplecov having a bug and we can avoid that bug by letting codecov itself merge the reports and keeping them separate on parallel runs.

After this was implemented, we started getting this kind of error when trying to upload the reports to codecov:

https://codecov.io/upload/v4?package=bash-1.0.6&token=<hidden>&package=bash-1.0.6&token=&branch=develop&commit=c6791f1455d88ba20aedd3c6a93a709d8771b5ae&build=3061503581&build_url=https%3A%2F%2Fgithub.com%2Fdecidim%2Fdecidim%2Factions%2Fruns%2F3061503581&name=decidim-admin&tag=&slug=decidim%2Fdecidim&service=github-actions&flags=decidim-admin&pr=&job=%5BCI%5D%20Admin%20%28system%20tests%29&cmd_args=n,C,F
{'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}
404

To Reproduce
This happens randomly, so see #9686 (comment).

That comment explains an actual case where this happened and the reason why it happened.

Expected behavior
It would be expected that the code coverage report uploading succeeds normally.

Screenshots
Coverage report upload failed

Stacktrace

==> Uploading to Codecov
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  448k  100   171  100  448k    881  [23](https://github.com/decidim/decidim/actions/runs/3061503581/jobs/4941352077#step:6:24)13k --:--:-- --:--:-- --:--:-- 2[31](https://github.com/decidim/decidim/actions/runs/3061503581/jobs/4941352077#step:6:32)3k
    {'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}

Extra data (please complete the following information):

  • Device: The machine running at Microsoft datacenter
  • Device OS: Linux
  • Browser: Any
  • Decidim Version: develop
  • Decidim installation: N/A

Additional context
The codecov uploader we are using (https://codecov.io/bash) would allow passing the "codecov token" using -t TOKEN. It would need to be added here:

bash <(curl -s https://codecov.io/bash) -n $REPORT_NAME -C $SHA -F $REPORT_NAME
else
bash <(curl -s https://codecov.io/bash) -n $REPORT_NAME -C $SHA -P $PRID -F $REPORT_NAME

After investigating the codecov script, this could be also solved simply by providing the CODECOV_TOKEN environment value to the upload script.

The problem is that at least I don't have access to the Codecov account to get that token.

Anyone knows who has access to the Codecov account?

@ahukkanen
Copy link
Contributor Author

Also to note here that the change done at #9686 is very much needed until simplecov-ruby/simplecov#1019 is resolved.

We just found out another problem after implementing that fix.

@ahukkanen
Copy link
Contributor Author

ahukkanen commented Sep 15, 2022

And just looking at that attached screenshot, I also noticed this:
Token parameter issue

The last URL parameter will always win, so maybe this is a bug somewhere else... It could be that we are already sending the token.

EDIT:
Never mind, this is a bug in the codecov upload script:
Codecov upload script

  # Full query (to display on terminal output)
  query=$(echo "package=$package-$VERSION&token=$token&$query" | tr -d ' ')
  queryNoToken=$(echo "package=$package-$VERSION&token=<hidden>&$query" | tr -d ' ')

They are first setting the token in the query parameter and then they are appending that parameter to the queryNoToken.

So I think the solution would be to set the token correctly.

@ahukkanen ahukkanen added type: bug Issues that describe a bug type: internal PRs that aren't necessary to add to the CHANGELOG for implementers labels Sep 15, 2022
@ahukkanen
Copy link
Contributor Author

It also seems that the selective running of the specs causes coverage comparison failures.

For example, at #9621, codecov is reporting coverage degradation for the core classes and those specs weren't even run for that PR because it didn't touch any classes that would trigger the core specs to run:
https://app.codecov.io/gh/decidim/decidim/pull/9621

Coverage report

The bottom line from this is that in case any code in the target class is covered in the coverage report, it will compare that coverage to any previous runs which might have covered the same file. And in case the coverage differs, it will be a problem for Codecov as we are comparing against the "project" coverage, not the "patch" coverage.

@ahukkanen
Copy link
Contributor Author

Same thing as in the original post happened also at:
https://github.com/decidim/decidim/actions/runs/3098964809/jobs/5017524140

The coverage upload failed at the decidim-admin module causing an invalid report.

One possible solution which was also suggested at similar problems after searching would be to modify the coverage upload script so that it would retry e.g. 10 times and wait 1 minute between each retry until the upload succeeds or the retries amount is reached.

@ahukkanen
Copy link
Contributor Author

Even with the official Codecov action we seem to have the same problem (at PR #10344):
https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113

image

Run log:

Run codecov/codecov-action@v3
  with:
    name: decidim-accountability
    flags: decidim-accountability
  env:
    CI: true
    RUBY_VERSION: 3.1.1
    NODE_VERSION: 16.9.1
    DECIDIM_MODULE: decidim-accountability
    PARALLEL_TEST_PROCESSORS: [2](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:2)
    DATABASE_USERNAME: postgres
    DATABASE_PASSWORD: postgres
    DATABASE_HOST: localhost
    RUBYOPT: -W:no-deprecated
    VALIDATOR_HTML_URI: http://localhost:8888/
==> linux OS detected
https://uploader.codecov.io/latest/linux/codecov.SHA256SUM
==> SHASUM file signed by key id 806bb28aed779869
==> Uploader SHASUM verified (20f9c9d7848[3](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:3)fce977b6cc39e231a73[4](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:4)a23bcd36f4d[5](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:5)3[6](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:6)bb[7](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:7)355222fb[8](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:8)8d02bc  codecov)
==> Running version latest
==> Running version v0.3.2
/home/runner/work/_actions/codecov/codecov-action/v3/dist/codecov -n decidim-accountability -Q github-action-3.1.1 -F decidim-accountability -C 1[9](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:9)ee9df14477fcb55f74bc2e7844a2eccec4cf9f
[2023-02-06T19:44:22.640Z] ['info'] 
     _____          _
    / ____|        | |
   | |     ___   __| | ___  ___ _____   __
   | |    / _ \ / _` |/ _ \/ __/ _ \ \ / /
   | |___| (_) | (_| |  __/ (_| (_) \ V /
    \_____\___/ \__,_|\___|\___\___/ \_/

  Codecov report uploader 0.3.2
[2023-02-06T19:44:22.658Z] ['info'] => Project root located at: /home/runner/work/decidim/decidim
[2023-02-06T19:44:22.663Z] ['info'] -> No token specified or token is empty
[2023-02-06T19:44:22.691Z] ['info'] Searching for coverage files...
[2023-02-06T19:44:23.827Z] ['info'] Warning: Some files located via search were excluded from upload.
[2023-02-06T19:44:23.827Z] ['info'] If Codecov did not locate your files, please review https://docs.codecov.com/docs/supported-report-formats
[2023-02-06T19:44:23.827Z] ['info'] => Found 1 possible coverage files:
  coverage/coverage.xml
[2023-02-06T19:44:23.827Z] ['info'] Processing /home/runner/work/decidim/decidim/coverage/coverage.xml...
[2023-02-06T19:44:23.906Z] ['info'] Detected GitHub Actions as the CI provider.
[2023-02-06T19:44:23.908Z] ['info'] Pinging Codecov: https://codecov.io/upload/v4?package=github-action-3.1.1-uploader-0.3.2&token=*******&branch=chore%2Fcodecov-action&build=4[10](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:10)716[12](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:12)02&build_url=https%3A%2F%2Fgithub.com%2Fdecidim%2Fdecidim%2Factions%2Fruns%2F4107161202&commit=19ee9df[14](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:14)477fcb55f74bc2e7844a2eccec4cf9f&job=%5BCI%5D+Accountability&pr=10344&service=github-actions&slug=decidim%2Fdecidim&name=decidim-accountability&tag=&flags=decidim-accountability&parent=
[2023-02-06T[19](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:20):44:[24](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:25).[36](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:37)3Z] ['error'] There was an error running the uploader: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: [40](https://github.com/decidim/decidim/actions/runs/4107161202/jobs/7086296113#step:6:41)4 - {'detail': ErrorDetail(string='Unable to locate build via Github Actions API. Please upload with the Codecov repository upload token to resolve issue.', code='not_found')}

@ahukkanen
Copy link
Contributor Author

Related discussions:

So the bottom line is that:

  • It is not related whether we use the legacy bash uploader or the newer GitHub action
  • The reason why uploads fail is that the public actions are using a shared Codecov token which is reaching its API limits with GitHub and therefore it is failing to locate the correct build
  • This can be fixed by providing our own token
  • But problem with our own token is that it does not work for any workflows triggered from Decidim forks which would still cause this issue for external contributors

Potential solution is provided in the discussion linked above:

codecov/codecov-action#557 (comment)

ssbarnea, what have helped me, and what I thought you suggested was adding token not only to secrets, but to CI yaml as well. Like that:

      uses: codecov/codecov-action@v3
      with:
        token: ${{ secrets.CODECOV_TOKEN }}

Have you tried it?

codecov/codecov-action#557 (comment)

From https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflows-in-forked-repositories-1

secrets are not passed to the runner when a workflow is triggered from a forked repository

See also https://securitylab.github.com/research/github-actions-preventing-pwn-requests/

codecov/codecov-action#557 (comment)

and the proposed workaround of using an upload token won't work for PRs from forks

It works if you make you token public: https://github.com/FerretDB/FerretDB/blob/f4de6e41fc6d0ab9f00ac9e2facdda877fca86ee/.github/workflows/go.yml#L23-L25

So if we expose our Codecov token through the workflow files by hard coding it there, i.e. not through the configured environment variables, it should work for both workflows triggered from the decidim/decidim repo as well as from any forked repositories.

@andreslucena
Copy link
Member

@ahukkanen great summary.

I guess that if the only risk is that someone could send reports on our behalf and nothing else, then we can live with that. I'll also subscribe to those issues so we can keep try again with the solution provided by them.

Go ahead with the change 👍🏽

@ahukkanen
Copy link
Contributor Author

With about one week of experience, I haven't yet seen CodeCov failing unnecessarily so the workaround seems to work.

We'll keep following the situation, I believe we haven't yet tested it under heavy loads, i.e. when there is a large amount of workflows run and a large amount of CodeCov requests at the same time.

@ahukkanen
Copy link
Contributor Author

Today I noticed the first error on this run when there was rather high load on the CI actions:
https://github.com/decidim/decidim/actions/runs/4403186821/jobs/7711240421

Codecov returned HTTP 500 which is different what we used to see before setting the token. Here's the whole log from the upload coverage step:

==> linux OS detected
https://uploader.codecov.io/latest/linux/codecov.SHA256SUM
==> SHASUM file signed by key id 806bb28aed779869
==> Uploader SHASUM verified (080b43eaec3434326bb0f61653a82d27aba15c311ddde9d3f68cb364314f7aae  codecov)
==> Running version latest
==> Running version v0.3.5
/home/runner/work/_actions/codecov/codecov-action/v3/dist/codecov -n decidim-proposals-system-public -Q github-action-3.1.1 -F decidim-proposals-system-public
[2023-03-13T11:15:15.851Z] ['info'] 
     _____          _
    / ____|        | |
   | |     ___   __| | ___  ___ _____   __
   | |    / _ \ / _` |/ _ \/ __/ _ \ \ / /
   | |___| (_) | (_| |  __/ (_| (_) \ V /
    \_____\___/ \__,_|\___|\___\___/ \_/

  Codecov report uploader 0.3.5
[2023-03-13T11:15:15.863Z] ['info'] => Project root located at: /home/runner/work/decidim/decidim
[2023-03-13T11:15:15.868Z] ['info'] ->  Token found by environment variables
[2023-03-13T11:15:15.904Z] ['info'] Searching for coverage files...
[2023-03-13T11:15:16.992Z] ['info'] Warning: Some files located via search were excluded from upload.
[2023-03-13T11:15:16.992Z] ['info'] If Codecov did not locate your files, please review https://docs.codecov.com/docs/supported-report-formats
[2023-03-13T11:15:16.992Z] ['info'] => Found 1 possible coverage files:
  coverage/coverage.xml
[2023-03-13T11:15:16.992Z] ['info'] Processing /home/runner/work/decidim/decidim/coverage/coverage.xml...
[2023-03-13T11:15:17.067Z] ['info'] Detected GitHub Actions as the CI provider.
[2023-03-13T11:15:17.070Z] ['info'] Pinging Codecov: https://codecov.io/upload/v4?package=github-action-3.1.1-uploader-0.3.5&token=*******&branch=develop&build=4403186821&build_url=https%3A%2F%2Fgithub.com%2Fdecidim%2Fdecidim%2Factions%2Fruns%2F4403186821&commit=7504b5956f4e0a872d31a7d943dd98[19](https://github.com/decidim/decidim/actions/runs/4403186821/jobs/7711240421#step:6:20)6c441589&job=%5BCI%5D+Proposals+%28system+public%29&pr=&service=github-actions&slug=decidim%2Fdecidim&name=decidim-proposals-system-public&tag=&flags=decidim-proposals-system-public&parent=
[[20](https://github.com/decidim/decidim/actions/runs/4403186821/jobs/7711240421#step:6:21)[23](https://github.com/decidim/decidim/actions/runs/4403186821/jobs/7711240421#step:6:24)-03-13T11:15:19.504Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 500 - {"error": "Server Error (500)"}

Not much details available what might have caused it but the good thing is that it's no longer the same error that we used to get. This is likely something internal at Codecov's end.

@ahukkanen
Copy link
Contributor Author

I posted my findings also at: codecov/codecov-action#926.

Would be really helpful if they provided a retry option since the issues we are facing now seem to be at their end, so there's not much we can do about them.

@ahukkanen
Copy link
Contributor Author

I think we need to reopen this issue since this is not permanently solved.

Codecov is still failing quite often because one of the actions fails to upload its coverage. Not all of them but if we are missing even one, it has a high impact on the reported coverage.

There is still no activity on this matter from Codecov's end but in the issue linked above (codecov/codecov-action#926), there are a couple of suggestions how we could try to solve this issue:

  1. Split the coverage upload to its own step and fail the step
  • In this suggestion, the job would fail but only a single step of it needs to be rerun which would be rather fast.
  • The coverage would be stored as an artifact so it would be available for the separate step.
  • Note that this requires setting the fail_ci_if_error flag to true for the action.
  1. Using the Retry Step action
  • This action is able to retry a certain step in case it fails, have not investigated further

@alecslupu
Copy link
Contributor

alecslupu commented Oct 30, 2023

@ahukkanen This PR tried to enable fail_ci_if_error, which would fail the pipeline if there is some HTTP error, but as @andreslucena pointed out here #11837 (comment), even though there was no error on uploading the coverage data, the codecov still failed.

What i could observe was that there was no upload error, yet the codecov took ages to process the data to display it... Maybe we have a very big codebase? :D

I have tried in the same build to split the pipeline in 3802333, yet the codecov was still failing.

As i understand the github actions, there is no way of allowing github to finish all the workflows, to generate artifacts containing coverage data, then download all the artifacts using one single workflow ... and send it to codecov in a more compact request (either a merged result or multiple workflows )

@ahukkanen ahukkanen reopened this Oct 30, 2023
@ahukkanen
Copy link
Contributor Author

@alecslupu OK, that is actually interesting because it seems the problem may be somewhere else than I originally thought. In my previous investigations, I figured the problem would be that some of the actions fail to upload the coverage report, which can still happen.

But at #11837 I was particularly interested that you said:

Well, at least we know that the bug is on the codecov processing time.

What do you mean by "codecov processing time"?

@alecslupu
Copy link
Contributor

@alecslupu OK, that is actually interesting because it seems the problem may be somewhere else than I originally thought. In my previous investigations, I figured the problem would be that some of the actions fail to upload the coverage report, which can still happen.

But at #11837 I was particularly interested that you said:

Well, at least we know that the bug is on the codecov processing time.

What do you mean by "codecov processing time"?

I think i have answered here : #9832 (comment)
There were no upload errors, yet the codecov still failed due to enourmous processing time on the codecov side.

To be honest, even if i try to load https://app.codecov.io/gh/decidim/decidim/pull/11837 on codecov is taking me ages ( aka not loading data)

image

image

What is more interesting is that i have this commit (updates only a js library) that generates an indirect change ... https://app.codecov.io/gh/decidim/decidim/commit/94c3c595ca42ef02073aada193ca70780734c907/indirect-changes

@alecslupu
Copy link
Contributor

alecslupu commented Oct 30, 2023

I am just saying ... maybe is worth trying to find other apps that provide similar services like enable the code climate ( which is already installed), yet the codeclimate does not appear to support multiple workflow runs in the same time.
Codecov seems to be installed since the beginning: 0c61969 ( commit since 2016 )

@ahukkanen
Copy link
Contributor Author

@alecslupu Yes, that is definitely a good idea at this point since it doesn't seem we are able to find a solution to this problem.

In the past it was suggested by @microstudi that we would use Code Climate also for the coverage checks as for his experience, it has worked well. I don't know if there are any downsides to that.

@alecslupu
Copy link
Contributor

I am thinking that any of the following would allow us to wait for all the other workflows to finish. If this is the case, we could patch the current workflows so that we upload the coverage as artifact, the in another place to download all of them, unzip, merge in a single file, then push it to whatever cov service.

https://github.com/mktcode/consecutive-workflow-action
https://github.com/fountainhead/action-wait-for-check
https://github.com/lewagon/wait-on-check-action

@ahukkanen
Copy link
Contributor Author

@alecslupu From the point above I understood that at least with Codecov we would still have the same problem with that strategy if Codecov reports the coverage to GitHub before it has finished analyzing the whole report.

Just as a side note that we've never seen this problem with any modules where we are using Codecov since they only upload a single report.

@andreslucena
Copy link
Member

I am just saying ... maybe is worth trying to find other apps that provide similar services like enable the code climate ( which is already installed), yet the codeclimate does not appear to support multiple workflow runs in the same time.

This is the only feature that we need, right?

The term that I've seen used for this feature is "carryforward". I see that Coverall supports it:

https://coveralls.io/better-monorepo-support

(I've never used it)

@microstudi
Copy link
Contributor

In the past it was suggested by @microstudi that we would use Code Climate also for the coverage checks as for his experience, it has worked well. I don't know if there are any downsides to that.

Actually, CodeClimate was used in the past in Decidim. They decided to change to suport multi action uploads. I don't know if codeclimate suports it now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Issues that describe a bug type: internal PRs that aren't necessary to add to the CHANGELOG for implementers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants