Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run started to error out #150

Closed
yarikoptic opened this issue Oct 2, 2022 · 33 comments · Fixed by #151
Closed

run started to error out #150

yarikoptic opened this issue Oct 2, 2022 · 33 comments · Fixed by #151
Assignees

Comments

@yarikoptic
Copy link
Member

happens consistently on datalad, seems always on the same pr

-N  - 5/549374: Cron Daemon            Cron <datalad@smaug> chronic flock -n -E 0 /home/datalad/.run/tinuous-datalad.lock /mnt/datasets/datalad/ci/logs/tools/cron_job 
...
2022-10-01T18:20:14-0400 [INFO    ] tinuous Found run 5550                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
2022-10-01T18:20:14-0400 [INFO    ] tinuous Logs for test_extensions.yml (Extensions) #5550 already downloaded to 2022/09/15/pr/7039/6ac3eef/github-Extensions-5550-failed; skipping                                                                                                                                                                                                                                                                                                                                                  
2022-10-01T18:20:14-0400 [INFO    ] tinuous Fetching runs for workflow  (Docs)                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/requests/adapters.py", line 439, in send                                                                                                                                                                                                                                                                                                                                                                                                                
    resp = conn.urlopen(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/urllib3/connectionpool.py", line 846, in urlopen                                                                                                                                                                                                                                                                                                                                                                                                        
    return self.urlopen(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/urllib3/connectionpool.py", line 846, in urlopen                                                                                                                                                                                                                                                                                                                                                                                                        
    return self.urlopen(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/urllib3/connectionpool.py", line 846, in urlopen                                                                                                                                                                                                                                                                                                                                                                                                        
    return self.urlopen(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  [Previous line repeated 9 more times]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/urllib3/connectionpool.py", line 836, in urlopen                                                                                                                                                                                                                                                                                                                                                                                                        
    retries = retries.increment(method, url, response=response, _pool=self)                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment                                                                                                                                                                                                                                                                                                                                                                                                          
    raise MaxRetryError(_pool, url, error or ResponseError(cause))                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/datalad/datalad/actions/workflows/208727/runs (Caused by ResponseError('too many 500 error responses'))                                                                                                                                                                                                                                                                                                 

During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/datalad/miniconda3/envs/tinuous-dev/bin/tinuous", line 33, in <module>                                                                                                                                                                                                                                                                                                                                                                                                                                                  
    sys.exit(load_entry_point('tinuous', 'console_scripts', 'tinuous')())                                                                                                                                                                                                                                                                                                                                                                                                                                                             
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/click/core.py", line 829, in __call__                                                                                                                                                                                                                                                                                                                                                                                                                   
    return self.main(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/click/core.py", line 782, in main                                                                                                                                                                                                                                                                                                                                                                                                                       
    rv = self.invoke(ctx)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/click/core.py", line 1259, in invoke                                                                                                                                                                                                                                                                                                                                                                                                                    
    return _process_result(sub_ctx.command.invoke(sub_ctx))                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/click/core.py", line 1066, in invoke                                                                                                                                                                                                                                                                                                                                                                                                                    
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/click/core.py", line 610, in invoke                                                                                                                                                                                                                                                                                                                                                                                                                     
    return callback(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func                                                                                                                                                                                                                                                                                                                                                                                                              
    return f(get_current_context().obj, *args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  File "/mnt/datasets/datalad/ci/tinuous/src/tinuous/__main__.py", line 112, in fetch                                                                                                                                                                                                                                                                                                                                                                                                                                                 
    for obj in ci.get_build_assets(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  File "/mnt/datasets/datalad/ci/tinuous/src/tinuous/github.py", line 118, in get_build_assets                                                                                                                                                                                                                                                                                                                                                                                                                                        
    for run in self.get_runs(wf, self.since):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  File "/mnt/datasets/datalad/ci/tinuous/src/tinuous/github.py", line 43, in wrapped                                                                                                                                                                                                                                                                                                                                                                                                                                                  
    return func(gha, *args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
  File "/mnt/datasets/datalad/ci/tinuous/src/tinuous/github.py", line 101, in get_runs                                                                                                                                                                                                                                                                                                                                                                                                                                                
    for r in wf.get_runs():                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/PaginatedList.py", line 56, in __iter__                                                                                                                                                                                                                                                                                                                                                                                                          
    newElements = self._grow()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/PaginatedList.py", line 67, in _grow                                                                                                                                                                                                                                                                                                                                                                                                             
    newElements = self._fetchNextPage()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/PaginatedList.py", line 199, in _fetchNextPage                                                                                                                                                                                                                                                                                                                                                                                                   
    headers, data = self.__requester.requestJsonAndCheck(                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/Requester.py", line 354, in requestJsonAndCheck                                                                                                                                                                                                                                                                                                                                                                                                  
    *self.requestJson(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/Requester.py", line 454, in requestJson                                                                                                                                                                                                                                                                                                                                                                                                          
    return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode)                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/Requester.py", line 528, in __requestEncode                                                                                                                                                                                                                                                                                                                                                                                                      
    status, responseHeaders, output = self.__requestRaw(                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/Requester.py", line 555, in __requestRaw                                                                                                                                                                                                                                                                                                                                                                                                         
    response = cnx.getresponse()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/github/Requester.py", line 127, in getresponse                                                                                                                                                                                                                                                                                                                                                                                                          
    r = verb
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/requests/sessions.py", line 555, in get                                                                                                                                                                                                                                                                                                                                                                                                                 
    return self.request('GET', url, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/requests/sessions.py", line 542, in request                                                                                                                                                                                                                                                                                                                                                                                                             
    resp = self.send(prep, **send_kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/requests/sessions.py", line 655, in send                                                                                                                                                                                                                                                                                                                                                                                                                
    r = adapter.send(request, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  File "/home/datalad/miniconda3/envs/tinuous-dev/lib/python3.9/site-packages/requests/adapters.py", line 507, in send                                                                                                                                                                                                                                                                                                                                                                                                                
    raise RetryError(e, request=request)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
requests.exceptions.RetryError: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/datalad/datalad/actions/workflows/208727/runs (Caused by ResponseError('too many 500 error responses'))    

and started to fail on dandi-cli

Subject: Cron <dandi@drogon> cd /mnt/backup/dandi/tinuous-logs/dandi-cli && chronic flock -n -E 0 /home/dandi/.run/tinuous-dandi-cli.lock /mnt/backup/dandi/tinuous-logs/venv/bin/tinuous fetch                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
2022-10-01T18:00:02-0400 [INFO    ] tinuous tinuous 0.5.2                                                                                                                                                                                                                                                                                                                                                                                                             
2022-10-01T18:00:02-0400 [INFO    ] tinuous Fetching resources from github                                                                                                                                                                                                                                                                                                                                                                                            
2022-10-01T18:00:02-0400 [INFO    ] tinuous Fetching runs newer than 2022-09-30 21:03:18+00:00                                                                                                                                                                                                                                                                                                                                                                        
2022-10-01T18:00:02-0400 [INFO    ] tinuous Fetching runs for workflow .github/workflows/test.yml (Tests)                                                                                                                                                                                                                                                                                                                                                             
2022-10-01T18:00:04-0400 [INFO    ] tinuous Found run 4227                                                                                                                                                                                                                                                                                                                                                                                                            
2022-10-01T18:00:04-0400 [INFO    ] tinuous Logs for test.yml (Tests) #4227 already downloaded to 2022/10/01/github/cron/20221001T061504/d5ed594/Tests/4227; skipping                                                                                                                                                                                                                                                                                                 
2022-10-01T18:00:04-0400 [INFO    ] tinuous Fetching runs for workflow  (Tests)                                                                                                                                                                                                                                                                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                    
...                                                                                                                                                                                                                                                                                                                                                                                                   
  File "/mnt/backup/dandi/tinuous-logs/venv/lib/python3.8/site-packages/requests/adapters.py", line 510, in send                                                                                                                                                                                                                                                                                                                                                      
    raise RetryError(e, request=request)                                                                                                                                                                                                                                                                                                                                                                                                                              
requests.exceptions.RetryError: HTTPSConnectionPool(host='api.github.com', port=443): Max retries exceeded with url: /repos/dandi/dandi-cli/actions/workflows/228028/runs (Caused by ResponseError('too many 500 error responses'))     

since so consistent on the same PR -- may be indeed some github issue??? should be investigated and possibly filed with github. I do not think I saw similar problem from any other cron job -- only from datalad and dandi-cli.

@jwodder
Copy link
Member

jwodder commented Oct 3, 2022

@yarikoptic The error does not appear to have anything to do with specific PRs; rather, it's tied to specific (broken?) workflows which have empty path fields and whose run listings always return 500.

@jwodder
Copy link
Member

jwodder commented Oct 3, 2022

@yarikoptic For filling out the GitHub ticket, I need to know when this started.

@yarikoptic
Copy link
Member Author

hm, fun. Well - the dates are in the logs above pretty much since we run those cron jobs quite regularly. In both cases it started on Oct 01. First email failing for dandi-cli:

Date: Sat, 01 Oct 2022 00:13:02 -0400
From: Cron Daemon <root@drogon.datalad.org>
To: dandi@drogon.datalad.org
Subject: Cron <dandi@drogon> cd /mnt/backup/dandi/tinuous-logs/dandi-cli && chronic flock -n -E 0
/home/dandi/.run/tinuous-dandi-cli.lock /mnt/backup/dandi/tinuous-logs/venv/bin/tinuous fetch

for datalad:

Date: Sat, 01 Oct 2022 02:33:11 -0400
From: Cron Daemon <root@smaug.datalad.org>
To: datalad@smaug.datalad.org
Subject: Cron <datalad@smaug> chronic flock -n -E 0 /home/datalad/.run/tinuous-datalad.lock
        /mnt/datasets/datalad/ci/logs/tools/cron_job

@jwodder
Copy link
Member

jwodder commented Oct 3, 2022

Ticket filed with GitHub. I associated it with the "con" organization, so you might(?) be able to see it: https://support.github.com/ticket/personal/0/1818408

@yarikoptic
Copy link
Member Author

Unfortunately can't see it, but that's ok I guess. Is there anything we need to fix in those workflows triggering this issue and/or any way tinious could workaround it ATM?

@jwodder
Copy link
Member

jwodder commented Oct 3, 2022

@yarikoptic We could make tinuous skip workflows with an empty path, but I'm not entirely sure whether that can legitimately happen.

@yarikoptic
Copy link
Member Author

ah gotcha - so you suspect that github returned information about workflows is buggy... ok, let's wait a day for a possible response to that github support ticket you filed.

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 5, 2022

FTR: it seems has started to error out also for datalad-extensions. Let's hope github folks would figure it out and solve soonish.

edit: and also for heudiconv for which I tried to add con/tinuous

@yarikoptic
Copy link
Member Author

if the issue is due to github duplicating workflow records in its listing, where the 2nd copy doesn't have path, couldn't we detect duplicate incomplete entries and just skip them? (after some lgr.warning("Detected duplicate buggy entry of workflow %s, a known github issue", workflow.name) or alike for debugging etc)

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic

  • Note that the only sense in which the broken workflows are "duplicates" is that they have the same name field as another workflow; all the other fields differ.
  • I don't know whether it'll always be the case that broken workflows are duplicates. For one thing, if I use the API to list the workflows for datalad-extensions and heudiconv right now, neither has any workflows with duplicate names or empty paths.

@yarikoptic
Copy link
Member Author

yarikoptic commented Oct 10, 2022

  • I think it is ok to assume that name should be unique, or am I wrong? (i.e. if we have some case where we have the same name for multiple workflows, do we?)
  • have you reached the crash due to similar 500 issue with datalad-extensions or heudiconv? (for heudiconv I remember getting it only when trying to get it for some older starting date, like in May of this year)

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic

I think it is ok to assume that name should be unique, or am I wrong?

It not OK to assume that. The "name" is just whatever's filled in for the "name" key at the top of the workflow file, and it's perfectly possible for two or more workflows to have the same name.

have you reached the crash due to similar 500 issue with datalad-extensions or heudiconv?

Using the below script, I get a 500 when trying to list the runs for the first datalad/datalad-extensions workflow (which has a unique name and a path of .github/workflows/test_extensions.yml, which currently does not exist in the repo), and I get no errors for nipy/heudiconv:

#!/usr/bin/env python3
import click
from ghrepo import GHRepo
import requests

@click.command()
@click.argument("repo", type=GHRepo.parse)
def main(repo):
    with requests.Session() as s:
        r = s.get(f"{repo.api_url}/actions/workflows?per_page=100")
        r.raise_for_status()
        for workflow in r.json()["workflows"]:
            print("Name:", repr(workflow["name"]))
            print("Path:", repr(workflow["path"]))
            r = s.get(workflow["url"] + "/runs")
            if not r.ok:
                print("ERROR:", r.status_code)
            else:
                print("Runs:", r.json()["total_count"])
            print()

if __name__ == "__main__":
    main()

@yarikoptic
Copy link
Member Author

interesting, so -- it is 500s on an extension which no longer exists although existed long time ago:

❯ git log -- .github/workflows/test_extensions.yml
commit 36f3042c0f9fbfef3d6cba82e1176a70c967e913
Author: Yaroslav Halchenko <debian@onerussian.com>
Date:   Mon Feb 24 12:07:29 2020 -0500

    Remove test_extensions.yml with a matrix for all extensions (to be split)

well beyond the date we care about. So, for each one of those 500s we just need to establish (and ideally cache but we can may be tolerate querying each time for now) the date when it was removed in the repo. is there a github API to provide commits for a path? (I couldn't find quickly)

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic I don't see anything obvious. Note that my script is able to retrieve runs for .github/workflows/upload-windows-extras.yaml, .github/workflows/test-datalad_osf.yaml, and some other workflows that no longer exist in the repo. Interestingly, testing the script on some other repositories seems to suggest that workflows that don't exist anymore shouldn't be returned.

@yarikoptic
Copy link
Member Author

Interestingly, testing the script on some other repositories seems to suggest that workflows that don't exist anymore shouldn't be returned.

wouldn't that mean that we potentially could loose some logs? e.g. if

  • run tinuous
  • workflowx runs
  • workflowx is removed
  • run tinuous
    we would not be able to fetch logs from that run between tinuous runs?

also -- I checked mail archives - it seems that "Extensions" workflow was returned before even after its removal , e.g.

2021-03-17T12:05:03-0400 [INFO    ] tinuous Fetching runs for workflow test_extensions.yml (Extensions)
2021-03-17T12:05:04-0400 [INFO    ] tinuous Fetching runs for workflow test_macos.yml (Test on macOS)
2021-03-17T12:05:04-0400 [INFO    ] tinuous Fetching logs from travis
2021-03-17T12:05:04-0400 [INFO    ] tinuous Fetching builds newer than 2021-03-17 02:32:36+00:00

or

2021-03-30T11:00:20-0400 [INFO    ] tinuous Fetching runs newer than 2021-03-30 06:28:07+00:00
2021-03-30T11:00:20-0400 [INFO    ] tinuous Fetching runs for workflow test_crippled.yml (CrippledFS)
2021-03-30T11:00:21-0400 [INFO    ] tinuous Fetching runs for workflow test_extensions.yml (Extensions)
2021-03-30T11:00:21-0400 [INFO    ] tinuous Fetching runs for workflow test_macos.yml (Test on macOS)
2021-03-30T11:00:22-0400 [INFO    ] tinuous Fetching logs from travis

so it seems that they should be returned.

@yarikoptic
Copy link
Member Author

hey -- we do have all needed information, kinda:

(Pdb) p workflow
{'id': 605374, 'node_id': 'MDg6V29ya2Zsb3c2MDUzNzQ=', 'name': 'Extensions', 'path': '.github/workflows/test_extensions.yml', 'state': 'active', 'created_at': '2020-02-21T01:31:37.000Z', 'updated_at': '2020-02-21T01:31:37.000Z', 'url': 'https://api.github.com/repos/datalad/datalad-extensions/actions/workflows/605374', 'html_url': 'https://github.com/datalad/datalad-extensions/blob/master/.github/workflows/test_extensions.yml', 'badge_url': 'https://github.com/datalad/datalad-extensions/workflows/Extensions/badge.svg'}

so we have 'updated_at': '2020-02-21T01:31:37.000Z' so

  • if we get 500
  • and the file is no longer in the tree
❯ curl -H "Accept: application/vnd.github+json" -H "Authorization: Bearer $(git config hub.oauthtoken)" https://api.github.com/repos/datalad/datalad-extensions/contents/.github/workflows/test_extensions.yml
{
  "message": "Not Found",
  "documentation_url": "https://docs.github.com/rest/reference/repos#get-repository-content"
}
  • and updated_at is before the date we are interested
    then skip that workflow while issuing a warning.

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic Detecting whether a 500 error occurred isn't straightforward. We've configured PyGithub to retry all 500 errors (as well as some other 5xx errors) up to 12 times, and only if all attempts fail (after about 12 and a half minutes) is an exception raised. Unfortunately, the requests.RetryError object doesn't seem to store any failed responses or status codes, and the only way to tell what exactly went wrong seems to be to inspect the stringification of the error, which is not robust.

@yarikoptic
Copy link
Member Author

We are talking about https://github.com/con/tinuous/blob/master/src/tinuous/github.py#L66 which uses urllib3.util.retry.Retry right? if so, what about -- then you subclass Retry, overload increment, do your check there, and if the workflow to skip is found, raise SkipThisWorkflow (the custom exception you add) and then react on that exception in our code to skip it.

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic

overload increment, do your check there

Just check for 500 or for something else? If we only check for 500's, I don't see how that'd be different from not retrying on 500. If we also check details of the workflow at that point, how would increment() be passed anything about the workflow?

@yarikoptic
Copy link
Member Author

Just check for 500 or for something else? If we only check for 500's, I don't see how that'd be different from not retrying on 500.

only for 500. It will be different that we will perform more checks for that 500 - that the date is still "of interest etc".

... how would increment() be passed anything about the workflow?

We will have the request URL, wouldn't it carry org/repo/workflow path as part of it within that end-point which is 500ing now?

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic

only for 500. It will be different that we will perform more checks for that 500 - that the date is still "of interest etc".

Assuming that increment() raises an error on 500 and the "more checks" are performed by the code that catches the error, how would we make the code retry on 500's?

We will have the request URL, wouldn't it carry org/repo/workflow path as part of it within that end-point which is 500ing now?

We also need the since timestamp in order to know whether a workflow is old enough to ignore.

@yarikoptic
Copy link
Member Author

Assuming that increment() raises an error on 500 and the "more checks" are performed by the code that catches the error, how would we make the code retry on 500's?

I envisioned that increment to be overloaded as:

def increment(self, *args, **kwargs):
  ret = super().increment(*args, **kwargs)
  if do_checks_for_the_url_say_it_is_500_to_skip:
     raise SkipThisWorkflow()
  return ret

so it would just perform as usual Retry otherwise, and raise our ad-hoc exception if runs into our use case

We will have the request URL, wouldn't it carry org/repo/workflow path as part of it within that end-point which is 500ing now?

We also need the since timestamp in order to know whether a workflow is old enough to ignore.

we know what workflow we are talking about so we could perform (a possibly duplicate) request to get any information we want to get that since as was shown above, right?

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic

we know what workflow we are talking about so we could perform (a possibly duplicate) request to get any information we want to get that since as was shown above, right?

The since timestamp comes from tinuous' statefile, not from the API.

@yarikoptic
Copy link
Member Author

The since timestamp comes from tinuous' statefile, not from the API.

so it is in our direct power to access it, although possibly breaking the beauty of the design, right? the ugly workaround to achieve would be to populate some global var or env var accessible to the entire process/all async jobs.

@yarikoptic
Copy link
Member Author

Let me describe rationale: I am just trying to get all the logs flowing again. I do hope that github team would fix underlying issue, but as we have no estimate on when/if that would happen, I do not want us to hold our breath.

@jwodder
Copy link
Member

jwodder commented Oct 10, 2022

@yarikoptic I think I have a way to mostly-robustly identify if a workflow run request failed due to a 500, but it still involves waiting to retry all 12 times for 12.5 minutes. Would that be acceptable?

@yarikoptic
Copy link
Member Author

I think so, unless it would bring us into some other rate limiting or alike problem.

yarikoptic added a commit that referenced this issue Oct 11, 2022
Warn on & skip workflow runs for certain "broken" GitHub workflows
@jwodder
Copy link
Member

jwodder commented Nov 2, 2022

@yarikoptic GitHub support has replied to my ticket, saying they've fixed the underlying issue. The script I posted above shows no longer shows any problematic workflows for dandi/dandi-cli, datalad/datalad, or datalad/datalad-extensions.

@yarikoptic
Copy link
Member Author

cool -- great to see things fixed up! Shouldn't we then revert that added skipping added in https://github.com/con/tinuous/pull/151/files#diff-ae556ebe88a885b67c7348cdd53576ce30426c59a5d439d58919251d8865fbb0R116 ?

@jwodder
Copy link
Member

jwodder commented Nov 2, 2022

@yarikoptic I don't see a need to remove it.

@yarikoptic
Copy link
Member Author

Wouldn't then if github gets broken again, we -- as the only force in the universe capable of detecting such cases -- would miss that and not report to github to get it fixed again?

@jwodder
Copy link
Member

jwodder commented Nov 2, 2022

@yarikoptic

  • I highly doubt we're the only ones who would notice this kind of problem.
  • The code already emits a warning whenever this sort of problem occurs.
  • If we removed the code and this happened again, we'd end up having to add the code back in while we waited another month for GitHub to fix it. Such churn could be avoided by just keeping the code in.

@yarikoptic
Copy link
Member Author

@yarikoptic

  • I highly doubt we're the only ones who would notice this kind of problem.

It is ok to doubt, but there is no empirical support for that and we did have to report an issue (so some indirect evidence that issue might have been unknown)

  • The code already emits a warning whenever this sort of problem occurs.

con/tinuous is typically ran by a cron job with chronic so warnings would not alert anyone in time

  • If we removed the code and this happened again, we'd end up having to add the code back in while we waited another month for GitHub to fix it. Such churn could be avoided by just keeping the code in.

true but we would trigger GitHub to fix it, hopefully faster that time. Without this churn we might just keep accumulating missing logs until someone mentions the warning or does some archaeological expedition. In an ideal case -- this error does not re-emerge and we just have assurance that we are not missing any log. So I would prefer to revert the skip.

jwodder added a commit that referenced this issue Nov 2, 2022
This reverts commit ac3116c, reversing
changes made to 655bb1c.
@jwodder jwodder mentioned this issue Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants