Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add concurrency for "download-syntax" module & support many args for --only parameter #19

Closed
wants to merge 1 commit into from
Closed

[WIP] Add concurrency for "download-syntax" module & support many args for --only parameter #19

wants to merge 1 commit into from

Conversation

senpos
Copy link
Contributor

@senpos senpos commented Aug 16, 2020

  • added concurrent (ThreadPoolExecutor) downloads for the _download command, had to refactor it a bit and break into smaller pieces
    On 4 cores, python .\bin\download-syntax download:
    Before: 33-47s
    After: 4.9-5s
  • added concurrent processing (ThreadPoolExecutor) for the _update command
    On 4 cores, python .\bin\download-syntax update:
    Before: 75-82s
    After: 37-39s
    Well, not THAT huge difference, but I suppose moby/moby repo is the bottleneck here, it takes forever to process it
  • if one concurrent task fails - log to the stderr, other tasks continue to work; we don't want to waste things we already processed :-)
  • replaced all path strings with pathlib.Path objects when possible; it allows to make reproducible results on both Linux and Windows (slashes/backslashes, uh-oh)
  • added encoding='utf-8' parameter to all open calls, because on some locales on Windows it creates files with incorrect symbols
  • added support for passing multiple arguments to the --only parameter for both _download and _update commands. Now you can do python .\bin\download-syntax download --only MagicStack/MagicPython asottile/language-xml or python .\bin\download-syntax download --only MagicStack/MagicPython --only asottile/language-xml, both "styles" should work (warning: it is a 3.8+ feature (scroll down to the extend action)

Yes, I messed up and all the changes were added to a single commit. Sorry!
If you are interested in something particular from the list - please, let me know and I will create a separate PR for it.

If it looks good enough - I may add changes to the README. Let me know what you think. :-)

P. S. I learned my lesson

@senpos
Copy link
Contributor Author

senpos commented Aug 16, 2020

Oops, there's no cp38m-win_amd64 wheel for editdistance and I don't want to install Build Tools for Visual Studio on this machine. Will boot from my laptop tomorrow and fix it, sorry. :-)

Copy link
Owner

@asottile asottile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be split into 4-5 separate changes instead of all in one

some things:

  • this is only run interactively and speed isn't really an issue (even after converting to a threadpool it looks like it only improved it by ~40% -- I don't think the introduced complexity is worth that). if this were in the critical path of something then sure the change is worth it
  • I personally don't like pathlib, you can accomplish all the same things with os.path.join -- additionally I see some as_posix() which means pathlib is actually harmful to correctness here :)
  • there are far too many changes in this patch to review it properly, I started reviewing it but then realized just how long this is -- ideally keep things in small, reviewable changes (and only move code around when necessary)

_GRAMMAR_DIR = 'share/babi/grammar_v1'
_LICENSE_DIR = 'licenses'
_GRAMMAR_DIR = Path('share/babi/grammar_v1')
_LICENSE_DIR = Path('licenses')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like pathlib sorry

parser.add_argument('--only')
# "extend" is a 3.8+ action, which allows passing multiple arguments
# to the same command, e.g. ./app --foo 1 2 3 -> {'foo': ['1', '2', '3']}
parser.add_argument('--only', action='extend', nargs='+', type=str, default=[])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

action='append' is a better experience anyway

')\n'
'REPOS = (\n' +
''.join(f'{textwrap.indent(repr(r), " " * 4)},\n' for r in up_to_date_repos) +
')\n'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, sorry, I probably messed up something in my IDE.

@senpos
Copy link
Contributor Author

senpos commented Aug 16, 2020

Hi Anthony!

Thanks for the comments! :-)

Yeah, I agree, mixing all the things into one huge PR was a mistake.

I will answer your questions anyway and close this PR as it is not unsuitable for the merge. It is a good experience for me though!

it only improved it by ~40%

It is almost x10 boost for the download command (5 seconds vs 40+ seconds). But for the update - yes, it is not necessary, I agree.

additionally I see some as_posix() which means pathlib is actually harmful to correctness here :)

While I understand that it is a personal preference either to use pathlib or not, but could you give me an insight why as_posix is harmful? The goal was to generate the same output on both Linux and Windows machines.
Using the original os.path.join method script produces strings with / and \ mixed: share/babi/grammar_v1\scope.terraform.json

Thanks for your time!

@asottile
Copy link
Owner

oh yeah, as_posix generally means that you're ignoring pathlib's normalization (kinda the point of pathlib in the first place!)

I think a lot of parts of this are salvageable though and could be split out into separate PRs! splitting commits after the fact can be kind of trick -- the way I've done it in the past ~essentially involves:

git checkout -p HEAD^  # interactively choose reversions from the previous patch
git diff --staged > patch  # save the reverted bits to a patch
git commit --amend  # modify the existing commit
git apply --reverse patch  # apply the reverse of the reversions
git commit ...  # make a new commit

@senpos
Copy link
Contributor Author

senpos commented Aug 16, 2020

Thanks, Anthony!

could be split out into separate PRs

That's what I will do. :-)
So, according to your comments, I plan to create PRs for:

  • encoding='utf-8' thingy
  • ThreadPoolExecutor only for the _download command, because it has the most impact (5s vs old 40+s)
  • allow to pass multiple args to only parameter using append action

I am closing this PR ^^

@senpos senpos closed this Aug 16, 2020
@asottile
Copy link
Owner

sounds great!

@senpos senpos deleted the concurrency-and-multiple-only-args branch August 18, 2020 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants