Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multithreading to fetch json faster #41

Merged
merged 17 commits into from
Nov 25, 2022

Conversation

CafeAuLait-CC
Copy link
Contributor

@CafeAuLait-CC CafeAuLait-CC commented Oct 3, 2022

Hello,

As s3s and stat.ink are both under active development, I'm still using the -o parameter to backup my data to local files quite often. Since the fetch_json() for ALL data is now taking forever to run, I added multithreading support for it.

I tried to set the max_workers to 10, and the total running time for me with the -o parameter has dropped from 4m45s to 45 seconds. So I hope this is useful.

@clovervidia
Copy link
Collaborator

I do agree that it does take some time to download the JSONs from SplatNet 3, but I'm not sure if using multithreading to speed it up is a good idea. We're walking a fine line in terms of accessing SplatNet 3, and I think going any faster with our requests might raise some red flags. I don't want to risk anyone's account getting banned.

@CafeAuLait-CC
Copy link
Contributor Author

CafeAuLait-CC commented Oct 5, 2022

@clovervidia I totally understand your concern. I was thinking that when opening a webpage using a modern web browser, it would be a normal strategy for the browser to send dozens of requests at background and preload the links in that page. So I assumed a small number for the max_number_of_threads would be fine. Even 2 will make it twice faster than 1 for-loop.

But it's up to you to make the decision, I'm absolutely OK with it. After all, the -o is just a temporary feature, right?

@frozenpandaman
Copy link
Owner

Sorry for the late reply, I've been traveling.

I agree with @clovervidia that we don't want to do anything that the official app can't do. If we add multithreading, limiting the concurrent threads to 2 would probably be best. I don't think this should necessarily be closed – but @clovervidia and I should probably discuss further.

@clovervidia
Copy link
Collaborator

Yeah, we should probably discuss this. Limiting it to 2 or 3 threads might be all right.

@CafeAuLait-CC
Copy link
Contributor Author

Please feel free to reopen this PR or keep it closed for now, which ever you wish.

@niyari
Copy link
Contributor

niyari commented Oct 7, 2022

I really like the idea of using threads. After I got the whole thing working, I think I prepared a PR to make it that way too.
For example, in any scene can separate display and capture.

@frozenpandaman frozenpandaman reopened this Oct 9, 2022
@frozenpandaman frozenpandaman mentioned this pull request Oct 10, 2022
11 tasks
@frozenpandaman
Copy link
Owner

Thanks a lot for keeping this updated @CafeAuLait-CC! Will probably limit the number of threads to 2 but will plan to take another closer look by this weekend and hopefully get a version of this merged in.

@CafeAuLait-CC
Copy link
Contributor Author

Cool! I have just updated it and set the number to 2. Feel free to squash the commits when you are merging it.

@frozenpandaman
Copy link
Owner

frozenpandaman commented Nov 20, 2022

So sorry to make you keep having to fix the merge conflicts. 😅 What is really holding things up is that I have a ton of local changes that aren't public/present in master yet – namely, support for Salmon Run. As soon as Hina adds the stat.ink web UI so you can actually see uploaded SR results (he told me it would take "maybe less than a week"!), I'll update to v0.2.0 and push all those changes. I can then immediately merge this PR in as well, so it can go along with that update. And the areas of code those touch vs. this PR touches are pretty different so I don't think there will be any conflicts then, but if there are I'm happy to resolve them myself.

Just, for now, I'm wary (really due to my own lack of Git confidence, sorry) about merging this and then having to try and pull these new upstream changes into my own non-committed/draft local version of the codebase… I think stuff would just get messy. But as soon as v0.2.0 goes out I will pull this in with it!

@frozenpandaman
Copy link
Owner

frozenpandaman commented Nov 20, 2022

Also, would you mind explaining what's going on with the "passing in lists the length of battle_ids" parts in these two lines? I'm also curious what just the bare swim does here, like, referring to the function but…?
https://github.com/CafeAuLait-CC/s3s/blob/8d60a6a8ac2b541af318695a41db676798e443ba/s3s.py#L274-L276

Relatedly, why is swim an argument in fetch_detailed_result()?
https://github.com/CafeAuLait-CC/s3s/blob/8d60a6a8ac2b541af318695a41db676798e443ba/s3s.py#L302

Never done anything with concurrency in Python, really, so just for my own understanding if you have the time to just give a quick explanation that'd be cool!

@CafeAuLait-CC
Copy link
Contributor Author

CafeAuLait-CC commented Nov 20, 2022

Don’t worry and take your time, I’m looking forward to seeing the new features like supporting for Salmon Run and Tri-color Turf War(in Hina’s plan) come out. Most of the conflicts are in the import part, so it’s not hard to fix at all.

As for your questions, the thread_pool.map() function takes iterable as arguments. It will create new threads using the function name which the threads will run and lists of job parameters used by that function. So each thread will look like:

thread function to run [arg1] [arg2] [arg3]
Thread 0 fetch_detailed_result True battle_id 1 swim()
Thread 1 fetch_detailed_result True battle_id 2 swim()

The arg columns are the lists passed in, and each thread will take an argument from each of the lists, then use them as the arguments for the fetch_detailed_result() function.

So Thread 0 will run fetch_detailed_result(arg1[0], arg2[0], arg3[0]), and Thread 1 will run fetch_detailed_result(arg1[1], arg2[1], arg3[1])

I’m not sure whether my explanation is clear enough. You can find a simple example of using map() in ”Python ThreadPoolExecutor map” section at this link.

And the reason swim is an argument in fetch_detailed_result() simply because swim() needs to be called every time right after a detailed record is fetched from the server, to make sure the progress bar works as expected. In your original code where loops is used instead of multi threading, swim() is call inside the loop, at every iteration. So for the parallelized version, swim() need’s to be called in every thread.

@frozenpandaman frozenpandaman changed the base branch from master to multithreading November 25, 2022 17:02
@frozenpandaman frozenpandaman merged commit 08b8872 into frozenpandaman:multithreading Nov 25, 2022
@frozenpandaman
Copy link
Owner

frozenpandaman commented Nov 25, 2022

ok lol looks like i still made the commit history super gross when merging the multithreading branch into master but whatever. merged!

thank you, @CafeAuLait-CC, and for your explanation above as well, super appreciated! :)

@CafeAuLait-CC
Copy link
Contributor Author

lol Thank you! And great to see v0.2.0 finally comes out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants