-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP ban during sync #12
Comments
I don't think it's a good idea to sync data if your dataset is too old (like starting from that big To be cleared that this repo is originally meant to load the original But "barely" means I have that issue, too. A few days ago I noticed my server was died continiously, found that my account was banned or something else that made my cookies invalid, so the script ran into an infinity loop and ate all the resources. After resolving that issue, my dataset is about 3 days behind, and it took about 1.5 hours to sync. So for you, it may take a long time (maybe about a month) to finish that (and your script should not be killed and your RAM should not be eaten up or you have to restart). #1 asked for latest database dump a year ago, so I dumped the diff part to him, but the file is expired, and unfortunately my HDD was died a few months ago and that dump data is not kept. If you do need a database dump, please let me know and I'll do that for you, since I need to schedule it to make sure the other things on that server will not be significantly affected, and it may takes some time since the example server is running on a server with poor performance. |
hmmm so, what I do (which the changes are stored on private repo) in general is
for memory I think the import should be split like every certain milestone (lets say 1000 API call). so basically to cater this issue, i need to slowly update data from page 10k, 9k, 8k, etc. ? added that offsetting to help me start with ID 1.6M/1.5M also i thought you do "regular yearly dump" haha, i see by the other issue dump thing then |
Yep, if you can grab all the needed galleries, it's better to do the thing as ascending, so that you can easily to recover your progress when anything goes wrong (and much better if you write data directly into database instead of wait until finishes then write to a file, I use that way since I already wrote an import script so I don't want to duplicate it again, but that's a big issue when you're syncing a lot of galleries). Or in other way, you can make a list of pending galleries, when a gallery is done, remove it from the list and write it to somewhere safe, and when error occurs, you can use that list to continue your progress (and yes, there's a If you need to speed up (or prevent being banned too fast), you can find some HTTP proxy server to forward your request, I think some scripts are natively supported (by putting a Well, I don't do any scheduled backup (and I don't expect anyone would use this lol), so when the user asked for a sql dump, I had to dump it manually (and removed unused fields on the server). BTW I checked the database just now, the database in total is about 1.1 GB (but I added some indexes). If you're just using the data (the gallery search api is very slooooooooow), maybe it's better to optimize the structure or change the database engine (I use MyISAM but I think it's not a modern choose), or consider use other database engine like Oracle or ElasticSearch (and with a bunch of servers as cluster lol). |
i'll do say at most ~1200 (may lower) page a day should do, and planning to use the proxy after this one, to reduce the "detection". (it just sad that the source for initializing the DB is indeed that one and only mega link lol) say, does changing gallery category into integer makes it "better" in terms of storage? |
It might be, I think E-Hentai stores it as integer, once you use the category filter on E-Hentai and checking the query string, you know what I mean, and that's why the api supports querying with combined bit (though the behaviour is opposite to E-Hentai, like for doujinshi only, E-Hentai is But I'm not using it since I don't know at that time, so I store it as-is.
Yep,
According to offical wiki, 4-5 requests per ~5 seconds is fine. By the way I dumped my database just now, you can import it directly (or modify the structures before importing), so that you won't need to sync again (and stress E-Hentai's server). I thought it may take a long time to export, but it seems not that long. https://github.com/ccloli/e-hentai-db/releases/tag/v0.3.0-29%2Bg1bac4cf If you need to import the sql, it's better to backup your database first (or use a new database to import), since the sql is using |
i remapped some of the numbers (i think private is below misc on mine) but dw with that lol, i can handle that on my own. i can replace the insert into replace so its easier to handle if needed. im sure those DB format still based on this repo as a whole right, i can tinker some of the dump statements by that 👌 |
BTW, if you'd like to do a scheduled update, here is my crontab, and you can modify it for your case: 0 */1 * * * cd /var/www/e-hentai-db/ && npm run sync exhentai.org && npm run torrent-import exhentai.org && npm run torrent-sync exhentai.org 1
30 */6 * * * cd /var/www/e-hentai-db/ && npm run resync 48
15 */12 * * * cd /var/www/e-hentai-db/ && npm run sync exhentai.org 24
One thing I forgot to mention is that for now I'm not having a strong willing to improved the current syncing script and SQL query statements, as thought it's a bit messy, it works for now. So at least for now the repo won't have any big updates, so you can do everything you want without considering merge newer features from this repo because there will be no newer features. Though I've some idea but for now I don't have time and spirit to do it, if you'd like to do more things in your repo, there are some ideas (these are just ideas, not meaning you need to finish them, and I note them here is also for I may forget them in the future):
|
idk if anyone is still working on this but here is an idea to improve the syncing process: I'd implement it myself but I really dont wanna work with js 😭 |
actually nvm I was wrong about how pages works, its probably way easier to just sync a limited amount of ids at a time with the next thing |
i know it looks ugly as fuck but i made thingy here |
Hello, I triggered IP ban error during gallery ,metadata fetching (I think from 2.0M ID to 1.9M only at that point).
Is there any solution aside syncing the huge thing? I kinda wonder about maintaining this as well. (If we need to talk this privately, Let me know as well, thanks!)
The text was updated successfully, but these errors were encountered: