-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[imgur] grab images from imgur #99
Comments
wrong link, i think |
@typhoon71
I can't see any imgur links at https://github.com/dteviot/
WebToEpub/issues/new. Did you post the correct link?
The Japtem site has imgur images and sideshows. The existing parser is
able to handle links to individual images, but I have not yet spent time
trying to figure out how to parse a slideshow. i.e. Parser has no problem
with this page http://japtem.com/projects/dd-toc/v1-illustrations/, with
exception of the slideshow, which is just duplicates the other images on
the page.
Please provide link to page that has problem.
…On Mon, Dec 26, 2016 at 8:37 PM, typhoon71 ***@***.***> wrote:
I'm pretty sure there was talk about this but I can't seem to find it any
more so I made a separate issue for sake of visibility.
An example would be "https://github.com/dteviot/WebToEpub/issues/new",
where there are various image -links- like "http://i.imgur.com/AAFDx1B.jpg"
that don't get grabbed (or shown).
There's also the more complex case of linked imgur galleries, where you
have a link to an imgur gallery with all the illustration before the novel.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#99>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE6w2ZPMmLaLreiPZbR0btTEujIbllP_ks5rL27HgaJpZM4LVnpT>
.
|
How did I put "https://github.com/dteviot/" as a link is beyond me; obv it's wrong, sorry. |
@typhoon71 Thanks for the corrected links <h6><TL Note: <a href="http://imgur.com/K4CZyyP.jpg">Insert image</a> ></h6> This is a hyperlink with no image tag. So parser leaves it alone. In this particular case, I think it's a note for someone to add the image tag. Creating a parser to analyse "bare" hyperlinks and figure out what to do with them in the general case is "very hard and error prone". At this time, I'm not going to attempt to handle this. "http://skythewoodtl.com/fanfic-gifting/" is a more interesting. At current time, the plugin analyses the first page given and selects the parser based on that. To solve this problem, the plugin needs to check each page and select the parser on a page by page basis. This may require a significant re-write of the plugin. i.e. the text page links need to be processed with the WordpressBase parser, while the imgur links should be processed with an imgur parser. (Which I need to create.) |
|
Just an idea about the bare imgur links. Since right now they get grabbed as This way I could use the calibre editor ability to download external resources (automatically). Would this be possible (with some if-then cycle) or it would require too much work? Just asking. |
Yes, it could be done.
However, this isn't a general case fix, it's a specific case one. And in the above example, you'd probably want to remove the h6 tag and the translator note. Also, it won't fix Imgur links to galleries. (Note, the ZirusmusingParser does handle Imgur galleries. But it needs more work, as the HTML doesn't always include all images in the gallery. |
mmm, then I suppose it's faster replacing the tags in calibre. |
See: #99 Also added SkythewoodtlParser, to fetch the imgur galleries. Note, imgur gallery links need to end with ?grid to make sure all images are included.
@typhoon71
As usual, changes have been checked into Experimental branch. |
Thanks! A lot! Nice new year present for me.
|
I like to think it's a late Christmas present. Hopefully it will save you some time.
Good question. Probably the simplest fix is to add the Skythewoodtl logic into CrimsonMagicParser.
parserFactory.register("skythewoodtl.com", function() { return new CrimsonMagicParser() }); after the line parserFactory.register("crimsonmagic.me", function() { return new CrimsonMagicParser() });
findContent(dom) {
if (ImgurParser.isImgurGallery(dom)) {
return ImgurParser.convertGalleryToConventionalForm(dom);
}
let content = super.findContent(dom);
if (content != null) {
let that = this;
let toReplace = util.getElements(content, "a", that.isHyperlinkToReplace);
for(let hyperlink of toReplace) {
that.replaceHyperlinkWithImg(hyperlink);
}
}
return content;
}
Are you talking about removing images from the galleries that are also included in the chapter text? If so, I've been thinking about that. Solution might be to have a node.js script that does custom processing of epubs after they've been created. |
@typhoon71 |
First thing, thanks a lot. ;)
Here's what happens: 1) The image complete gallery link is gone from the fetched links: the previous version would save a "Image" chapter at the end after the epilogue, with all the images based on it. 2) Some images are not fetched, like 3) At the start of the prologue or inside a chapter (volume 9 for example) there's a link to an imgur gallery: this is not processed. If the link is the same as the general gallery then it's fine, one would delete it from the epub. Did you notice it? Now it gives an error (expected image got html). |
Argh! I must move that button. |
@typhoon71 |
You will find http://imgur.com/DZYuHnc in http://skythewoodtl.com/g93/. All the images that exibit that behaviour are in volume 9, located at the end of http://skythewoodtl.com/fanfic-gifting/. |
D'oh! I combined both the "turn imgur link into an image" and "fetch imgur link as gallery logic".
Actually, it IS processed, but it's turned into an an <img> tag, which the image collector runs thinking it's a single image. The image collector then gets the HTML for a gallery and doesn't know how to handle it. (Which is why you get the error message.) When the "turn imgur link into image" is modified as described in the above step, then the gallery link isn't converted and you don't get the error. Of course, you don't get the gallery either, but then the existing code only fetches a gallery if the gallery link is in the "chapter URLs". (Because to get all the images to put in the gallery the logic needs to fetch the HTML from the image gallery and parse it.) The page parsing logic is currently not able to fetch additional pages.
|
Hmmm.
You know, I'm thinking it might be useful to send the owner of these sites a note about the links and how to fix them. emails are skythewood@gmail.com and re.yun.NS@gmail.com. (Probably easier and faster than waiting for me to update the parser.) |
Will check the dirty fix and report later. |
Not a (late) report actually, I had no time to pack epubs lately (so I didn't read stuff too, sniffle). I found that this imgur album gives an error while trying to fetch it. I don't know if it's caused by the Amazon issue that was around yesterday, but while imgur now works again I can't fetch this gallery. Note that I was able to fetch it in the past: I am redoing a couple of epubs and noticed it. Oh, I will try to do what I said in my 2 month old post just above... hopefully. |
@typhoon71 |
If I open the page directly I can see the gallery normally (a page with a lot of images), but I can't grab it with webtoepub from Skythewood. Other imgur links on Skythewood are fine, it's just this one. This is the error: Could not find content element for web page 'http://imgur.com/a/f7Ezg?grid'. Error: Could not find content element for web page 'http://imgur.com/a/f7Ezg?grid'. I cant think of anything that can cause this, just that the images seems to be different from what I remember (they weren't 2 pages scan before). |
@typhoon71
|
You got mail... |
Thanks. |
Good to know it wasn't me having cache issues for one time... XD |
Correctly recognize imgur host names like s.imgur.com, api.imgur.com etc.
Well, this is embarassing. The faulty code assumed that imgur used "imgur.com" or "i.imgur.com" as the only hostnames. If the new version doesn't work, there's two things we can try. Second option, static findImagesList(dom) {
// Ugly hack, need to find the list of images as image links are created dynamically in HTML.
// Obviously this will break each time imgur change their scripts.
console.log("Searching for imgur images from " + dom.baseURI);
for(let script of util.getElements(dom, "script")) {
let text = script.innerHTML;
let index = text.indexOf("\"images\":[{\"hash\"");
if (index !== -1) {
console.log("Found start of images in JSON");
text = text.substring(index + 9);
let endIndex = text.indexOf("}]");
if (endIndex !== -1) {
console.log("Found end of images in JSON");
return JSON.parse(text.substring(0, endIndex + 2));
} else {
console.log("Unable to find end of images in JSON");
return;
}
}
}
console.log("Unable to find start of images in JSON");
} It adds logging to say what's going wrong.
This logging should help me figure out where the problem is. |
Updated, checked, it doesn't work.
|
That's the problem. It's a blogspot site, so the blogspot parser is used. (Weird thing about Blogspot, depending on where you are in the world, it changes the host. If I try using the URL you gave (http://skythewood.blogspot.it/p/youjo-senki.html) it sends me to "http://skythewood.blogspot.co.nz/p/youjo-senki.html".) You can do a quick hack to the Blogspot parser to get it to support Imgur. findContent(dom) {
if (ImgurParser.isImgurGallery(dom)) {
return ImgurParser.convertGalleryToConventionalForm(dom);
}
let content = BlogspotParser.FindContentElement(dom);
if (content == null) {
content = util.getElement(dom, "div", e => e.className.startsWith("entry-content"));
}
return content;
} i.e. add this to the start of the function. if (ImgurParser.isImgurGallery(dom)) {
return ImgurParser.convertGalleryToConventionalForm(dom);
} |
OK, I changed the code in BlogspotParser.js; the ImgurParser.js has updated code too. On a side note, I noticed that if I don't use the "?grid" at the end of the imgur link... it works anyway! |
It depends on the number of images in the gallery. |
Thanks for the explanation; since all 17 images are grabbed, it should be the first 20 images. |
CrimsonMagicParser now replaces imgur hyperlinks with the gallery contents.
@typhoon71
|
Last branch seems fine, also it's nice to have it to automatically add "?grid". I did find an issue here "https://bakapervert.wordpress.com/vol-19/": both the Illustration links link to a page with a imgur link inside it; the thing is that even if I edit manually that imgur link into the Chapter list it doesn't fetch it. I can't seem to just get those 2 imgur links/galleries alone either. |
That's because only the CrimsonMagicParser currently has handling for Imgur pages/links. And that parser only recognises https://crimsonmagic.me/ and http://skythewoodtl.com/. parserFactory.register("bakapervert.wordpress.com", function() { return new CrimsonMagicParser() }); |
OK, binding done. It does indeed grab the images now, which is what I wanted. The only issue is that the pics get cut; I will check later if putting the imgur link in a separate chapter link solvs this (removing the dupes from the epub later). --> Checked, it works Also, thanks a lot for working on this (one should at least say "thanks" sometime right?) |
I'm not sure what you mean by this. Can you provide more details, please?
You're very welcome. But, yes, it's nice getting thanks. |
More info about the "pics get cut" bit:
This doesn't happen if you grab the imgur gallery directly (putting a direct link to the imgur gallery in the chapter list). [ In fact I just did that for v19. For v20 I kept the original link, added a direct one and removed the dupes manually. ] |
I'm pretty sure there was talk about this but I can't seem to find it any more so I made a separate issue for sake of visibility.
An example would be "https://crimsonmagic.me/2016/11/16/gifting-10-1/" [FIXED LINK], where there are various image -links- like "http://imgur.com/K4CZyyP.jpg" that don't get grabbed (or shown).
There's also the more complex case of linked imgur galleries like on "http://skythewoodtl.com/fanfic-gifting/" [ADDED LINK], where you have a link to an imgur gallery with all the illustration before the novel ("Images" at the end).
The text was updated successfully, but these errors were encountered: