Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate comic.pixiv.net #2607

Closed
Type-kun opened this issue Jun 7, 2016 · 14 comments
Closed

Investigate comic.pixiv.net #2607

Type-kun opened this issue Jun 7, 2016 · 14 comments
Labels
Source Support Upload/Source support

Comments

@Type-kun
Copy link
Collaborator

Type-kun commented Jun 7, 2016

This seems to be a relatively new website by pixiv which allows artists to use a better reader for their short series than regular pixiv provides. However, I don't know if there's an easy way to extract the images. Users can't really do that manually, for example one can't right-click the image and "open in a new tab", since the reader intercepts right clicks. That said, you can see the links when you monitor the downloaded resources, but that's too hardcore for most uploaders, and also, when opened on their own, those links don't work (probably referer-based protection), which is inconvenient.

Additionally, new images seem to be loaded with AJAX on demand, not when the page is loaded, which makes automated uploads difficult. The reader script probably can be reverse-engineered, at least it's not in flash or something, just javascript. Then, if that's possible, batch upload strategy for comic.pixiv.net would be great.

The site seems to automatically authenticate users which are logged in on the main pixiv.net website.


Example: https://comic.pixiv.net/viewer/stories/9869

Page 2:
https://img-comic.pximg.net/images/page/9869/V52hshKjl05juBvdbHJ5/2.jpg?20151030104009
Page 3:
https://img-comic.pximg.net/images/page/9869/4Dx6Cl2FiZtOkRRyUCJv/3.jpg?20151030104009

General pattern:
https://img-comic.pximg.net/images/page/<story_id>/<random_key>/<page_number>.jpg?<timestamp>

Keys seem to be consistent after reloading, so maybe they are permanently bound to images. Perhaps they are function of page and user session or profile, this will need some further checks checks. Either way, they seem to be completely random at a glance, so we can't grab all the pages just by knowing story id and page count.

Not sure if timestamps are mandatory or not, maybe they serve the same purpose as in pixiv, to indicate revisions.

Also, when loading the story page, meta name="viewer-api-url" tag with json info is included into the page html code. In this case, I saw /api/v1/viewer/stories/rXqsSATnBA/9869.json linked, and saw it loaded in the page resources inspector: the json file itself has links to all pages under data.pages. However, attempt to directly access to https://comic.pixiv.net/api/v1/viewer/stories/rXqsSATnBA/9869.json returns an error. Probably some sort of protection is in use, or pixiv api uses different authentication methods.

@SD-DAken
Copy link

SD-DAken commented Jun 8, 2016

Page 2:
https://img-comic.pximg.net/images/page/9869/V52hshKjl05juBvdbHJ5/2.jpg?20151030104009
Page 3:
https://img-comic.pximg.net/images/page/9869/4Dx6Cl2FiZtOkRRyUCJv/3.jpg?20151030104009

I get exactly the same file URLs, so they seem to be static and not dependent on the user or session.


I've tested this a bit (with curl) and my results are as follows:

To access the images only the referrer has to be set correctly like e.g.
curl "https://img-comic.pximg.net/images/page/9869/V52hshKjl05juBvdbHJ5/2.jpg?20151030104009" -H "Referer: https://comic.pixiv.net/viewer/stories/9869" > test.jpg

The json file is quite a bit trickier, the request must appear like a "XMLHttpRequest" and the correct session cookies must be set:
curl "https://comic.pixiv.net/api/v1/viewer/stories/93WrK4ZKrs/9869.json" -H "Host: comic.pixiv.net" -H "X-Requested-With: XMLHttpRequest" -H "Cookie: PHPSESSID=<pixiv_session>; _pixiv-comic_session=<comic_session>; "

where:
<pixiv_session>: Your PHP session key created on login to pixiv
<comic_session>: The session key for the comics page, automatically created on first visit

Also note: The json URL apparently depends on the (comic?) session.

@SD-DAken
Copy link

SD-DAken commented Jun 8, 2016

<pixiv_session>: Your PHP session key created on login to pixiv
<comic_session>: The session key for the comics page, automatically created on first visit

Copying these values from one's browser works, but isn't an option for Danbooru.

Trying to get the server to correctly initialize the (comic) session computationally seems to be really difficult on the other hand (or I'm missing something obvious).

@r888888888
Copy link
Collaborator

Danbooru already uses pixiv session tokens that it gets by manually logging into the site. I assume the comic session token would be the same. If the comic session is returned by logging into the comic site, then it's just a matter of storing the token locally and reusing it.

I've noticed Pixiv has been switching up their login options recently though and the newer JS one was tricky for me to decipher. It seems to rely on a CAPTCHA or some sort of key verification and is therefore difficult to automate.

@Type-kun
Copy link
Collaborator Author

Type-kun commented Jun 9, 2016

Danbooru already uses pixiv session tokens that it gets by manually logging into the site. I assume the comic session token would be the same. If the comic session is returned by logging into the comic site, then it's just a matter of storing the token locally and reusing it.

Looks like it. If I delete the _pixiv-comic_session cookie, disable javascript and reload the page, cookie is still there. It comes in Set-cookie header, so it should be reusable just like regular sessionID. Pixiv.net cookies should most likely be passed with the request.

I guess, that's pretty much it. So the steps are as follows:

When https://comic.pixiv.net/viewer/stories/<story_id> is opened with batch upload bookmarklet:

  • get the page HTML, passing the pixiv.net cookies in the request
  • store the comic.pixiv.net cookies we get back from the request
  • Search the html for <meta name="viewer-api-url" content="...", and get the json url from content attribute
  • load the json using both pixiv.net and comic.pixiv.net cookies, along with "X-Requested-With: XMLHttpRequest" header.
  • Search json for data.contents.pages array. In there, each element can have right and left object, both contain image url in data.url. This gives us all the image urls for batch upload page.

When image from img-comic.pximg.net/images/page/<story_id> is opened in upload page:

  • Pass it through image proxy, like with pixiv images
  • Supply https://comic.pixiv.net/viewer/stories/<story_id> as referer

@SD-DAken
Copy link

Looks like it. If I delete the _pixiv-comic_session cookie, disable javascript and reload the page, cookie is still there. It comes in Set-cookie header, so it should be reusable just like regular sessionID. Pixiv.net cookies should most likely be passed with the request.

The cookie itself is set via a set-cookie header, that's right. But in my tests this isn't enough to completely initialize a working session with the server.

If I copy both PHPSESSIDand _pixiv-comic_session from my browser and request https://comic.pixiv.net/viewer/stories/9869 via curl I get an html document back that contains:

<meta name="app-token" content="" />
<meta name="token-api-url" content="" />
<meta name="viewer-api-url" content="/api/v1/viewer/stories/8buf0QJJPk/9869.json" />
<meta name="works-info-api-url" content="" />

But if I only copy PHPSESSIDfrom my browser and use the _pixiv-comic_session returned in the set-cookie header I get an html document containing:

<meta name="app-token" content="88c4e9bf9041d3d4ddc9791346598891" />
<meta name="token-api-url" content="/api/v1/viewer/token/88c4e9bf9041d3d4ddc9791346598891.json" />
<meta name="viewer-api-url" content="" />
<meta name="works-info-api-url" content="" />

So there seems to be still one step missing to make this work completely.

@SD-DAken
Copy link

Investigating this some more (this time via the browser's developer tools) it seems that the /api/v1/viewer/token/<random_value>.json file contains something like:
{"error":null,"data":{"token":"IS15ZcVeGv"}}

This is exactly the comic-session-specific value appearing in the /api/v1/viewer/stories/<token>/<story-id>.json URL.

So (at least on the first request of the comic session) viewer-api-url is not set and token-api-url has to be requested instead to get the required token.

@SD-DAken
Copy link

This was actually quite difficult to figure out. The steps below seem to work.

Prerequisite: The PHPSESSID value is known. (Since Danbooru already logs in to pixiv this should be the case).

  • curl "https://comic.pixiv.net/viewer/stories/9869" -H "Host: comic.pixiv.net" -H "Cookie: PHPSESSID=<pixiv_session>;" -D -

The server responds with Set-Cookie: _pixiv-comic_session=<comic_session>, send the request again with that cookie added:

  • curl "https://comic.pixiv.net/viewer/stories/9869" -H "Host: comic.pixiv.net" -H "Cookie: PHPSESSID=<pixiv_session>; _pixiv-comic_session=<comic_session>;" -D -

The server responds with Set-Cookie: is_browser=yes and an html document, from which two values are needed:
<meta content="<csrf_token>" name="csrf-token" />
<meta name="token-api-url" content="/api/v1/viewer/token/<random>.json" />
Request the token json file with these parameters filled in (This must be a POST request, hence the Content-Length: 0 and --data "" below). This request also apparently must be made only a short moment after the previous, else it fails (Tricky when doing it by hand, shouldn't be a problem when done by a script).

  • curl "https://comic.pixiv.net/api/v1/viewer/token/<random>.json" -H "Host: comic.pixiv.net" -H "X-CSRF-Token: <csrf_token>" -H "X-Requested-With: XMLHttpRequest" -H "Cookie: PHPSESSID=<pixiv_session>; _pixiv-comic_session=<comic_session>; is_browser=yes;" -H "Content-Length: 0" --data "" -D -

A json file that looks like {"error":null,"data":{"token":"<comic_session_token>"}} is returned.
Fill this in at the right position...

  • curl "https://comic.pixiv.net/api/v1/viewer/stories/<comic_session_token>/9869.json" -H "Host: comic.pixiv.net" -H "X-Requested-With: XMLHttpRequest" -H "Cookie: PHPSESSID=<pixiv_session>; _pixiv-comic_session=<comic_session>; is_browser=yes;"

... and the json file (file structure already explained by Type-kun above) with the image urls is returned, now the images can be downloaded:

  • curl "https://img-comic.pximg.net/images/page/9869/gB6a6Hf9VhHwHUBc1TYY/8.jpg?20151030104009" -H "Referer: https://comic.pixiv.net/viewer/stories/9869" > test.jpg

@SD-DAken
Copy link

All subsequent requests can must (as long as the comic session stays active) directly use the viewer-api-url value.

@r888888888
Copy link
Collaborator

commit to store the comic session id in aa77ba3

@r888888888
Copy link
Collaborator

I'm not sure how much demand for this there is but it would be a fair amount of work I think. I assume it would work like the batch bookmarklet where all the posts in a comic can be uploaded. Automatically creating and tagging a pool could be handled, too.

@Type-kun
Copy link
Collaborator Author

I'm not sure how much demand for this there is but it would be a fair amount of work I think. I assume it would work like the batch bookmarklet where all the posts in a comic can be uploaded. Automatically creating and tagging a pool could be handled, too.

I expected it to work directly through batch upload bookmarklet. There's no need for new scheme, it fits into current perfectly. I'll try to describe the steps again.

When bookmarklet is used on https://comic.pixiv.net/viewer/stories/9869:

  1. Get HTML https://comic.pixiv.net/viewer/stories/9869 passing pixiv session as PHPSESSID and comic session as _pixiv-comic_session cookies. We seem to already store both.
  2. Parse the resulting HTML for meta viewer-api-url, token-api-url and csrf-token.
    1. If viewer-api-url is not empty, use it.
    2. If viewer-api-url is empty and token-api-url is not empty:
      • Perform an empty POST request on link specified in token-api-url. Additional headers: X-CSRF-Token: <csrf_token>; X-Requested-With: XMLHttpRequest. Cookies: PHPSESSID and _pixiv-comic_session from before, and also is_browser=yes;
      • Parse the resulting JSON and retrieve data.token value
      • Use that value (<token>) to construct viewer-api-url : https://comic.pixiv.net/api/v1/viewer/stories/<token>/9869.json. "9869" can be retrieved from the initial URL passed to the bookmarklet, it's a story ID.
  3. Perform a GET query on viewer-api-url. Additional headers: X-Requested-With: XMLHttpRequest. Cookies: PHPSESSID and _pixiv-comic_session from before, and also is_browser=yes;
  4. Parse the resulting JSON. Search for data.contents.pages array. Each element in that array can have right and left object. Get data.url from each of those objects. It is the direct image URL that can be displayed at batch upload page as Image 0 and so on.

It takes up to 3 requests, but still fits into the current model. Direct image URLs require passing through image proxy, like regular pixiv images. The rule is: when https://img-comic.pximg.net/images/page/<story_id>/... is passed to image proxy, use https://comic.pixiv.net/viewer/stories/<story_id> as referrer.

I'm also not sure how much demand there is, but there's almost no other way to upload from comic.pixiv.net, because image urls are hidden and require digging through page source code or even inspecting network traffic.

@nonamethanks
Copy link
Member

@r888888888 this should probably be reopened, as there's been several people who have requested for this recently (most recently for https://comic.pixiv.net/works/5083).

@r888888888 r888888888 reopened this Sep 19, 2018
@stale
Copy link

stale bot commented Jul 29, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 29, 2019
@evazion evazion added Low Priority Valid but not high importance and removed stale labels Jul 30, 2019
@stale
Copy link

stale bot commented Oct 28, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 28, 2019
@evazion evazion removed the stale label Oct 28, 2019
@nonamethanks nonamethanks added Source Support Upload/Source support and removed Feature Low Priority Valid but not high importance labels Feb 1, 2023
@evazion evazion closed this as completed in a4d0e9e May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Source Support Upload/Source support
Projects
None yet
Development

No branches or pull requests

5 participants