Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-80: WebArchiving my old friend #81

Merged
merged 12 commits into from Jul 6, 2020
Merged

ISSUE-80: WebArchiving my old friend #81

merged 12 commits into from Jul 6, 2020

Conversation

DiegoPino
Copy link
Member

@DiegoPino DiegoPino commented Jun 24, 2020

See #80

This code actually works!

You will see a few JS alerts but those are harmless. Related to how the web replay JS library tries multiple sources for its ui.js library. This adds a new formatter (proof of concept, can be way better and will be)
Strawberry Warc Formatter using replay.web Embedded player with simple settings for now
JSON Source Key for WARCS (use as:document please), Width, Height and an extra one for remote URLs.
@TODO: add extra settings to hide nav bar via "embed=replay" and "url="

The rest is what we do, load some JS, iterate over some stuff, do some checks and render some HTML.

A note. I figured out that the issue with the height of the element/iframe is the fact that is not automatically seen as of type block, so it was not flowing, extending the rest of the layout. Now it should fine!

One last note: I had to create a controller to serve the ws.js file. It needs to be inside a /replay path inside the main domain and D8 can not do that via a direct file or the library. I like how i did it!
@giancarlobi @mitchellkeaney @ikreymer

image

First pass. Simple without too many options. Does not read yet from remote URL. Need ask @ikreymer if there is an argument to load multiple warcs (that collection) into a single embed, or do i need multiple embeds, on per file/remote URL entry
JS workers are local beasts. This remote endpoint allows the other remote JS to be loaded in a specific domain path (see the format_strawberryfield.replayweb route) to the ui.js knows where to find it
Yeah. Hello application/warc
@DiegoPino DiegoPino self-assigned this Jun 24, 2020
@DiegoPino DiegoPino added enhancement New feature or request future roadmap Things we would love to have but are no priority for version 1.0 Javascript Favourite language of a PHP developer metadata Meta(l) data labels Jun 24, 2020
@DiegoPino DiegoPino added this to the 1.0.0-beta3 milestone Jun 24, 2020
@DiegoPino
Copy link
Member Author

One thing i just noticed is that on a first load i Chrome it failed to fetch the page and actually showed itself inside the iframe. On a reload it worked fine. I wonder if there was some cache around, or, there is a race condition?

@ikreymer
Copy link

One thing i just noticed is that on a first load i Chrome it failed to fetch the page and actually showed itself inside the iframe. On a reload it worked fine. I wonder if there was some cache around, or, there is a race condition?

Yes, this is something I've seen, but not consistently.. Trying to track it down. Seems like a race condition where the service worker says its been registered but doesn't actually take effect yet, until reload.. Hope to have a fix for this soon.

@DiegoPino
Copy link
Member Author

@ikreymer thanks! Just reproduced it again. Maybe i can/should put the worker on top of all the other JS? Maybe even a tiny pause?

Attaching Screenshots (first load, self not found)
image

Second load
image

@DiegoPino
Copy link
Member Author

I created a Public Test Object here https://play.archipelago.nyc/do/29156c97-3c8a-41c1-a4f4-c22816548628
And will be documenting issues/Tiny details i'm finding. @giancarlobi if you have some time, could you check if you experience the same issue of not loading the first time and then second time? Also does the 90 Mbytes large File load for you? I feel i will have some questions tomorrow. Thanks to all

@ikreymer
Copy link

@DiegoPino probably the simplest thing to try is just to put the existing index.html as ./replay/index.html next to ./replay.sw.js. That's all the service worker is serving (the ui.js is replaced with the one from the embed).

@DiegoPino
Copy link
Member Author

@ikreymer cool. Thanks. Will try serving that index.html ./replay/index.html. I tried with a warc.gz file downloaded from Internet Archive and the player complained it could not read the format, so i unarchived it and uploaded the .warc one directly. That worked as you can see there. I wonder if there is content negotiation based on mime type happening? And i was serving the incorrect mime for the .gz? Or it is the Internet Archive format and it is actually and simply a zipped warc? Feel i need to learn more about the format. What was strange is that FIDO which does a deeper analisys of the file recognized the .war.gz as just a gzipped file and assigned that Pronom ID. I also noticed that .warc files use little to no exif/extra metadata attached to the format header. Is that normal the standard? Thanks a lot @ikreymer! Having fun with this.

@ikreymer
Copy link

@DiegoPino Hm, it should work with .warc.gz from IA - can you share a link to the .warc.gz that didn't work?

Yes, WARCs generally have very little additional metadata -- yet another reason I'm trying to create this new bundling format.

Here is an example that you can try using this format: https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/etd.wacz

Unlike with raw WARCs, this should load very quickly as it doesn't need to download the whole thing before providing access. This format will probably just be detected as a zip file :)

@DiegoPino
Copy link
Member Author

@ikreymer great. Will try that wacz. So is that your very own extension? Cool. The warc.gz that failed is this one https://archive.org/download/lego.gizmodo.com-sitemap-2008-20160821/lego.gizmodo.com-sitemap-2008-20160821.warc.gz

Should this fix the double loading needed to get the JS worker warmed up?
This is quite simple. If there is a starting URL, we provide it and the bar goes away. If not, we check if bar is enabled and show it. If not, we just show 'pages' selector so the user can decide. Its late and i'm tired, there can be better options for realz
@DiegoPino
Copy link
Member Author

I created a second Demo object with the wacz extension. It works find and i can even set now url and remove the bars via some UI facing setting. Not sure if its that useful if the warc file is not fully tested and has broken links, but it can be done
https://play.archipelago.nyc/do/ec0144ee-5d34-4f88-b093-e48ecc557255
What i do not see is the faster loading/streaming and starting to wonder if my way of serving this files needs to enforce/passthrough something to make sure the JS can make use of streaming. We are intermediating with S3, not serving it directly.

@ikreymer
Copy link

I created a second Demo object with the wacz extension. It works find and i can even set now url and remove the bars via some UI facing setting. Not sure if its that useful if the warc file is not fully tested and has broken links, but it can be done
https://play.archipelago.nyc/do/ec0144ee-5d34-4f88-b093-e48ecc557255
Nice! Yeah, i think maybe default should be with location bar at least.. still working on what the options should be.

What i do not see is the faster loading/streaming and starting to wonder if my way of serving this files needs to enforce/passthrough something to make sure the JS can make use of streaming. We are intermediating with S3, not serving it directly.
I think I found the issue.. replayweb.page attempts to detect if range requests are supported by checking if a HEAD with Range: bytes=0- request returns a 200 or a 206. In the current setup, it returns a 200 (and same for GET), but otherwise bounded range requests seem to work fine! That's probably not a good check, as it is equivalent to a 200, so maybe should do bytes=0-1, which is what Safari does to determine support.

Thanks for setting this up so quickly!

@giancarlobi
Copy link
Collaborator

I created a Public Test Object here https://play.archipelago.nyc/do/29156c97-3c8a-41c1-a4f4-c22816548628
And will be documenting issues/Tiny details i'm finding. @giancarlobi if you have some time, could you check if you experience the same issue of not loading the first time and then second time? Also does the 90 Mbytes large File load for you? I feel i will have some questions tomorrow. Thanks to all

@DiegoPino the page loads without error but show this

image

then clickking on chain logo/replayweb.pages the web page displayed

image

@giancarlobi
Copy link
Collaborator

I created a second Demo object with the wacz extension. It works find and i can even set now url and remove the bars via some UI facing setting. Not sure if its that useful if the warc file is not fully tested and has broken links, but it can be done
https://play.archipelago.nyc/do/ec0144ee-5d34-4f88-b093-e48ecc557255
What i do not see is the faster loading/streaming and starting to wonder if my way of serving this files needs to enforce/passthrough something to make sure the JS can make use of streaming. We are intermediating with S3, not serving it directly.

@DiegoPino this one loads and displays web page at the first access

image

@DiegoPino
Copy link
Member Author

DiegoPino commented Jun 25, 2020 via email

@DiegoPino
Copy link
Member Author

Hi @ikreymer thank you for all this great interaction.

I think I found the issue.. replayweb.page attempts to detect if range requests are supported by checking if a HEAD with Range: bytes=0- request returns a 200 or a 206. In the current setup, it returns a 200 (and same for GET), but otherwise bounded range requests seem to work fine! That's probably not a good check, as it is equivalent to a 200, so maybe should do bytes=0-1, which is what Safari does to determine support.

Great. Still i feel i can help there. So our S3 is in PHP terms a stream-wrapper, which means it deals with remotes as locals and allows many of the direct filesystem functions directly on the files even when they are served via external APIs. But that does not mean that i'm passing headers and dealing with responses the same way S3 would do natively. Reason for that is we (Institutional repositories use case) could/want to apply access to files sometimes only in the context of the DO Object use. So if e.g the Object can not be seen by the user, the file attached inherits that. So, long explanation to say that HEAD is not managed right now by me, but directly by NGINX. And i could help with giving you a better HEAD response (just need to know what is better)

Also. This commented out code
https://github.com/esmero/format_strawberryfield/blob/ISSUE-80/src/Controller/IiifBinaryController.php#L151-L181

was meant to actually implement streaming manually. Why? Because some users like @giancarlobi are not using S3 and we want also to allow that for Direct Filesystem served files. That said, there is an opportunity there to tune that streaming (like size of the chunks, what headers are needed, etc) to adapt to this need and in general to any other large file need (Video, Sound), taking ideas.

Other topics and some questions/housekeeping: Embed tag. Do you prefer me to open issues (or one meta issue) in your repo? Main concern are the links that don't work inside the iframe context (like clicking on the logo) v/s the ones that are super useful and make the experience great, like the
pages, page resources and replay ones. Also thinking loud about a button in case you end with a broken link during exploration and you want to go back to whas before. I mean, right button, back works in the context of the iframe but not intuitive for, e.g an ipad or iphone user.

All this said, i see so much potential in this collaboration and happy we can move forward. You work on this is amazing!!

ikreymer added a commit to webrecorder/wabac.js that referenced this pull request Jun 25, 2020
…brecorder/replayweb.page#5)

range support check: even if response is 200 on initial fetch, check for 'Accept-Range: bytes' to determine range support (see: esmero/format_strawberryfield#81)
@ikreymer
Copy link

@DiegoPino I think I have a fix for it, just added a check for Accept-Ranges: bytes which the server does return, so should avoid loading the whole thing at once. Yes, I think the embed options could be clarified a bit more, feel free to open an issue on https://github.com/webrecorder/replayweb.page and can discuss it further there!

Yeah, I think the current default mode of hiding the location bar can be confusing.. If you add embed="replay" attribute to <replay-web-page> it should show it with location bar..
Maybe that should be the default. Yes, perhaps it better to discuss the embed options on https://github.com/webrecorder/replayweb.page
Also, I'm seeing some strange behavior with this piece (double nav bar) that doesn't happen when loaded on replayweb.page itself -- a bit odd!
Maybe should try with something simpler..

@DiegoPino
Copy link
Member Author

DiegoPino commented Jun 25, 2020

@DiegoPino I think I have a fix for it, just added a check for Accept-Ranges: bytes which the server does return, so should avoid loading the whole thing at once. Yes, I think the embed options could be clarified a bit more, feel free to open an issue on https://github.com/webrecorder/replayweb.page and can discuss it further there!

Thanks! will do. Do i need to change the remote library version? i'm using CDN. Or should i test in a different way?

Yeah, I think the current default mode of hiding the location bar can be confusing.. If you add embed="replay" attribute to <replay-web-page> it should show it with location bar..
Maybe that should be the default. Yes, perhaps it better to discuss the embed options on https://github.com/webrecorder/replayweb.page

Will do thanks

Also, I'm seeing some strange behavior with this piece (double nav bar) that doesn't happen when loaded on replayweb.page itself -- a bit odd!

Me too. There is double loading. And then the site itself loads again. Maybe its because the locally now (in archipelago) served index.html? maybe i need to tune that index.html and make it slightly different?

@ikreymer
Copy link

Thanks! will do. Do i need to change the remote library version? i'm using CDN. Or should i test in a different way?

Yes, so instead of:
<script src="https://unpkg.com/replaywebpage@1.0.0/ui.js"></script> use
<script src="https://replayweb.page/ui.js"></script>

and instead of:
importScripts("https://unpkg.com/replaywebpage@1.0.0/sw.js"); use
importScripts("https://replayweb.page/sw.js");

This will use the tip of the main branch from replayweb.page. Even though I recommended against this for production, for development probably makes sense since I'm tweaking the latest. Once things are settled, can do a release and update to the cdn version for the release.

Here's another test case: https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/moabvideos.wacz
This is just a single page, but embeds many youtube videos, everything should download on-demand.
The url to use with this embed is: https://manonabeach.com/

@DiegoPino
Copy link
Member Author

@giancarlobi hope you had a great day and that you are sleeping now! For tomorrow, look at this one https://play.archipelago.nyc/do/0a7d02bb-928e-403c-b87c-7cf5b9d9d563
Hope it works better for you. Please open an account also at https://play.archipelago.nyc/user/register so i can give you admin permissions there. Thanks!

@DiegoPino
Copy link
Member Author

@ikreymer thanks! I will disconnect for today (long/zoom calls training day) but will make those changes tomorrow AM in the live site (won't touch this pull because this is the official thing). I will also activate your account in a little while. You good with a full admin one or do you feel shy and want to start with just a user that can create/edit content? let me know. Thanks @mitchellkeaney ping so you know we will have more people stopping by for dinner =)

It was always WACZ not WARCZ
@DiegoPino
Copy link
Member Author

@ikreymer i had some issues with your updated library and for some strange reason Sumitra's warc stopped working. So reverted to CDN until tomorrow until i have better eyes to check what happened. Probably something tiny somewhere or just my computer refusing to work. Who knows

@DiegoPino
Copy link
Member Author

Before i leave.. pretty sure its some CORS issue... just in case you wonder what could be happening because Chrome changed from "secure" to insecure in header. Good night

@ikreymer
Copy link

Before i leave.. pretty sure its some CORS issue... just in case you wonder what could be happening because Chrome changed from "secure" to insecure in header. Good night

Hm, yeah, hopefully something simple somewhere.. I checked and CORS seems to be ok for those urls, allowing all domains.

I've also made an update to the embedding UI. The default should now look like this, adding back/forward buttons:
Screen Shot 2020-06-25 at 11 04 19 PM
(Though, unfortunately, no way to prevent going back beyond the first page automatically.. the iframe history is the same as top window history).

I'll also double check the other WARCs with the latest replayweb.page

@dmer
Copy link
Collaborator

dmer commented Jun 26, 2020

Hi Diego,
This is really amazing and cool thanks for sharing - I saw a demo of this a while back and was waiting to see it in a repository platform - opens so many possibilities!
Also just in case another data point helps, I had the same issue of the player not loading in the page the first time - it shows up fine after a refresh. (Chrome 83.0.4103.106 on Mac OS)

@DiegoPino
Copy link
Member Author

@dmer thanks a lot! Will be doing some more testing with fresh out of the oven code from @ikreymer today. Also thanks for letting us know about the double loading issue. I saw same behavior on my phone last night too. Will try to find our what is keeping the JS worker sleeping.

@ikreymer
Copy link

@DiegoPino regarding the mixed content, i noticed this warning:

Mixed Content: The page at 'https://play.archipelago.nyc/do/0a7d02bb-928e-403c-b87c-7cf5b9d9d563' was loaded over HTTPS, but requested an insecure image 'http://rightsstatements.org/files/buttons/InC-EDU.white.svg'. This content should also be served over HTTPS

Looks like the file redirects to https anyway, but because it starts out http, the page is listed as 'not secure', not sure if that is affecting the replay in the iframe..

@DiegoPino
Copy link
Member Author

@ikreymer totally. Makes so much sense. SSL/Non mixed media is the worst these days. That can be fixed, sure.
I will try to built during the weekend a better test matrix. Will create an experimental second option so we can, in the same server try both libraries. Is there anything i can do on my side to ensure better streaming performance you can think of? I want to also test how much changes between serving the larger files directly from S3 v/s through my code. At least, if that is faster, it could be the default in case the assets is 100% public and has no restrictions. Do i need to totally clear server caches to test multiple options? Or is caching not really happening and every load will be as good as starting from scratch?

I will try to open some issues in your own repo if time permits (long list of other todos today) some of the following will go there:

  • i did a quick research on js browser history and maybe, in this context it could be manipulated? I fear a back back situation where the main site page is rendered inside the embedding tag will be complex to explain to an enduser as something norma.
  • I want to add a tiny preloading/loading animation whole the worker gets started. I can do that directly on our site, np.
  • Last night the 300MB file you uploaded failed to load at 70% and 80%. My internet connection was the worst. I assumed it was that and you code warned me and gave the option to go back (sorry can not remember the label). But that back button/(again which i can not remember the label right now.. gosh) once pressed failed and i had to reload the page. So question is: when that happens, is reloading the page safer? or should that button behave differently than what i saw?

Thanks to all the new code and testing! Your ui.js and sw.js both rock. This is great

@ikreymer
Copy link

Do i need to totally clear server caches to test multiple options? Or is caching not really happening and every load will be as good as starting from scratch?

Every unique sourceurl for the <replay-web-page> tag is cached separately.. The new UI should have a 'purge cache + full reload' option in the menu, which will clear that cache and do a full reload:

Screen Shot 2020-06-26 at 2 42 36 PM

i did a quick research on js browser history and maybe, in this context it could be manipulated? I fear a back back situation where the main site page is rendered inside the embedding tag will be complex to explain to an enduser as something norma.

Oh, it won't load in the page, it is just the equivalent of hitting back in the browser.

I want to add a tiny preloading/loading animation whole the worker gets started. I can do that directly on our site, np.

Is this before the progress bar starts? With WACZ, it should also be pretty quick.

Last night the 300MB file you uploaded failed to load at 70% and 80%. My internet connection was the worst. I assumed it was that and you code warned me and gave the option to go back (sorry can not remember the label). But that back button/(again which i can not remember the label right now.. gosh) once pressed failed and i had to reload the page. So question is: when that happens, is reloading the page safer? or should that button behave differently than what i saw?

Ah, i think it can improve the loading for embeds, and just add a try again/reload if it fails during loading.
Though, with the latest version from replayweb.page, it shouldn't even try to load the entire file at once.

I'm curious if it is still trying to load the full file even with the latest sw.js / ui.js? I'll try to repro locally as well (thanks for the updated instructions!)

@DiegoPino
Copy link
Member Author

DiegoPino commented Jun 29, 2020

@ikreymer thanks for your reply. Yes, the preload will be for before the progress starts, actually for when the ws.js is still attaching since on my older computer (but still 16GBytes of RAM) and on my phone there are a few seconds of white screen i want to avoid for end users. Seems pretty trivial to do, think will reuse the JS i have for the 3D Mesh loader.
Regarding

Oh, it won't load in the page, it is just the equivalent of hitting back in the browser.

I need to check that later tomorrow. Until now all my test have been done pretty late in the night (sorry) so sometimes hard to reproduce, but during one of those i remember pressing back and back inside the frame and it ended loading the same page i have as wrapper (e.g play.archipelago.nyc, the previous one as the Object being visited, so one step behind in the browser history) inside the frame. Which looked funny but unexpected.

I see a lot of cool improvements in your code so will test out as soon as i push some other code i have, want to allow different streaming/chunking strategies based on an argument passed to the URL, that way i can test better without having to duplicated so much code. Thanks!

@ikreymer
Copy link

Oh, it won't load in the page, it is just the equivalent of hitting back in the browser.

I need to check that later tomorrow. Until now all my test have been done pretty late in the night (sorry) so sometimes hard to reproduce, but during one of those i remember pressing back and back inside the frame and it ended loading the same page i have as wrapper (e.g play.archipelago.nyc, the previous one as the Object being visited, so one step behind in the browser history) inside the frame. Which looked funny but unexpected.

I think I found the sort of confusion! You were referring to the 'Back' button that appears when there is an error, now the arrow back/forward buttons!

I'm adding a fix so it'll have a try 'Try Again', which will reload the embed, and is about as much as can be done in case of failure to load. (The 'Back' on error was for standalone https://replayweb.page to return to the home page, and doesn't make sense with an embed)

Screen Shot 2020-06-29 at 5 24 39 PM

@DiegoPino
Copy link
Member Author

DiegoPino commented Jun 30, 2020 via email

ikreymer added a commit to webrecorder/replayweb.page that referenced this pull request Jun 30, 2020
@DiegoPino
Copy link
Member Author

Merging this folks. There is still work to do on our site, like making sure streaming is piped between S3 and the client and more settings/options for the viewer. But let's keep things discrete in the meantime, always time for another pull! Thanks to all of you!!

@DiegoPino DiegoPino merged commit 1f86796 into 8.x-1.0-beta3 Jul 6, 2020
@DiegoPino DiegoPino deleted the ISSUE-80 branch July 15, 2020 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future roadmap Things we would love to have but are no priority for version 1.0 Javascript Favourite language of a PHP developer metadata Meta(l) data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants