New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISSUE-80: WebArchiving my old friend #81
Conversation
First pass. Simple without too many options. Does not read yet from remote URL. Need ask @ikreymer if there is an argument to load multiple warcs (that collection) into a single embed, or do i need multiple embeds, on per file/remote URL entry
JS workers are local beasts. This remote endpoint allows the other remote JS to be loaded in a specific domain path (see the format_strawberryfield.replayweb route) to the ui.js knows where to find it
Yeah. Hello application/warc
One thing i just noticed is that on a first load i Chrome it failed to fetch the page and actually showed itself inside the iframe. On a reload it worked fine. I wonder if there was some cache around, or, there is a race condition? |
Yes, this is something I've seen, but not consistently.. Trying to track it down. Seems like a race condition where the service worker says its been registered but doesn't actually take effect yet, until reload.. Hope to have a fix for this soon. |
@ikreymer thanks! Just reproduced it again. Maybe i can/should put the worker on top of all the other JS? Maybe even a tiny pause? |
I created a Public Test Object here https://play.archipelago.nyc/do/29156c97-3c8a-41c1-a4f4-c22816548628 |
@DiegoPino probably the simplest thing to try is just to put the existing index.html as |
@ikreymer cool. Thanks. Will try serving that index.html |
@DiegoPino Hm, it should work with .warc.gz from IA - can you share a link to the .warc.gz that didn't work? Yes, WARCs generally have very little additional metadata -- yet another reason I'm trying to create this new bundling format. Here is an example that you can try using this format: https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/etd.wacz Unlike with raw WARCs, this should load very quickly as it doesn't need to download the whole thing before providing access. This format will probably just be detected as a zip file :) |
@ikreymer great. Will try that wacz. So is that your very own extension? Cool. The warc.gz that failed is this one https://archive.org/download/lego.gizmodo.com-sitemap-2008-20160821/lego.gizmodo.com-sitemap-2008-20160821.warc.gz |
Should this fix the double loading needed to get the JS worker warmed up?
This is quite simple. If there is a starting URL, we provide it and the bar goes away. If not, we check if bar is enabled and show it. If not, we just show 'pages' selector so the user can decide. Its late and i'm tired, there can be better options for realz
I created a second Demo object with the wacz extension. It works find and i can even set now url and remove the bars via some UI facing setting. Not sure if its that useful if the warc file is not fully tested and has broken links, but it can be done |
Thanks for setting this up so quickly! |
@DiegoPino the page loads without error but show this then clickking on chain logo/replayweb.pages the web page displayed |
@DiegoPino this one loads and displays web page at the first access |
Giancarlo. Thanks. The first example i shared is confusing in terms of UI, i agree.
Since its embedding a full page, `logotype`, `docs`, `help` are all linked with relative links but none of them exist in our site. pages and the other links do read and show the assets in correct way. This are all things we can tune. The second object i shared is passed from the metadata (JSON) a starting base url and thus loads an initial page and hides the links (docs, etc) that don't work, but in that case having a way of still browsing or playing back a recording could be useful if you get lost with the links.
I feel both examples help refine the experience and we will get there.
The ui.js has all/most of this as inline html so i see we can start by testing/documenting findings and discussing with Ilya. Thanks friend!
|
Hi @ikreymer thank you for all this great interaction.
Great. Still i feel i can help there. So our S3 is in PHP terms a stream-wrapper, which means it deals with remotes as locals and allows many of the direct filesystem functions directly on the files even when they are served via external APIs. But that does not mean that i'm passing headers and dealing with responses the same way S3 would do natively. Reason for that is we (Institutional repositories use case) could/want to apply access to files sometimes only in the context of the DO Object use. So if e.g the Object can not be seen by the user, the file attached inherits that. So, long explanation to say that HEAD is not managed right now by me, but directly by NGINX. And i could help with giving you a better HEAD response (just need to know what is better) Also. This commented out code was meant to actually implement streaming manually. Why? Because some users like @giancarlobi are not using S3 and we want also to allow that for Direct Filesystem served files. That said, there is an opportunity there to tune that streaming (like size of the chunks, what headers are needed, etc) to adapt to this need and in general to any other large file need (Video, Sound), taking ideas. Other topics and some questions/housekeeping: Embed tag. Do you prefer me to open issues (or one meta issue) in your repo? Main concern are the links that don't work inside the iframe context (like clicking on the logo) v/s the ones that are super useful and make the experience great, like the All this said, i see so much potential in this collaboration and happy we can move forward. You work on this is amazing!! |
…brecorder/replayweb.page#5) range support check: even if response is 200 on initial fetch, check for 'Accept-Range: bytes' to determine range support (see: esmero/format_strawberryfield#81)
@DiegoPino I think I have a fix for it, just added a check for Yeah, I think the current default mode of hiding the location bar can be confusing.. If you add |
Thanks! will do. Do i need to change the remote library version? i'm using CDN. Or should i test in a different way?
Will do thanks
Me too. There is double loading. And then the site itself loads again. Maybe its because the locally now (in archipelago) served index.html? maybe i need to tune that index.html and make it slightly different? |
Yes, so instead of: and instead of: This will use the tip of the main branch from replayweb.page. Even though I recommended against this for production, for development probably makes sense since I'm tweaking the latest. Once things are settled, can do a release and update to the cdn version for the release. Here's another test case: https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/moabvideos.wacz |
@giancarlobi hope you had a great day and that you are sleeping now! For tomorrow, look at this one https://play.archipelago.nyc/do/0a7d02bb-928e-403c-b87c-7cf5b9d9d563 |
@ikreymer thanks! I will disconnect for today (long/zoom calls training day) but will make those changes tomorrow AM in the live site (won't touch this pull because this is the official thing). I will also activate your account in a little while. You good with a full admin one or do you feel shy and want to start with just a user that can create/edit content? let me know. Thanks @mitchellkeaney ping so you know we will have more people stopping by for dinner =) |
It was always WACZ not WARCZ
@ikreymer i had some issues with your updated library and for some strange reason Sumitra's warc stopped working. So reverted to CDN until tomorrow until i have better eyes to check what happened. Probably something tiny somewhere or just my computer refusing to work. Who knows |
Before i leave.. pretty sure its some CORS issue... just in case you wonder what could be happening because Chrome changed from "secure" to insecure in header. Good night |
Hi Diego, |
@DiegoPino regarding the mixed content, i noticed this warning:
Looks like the file redirects to https anyway, but because it starts out http, the page is listed as 'not secure', not sure if that is affecting the replay in the iframe.. |
@ikreymer totally. Makes so much sense. SSL/Non mixed media is the worst these days. That can be fixed, sure. I will try to open some issues in your own repo if time permits (long list of other todos today) some of the following will go there:
Thanks to all the new code and testing! Your ui.js and sw.js both rock. This is great |
Every unique
Oh, it won't load in the page, it is just the equivalent of hitting back in the browser.
Is this before the progress bar starts? With WACZ, it should also be pretty quick.
Ah, i think it can improve the loading for embeds, and just add a try again/reload if it fails during loading. I'm curious if it is still trying to load the full file even with the latest sw.js / ui.js? I'll try to repro locally as well (thanks for the updated instructions!) |
@ikreymer thanks for your reply. Yes, the preload will be for before the progress starts, actually for when the ws.js is still attaching since on my older computer (but still 16GBytes of RAM) and on my phone there are a few seconds of white screen i want to avoid for end users. Seems pretty trivial to do, think will reuse the JS i have for the 3D Mesh loader.
I need to check that later tomorrow. Until now all my test have been done pretty late in the night (sorry) so sometimes hard to reproduce, but during one of those i remember pressing back and back inside the frame and it ended loading the same page i have as wrapper (e.g play.archipelago.nyc, the previous one as the Object being visited, so one step behind in the browser history) inside the frame. Which looked funny but unexpected. I see a lot of cool improvements in your code so will test out as soon as i push some other code i have, want to allow different streaming/chunking strategies based on an argument passed to the URL, that way i can test better without having to duplicated so much code. Thanks! |
I think I found the sort of confusion! You were referring to the 'Back' button that appears when there is an error, now the arrow back/forward buttons! I'm adding a fix so it'll have a try 'Try Again', which will reload the embed, and is about as much as can be done in case of failure to load. (The 'Back' on error was for standalone https://replayweb.page to return to the home page, and doesn't make sense with an embed) |
Great!!! Thanks so much
|
This one goes for Ilya.
Merging this folks. There is still work to do on our site, like making sure streaming is piped between S3 and the client and more settings/options for the viewer. But let's keep things discrete in the meantime, always time for another pull! Thanks to all of you!! |
See #80
This code actually works!
You will see a few JS alerts but those are harmless. Related to how the web replay JS library tries multiple sources for its ui.js library. This adds a new formatter (proof of concept, can be way better and will be)
Strawberry Warc Formatter using replay.web Embedded player
with simple settings for nowJSON Source Key for WARCS (use as:document please), Width, Height and an extra one for remote URLs.
@TODO:
add extra settings to hide nav bar via "embed=replay" and "url="The rest is what we do, load some JS, iterate over some stuff, do some checks and render some HTML.
A note. I figured out that the issue with the height of the element/iframe is the fact that is not automatically seen as of type block, so it was not flowing, extending the rest of the layout. Now it should fine!
One last note: I had to create a controller to serve the ws.js file. It needs to be inside a /replay path inside the main domain and D8 can not do that via a direct file or the library. I like how i did it!
@giancarlobi @mitchellkeaney @ikreymer