Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive Video #33

Open
AdamSobieski opened this issue Jun 19, 2021 · 16 comments
Open

Interactive Video #33

AdamSobieski opened this issue Jun 19, 2021 · 16 comments

Comments

@AdamSobieski
Copy link

AdamSobieski commented Jun 19, 2021

Introduction

Interactive videos are videos which support user interaction. Interactive videos can navigate to, or branch to, video content depending upon: users’ interactions with menus, users’ settings and configurations, models of users, other variables or data, random numbers, and program logic.

Educational uses of interactive video include, but are not limited to: instructional materials, how-to videos, and interactive visualizations.

Some chatbots or dialogue systems can be stored as interactive videos.

Interactive films are interactive videos or cinematic videogames where one or more viewers can interact to influence the courses of unfolding stories. Interactive films can be described as video forms of choose-your-own-adventure stories, or gamebooks. Contemporary examples of interactive films include: “Black Mirror: Bandersnatch”, “Puss in Book: Trapped in an Epic Tale”, and “Minecraft Story Mode”.

One day, some AI systems could be trained using large collections of interactive films.

A Standard Runtime Environment

As envisioned, interactive videos contain JavaScript scripts and/or WebAssembly (WASM) modules. These scripts and modules should be provided with a runtime environment different from the one provided by Web browsers for Web documents. The runtime environment provided for interactive video scripts and modules should include functionality for:

  1. navigation (e.g., seeking to clips, segments, or chapters in interactive videos)
  2. prefetching resources (e.g., files in videos, or video attachments)
    a. there should be a way, e.g., using URL fragments, to locate and prefetch files in videos, or video attachments
  3. opening resources (e.g., files in videos, or video attachments)
    a. there should be a way, e.g., using URL fragments, to locate and retrieve files in videos, or video attachments
  4. presenting menus
    a. perhaps also functionality for presenting image maps or “hotspots” atop video
  5. accessing users’ settings and configurations
  6. storing and accessing local data
  7. storing and accessing remote data
    a. e.g., learner, player, and user models
  8. parsing JSON, XML, RDF, and other data formats

There is a need for one standard runtime environment for use by multiple interactive video formats. With a standard runtime environment, interactive video player software (e.g., Web browsers) could more readily play multiple formats of interactive video.

Security Considerations and User Permissions

Only the standard runtime environment intended for interactive videos’ JavaScript scripts and WASM modules should be available to them. Containing documents could, perhaps, permit otherwise (see also: <iframe>).

Interactive videos could contain hashes for and/or digital signatures of files in videos, or video attachments. WASM modules could be digitally signed.

Interactive video players (e.g., Web browsers) could make use of user permissions systems to protect users’ data privacy while providing users with features.

Menus and Accessibility

As envisioned, the presentation and display of menus is handled by interactive video players through the runtime environment API. Menus could be presented to users by invoking a present function.

Promise<MenuReponse> present(any menu, optional any options);

Spoken Language Interaction and Dialogue Systems

Interactive videos could utilize speech recognition grammars (e.g., SRGS) to enable users to select menu options via spoken natural language. In Web browsers, this functionality could be provided via the Web Speech API.

Perhaps interactive videos could also utilize remote speech recognition services.

Internationalization

Different versions of files in videos, or of video attachments, could exist in interactive videos for specific languages (e.g., menus, speech recognition grammars).

Documents Containing Interactive Videos

A document element for interactive video (e.g., <video>) could have an event upon its interface to be raised whenever a menu is to be presented to a user. In this way, the presentation of a menu could be intercepted by a containing document. The containing document could then process the arguments passed to the present function, display a menu to a user in a stylized manner, and provide a response back to the interactive video’s scripts or WASM modules, for example resulting in navigation to, or seeking to, a video clip, segment, or chapter.

Interactivity via Animated Colored Silhouettes in Secondary Video Tracks

One could use secondary video tracks to provide arbitrarily-shaped interactive regions. These secondary video tracks could each contain multiple animated colored silhouettes, overlay regions, or “hotspots”. The colors of these silhouettes would correspond with arbitrarily-shaped interactive elements. The color black, however, would be reserved for indicating the absence of a silhouette.

Consider an interactive video of an automobile engine, under a hood. Envision a secondary video track where there is a colored silhouette for each part of the engine. While the primary video track would be visible to a user, using the secondary video track(s), the user could, with their mouse, hover over and click on the parts of the engine.

Animated colored silhouettes could also be rectangular and mirror the motion of text or images in videos. This would facilitate traditional hyperlinks in videos.

Semantics and Metadata

With semantics and metadata, one could describe the contents of videos, the objects and events which occur in them, and place this information in semantic tracks. One could utilize semantic graphs as well as “semantic deltas”, or “semantic diffs”, which indicate instantaneous changes to semantic graphs.

As envisioned, the animated colored silhouettes in secondary video tracks have unique identifiers and are URI-addressable. In this way, semantics and metadata could more readily describe the silhouetted regions which map to visual contents in videos.

With semantic tracks, users could, for example, utilize queries, via user interfaces, upon the contents of videos and observe the query results, e.g., objects or events, visually selected, outlined, or highlighted in the videos. Also possible is that query results could be presented to users with storyboards or other visualizations using relevant images from videos.

JavaScript and WebVTT

A syntax example is indicated for embedding JavaScript in WebVTT text tracks. The example provides two lambda functions for a cue, one to be called when the cue is entered and the other to be called when the cue is exited.

05:10:00.000 --> 05:12:15.000
enter:()=>{...}
exit:()=>{...}

Polyfills and Prototyping

It appears that to implement a polyfill to prototype interactive video functionality, the HTML5 media API would need to surface the capability to access files in videos, or video attachments.

As envisioned, a polyfill would load JavaScript scripts and WASM modules from videos, implement and provide the standard runtime environment for those scripts and modules, and then run the videos, e.g., perhaps by calling a function like main or raising an event.

Another approach could make use of HTML5 custom elements. A custom element:

<custom-ivideo src="X" />

could utilize an <iframe> to load a generated HTML5 document:

<iframe src="ivideo.php?src=url_encode(X)" allowfullscreen="true" />

with that generated HTML5 document resembling:

<html>
  <head>
    <script src="ivideo-polyfill.js" />
  </head>
  <body>
    <video src="X" />
  </body>
</html>

This could also be achieved utilizing the srcdoc attribute of the <iframe> element.

Attachments

Attachments in videos are additional files, such as "related cover art, font files, transcripts, reports, error recovery files, picture or text-based annotations, copies of specifications, or other ancillary files".

One can refer to a comparison of video container formats to see which video container formats presently support attachments.

Attachments in videos are a means for adding JavaScript scripts and WASM modules to videos. By placing utilized scripts and modules in interactive videos, the videos can be self-contained and portable.

Interfaces for inspecting and accessing attachments could resemble:

partial interface HTMLMediaElement
{
  [SameObject] readonly attribute AttachmentList attachments;
}
[Exposed=Window]
interface AttachmentList : EventTarget {
  readonly attribute unsigned long length;
  getter Attachment (unsigned long index);

  attribute EventHandler onchange;
  attribute EventHandler onaddattachment;
  attribute EventHandler onremoveattachment;
};

In theory, one could provide arguments including MIME type and natural language when opening a video attachment, per
content negotiation.

partial interface AttachmentList
{
  Attachment? getAttachment(DOMString name, optional DOMString type, optional DOMString lang);
}

or

partial interface AttachmentList
{
  Promise<Attachment?> getAttachment(DOMString name, optional DOMString type, optional DOMString lang);
}

for example getAttachment("main", "application/wasm").

Related specifications include the File API and BufferSource. The BufferSource interface is utilized for loading WASM modules.

Alternatively, mechanisms for inspecting and accessing video attachments could be encapsulated by Web browsers and the standard runtime environment for interactive video.

Conclusion

A standard runtime environment for interactive videos and standard formats for interactive videos are needed. With new standards, interactive videos would be readily authored, self-contained, portable, secure, accessible, interoperable, readily analyzed, and readily indexed and searched.

There is a need for one standard runtime environment for use by multiple interactive video formats. With a standard runtime environment, interactive video player software (e.g., Web browsers) could more readily play multiple formats of interactive video.

With standard interactive video formats (and perhaps open file formats for project files), extensible content authoring tools could be more readily developed. Authoring interactive stories and producing interactive films is difficult and, with new software tools, we could expect much more content.

With standard interactive video formats, interactive videos would be self-contained and portable.

With a standard runtime environment and standard interactive video formats, interactive videos would be more secure and would access user data and other resources only in accordance with user permissions.

With a standard runtime environment and standard interactive video formats, interactive videos would be accessible.

With standard interactive video formats, interactive videos would be interoperable with other technologies. For example, interactive videos could be played in Web documents and EPUB digital textbooks.

With standard interactive video formats, interactive videos could be better analyzed.

With standard interactive video formats, large collections of interactive videos could be better indexed and searched.

Thank you. I look forward to discussing these ideas with you.

References

[HTML5 Media] https://html.spec.whatwg.org/multipage/media.html
[PLS] https://www.w3.org/TR/pronunciation-lexicon/
[SMIL] https://www.w3.org/TR/SMIL/
[SRGS] https://www.w3.org/TR/speech-grammar/
[V8] https://v8.dev/
[V8 Isolates, Contexts, Worlds and Frames] https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/blink/renderer/bindings/core/v8/V8BindingDesign.md#A-relationship-between-isolates_contexts_worlds-and-frames
[WASM] https://webassembly.org/
[Web Animations] https://www.w3.org/TR/web-animations-1/
[Web Speech API] https://wicg.github.io/speech-api/
[WebVTT] https://www.w3.org/TR/webvtt1/

@spaceribs
Copy link

I'm the developer of a web extension called Plopdown which adds interactive elements to existing videos on the web. The most complex and annoying problem in this space is the calculations and updates I need to make to ensure the elements related to the video are positioned correctly and show up in fullscreen. If anything should come out of this proposal, it's the ability to add elements to the playing video that are fully embedded.

@AdamSobieski
Copy link
Author

AdamSobieski commented Jun 21, 2021

@spaceribs , thank you. I wonder whether Plopdown uses rectangular regions for interactive elements or whether it utilizes arbitrary shapes?

On the topic of ways of implementing arbitrary shapes, in theory, one could use video tracks. These video tracks could contain of multiple animated silhouettes, overlay regions, or “hotspots”. Consider an interactive video of an automobile engine, under a hood. Envision a secondary video track where there is a colored silhouette for each part of the engine. While the primary video track would be visible to a user, using the secondary video track(s), the user could, with their mouse, hover over and click on the parts of the engine.

@spaceribs
Copy link

spaceribs commented Jun 21, 2021

For plopdown, regions were somewhat restrictive and I specifically wanted to implement other kinds of overlays like picture-in-picture, games, etc. The best course of action was to use VTT metadata to store the rendering information and place an absolutely positioned <div> with a renderer which listens to the stream.

The tricky part was the positioning of the "stage" element. Originally I placed it as a sibling of the video, but the controls of Youtube/Netflix would prevent click events from coming through to my menus and elements. I eventually landed on querying via elementsFromPoint as the best solution to make sure the overlay displayed in fullscreen.

The majority of elements are just normal HTML/SVG elements, nothing more complex than that.

@chrisn
Copy link

chrisn commented Jun 21, 2021

There's some similarity here with BBC R&D's object based media: https://www.ibc.org/trends/object-based-media-exploring-the-user-experience/6889.article

@AdamSobieski
Copy link
Author

AdamSobieski commented Jun 21, 2021

@chrisn , I approached these interactive and adaptive media topics from an AI and educational technology perspective. Thanks for that article which showcases some BBC R&D work and brings to mind other interesting scenarios, e.g., journalism.

Here is a hyperlink to the referenced article: Object-Based Media: An Overview of the User Experience by Maxine Glancy, Lauren Ward, Nick Hanson, Andy Brown, and Michael Armstrong.

Here is a hyperlink to the StoryFormer tool: https://www.bbc.co.uk/makerbox/tools/storyformer .

@pchampin
Copy link

pchampin commented Jul 1, 2021

@AdamSobieski @spaceribs the Media WG is currently being rechartered. Have a look at the proposed charter, I think some of the deliverables will address several of the points raised above.

@chrisn
Copy link

chrisn commented Jul 1, 2021

It would be interesting to see a gap analysis:

  • what can be done today using existing browser capabilities, by building a player library?
  • what new browser capabilities are needed (taking into account current developments in the Media WG)?
  • what APIs would a standardized interactive video player need to provide, built on top of those capabilities?

@Malvoz
Copy link

Malvoz commented Jul 1, 2021

These proposals also seem related:

@AdamSobieski
Copy link
Author

AdamSobieski commented Jul 4, 2021

@pchampin , @chrisn , thank you. If the topics aren't already in scope, I would like to propose adding the discussion and exploration of interactive video, branching video, interactive film, and related topics to the new Media WG charter.

Here is a preliminary gap analysis:

  • what can be done today using existing browser capabilities, by building a player library?

While JS libraries and scripts running in Web documents can enhance included videos with interactivity, for self-contained and portable interactive videos which contain their own JS scripts and WASM modules, it appears that polyfills, prototypes, and player libraries would require the capabilities to inspect and access videos' file attachments, in particular JS scripts and WASM modules. This functionality could be implemented natively by browsers and provided only to interactive videos per their standard runtime environment.

Why should interactive videos be self-contained and portable?

  1. Ease of use. Interactive videos would not require JS libraries or JS in <script> elements to utilize. Websites and digital textbooks would be able to utilize interactive videos as easily as they can images, audio, or video. Users of popular CMSs would be able to simply use interactive videos without installing extensions.

  2. Separation of concerns. Websites or digital textbooks could be editioned without having to make any changes to interactive videos. Similarly, interactive videos could be editioned without having to make any changes to websites or digital textbooks.

  3. Portability and interoperability. Multiple websites or digital textbooks could (re)utilize interactive videos hosted by third parties.

  4. Security and user permissions. Self-contained interactive videos would also be secure, utilizing a standard runtime environment and user permissions system.

  5. Analysis, indexing and search. With self-contained and portable interactive videos, with a standard runtime environment, and as best practices emerge, interactive videos would be increasingly analyzable, indexable, and searchable.

  • what new browser capabilities are needed (taking into account current developments in the Media WG)?

With respect to what new browser capabilities are needed, these include:

  1. JavaScript and WebAssembly in Interactive Videos
    a. Means of inspecting and accessing JS scripts and WASM modules in interactive videos (and attachments in general) could be provided to Web developers either: (1) by extending HTMLMediaElement, or (2) by implementing this functionality natively in browsers and then providing relevant API to interactive videos through the standard runtime environment for interactive videos. In the latter case, only JS scripts and WASM modules in interactive videos would be able to inspect and access file attachments in videos.

  2. Secure Runtime Environment
    a. Placing interactive videos in secure <iframe>-like sandboxes can ensure that the JS scripts and WASM modules in them only have access to the standard runtime environment. It would be useful for browsers to be able to provide secure <iframe>-like environments for interactive videos without the scripting environments provided for Web documents and, instead, with the standard runtime environment for interactive videos.

  3. Secondary Video Tracks
    a. While playing and displaying primary video tracks, browsers should also be able to process synchronized secondary video tracks, e.g., containing one or more layers of colored silhouettes.
    b. Ensuring that these silhouettes have unique identifiers, or are URI-addressable, might require some new video formats or other innovative techniques.
    c. Web developers might desire to have some JavaScript APIs for these new tracks and related scenarios: (1) to inspect and access the set of currently visible silhouettes, to listen to events on this dynamic collection, and to listen to UI events on individual silhouettes, (2) to obtain bitmaps of silhouettes, to perform some operations on these, and to perform compositing with the visual contents of primary video tracks, e.g., to explore custom highlighting, visual outlines, and other custom effects.

  • what APIs would a standardized interactive video player need to provide, built on top of those capabilities?

With respect to what APIs that a standardized interactive video player would need to provide, there are two perspectives to consider.

Firstly, there is the Web document perspective, viewing the interactive video player from the outside. In these regards, one can envision events, e.g., onmenu for when interactive videos present users with menus.

Secondly, there is the interactive video perspective, viewing the interactive video player from the inside. This is referred to, above, as the standard runtime environment, the APIs available to the JS scripts and WASM modules in interactive videos.

Possible features for the standard runtime environment include: navigation (seeking/branching), prefetching and opening video attachments, prefetching and opening remote resources, presenting menus, accessing users’ settings and configurations, storing and accessing local and remote data, and parsing JSON, RDF, XML, and other data formats.

@AdamSobieski
Copy link
Author

AdamSobieski commented Jul 9, 2021

@Malvoz, thank you. Combining video annotation with video semantics, end-users could annotate any objects, events, or activities occurring in videos.

Presently, one can select video content by making use of rectangular regions (e.g., media fragments). I am broaching uses of ancillary, or secondary, video tracks with one or more layers of uniquely-identifiable silhouettes. As envisioned, this content would be generated in part or entirely by computer vision algorithms.

Hypertext presentations of annotations (or arrows connected to these) could follow or track objects, events, or activities occurring in videos. Relevant video content could be highlighted when the mouse hovers over or clicks on an annotation.

There are interesting educational applications and scenarios to consider with respect to combinations of video annotation and video semantics.

@chrisn
Copy link

chrisn commented Jul 19, 2021

@pchampin , @chrisn , thank you. If the topics aren't already in scope, I would like to propose adding the discussion and exploration of interactive video, branching video, interactive film, and related topics to the new Media WG charter.

The Media WG is focused on standardising the specs listed in its charter, so probably isn't the best forum for exploratory discussion - the Media & Entertainment Interest Group (MEIG) is better suited. That said, if any of your requirements would need changes to any of the Media WG specs, those could be raised as issues in the relevant GitHub repos.

@LJWatson
Copy link

LJWatson commented Apr 3, 2024

@AdamSobieski, @chrisn, was there any further action with the MEIG and/or filing issues as Chris suggested?

@AdamSobieski
Copy link
Author

@LJWatson, hello. I remain interested in these interactive video topics, e.g.:

  1. video capable of displaying interactive elements, e.g., menus, atop it,
  2. branching video where paths could be selected including based upon user interactions,
  3. streaming video where content might be routed to or generated on-the-fly, including in response to user interactions.

I remain interested in brainstorming and discussing how these scenarios could be supported in Web browsers in standard ways, e.g., simply by using the <video> element. It may also be the case that WebRTC technologies could be of use for delivering interactive video scenarios...

@chrisn
Copy link

chrisn commented Apr 3, 2024

There hasn't really been much further discussion on this issue, at least in a W3C context, and MEIG would still welcome input in its GitHub repo.

BBC R&D continues to do work in this area, see for example some of our pilots and collaborations.

@LJWatson
Copy link

LJWatson commented Apr 4, 2024

Thanks @AdamSobieski and @chrisn. We'll need implementor interest to be able to move this forward, but before then I think the idea itself will need to be specified a bit more.

@AdamSobieski
Copy link
Author

AdamSobieski commented Apr 13, 2024

Thank you, @LJWatson. That makes sense.

I recently opened a discussion and brainstorming thread on these topics in the Civic Technology Community Group mailing list.

I am excited about possibilities pertaining to new user experiences with respect to the Web and smart television.

With respect to the concept of a "channel", I am brainstorming about channels 2.0, multi-channels, or multi-stream channels. These would be more than just video streams. These constructs might utilize static or dynamic XML or JSON resources to describe their features while referring to one or more real-time video streams.

Some ideas:

  1. "Channels" could each provide a homepage and/or a main menu for consumers to make use of, e.g., with remote controls.
  2. "Channels" could provide grid-based navigation widgets, or guides, for browsing the content on multiple, parallel video streams or sub-channels inside of them. Sets of sub-channels could be static, dynamic, or contain both statically and dynamically available sub-channels.
  3. "Channels" could provide UX for receiving customer feedback pertaining to content.
  4. "Channels" or specialized sub-channels could offer semi-personalized content. Customers could receive recommended, personalized content. Content providers could interrupt such content to provide other content, e.g., breaking news content, intended for entire audiences.
  5. "Channels" could provide customers with interactivity with respect to sub-channels, segments, and/or advertisements.
    1. Hyperlinks or hotspots could be provided in streaming interactive video content for navigating between sub-channels.
    2. Surveys and opinion polls could also be conducted via smart televisions using interactive video.
  6. "Channels" could support second-screen and other multi-device interoperability scenarios.
  7. "Channels" could provide settings and configuration for customers.

Per idea 1, channel-specific homepages and/or main menus, or idea 2, browsing sub-channels via grid-based navigation widgets, or guides, customers would be able to check their local weather forecasts at any point in time while remaining on a news "channel".

Per idea 2, for news content, sub-channels could be those kinds of contents indicated on websites' main menus and submenus (e.g., weather, national, world, local, business, technology, entertainment, sports, science, health). There could also be a default "combination" sub-channel which would be understood as selecting and blending together segments from those other specialized sub-channels.

In addition to news-related scenarios, another use-case scenario that comes to mind is music video. Different kinds of music could be each be streamed in a parallel sub-channel of a music "channel" rather than there having to be dozens or more traditional television channels, e.g., one per kind of music, from a content provider.

After brainstorming and discussing with the group, towards specifying interactive video ideas a bit more, I am looking forward to providing an update here or in the MEIG repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants