Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define how grammars work and give examples #57

Open
jlguenego opened this issue Jul 29, 2019 · 30 comments · May be fixed by #58
Open

Define how grammars work and give examples #57

jlguenego opened this issue Jul 29, 2019 · 30 comments · May be fixed by #58

Comments

@jlguenego
Copy link

You should provide an fully working example about how to use grammar. Because I did not see any use case where I could use that.

I do not understand what is the purpose of having grammar and how to use it.
"google search" did not help me.

@foolip
Copy link
Member

foolip commented Aug 1, 2019

Do you mean how to use recognition.grammars.addFromURI(...) and recognition.grammars.addFromString(...)? A while ago I tried to work this out myself by looking at the Chromium source code, but I couldn't get to the bottom of what they actually do.

@gshires do you have any context on what these methods do?

@marcoscaceres when you've looked at the API, could you make sense of this bit?

I'm adding use counters to Chromium to figure out how this is used, and if the grammar stuff isn't really used in the wild it's possible it could be removed.

@jlguenego
Copy link
Author

Yes, I mean having a understanding use case where recognition.grammars.addFromURI(...) would be useful.

@kdavis-mozilla
Copy link

I do not understand what is the purpose of having grammar...

The basic idea is to decrease word error rate for STT.

...and how to use it

You'd set grammars on a SpeechRecognition instance before calling start().

An example, say you are doing English STT for restaurant that has Indian dishes. So you'd want to have a grammar that includes the names of Indian dishes, e.g. palak paneer, to decrease word error rate for STT unfamiliar with such terms.

As to if it's used in Chromium, I do not know.

@marcoscaceres
Copy link
Collaborator

The spec fails to specify the format the the grammar is in (see "ISSUE 3" in the spec). This really not be in the spec at all, given how poorly specified that all is.

We should get rid of all of SpeechGrammarList entirely (looks like more fingerprinting surface).

In Chrome, the src just returns the URL from the web page - so seems completely useless.

@kdavis-mozilla
Copy link

@marcoscaceres, here @jlguenego was asking: What is a use case for a grammar.

The issues you mention, while valid, are orthogonal to "what is a use case for a grammar" question. So maybe they should be is a different GitHub issue?

@marcoscaceres
Copy link
Collaborator

New issue for removing that would be SpeechGrammarList and associated uses would be great.

@marcoscaceres marcoscaceres linked a pull request Aug 2, 2019 that will close this issue
@marcoscaceres
Copy link
Collaborator

Sent PR #58

@foolip
Copy link
Member

foolip commented Aug 2, 2019

https://bugs.chromium.org/p/chromium/issues/detail?id=680944 has some clues about what grammars are for, although it's a bug about something that doesn't work:

recognizer = new SpeechRecognition();  //standard
recognizer.continuous = false; //stops listening in a pause
recognizer.lang = "en-GB";   //"en-GB"  "el-GR"
recognizer.interimResults = true;
recognizer.maxAlternatives = 5;
var commands = ['transfer' , 'inquiry', 'statement','balance'];
var grammar = '#JSGF V1.0; grammar commands; public <command> = '
+ commands.join(' \| ') + ' ;';
var speechRecognitionList = new SpeechGrammarList();
speechRecognitionList.addFromString(grammar, 1);
recognizer.grammars = speechRecognitionList;

This example is about limiting recognition to a small set of words. A banking use case can be inferred from the set of strings in the example: 'transfer' , 'inquiry', 'statement', 'balance'.

But again, this doesn't work in Chrome.

@jlguenego I'll go ahead and rename this issue to go beyond just asking for examples, hope you don't mind.

@foolip foolip changed the title grammar example Define how grammars work and give examples Aug 2, 2019
@kdavis-mozilla
Copy link

A fuller example from MDN is the tutorial we wrote several years ago.

However, I doubt if any browser supports the tutorial as written as it uses JSGF which as far as I know is not supported in any browser.

@foolip
Copy link
Member

foolip commented Aug 2, 2019

looks like more fingerprinting surface

Unless the interface works and actually does something useful by changing the outcome of speech recognition, the API surface itself is write-only and doesn't reveal anything, there isn't even a way to feature detect what's supported :)

@foolip
Copy link
Member

foolip commented Aug 2, 2019

I've done some digging in HTTP Archive for pages containing "SpeechRecognition" and ".grammars" and 381 results. Most are variations of the same script, and all the bits producing grammar strings that I could interpret are using JSGF. So, probably that has worked to some extent.

@foolip
Copy link
Member

foolip commented Aug 2, 2019

I found 32 references to "jspeech" which is probably https://github.com/tur-nr/node-jspeech maintained by @tur-nr. @tur-nr, do you know which browsers JSGF works in?

@kdavis-mozilla
Copy link

@foolip Myself and Andre Natal implemented JSGF in Firefox and Firefox OS many years ago, both "prefed off" by default, but removed both.

I'd guest most of the HTTP Archive hits you found are variations of the tutorial we made for the now dead FirefoxOS.

@foolip
Copy link
Member

foolip commented Aug 2, 2019

@kdavis-mozilla numerically most actually seem to be https://cdn.botframework.com/botframework-webchat/latest/botchat.js or variations on this. botframework.com is a Microsoft framework, so perhaps we could find someone to help shed light on how grammar is being used in this project. @thejohnjansen, are you able to help?

@tur-nr
Copy link

tur-nr commented Aug 2, 2019

@foolip I wrote that library a few years back when doing a hackathon project with the speech API. I don't know the browser support unfortunately, but I used Chromium (WebKit) Speech Recognition. Wasn't sure it was doing anything as Chrome just captured anything I said regardless of the grammar I gave it 🤷‍♀️

Microsoft have used it in their bot framework yes. You can review their usage on GitHub.

Hope that helps 😬

aarongable pushed a commit to chromium/chromium that referenced this issue Aug 2, 2019
SpeechGrammar's addFromUri method was already measured because the
capitalization doesn't match the spec, but also measure the other parts
of the grammar API surface to learn how widely used it is.

Prompted by spec issue: WICG/speech-api#57

Change-Id: Ib9289f911ad4966d5e0c836924444cd1d3b4be60
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/1732790
Auto-Submit: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Henrik Boström <hbos@chromium.org>
Commit-Queue: Philip Jägenstedt <foolip@chromium.org>
Cr-Commit-Position: refs/heads/master@{#683515}
@foolip
Copy link
Member

foolip commented Aug 2, 2019

Thanks @tur-nr! I tried to find use of grammars in https://github.com/microsoft/BotFramework-WebChat but couldn't find the same code as I see in https://cdn.botframework.com/botframework-webchat/latest/botchat.js.

@billba @danmarshall @corinagum, I see you are among the top contributors to that repo at Microsoft. Can any of you shed light on how https://cdn.botframework.com/botframework-webchat/latest/botchat.js uses the web-exposed APIs in https://w3c.github.io/speech-api/#speechreco-speechgrammar, which we're discussing in this issue? Specifically, what kind of values are you passing to addFromString and have you found it to have an effect on any browser?

@saschanaz
Copy link
Contributor

@foolip
Copy link
Member

foolip commented Aug 5, 2019

Yes, that's it, added by @compulim in microsoft/BotFramework-WebChat#937. @compulim, you mentioned "Web Speech API + JSRF" there, did you get that working in any browser at the time?

@foolip
Copy link
Member

foolip commented Aug 5, 2019

With help from @gshires I've been able to locate what happens with the grammars in Chromium, the weights and URLs are passed along to the speech recognition engine. Since neither are interpreted by Chromium, this is effectively just a way to pass engine-specific configuration or options along.

As with text layout engines or WebRTC, having some controls is reasonable, and standardizing some options across speech engines might be possible.

So, at least as implemented in Chromium, grammars aren't quite what you'd expect them to be, and I would not recommend trying to use it since the behavior isn't documented and could change.

I've sent https://chromium-review.googlesource.com/c/chromium/src/+/1732790 to measure the usage of these APIs in more detail.

@kdavis-mozilla
Copy link

kdavis-mozilla commented Aug 5, 2019

So, at least as implemented in Chromium

What does "implemented" mean here?

Is the grammar only retained or is it retained and used to affect STT results?

@compulim
Copy link

compulim commented Aug 5, 2019

@foolip Agree with your observations.

Although Chromium say it support JSGF, when I send JSGF with weighted phrases, I don't feel the difference. It feels like the JSGF is simply ignored.

But my observation is very subjective because I am using my voice to test the engine. Without looking at the source code, it's very hard to say whether JSGF is working or ignored.

@foolip
Copy link
Member

foolip commented Aug 5, 2019

@kdavis-mozilla Chromium only exposes the interfaces and passes along the grammar URL and weight as given without interpretation. Any actual effect would be in the speech engine service and that is neither open source nor documented, AFAICT. I wasn't able to find any use of grammars in httparchive that seems to have any effect in Chrome.

@compulim I'm pretty sure at this point that addFromString doesn't do anything at all in Chromium, and the source code doesn't mention JSGF anywhere.

What I'd like to do at this point:

  • Understand what configuration the engine is capable of via the grammar-related APIs
  • Wait for stats from new use counters to be available

I think the outcome will likely be adding other ways to configure the speech recognition, maybe more attributes. But it depends a lot on the feasibility of changing/removing the existing APIs.

@kdavis-mozilla
Copy link

kdavis-mozilla commented Aug 5, 2019

...maybe more attributes...

As a designer of an STT engine I'd say this is not the way to go.

The only attributes I can think of that might make sense across engines are grammars/language models with weights. Even this is very engine specific.

Generally, there are many other such "attributes" that are STT engine specific and even vary from one version of an STT engine to the next. So exposing them in this API is not going to be a good design decision.

@foolip
Copy link
Member

foolip commented Aug 5, 2019

It's probably easier to discuss a concrete example than the general idea of adding attributes, but I don't have a concrete example at this time.

@kdavis-mozilla
Copy link

kdavis-mozilla commented Aug 5, 2019

I can come up with many concrete examples, all of them bad 😊 Off the top of my head here are two...

Beam Width
For example, when using a language model one usually introduces a beam search in which the beam elements are ordered by language model and acoustic model scoring. The width of this beam "beam width" could be one of these "attributes".

However, the beam width is usually tuned such that the beam is as large as possible for the given resource budget. Allowing the users to set the beam width above this value will increase the quality but invalidate the financial calculations that went into deployment of the STT engine. Allowing users to decrease this beam width will decrease the quality of the STT results and frustrate users.

Language Model vs Acoustic Model Weighting
When one creates a STT engine it generally has two components, a language model and an acoustic model. Both models work together to assign probabilities to proposed transcripts. The language model suggests its probability for a transcript and the acoustic model suggests its probability for the same transcript. The final probability is given but assigning a weight to the language model's probabilities and a weight to the acoustic model's probabilities. These weights are laboriously tuned to optimize performance for particular use cases.

These weights could be some of these "attributes". However, giving users access to these weights will allow them to tune the engine away from its optimal configuration for the in-browser use case. Decreasing quality and frustrating users.

If you want more examples I can provide more.

@foolip
Copy link
Member

foolip commented Aug 6, 2019

No, no, I wasn't looking for examples, of course there are more bad ideas than good ideas, and bad ones are easy to list. I'll open new issues for any actual proposals if they show up.

@guest271314
Copy link

@compulim

@foolip Agree with your observations.

Although Chromium say it support JSGF, when I send JSGF with weighted phrases, I don't feel the difference. It feels like the JSGF is simply ignored.

But my observation is very subjective because I am using my voice to test the engine. Without looking at the source code, it's very hard to say whether JSGF is working or ignored.

The Chrome/Chromium implementation of SpeechRecognition is essentially a black box.

What is known is that at Chrome/Chromium no permission is requested and no notification is provided that the user PII biometric data (the users' voice) is being recorded and sent to an undisclosed third-party web service. It is unclear if the users' voice is stored forever, and further used for research and development of proprietary technologies #56. It is not documented exactly how the third-party web service performs STT.

Additionally, so-called "curse" words should not be censored in the result.

Until the glaring issue at Chrome/Chromium of users not being notified and not being asked permission for their voice to be recorded and sent to an undisclosed web service, there is no way to practically test or implement grammars.

What can be done now is to 1) review how https://github.com/cmusphinx/pocketsphinx handles grammars https://github.com/cmusphinx/pocketsphinx/search?q=grammar&unscoped_q=grammar; 2) start from scratch converting voice to IPA (https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) to words in a given language, essentially the reverse of
https://github.com/itinerarium/phoneme-synthesis; see also

@szewai
Copy link

szewai commented Jan 20, 2021

Hello @foolip, is there any update on the usage report for the grammars API?

@foolip
Copy link
Member

foolip commented Jan 25, 2021

Hi @szewai!

Here's the data from the use counters we have in Chrome, including the ones added in #57 (comment) and some more:

The new webkitSpeechRecognition() and new webkitSpeechGrammarList() usage is much higher than I would have guessed, but SpeechRecognition.prototype.start() gives a much better idea of the real usage. The addFromString() and addFromUri() usage in particular should be understood in relation to that, and a reasonable interpretation is that addFromUri() is often used when there's real usage of the API happening. (It could be that the start() and addFromUri() is mutually exclusive, but I see no reason to suspect it.)

However, as stated in #57 (comment), these "grammars" are effectively engine-specific options. If we find that usage in the wild depends on this in some important way, I think we should first try to define what the effect of certain invocations of addFromUri() should be, or if that turns out impractical to implement for other engines, define an alternative way to communicate those settings, and try to migrate the usage in the wild to that standardized mechanism.

kennethkufluk added a commit to kennethkufluk/web-speech-api that referenced this issue Jan 27, 2022
Safari does not currently implement webkitSpeechGrammarList and throws an error at line 2.
I have guarded against this, so the script now works in Safari on desktop and iPhone.

I've also added a comment around the grammar code because no browser currently supports them, as noted in this WICG discussion:
WICG/speech-api#57 

Since we'll check usage stats to determine if grammars should be removed from the spec, and this script encourages grammar usage (I imagine this code will be copy/pasted assuming they have an effect), I added a comment to discourage their use.

I didn't want to remove grammars completely, as this script is used as a demonstration of the spec on MDN.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants