Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect similar words for words with 'vbz' pos #177

Open
dhowe opened this issue May 21, 2022 · 11 comments
Open

Incorrect similar words for words with 'vbz' pos #177

dhowe opened this issue May 21, 2022 · 11 comments
Assignees

Comments

@dhowe
Copy link
Owner

dhowe commented May 21, 2022

For example, many incorrect verb forms in list for 'spreads':

    let word = 'spreads', pos = 'vbz';

    let rhymes = RiTa.rhymes(word, { pos });
    let sounds = RiTa.soundsLike(word, { pos });
    let spells = RiTa.spellsLike(word, { pos });
@KarlieZhao
Copy link
Collaborator

I think this problem appears because there are some words with incorrect pos in dict,
for example,

"computerized":["k-ah-m p-y-uw1 t-er ay-z-d","jj nn vb vbn"],
"discriminated":["d-ih s-k-r-ih1 m-ah n-ey t-ah-d","vbd jj nn vb"],
"expected":["ih-k s-p-eh1-k t-ah-d","vbn vbd jj vb"]

words like 'computerized' will be considered as base form verbs (because their pos contain 'vb') and hence, in this case where the target pos is vbz, conjugator will directly return 'computerizeds'. The easiest way to solve this might be just to modify the words' pos in dict?

@KarlieZhao
Copy link
Collaborator

#179 might be for the same reason

@dhowe
Copy link
Owner Author

dhowe commented Jun 1, 2022

good notice -- I wonder if we might be able to remove all the 'vbn' from the dictionary, since we can compute them from the base form

@dhowe
Copy link
Owner Author

dhowe commented Jun 4, 2022

So we have done this before with verb tenses (see earlier tickets from @cqx931 below). Once we find a pos that we want to remove from the dict, then we need to find all the places we would need to make updates to the code to deal with that pos (soundsLike, spellsLike, search, pos, conjguate, hasWord, tag etc.), then add tests (which will fail), then add the code to handle these cases, then remove the words with a script... then re-try the tests and adjust until the pass...

See:
dhowe/RiTaV1#536
dhowe/RiTaV1#366
dhowe/RiTaV1#357
dhowe/RiTaV1#365

dhowe/RiTaJSv1#37
#80

@KarlieZhao
Copy link
Collaborator

So here's a list of verbs with incorrect pos in the current dict, and the pos I think need to be removed/added are in the comment:

"beat": ["b-iy1-t", "vb jj nn vbd vbn vbp"], //-vbn
"become": ["b-ih k-ah1-m", "vb vbd vbn vbp"], //-vbd
"bit": ["b-ih1-t", "nn vbd vbn jj rb vb"], //-vb, -vbn
"bore": ["b-ao1-r", "vbd vbp jj nn vb"], //-vbd
"broke": ["b-r-ow1-k", "vbd vbn jj rb vb"], //-vb, -vbn
"build": ["b-ih1-l-d", "vb vbn vbp nn"], //-vbn
"called": ["k-ao1-l-d", "vbn vbd vb"], //-vb
"come": ["k-ah1-m", "vb vbd vbn vbp vbz jj"], //-vbd, -vbz
"committed": ["k-ah m-ih1 t-ah-d", "vbn jj vb vbd"], //-vb
"computerized": ["k-ah-m p-y-uw1 t-er ay-z-d", "jj nn vb vbn"], //-vb, -nn
"concerned": ["k-ah-n s-er1-n-d", "vbn jj vb vbd"], //-vb
"discriminated": ["d-ih s-k-r-ih1 m-ah n-ey t-ah-d", "vbd jj nn vb"], //-vb, -nn
"ended": ["eh1-n d-ah-d", "vbd jj vb vbn"], //-vb
"enter": ["eh1-n t-er", "vb vbn vbp"], //-vbn
"expected": ["ih-k s-p-eh1-k t-ah-d", "vbn vbd jj vb"], //-vb
"finished": ["f-ih1 n-ih-sh-t", "vbd jj vb vbn"], //-vb
"gained": ["g-ey1-n-d", "vbd vbn vb"], //-vb
"got": ["g-aa1-t", "vbd vbn vbp vb"], //-vb, -vbn
"have": ["hh-ae1-v", "vbp jj nn vb vbn"], //-vbn
"include": ["ih-n k-l-uw1-d", "vbp vbn vb"], //-vbn
"increased": ["ih-n k-r-iy1-s-t", "vbn jj vb vbd"], //-vb
"involved": ["ih-n v-aa1-l-v-d", "vbn vbd jj vb"], //-vb
"knit": ["n-ih1-t", "vbn jj nn vb"], //+vbd
"launched": ["l-ao1-n-ch-t", "vbn vbd vb"], //-vb
"lead": ["l-eh1-d", "vb vbn vbp jj nn"], //-vbn
"led": ["l-eh1-d", "vbn vbd vb"], //-vb
"lived": ["l-ay1-v-d", "vbd vbn vb"], //-vb
"outpaced": ["aw1-t p-ey-s-t", "vbd nn vb vbn vbp"], //-vb
"oversaw": ["ow1 v-er s-ao", "vbd vb"], //-vb
"oversold": ["ow1 v-er s-ow1-l-d", "vbn jj vb"], //-vb
"own": ["ow1-n", "jj vbn vbp vb"], //-vbn
"paled": ["p-ey1-l-d", "vbd vb vbn"], //-vb
"pay": ["p-ey1", "vb vbd vbp nn"], //-vbd
"plan": ["p-l-ae1-n", "nn vb vbn vbp"], //-vbn
"post": ["p-ow1-s-t", "nn in jj vb vbd vbp"], //-vbd
"prepaid": ["p-r-iy p-ey1-d", "jj vbn vb"], //-vb
"pressured": ["p-r-eh1 sh-er-d", "vbn jj nn vb vbd"], //-vb
"proliferated": ["p-r-ah l-ih1 f-er ey t-ih-d", "vbn vb vbd"], //-vb
"remade": ["r-iy m-ey1-d", "vbn nn vb"], //-vb, +vbd
"rent": ["r-eh1-n-t", "nn vb vbn vbp"], //-vbn
"reopened": ["r-iy ow1 p-ah-n-d", "vbd vbn vb"], //-vb
"reported": ["r-iy p-ao1-r t-ah-d", "vbd jj vb vbn vbp"], //-vb
"repurchase": ["r-iy p-er1 ch-ah-s", "nn vbd vbn jj vb"], //-vbd, -vbn
"resold": ["r-iy s-ow1-l-d", "vbn vbd vbp vb"], //-vb
"roast": ["r-ow1-s-t", "nn vb vbn"], //-vbn
"settled": ["s-eh1 t-ah-l-d", "vbd vbn jj vb"], //-vb
"spit": ["s-p-ih1-t", "vb nn vbd"], //+vbn
"started": ["s-t-aa1-r t-ah-d", "vbd jj vbn vb"], //-vb
"sublet": ["s-ah1 b-l-eh-t", "vb vbn"], //+vbd
"trouble": ["t-r-ah1 b-ah-l", "nn vbd vbp jj vb"], //-vbd
"wed": ["w-eh1-d", "vbn vb"], //+vbd
"were": ["w-er", "vbd vb"], //-vb
"weren't": ["w-er-ah-n-t", "vbd vb"], //-vb
"wet": ["w-eh1-t", "jj nn vbd vb vbp"], //+vbn

I suggest that the first step is to remove the 'vb' tags in words that are not in base form, which should fix the problem in this ticket. Then we can consider removing those verbs with only vb* tag and no other tags, as suggested in dhowe/RiTaV1#357

For step 1, below are the corresponding tests to be added, taking 'concern' ('concerned') as an example:

//hasWord
expect(RiTa.hasWord("concerned")).to.be.true;
expect(RiTa.hasWord("concerneds")).to.be.false;
expect(RiTa.hasWord("concerneded")).to.be.false;

//pos
eql(RiTa.pos("concerned"), ["vbd"]);
eql(RiTa.pos("concerned", { simple: 1 }), ["v"]);

//search
expect(RiTa.search({ pos: "vb",limit: -1 }).includes("concerned")).to.be.false;
expect(RiTa.search({ pos: "vbn",limit: -1 }).includes("concerned")).to.be.true;
expect(RiTa.search('concern', { pos: "vbd", limit: -1 })).eql([ 'concerned']);
expect(RiTa.search('concern', { pos: "vbn", limit: -1 })).eql([ 'concerned']);

//conjugate
let opt = {
        number: RiTa.SINGULAR,
        person: RiTa.FIRST,
        tense: RiTa.PAST
};
expect(RiTa.conjugate("concern", opt)).eq("concerned");

//unconjugate
expect(RiTa.conjugator.unconjugate("concerned")).eq("concern");

//allTags
expect(RiTa.tagger.allTags("concerned")).eql(['vbd','jj','vbn']);

//tag
eq(RiTa.tagger.tag(["I", "am", "concerned", "about","this", "."], { inline: true }), "I/prp am/vbp concerned/jj about/in this/dt .");

//soundsLike
expect(RiTa.soundsLike("concern", { pos: 'vb' }).includes("concerned")).to.be.false;

//spellsLike
expect(RiTa.spellsLike("concern", { pos: 'vb' }).includes("concerned")).to.be.false;

please let me know if any part of the list/tests has problems.

@dhowe
Copy link
Owner Author

dhowe commented Jun 8, 2022

This looks really good -- I think the ultimate goal is to only have 'vb' for each of the regular verbs (plus all needed forms for irregular verbs) and compute all the other forms when needed... But this is a great first step -- do you want to do a PR in ritajs to start?

@KarlieZhao
Copy link
Collaborator

yes, I'll make the tests past and create a PR

@dhowe
Copy link
Owner Author

dhowe commented Jun 9, 2022

great -- also needs to handle:

RiTa.analyze('concerned')
RiTa.analyze('concerns')

@dhowe
Copy link
Owner Author

dhowe commented Jun 18, 2022

@KarlieZhao status ?

@KarlieZhao
Copy link
Collaborator

@KarlieZhao status ?

the issue in this ticket should've been fixed, however, I think we can go ahead and try to remove the words with only vb* tags in the lexicon...

@dhowe
Copy link
Owner Author

dhowe commented Jun 19, 2022

good - this will take some thought, so first come up with a plan... then we can discuss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants