Skip to content

Conversation

@KenelmQLH
Copy link
Collaborator

Thanks for sending a pull request!
Please make sure you click the link above to view the contribution guidelines,
then fill out the blanks below.

Description

Use UNK to fix the OOV problem in word2vec.

What does this implement/fix? Explain your changes.

Before using word2vec to infer tokens, check wheather the word is in the vocabulary.

Pull request type

  • [DATASET] Add a new dataset
  • [BUGFIX] Bugfix
  • [FEATURE] New feature (non-breaking change which adds functionality)
  • [BREAKING] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [STYLE] Code style update (formatting, renaming)
  • [REFACTOR] Refactoring (no functional changes, no api changes)
  • [BUILD] Build related changes
  • [DOC] Documentation content changes
  • [OTHER] Other (please describe):

Changes

Feat1, use UNK to fix the OOV problem in word2vec.

Does this close any currently open issues?

#101

Any relevant logs, error output, etc?

Before

import numpy as np
from EduNLP.Tokenizer import get_tokenizer
from EduNLP.Vector import T2V
print(items)
tokenzier = get_tokenizer("pure_text")
token_items = tokenzier(items)
# ['OOV', '霜飔曈', '曚', '菡萏', '叆', '叇']

path = "./w2v/general_literal_300/general_literal_300.kv"
t2v = T2V('w2v',filepath=path)

t2v.infer_tokens(token_items)
# KeyError: "Key 'OOV' not present"

After

t2v.i2v.key_to_index("OOV")
# 0
np.array_equal(t2v.i2v["OOV"], np.zeros((300,)))
# True

Checklist

Before you submit a pull request, please make sure you have to following:

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [FEATURE], [BREAKING], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage and al tests passing
  • Code is well-documented (extended the README / documentation, if necessary)
  • If this PR is your first one, add your name and github account to AUTHORS.md

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@KenelmQLH KenelmQLH added the enhancement New feature or request label Nov 13, 2021
@KenelmQLH KenelmQLH self-assigned this Nov 13, 2021
@codecov-commenter
Copy link

codecov-commenter commented Nov 13, 2021

Codecov Report

Merging #105 (126119f) into dev (376d23c) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##               dev      #105   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           48        48           
  Lines         1488      1489    +1     
=========================================
+ Hits          1488      1489    +1     
Impacted Files Coverage Δ
EduNLP/Vector/gensim_vec.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 376d23c...126119f. Read the comment docs.

@@ -1,3 +1,12 @@
v0.0.7:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to submit in another PR.

@tswsxk tswsxk merged commit 9071cca into bigdata-ustc:dev Nov 15, 2021
@tswsxk tswsxk linked an issue Nov 15, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add handling of OOV

3 participants