Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of zero vectors #32

Closed
shirish93 opened this issue May 27, 2016 · 4 comments
Closed

Large number of zero vectors #32

shirish93 opened this issue May 27, 2016 · 4 comments

Comments

@shirish93
Copy link

shirish93 commented May 27, 2016

Hello,

This could be the case with my processing, but it appears that 617, 129 out of the 665, 494 english vectors are zero vectors: they are defined in the label, but have all zeros (ie, there are only 48, 365 non-zero vectors for English). I discovered this with the 300-sized dataset. Might this be an issue with the uploaded dataset, or should I recheck my methodology? If you could confirm this is not the issue on your side using the dataset available for download, I can work on fixing on my side.

For reference, this is the code I used to count empty vectors:

empty = np.zeros(300)
count = 0
for each in englishVectors:
 if np.array_equal(each, empty):
  count +=1

I discovered this while trying to figure out the words closest to semi-common words.

For reference, using your code for 'most similar', the words that seem to be representative of the 'zero vectors' are the following:

['adddresse', 'rudat', 'barhydt', 'weeked', 'inovonics', 'alleppey', 'katten', 'georgievski', 'kopinski', 'waxwing', 'irin_plusnews']

@rspeer
Copy link
Member

rspeer commented May 31, 2016

It might be an issue with the version of the dataset I uploaded. I'll check.

@rspeer
Copy link
Member

rspeer commented May 31, 2016

I just re-downloaded the 600d dataset and, while there are zero-vectors, there are only 5882 of them, which is identical to the number of zero vectors in the version I evaluated for the paper.

This narrows it down: either I made a mistake truncating the 600d vectors to 300d, and you downloaded the 300d version; or you made a mistake in post-processing the data. Can you tell me more specifically what you did?

@rspeer
Copy link
Member

rspeer commented May 31, 2016

Confirmed that the 300d version, as uploaded, has the same 5882 zero-vectors. The error is in something you did with the data, I'd say.

@rspeer rspeer closed this as completed May 31, 2016
@shirish93
Copy link
Author

Thanks, I'll work it out!

Thanks for the dataset also! It's extremely interesting to play around with it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants