Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Binary model that was trained on Common crawl #428

Closed
MrBoor opened this issue Feb 6, 2018 · 15 comments
Closed

Binary model that was trained on Common crawl #428

MrBoor opened this issue Feb 6, 2018 · 15 comments

Comments

@MrBoor
Copy link

MrBoor commented Feb 6, 2018

Hello!
I enjoy using your library and pretrained vectors. I see that for vectors that were trained on wiki you provide both binary model and pretrained vectors. However, for vectors that were trained on Common crawl, you only provide pretrained vectors. Is it possible for you to publish binary model for them?

Thanks,
Alexander.

@orech
Copy link

orech commented Feb 12, 2018

That would be very helpful for me as well

@JovanVeljanoski
Copy link

I would also very much appreciate it if you could publish the binary model. Thanks!

@rboyes
Copy link

rboyes commented Mar 8, 2018

Yes it would be very useful

@rboyes
Copy link

rboyes commented Mar 9, 2018

For the english link you post above, they only contain the word vectors, not the model .bin files, which is what we are asking for.

With the model files, we can create out of vocabulary word vectors, but we can't do that with the word vectors only.

@phdowling
Copy link

Also interested in this. The bin files for english would be very valuable.

@m09
Copy link

m09 commented Apr 24, 2018

I would also be interested in the binary vectors.

@Schneitzer
Copy link

Is there a reason why the .bin file will not be made open to the public?

It would be really helpful to be able to generate OOV word vectors for English words, but without the .bin file this would not be possible.

@maxfriedrich
Copy link

I found a link to an English .bin in the comments of #494: https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.bin.zip

@Schneitzer
Copy link

Thank you maxfriedrich.

However, I think most of us would like to see the .bin file on the Common Crawl corpus. The link you provided only contains the vectors trained on the Wikipedia and News, but not on the Common Crawl.

I'm currently working on text classification tasks on Tweets, so it would be nice to have the Common Crawl vectors. Hope it will be published later.

@rktamplayo
Copy link

Any update on this? I hope an admin at least assign someone to answer our queries...

@thusithaC
Copy link

This is indeed strange. For non English languages, the common crawl binaries are available but for English (which is most widely used) it is missing?

@yuchsiao
Copy link

Just check in back to see if there is any plan to release the common crawl version of binaries for English. Any update?

@vdpappu
Copy link

vdpappu commented Aug 14, 2018

just popping this up. checking if we could bet the binaries for commoncrawl

@EdouardGrave
Copy link
Contributor

Hi all,

Thank you for raising this issue.

The model trained on the common crawl data did not use subwords, and thus the binary model would not contain anymore information compared to the text file that we released. In particular, this binary model could not be used to compute representation for out of vocabulary words. This is the reason why we did not release the binary model.

We will likely release a model trained on crawl data with subwords in the near future (both binary and text models will be released).

Best,
Edouard.

@thusithaC
Copy link

@EdouardGrave Hi Edo, Any update on the sub-word model trained on the common crawl?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests