Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

How fasttext predicts the words no in the train set? #475

Closed
bringtree opened this issue Apr 5, 2018 · 1 comment
Closed

How fasttext predicts the words no in the train set? #475

bringtree opened this issue Apr 5, 2018 · 1 comment

Comments

@bringtree
Copy link

python3

import fastText as fasttext
import os
import numpy as np
pwd = os.getcwd()
model_bin = "/home/bringtree/data/wiki.zh.bin"
model_vec = "/home/bringtree/data/wiki.zh.vec"
model = fasttext.load_model(model_bin)
word_1 = model.get_word_vector('asdhasjhdkajshd')
print(word_1[:20])
[-0.10704836 -0.5085796  -0.05533567 -0.45416433  0.36912176 -0.04111901
 -0.3435909  -0.13083233  0.07110099 -0.23444724  0.26429185  0.31326798
  0.20615076 -0.23127083 -0.11359369  0.21303149 -0.19785886  0.32893217
 -0.14822693  0.02602408]

"asdhasjhdkajshd" is not in the train set. And i want to know how do the model predict it?

@nixphix
Copy link

nixphix commented Apr 5, 2018

FT breaks down each word into a bag of n-grams of chars, like

'awesome' => <aw>, <awe>, <wes>, <eso>, <som>, <ome>, <me>
if we set minn = maxn = 3

each subword n-grams are assigned a vector value when an OOV(out of vocabulary) word is encountered FT will try and build a vector by summing up subword vectors that would make up the word, so if you try to get a vector for awme then a vector sum of subwords <aw> and <me> is returned.

This is what makes FT robust in dealing with misspelled words and internet slag.

Also subword vector is not same as word vector <me> != me

you can get your subwords with model.get_subwords('asdhasjhdkajshd')

FT unsupervised model is based on this paper Enriching Word Vectors with Subword Information

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants