New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update find_nearest_neighbor to match command line #552

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@abulhawa

abulhawa commented Jun 21, 2018

in addition to matching the result returned by ./fasttext nn, the modification returns the top n closest neighbors and the score or cosines of the angle between the vectors.

update find_nearest_neighbor to match command line
in addition to matching the result returned by ./fasttext nn, the modification returns the top n closest neighbors and the score or cosines of the angle between the vectors.
@facebook-github-bot

This comment has been minimized.

facebook-github-bot commented Jun 21, 2018

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot

This comment has been minimized.

facebook-github-bot commented Jun 21, 2018

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@grahamannett

This comment has been minimized.

grahamannett commented Jun 26, 2018

This seems to work somewhat but I am still getting different results and have checked to make sure my params for the python unsupervised are the same as the skipgram training.

For instance with the fastText executable using nn:

Query word? asparagus
spinach 0.799567
horseradish 0.793891
chickpea 0.77919
chickpeas 0.773833
tomato 0.767592
esculenta 0.766908
beetroot 0.766364
plantain 0.76614
cabbage 0.76239
asparagales 0.759608

versus with the python:

>>> for x in find_nearest_neighbor(model.get_word_vector('asparagus'), vectors):
>>>     print(model.get_words()[x[0]], x[1])

spinach 0.7993550878315109
asparagales 0.7766258347569811
horseradish 0.773088708029314
arrowroot 0.7710282497565528
fennel 0.7620764212893897
cauliflower 0.7606887426817247
cabbages 0.7599950964896218
tomato 0.7596469271417923
walnuts 0.7588274170546428
beetroot 0.757907140329613

is that just due to randomness of the model training? Also should I be doing anything special with vectors or the query term in terms of normalizing or what not? Im guessing this function could be fleshed out to better represent the binary equivalent as well or short tutorial written.

@abulhawa

This comment has been minimized.

abulhawa commented Jun 26, 2018

in another issue (#384 ) I wrote a class to optimize finding the nearest neighbors using the word rather than its vector... i'll copy it here for reference:

class FastTextNN:
    
    def __init__(self, ft_model, ft_matrix=None):
        self.ft_model = ft_model        
        self.ft_words = ft_model.get_words()
        self.word_frequencies = dict(zip(*ft_model.get_words(include_freq=True)))
        self.ft_matrix = ft_matrix
        if self.ft_matrix is None:
            self.ft_matrix = np.empty((len(self.ft_words), ft_model.get_dimension()))
            for i, word in enumerate(self.ft_words):
                self.ft_matrix[i,:] = ft_model.get_word_vector(word)
    
    def find_nearest_neighbor(self, query_word, vectors, n=10,  cossims=None):
        """
        vectors is a 2d numpy array corresponding to the vectors you want to consider

        cossims is a 1d numpy array of size len(vectors), which can be passed for efficiency
        returns the index of the closest n matches to query within vectors and the cosine similarity (cosine the angle between the vectors)

        """
        
        query  = self.ft_model.get_word_vector(query_word)
        if cossims is None:
            cossims = np.matmul(vectors, query, out=cossims)

        norms = np.sqrt((query**2).sum() * (vectors**2).sum(axis=1))
        cossims = cossims/norms
        if query_word in self.ft_words:
            result_i = np.argpartition(-cossims, range(n+1))[1:n+1]
        else:
            result_i = np.argpartition(-cossims, range(n+1))[0:n]
        return list(zip(result_i, cossims[result_i]))

    def nearest_words(self, word, n=10, word_freq=None):
        result = self.find_nearest_neighbor(word, self.ft_matrix, n=n)
        if word_freq:
            return [(self.ft_words[r[0]], r[1]) for r in result if self.word_frequencies[self.ft_words[r[0]]] >= word_freq]
        else:
            return [(self.ft_words[r[0]], r[1]) for r in result]

i'm assuming you're using the same fasttext model file in cli and in python.

class usage example:

fasttext_nn = FastTextNN(fasttext_model) # pass your fasttext model here
fasttext_nn.nearest_words('word')

hope this helps

@grahamannett

This comment has been minimized.

grahamannett commented Jun 26, 2018

@abulhawa Yeah that seems really useful to have incorporated into the python library part. Im not entirely sure but it seems like the C code is normalizing the vectors that are passed into the nn function (which I was doing in mine, I didn't try it out without normalization yet). If thats needed (which for kNN I think it generally provides better results), it just would be

self.ft_matrix[i,:] = = ft_model.get_word_vector(word) / np.linalg.norm(ft_model.get_word_vector(word))

maybe someone with more knowledge than me could weigh in or just have it as an optional argument?

Also seems like the query vector is normalized as well?

@grahamannett

This comment has been minimized.

grahamannett commented Jun 27, 2018

Looking at some of the PR's it seems like tons of the python library issues are fixed but unmerged (some even commented on to fix an issue and then left unmerged i.e. #517) , is there any reason why someone on the facebook team can't approve the PR since they seem to actively be merging stuff related to docs/examples...

Seems pointless for people to submit PR's if they are going to be left in PR limbo

@stefanybedoya

This comment has been minimized.

stefanybedoya commented Aug 21, 2018

@abulhawa
Correct me if I am wrong but you should change the way you are getting the nn. It should be:

result_i = np.argpartition(-cossims, range(n+1))[0:n]

so you included the nn with the higher similarity score as well in your list.

@abulhawa

This comment has been minimized.

abulhawa commented Aug 26, 2018

@stefanybedoya
usually the most similar word that will be returned is the word itself, hence, there is no added information here. However, if the word is not in the vocabulary, the most similar word would be different, and in this case, we should use [0:n]. To integrate this, I modified the FastTextNN class above

@stefanybedoya

This comment has been minimized.

stefanybedoya commented Aug 30, 2018

Thank you @abulhawa. You right. I was seeing it from the OOV words perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment