adding spacy-universal-sentence-encoder #5534

MartinoMensio · 2020-06-02T15:00:16Z

Description

This PR adds to the spaCy Universe the wrapper I created for using in spaCy the Universal Sentence Encoder hosted on Tensorflow Hub https://tfhub.dev/google/collections/universal-sentence-encoder/1

It uses pipeline components to substitute the vector of documents, spans and tokens with a hook that computes the vector from the TensorFlow Hub model.

For more details, see https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub/

Types of change

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

adrianeboyd · 2020-06-08T18:26:10Z

Thanks for this contribution!

* adding spacy-universal-sentence-encoder * update affiliation * updated code example

ruidaiphd · 2020-08-02T04:55:39Z

Hello Martino,

I really like the idea to combine spacy with google stuff, thanks! I got some seemingly random deadlock problem... I am an amateur developer and not very sure about the posting rules in this world. I hope it would not matter much if I also posted the same post on GitHub.

I am basically doing a n*(n-1)/2 comparison on 50K papers by their titles and abstracts on a server with 32 cores. I tired single-thread (without Pool) and find no problem among a few tens of Ks, but the deadlock happens almost right away when we I ran the following code. Do you have any suggestions? BTW, I also tried processes=1... it does not work...

def simCount(row):  
    return [row[0], row[3], row[2], row[5], nlp(row[1]).similarity(nlp(row[4]))]  

with Pool(processes=25) as p:
    with tqdm(total=count, desc='Testing') as pbar:
        for idx_left, row_left in _sim_tst.iterrows():
/*Some pandas frame arrangement*/
            for simscore in p.imap_unordered(simCount, _4sim.values.tolist()):  
                ssrn_simscore.append(simscore)
                pbar.update()

Many thanks!

MartinoMensio · 2020-08-02T10:55:39Z

Hi @ray4wit,

Since this issue is relative to a problem of serialisation of Docs extension attributes, which is specific to the added project, I would suggest keeping the discussion here MartinoMensio/spacy-universal-sentence-encoder#6

Martino

MartinoMensio added 3 commits June 2, 2020 15:39

adding spacy-universal-sentence-encoder

4fdcec1

update affiliation

92a5246

updated code example

48817bb

svlandeg added the docs Documentation and website label Jun 2, 2020

adrianeboyd merged commit de00f96 into explosion:master Jun 8, 2020

adrianeboyd pushed a commit that referenced this pull request Jun 8, 2020

adding spacy-universal-sentence-encoder (#5534)

487be09

* adding spacy-universal-sentence-encoder * update affiliation * updated code example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding spacy-universal-sentence-encoder #5534

adding spacy-universal-sentence-encoder #5534

MartinoMensio commented Jun 2, 2020

adrianeboyd commented Jun 8, 2020

ruidaiphd commented Aug 2, 2020 •

edited

Loading

MartinoMensio commented Aug 2, 2020

adding spacy-universal-sentence-encoder #5534

adding spacy-universal-sentence-encoder #5534

Conversation

MartinoMensio commented Jun 2, 2020

Description

Types of change

Checklist

adrianeboyd commented Jun 8, 2020

ruidaiphd commented Aug 2, 2020 • edited Loading

MartinoMensio commented Aug 2, 2020

ruidaiphd commented Aug 2, 2020 •

edited

Loading