Is there a runnable example with the given data? #1

ceteri · 2021-02-21T18:18:33Z

This code looks really interesting.
However, is there a runnable example with the given data?

When running the current code using the example shown:

python train.py --embedding-len 64 --batch-size 2048 --epochs 500

The hard-coded location data/ of data does not appear to be included within this repo, and the named datasets are different from what's in datasets/

$ python train.py --embedding-len 64 --batch-size 2048 --epochs 500
Traceback (most recent call last):
  File "train.py", line 57, in <module>
    trainer = Trainer(args)
  File "/Users/paco/src/jax-skip-gram-negative-sampling/trainer.py", line 18, in __init__
    self.dataset = PyPILangDataset(args)#, examples_path='data/pypi_examples.pth', dict_path='data/pypi_dict.pth')
  File "/Users/paco/src/jax-skip-gram-negative-sampling/datasets/pypi_lang.py", line 18, in __init__
    self.files = self.tokenize_files()
  File "/Users/paco/src/jax-skip-gram-negative-sampling/datasets/pypi_lang.py", line 29, in tokenize_files
    node_lang_df = pd.read_csv(self.args.dataset_dir, na_filter=False)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 948, in __init__
    self._make_engine(self.engine)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2010, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] No such file or directory: 'data/'

My apologies, I may have missed something on set up?

The text was updated successfully, but these errors were encountered:

ddehueck · 2021-02-21T18:27:06Z

Hello! I believe the argument --dataset-dir <path_to_data> needs to be specified. So you're right I need to update this so that there is no longer a hard-coded default path that isn't included.

I think you're one of the first people to try out this repo so please keep asking questions and I'll find some time to clean up the repo and add more documentation!

ddehueck · 2021-02-21T18:56:16Z

Ok I'm making some updates now. Some changes:

Dataset Class needs to be loaded in trainer.py
There's a bug with the loss function - I had a fix for it in a local dir on my machine that never got commited.
data directory should be added. I'll commit the world order example as the default.

ceteri · 2021-02-21T20:10:22Z

Awesome, thanks @ddehueck !
I'll retry after those commits

ddehueck · 2021-02-22T01:25:35Z

@ceteri Ok should be good to go now! You should see embedding like: after around 100 epochs using the world order dataset.

ddehueck · 2021-02-22T01:32:39Z

Also, I believe there may be a way to speed up the loss as I did here: https://github.com/ddehueck/CrossWalk/blob/master/domains/sgns_loss.py

If you want to compare to an ML framework without a JIT compiler I have a SGNS implementation in pytorch here: https://github.com/ddehueck/skip-gram-negative-sampling

ceteri · 2021-02-22T06:46:47Z

Many thanks @ddehueck !
The comparison between JAX and PyTorch impls is really helpful.

That ran fine, although the embeddings that I'm seeing reported are:

Learned embeddings:
word: nuclear neighbors: ['michael', 'oxford', 'm', 'ibid', 'p', 'trans', 'quoted', 'n', 'networks', 'data']
word: mankind neighbors: ['michael', 'oxford', 'm', 'ibid', 'p', 'trans', 'quoted', 'n', 'networks', 'data']
word: khomeini neighbors: ['michael', 'oxford', 'm', 'ibid', 'p', 'trans', 'quoted', 'n', 'networks', 'data']
word: ronald neighbors: ['michael', 'oxford', 'm', 'ibid', 'p', 'trans', 'quoted', 'n', 'networks', 'data']
Beginning epoch: 500/500

Looking at line

jax-skip-gram-negative-sampling/trainer.py

Line 50 in 0696e86

    
           print(f'word: {word} neighbors: {self.model.nearest_neighbors(word, self.dataset.dictionary, params)}')

should the set of nearest neighbors be changing for each instance of word through that iteration?

ddehueck · 2021-02-22T15:41:55Z

In the line right above the one you linked to: for word in self.dataset.queries: which allows each dataset class used to define the queries.

I think what is happening is you are using different data in WorldOrderDataset class without changing self.dataset.queries. So the words may not exist in your dataset and you may be looking at a token indicating not in dataset.

So try changing the dataset queries.

ceteri · 2021-02-22T16:59:45Z

Thank you -

Checking through the code, I may be encountering a problem on Trainer.update() since the params and g values are all NaN ?

params [[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]
EPOCH: 47 | GRAD MAGNITUDE: nan

This warning gets printed 4x through each epoch:

[W ParallelNative.cpp:206] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)

This is with a simple command line:

python train.py --embedding-len 64 --batch-size 2048 --epochs 100

And the dataset is hard-coded to WorldOrderDataset ?

ddehueck · 2021-02-26T03:51:21Z

Hmm, not sure how much help I can be without seeing your setup. How is your data pipeline setup?

arjun-mani · 2021-03-05T18:10:10Z

I seem to be facing the same issue. I just cloned the repo and ran python3 train.py, and it seems my gradient is NaN and I get the exact same embeddings that @ceteri is getting. Were either of you able to find a fix for this?

ddehueck · 2021-03-05T18:13:46Z

@arjun-mani Are you running the default sample that is in the codebase or have you added a custom dataset?

arjun-mani · 2021-03-05T18:19:03Z

@ddehueck Default sample. Seems like the problem was with the loss - maybe you never fixed the bug on remote? I'm getting -inf values for the loss.

ddehueck · 2021-03-05T18:50:28Z

@arjun-mani Damn ok will take a closer look

arjun-mani · 2021-03-05T18:50:59Z

@ddehueck I just fixed the loss locally and am getting embeddings much closer to what you posted. So pretty sure that's the problem :)

ddehueck · 2021-03-05T18:52:55Z

@arjun-mani Awesome! Would you mind creating a PR?

arjun-mani · 2021-03-05T19:02:10Z

Not at all, I'll do it today.

ddehueck · 2021-03-05T19:03:15Z

@arjun-mani Appreciate it

ddehueck · 2021-04-01T22:09:55Z

@arjun-mani Any chance you can create that PR? Sorry to bother you.

arjun-mani · 2021-04-10T04:03:39Z

Hey @ddehueck - I'm so so sorry for the delay, it's been a crazy few weeks. I've made a lot of changes personally to the codebase and this is a really small change, so hopefully it's helpful if I just attach the modified bce_loss function (in sgns_loss,.py):

def bce_loss_w_logits(x, y):
        max_val = np.clip(x, 0, None)
        loss = max_val - x * y + np.log(1 + np.exp(-np.abs(x)))
        # loss = x - x * y + max_val + np.log(np.exp(-max_val) + np.exp((-x - max_val)))                                                                      
        return loss.mean()

(The old line is commented). Hope this helps, lmk if you'd like me to clarify anything.

ceteri · 2021-04-10T04:40:13Z

Thank you @arjun-mani that works well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a runnable example with the given data? #1

Is there a runnable example with the given data? #1

ceteri commented Feb 21, 2021

ddehueck commented Feb 21, 2021

ddehueck commented Feb 21, 2021

ceteri commented Feb 21, 2021

ddehueck commented Feb 22, 2021 •

edited

Loading

ddehueck commented Feb 22, 2021

ceteri commented Feb 22, 2021

ddehueck commented Feb 22, 2021

ceteri commented Feb 22, 2021

ddehueck commented Feb 26, 2021

arjun-mani commented Mar 5, 2021

ddehueck commented Mar 5, 2021

arjun-mani commented Mar 5, 2021 •

edited

Loading

ddehueck commented Mar 5, 2021

arjun-mani commented Mar 5, 2021 •

edited

Loading

ddehueck commented Mar 5, 2021

arjun-mani commented Mar 5, 2021

ddehueck commented Mar 5, 2021

ddehueck commented Apr 1, 2021

arjun-mani commented Apr 10, 2021

ceteri commented Apr 10, 2021

Is there a runnable example with the given data? #1

Is there a runnable example with the given data? #1

Comments

ceteri commented Feb 21, 2021

ddehueck commented Feb 21, 2021

ddehueck commented Feb 21, 2021

ceteri commented Feb 21, 2021

ddehueck commented Feb 22, 2021 • edited Loading

ddehueck commented Feb 22, 2021

ceteri commented Feb 22, 2021

ddehueck commented Feb 22, 2021

ceteri commented Feb 22, 2021

ddehueck commented Feb 26, 2021

arjun-mani commented Mar 5, 2021

ddehueck commented Mar 5, 2021

arjun-mani commented Mar 5, 2021 • edited Loading

ddehueck commented Mar 5, 2021

arjun-mani commented Mar 5, 2021 • edited Loading

ddehueck commented Mar 5, 2021

arjun-mani commented Mar 5, 2021

ddehueck commented Mar 5, 2021

ddehueck commented Apr 1, 2021

arjun-mani commented Apr 10, 2021

ceteri commented Apr 10, 2021

ddehueck commented Feb 22, 2021 •

edited

Loading

arjun-mani commented Mar 5, 2021 •

edited

Loading

arjun-mani commented Mar 5, 2021 •

edited

Loading