Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The word order of the data contents is missing #1

Open
CarlKilhart opened this issue Oct 21, 2023 · 4 comments
Open

The word order of the data contents is missing #1

CarlKilhart opened this issue Oct 21, 2023 · 4 comments

Comments

@CarlKilhart
Copy link

The description in README indicates that the word order of the contents is included in the data, but it turns out to be that the words are ordered by their ID in the vocab, not the real word order from the original texts. Could you offer a version with the correct word order?

@cezhang01
Copy link
Owner

cezhang01 commented Oct 22, 2023

Hi @CarlKilhart ,

Thank you for your interest in our work!

The current contents.txt contains sequences of words after preprocessing - we removed stop words, punctuations, and other meaningless words. The current vocabulary contains remaining words. The current contents.txt contains the sequence of these remaining words. These remaining words are still ordered in the correct sequence of their original raw content.

For example, suppose vocabulary is [welcome, new, york, best, city, ...] and if the original raw content is welcome to the best city new york!, after preprocessing we have [0, 3, 4, 1, 2] for this document. Here to and the are removed because they are stop words. But the remaining 5 words (welcome, best, city, new, york) are still in the correct order with their original raw content.

Do I answer your question clearly? Or do you mean you need the original raw content of documents (including stop words, punctuations, etc)?

@CarlKilhart
Copy link
Author

Maybe you uploaded a wrong version of contents.txt? Taking the first row of ml dataset as an example, clearly 3 16 17 28 34 36 39 45 46 85 111 150 150 151 192 192 192 192 200 201 217 218 269 306 328 351 377 476 477 488 507 623 723 762 898 947 1270 1347 1494 1587 1697 is ordered by the word ID, not the correct sequence of their original raw content. I would appreciate it if you could check the data files.

@cezhang01
Copy link
Owner

Hi @CarlKilhart ,

Thank you for the reminder!

The current datasets indeed don't have word order. But my model also doesn't use word order for training. Thus the current datasets are still valid and correct for reproducing the results in the paper.

I just processed the datasets again the obtain the word order. You can download all 5 datasets with word order, including Web dataset, using the below Google Drive link: https://drive.google.com/file/d/10sGsStbutM-e1XfM8uDwP354YcXdpmgj/view?usp=sharing

Please note that for Aminer dataset, I forgot how I preprocessed it last year. I recently rewrite the preprocessing code to produce Aminer dataset, but the current dataset may have some deviations from the one uploaded on github repo.

@cezhang01
Copy link
Owner

Hi @CarlKilhart ,

Did I clearly answer your question? If no more questions, could I close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants