Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Basic utils to handle raw data features #2102

Merged
merged 15 commits into from
Aug 27, 2020

Conversation

classicsong
Copy link
Contributor

@classicsong classicsong commented Aug 25, 2020

Description

We provide a list of APIs in data.utils to handle raw data into numpy arrays:

  • parse_word2vec_node_feature
  • parse_category_single_feat
  • parse_category_multi_feat
  • parse_numerical_feat
  • parse_numerical_multihot_feat

#2088

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

Copy link
Collaborator

@VoVAllen VoVAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase of parse is a bit confusing. Maybe a submodule called transform, and all function called encode_XXX would better?

python/dgl/data/utils.py Show resolved Hide resolved
else:
return feat

def parse_category_multi_feat(category_inputs, norm=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would user need the corresponding labels of the multi-hot vector (i.e. The first elements is for label 'A')?

python/dgl/data/utils.py Show resolved Hide resolved
python/dgl/data/utils.py Show resolved Hide resolved
manager = Manager()
d = manager.dict()
job=[]
for i in range(num_process):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use multiprocessing pool(pool.map) here.

python/dgl/data/utils.py Outdated Show resolved Hide resolved
python/dgl/data/utils.py Outdated Show resolved Hide resolved
python/dgl/data/utils.py Show resolved Hide resolved
python/dgl/data/utils.py Outdated Show resolved Hide resolved
@VoVAllen
Copy link
Collaborator

And for converting feature into one-hot/multi-hot, if the category information is needed, I think it might be better to design it as a class, just like what sklearn do.

Copy link
Collaborator

@VoVAllen VoVAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@classicsong classicsong merged commit 33a8bb9 into dmlc:master Aug 27, 2020
@classicsong classicsong deleted the graph-loader branch August 27, 2020 03:36
@BarclayII BarclayII mentioned this pull request Aug 29, 2020
classicsong added a commit that referenced this pull request Sep 3, 2020
jermainewang added a commit that referenced this pull request Sep 8, 2020
…2147)

This reverts commit 33a8bb9.

Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
kingmbc pushed a commit to kingmbc/dgl that referenced this pull request Sep 10, 2020
dmlc#2147)

This reverts commit 33a8bb9.

Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
zhjwy9343 pushed a commit to zhjwy9343/dgl that referenced this pull request Sep 17, 2020
dmlc#2147)

This reverts commit 33a8bb9.

Co-authored-by: Minjie Wang <wmjlyjemaine@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants