-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an example of using own dataset #114
Conversation
Codecov Report
@@ Coverage Diff @@
## master #114 +/- ##
=======================================
Coverage 73.27% 73.27%
=======================================
Files 69 69
Lines 2582 2582
=======================================
Hits 1892 1892
Misses 690 690 |
examples/own_dataset/README.md
Outdated
## How to use your own dataset | ||
1. Prepare a CSV file which contains the list of SMILES and the values you want to train. | ||
The first line of the CSV file should be label names. | ||
See `dataset.csv` as an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write that dataset.csv
is made by sampling from the QM9 dataset?
Please add a link from the tutorial to this example. For example, adding the following comment at the bottom of the tutorial: once you completed this tutorial the next step would be to use Chainer Chemistry to model your own dataset. See this example (link) how to do it. |
examples/own_dataset/README.md
Outdated
python train.py dataset.csv --label value1 value2 | ||
``` | ||
|
||
## How to use your own dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reminds me that this is the only preferred way to use one's own dataset. But using CSVFileParser
is just one way: Chainer Chemistry has SDFFileParser
and users may even deal with the dataset without the parsers, although we think it would be costly. So how about changing the section title to "Procedure" simply?
## Usage | ||
``` | ||
python train.py dataset.csv --label value1 value2 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain (a part of ) options of the script. At least --label
needs description. But I think the option is the only one treat specially and it would be enough to write as "type python train.py --help
to see complete options".
Could you change the test scripts to check if this example run without errors. |
examples/own_dataset/train.py
Outdated
from chainer_chemistry.dataset.converters import concat_mols | ||
from chainer_chemistry.dataset.parsers import CSVFileParser | ||
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict | ||
from chainer_chemistry.datasets import NumpyTupleDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not have to separate import of chainer_chemisty because Chainer Chemisty is a third party library for this example and therefore can be treated in the same way as Chainer or NumPy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you sort import statements in alphabetical order?
examples/own_dataset/train.py
Outdated
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict | ||
from chainer_chemistry.datasets import NumpyTupleDataset | ||
|
||
from rdkit import Chem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as import of chainer_chemistry
examples/own_dataset/train.py
Outdated
sys.exit("Error: No target label is specified.") | ||
|
||
# Dataset preparation | ||
dataset = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line?
examples/own_dataset/train.py
Outdated
def postprocess_label(label_list): | ||
return numpy.asarray(label_list, dtype=numpy.float32) | ||
|
||
print('preprocessing dataset...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to capitalize the first character, as we do in other places.
examples/own_dataset/README.md
Outdated
The first line of the CSV file should be label names. | ||
See `dataset.csv` as an example. | ||
|
||
2. Use `CSVFileParser` of Cheiner Chemistry to feed data to model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to add a link to the document of CSVFileParser.
examples/own_dataset/train.py
Outdated
preprocessor = preprocess_method_dict[method]() | ||
parser = CSVFileParser(preprocessor, | ||
postprocess_label=postprocess_label, | ||
labels=labels, smiles_col='SMILES') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that you hard-coded the value of smiles_col
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want to show that we can specify the column name by smiles_col
in this example.
return numpy.mean(numpy.absolute(diff), axis=0)[0] | ||
|
||
classifier = L.Classifier(model, lossfun=F.mean_squared_error, | ||
accfun=scaled_abs_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment somewhere that scaled errors are reported as main/accuracy
and validation/main/accuracy
because using accfun
for calculating a (scaled) error is not ordinal usage of Classifier
.
I could not figure out how to add "test scripts to check if this example run without errors". Could you show me an example or how other examples are tested? |
We have scripts that run examples in CPU ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except comments and the example tests.
docs/source/tutorial.rst
Outdated
Using your own dataset | ||
======================== | ||
You can use your own dataset in Chainer Chemistry. | ||
`example/own_dataset <https://github.com/pfnet-research/chainer-chemistry/tree/master/examples/own_dataset/>`_ shows an example code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is uncountable (cf. here). So, "example code" or "an example" would be better.
examples/own_dataset/README.md
Outdated
@@ -4,10 +4,15 @@ | |||
python train.py dataset.csv --label value1 value2 | |||
``` | |||
|
|||
## How to use your own dataset | |||
The `--label` option specifies which row in `dataset.csv` is trained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is columns, not rows that we have to specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as we can specify multiple columns, "is" should be substituted with "are".
Thank you for the update. I'm running the example test scripts. |
The example failed with this configuration. Could you check that?
log
|
I forgot to specify device number in |
Thank you for fixing the problem. I'll run the scripts again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except small comments.
examples/own_dataset/README.md
Outdated
The first line of the CSV file should be label names. | ||
See `dataset.csv` as an example. | ||
`dataset.csv` is made by sampling from the QM9 dataset. | ||
`value1` is homo and `value2` is lumo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize homo
and lumo
.
examples/own_dataset/README.md
Outdated
Type `python train.py --help` to see complete options. | ||
|
||
## Procedure | ||
1. Prepare a CSV file which contains the list of SMILES and the values you want to train. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "... contains a list of and values ..." would be better.
examples/own_dataset/train.py
Outdated
from chainer_chemistry.dataset.converters import concat_mols | ||
from chainer_chemistry.dataset.parsers import CSVFileParser | ||
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict | ||
from chainer_chemistry.datasets import NumpyTupleDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you sort import statements in alphabetical order?
examples/own_dataset/train.py
Outdated
sys.exit("Error: No target label is specified.") | ||
|
||
# Dataset preparation | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this empty line.
LGTM |
This PR adds an example of using own dataset.
train.py
shows how to useCSVFileParser
and how to predict from SMILES.dataset.csv
is a sample CSV file. It is generated by extracting 100 molecules of QM9.value1
andvalue2
are homo and lumo.