Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an example of using own dataset #114

Merged
merged 5 commits into from
Apr 2, 2018

Conversation

n-yoshikawa
Copy link
Contributor

This PR adds an example of using own dataset.

  • train.py shows how to use CSVFileParser and how to predict from SMILES.
  • dataset.csv is a sample CSV file. It is generated by extracting 100 molecules of QM9. value1 and value2 are homo and lumo.

@codecov-io
Copy link

codecov-io commented Mar 13, 2018

Codecov Report

Merging #114 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #114   +/-   ##
=======================================
  Coverage   73.27%   73.27%           
=======================================
  Files          69       69           
  Lines        2582     2582           
=======================================
  Hits         1892     1892           
  Misses        690      690

## How to use your own dataset
1. Prepare a CSV file which contains the list of SMILES and the values you want to train.
The first line of the CSV file should be label names.
See `dataset.csv` as an example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write that dataset.csv is made by sampling from the QM9 dataset?

@delta2323
Copy link
Member

Please add a link from the tutorial to this example. For example, adding the following comment at the bottom of the tutorial: once you completed this tutorial the next step would be to use Chainer Chemistry to model your own dataset. See this example (link) how to do it.

python train.py dataset.csv --label value1 value2
```

## How to use your own dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reminds me that this is the only preferred way to use one's own dataset. But using CSVFileParser is just one way: Chainer Chemistry has SDFFileParser and users may even deal with the dataset without the parsers, although we think it would be costly. So how about changing the section title to "Procedure" simply?

## Usage
```
python train.py dataset.csv --label value1 value2
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain (a part of ) options of the script. At least --label needs description. But I think the option is the only one treat specially and it would be enough to write as "type python train.py --help to see complete options".

@delta2323
Copy link
Member

Could you change the test scripts to check if this example run without errors.

from chainer_chemistry.dataset.converters import concat_mols
from chainer_chemistry.dataset.parsers import CSVFileParser
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict
from chainer_chemistry.datasets import NumpyTupleDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not have to separate import of chainer_chemisty because Chainer Chemisty is a third party library for this example and therefore can be treated in the same way as Chainer or NumPy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you sort import statements in alphabetical order?

from chainer_chemistry.dataset.preprocessors import preprocess_method_dict
from chainer_chemistry.datasets import NumpyTupleDataset

from rdkit import Chem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as import of chainer_chemistry

sys.exit("Error: No target label is specified.")

# Dataset preparation
dataset = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line?

def postprocess_label(label_list):
return numpy.asarray(label_list, dtype=numpy.float32)

print('preprocessing dataset...')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to capitalize the first character, as we do in other places.

The first line of the CSV file should be label names.
See `dataset.csv` as an example.

2. Use `CSVFileParser` of Cheiner Chemistry to feed data to model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a link to the document of CSVFileParser.

preprocessor = preprocess_method_dict[method]()
parser = CSVFileParser(preprocessor,
postprocess_label=postprocess_label,
labels=labels, smiles_col='SMILES')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that you hard-coded the value of smiles_col?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to show that we can specify the column name by smiles_col in this example.

return numpy.mean(numpy.absolute(diff), axis=0)[0]

classifier = L.Classifier(model, lossfun=F.mean_squared_error,
accfun=scaled_abs_error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment somewhere that scaled errors are reported as main/accuracy and validation/main/accuracy because using accfun for calculating a (scaled) error is not ordinal usage of Classifier.

@n-yoshikawa
Copy link
Contributor Author

I could not figure out how to add "test scripts to check if this example run without errors". Could you show me an example or how other examples are tested?

@delta2323
Copy link
Member

We have scripts that run examples in CPU (example_test_cpu.sh) and GPU (example_test_cpu.sh) in the examples directory. So, I imagined to add your example to them.

Copy link
Member

@delta2323 delta2323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except comments and the example tests.

Using your own dataset
========================
You can use your own dataset in Chainer Chemistry.
`example/own_dataset <https://github.com/pfnet-research/chainer-chemistry/tree/master/examples/own_dataset/>`_ shows an example code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is uncountable (cf. here). So, "example code" or "an example" would be better.

@@ -4,10 +4,15 @@
python train.py dataset.csv --label value1 value2
```

## How to use your own dataset
The `--label` option specifies which row in `dataset.csv` is trained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is columns, not rows that we have to specify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as we can specify multiple columns, "is" should be substituted with "are".

@delta2323 delta2323 self-assigned this Mar 20, 2018
@delta2323
Copy link
Member

Thank you for the update. I'm running the example test scripts.

@delta2323
Copy link
Member

The example failed with this configuration. Could you check that?

python train.py dataset.csv --method nfp --label value1 --conv-layers 1 --gpu 0 --epoch 1 --unit-num 10

log

Preprocessing dataset...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1944.47it/s]
Train NFP model...
epoch       main/loss   main/accuracy  validation/main/loss  validation/main/accuracy  elapsed_time
1           0.884043    0.0152891      1.07675               0.0166278                 6.00279
Traceback (most recent call last):
  File "train.py", line 207, in <module>
    main()
  File "train.py", line 200, in main
    prediction = model(atoms, adjs).data[0]
  File "train.py", line 55, in __call__
    x = self.graph_conv(atoms, adjs)
  File "/home/delta/dev/chainer-chemistry/chainer_chemistry/models/nfp.py", line 145, in __call__
    h = self.embed(atom_array)
  File "/home/delta/dev/chainer-chemistry/chainer_chemistry/links/embed_atom_id.py", line 47, in __call__
    h = super(EmbedAtomID, self).__call__(x)
  File "/home/delta/.pyenv/versions/anaconda3-4.3.0/envs/anaconda3/lib/python3.6/site-packages/chainer/links/connection/embed_id.py", line 70, in __call__
    return embed_id.embed_id(x, self.W, ignore_label=self.ignore_label)
  File "/home/delta/.pyenv/versions/anaconda3-4.3.0/envs/anaconda3/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py", line 170, in embed_id
    return EmbedIDFunction(ignore_label=ignore_label).apply((x, W))[0]
  File "/home/delta/.pyenv/versions/anaconda3-4.3.0/envs/anaconda3/lib/python3.6/site-packages/chainer/function_node.py", line 245, in apply
    outputs = self.forward(in_data)
  File "/home/delta/.pyenv/versions/anaconda3-4.3.0/envs/anaconda3/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py", line 35, in forward
    .format(type(W), type(x)))
ValueError: numpy and cupy must not be used together
type(W): <class 'cupy.core.core.ndarray'>, type(x): <class 'numpy.ndarray'>

@n-yoshikawa
Copy link
Contributor Author

I forgot to specify device number in concat_mols. Both examples do not fail in my environment now.

@delta2323
Copy link
Member

Thank you for fixing the problem. I'll run the scripts again.

Copy link
Member

@delta2323 delta2323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except small comments.

The first line of the CSV file should be label names.
See `dataset.csv` as an example.
`dataset.csv` is made by sampling from the QM9 dataset.
`value1` is homo and `value2` is lumo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize homo and lumo.

Type `python train.py --help` to see complete options.

## Procedure
1. Prepare a CSV file which contains the list of SMILES and the values you want to train.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "... contains a list of and values ..." would be better.

from chainer_chemistry.dataset.converters import concat_mols
from chainer_chemistry.dataset.parsers import CSVFileParser
from chainer_chemistry.dataset.preprocessors import preprocess_method_dict
from chainer_chemistry.datasets import NumpyTupleDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you sort import statements in alphabetical order?

sys.exit("Error: No target label is specified.")

# Dataset preparation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this empty line.

@delta2323
Copy link
Member

LGTM

@delta2323 delta2323 merged commit 91681bd into chainer:master Apr 2, 2018
@n-yoshikawa n-yoshikawa deleted the example_own_dataset branch April 3, 2018 08:47
@delta2323 delta2323 added this to the 0.3.0 milestone Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants