Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classifier identified as regressor #39

Closed
GemmaTuron opened this issue Apr 8, 2024 · 11 comments
Closed

Classifier identified as regressor #39

GemmaTuron opened this issue Apr 8, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@GemmaTuron
Copy link
Member

Describe the bug
If a binary classification file is passed, with the activity column already in binary, and the following command is run:
zairachem fit -i input.csv -m model_folder
ZairaChem interprets it as a regression, not a binary classification, as indicated by the data/parameters.json file:

{
    "time_budget": 120,
    "task": "regression",
    "presets": "standard",
    "augment": false,
    "assay_id": "ASSAY",
    "assay_type": null,
    "credibility_range": {
        "min": null,
        "max": null
    },

Desktop (please complete the following information):
Ubuntu 22.04 LTS

Additional context
this can be confusing so we need to add clear instructions

@HellenNamulinda
Copy link
Collaborator

Hello @GemmaTuron,
Did you use zairachem example --file_name input.csv to generate the input file?
If yes, by default, that command generates data for a regression task, and we need to correct that in the README.md.

smiles,activity
COc1cc(CCC(C)=O)ccc1O,1.2945919608148597
COc1ccc(CCN)cc1OC,0.669207485449748
C(CN1CCOCC1)Oc1ccc(cc1)-c1cnc2c(cnn2c1)-c1ccccc1,1.1980759270462484
...

zairachem example --classification --file_name input.csv will generate an input file with classification data.

Otherwise, on my end, the task is correctly identified if the input file contains classification data.

{
    "time_budget": 120,
    "task": "classification",
    "presets": "standard",
    "augment": false,
    "assay_id": "ASSAY",
    "assay_type": null,
    "credibility_range": {
        "min": null,
        "max": null
    },
...

@GemmaTuron
Copy link
Member Author

Hi @HellenNamulinda
No, I am using a file I made myself. What command did you run to fit the classified data you got with the example command?

@miquelduranfrigola
Copy link
Member

@GemmaTuron what is the column name of your file?

@GemmaTuron
Copy link
Member Author

bin

@miquelduranfrigola
Copy link
Member

Thanks. This is surprising and is probably a bug.

@miquelduranfrigola miquelduranfrigola added the bug Something isn't working label Apr 15, 2024
@HellenNamulinda
Copy link
Collaborator

Hi @HellenNamulinda No, I am using a file I made myself. What command did you run to fit the classified data you got with the example command?

@GemmaTuron
I used zairachem fit -i train.csv -m model. Because I first ran zairachem split -i input.csv to get the train and test sets.

@HellenNamulinda
Copy link
Collaborator

Hi @GemmaTuron and @miquelduranfrigola,
This is my observation.
If the column name isn't activity(I tried changing it to another name), the split command will fail

File "/home/hellenah/zaira-chem/zairachem/cli/commands/split.py", line 48, in check_dataset_minimum_size
    fold_num_positives = sum(df[df.fold == fold_id].activity)
  File "/home/hellenah/anaconda3/envs/zairachem/lib/python3.10/site-packages/pandas/core/generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'activity'

But for the fit command, if the cut-off value isn’t specified, the first task assigned in data/parameters.json file will be regression(regardless of the column_name). And if you check that file immediately, you see regression as the task.

During the setup step where data preparation is performed, this file will get updated,
After all the checks and standardization(once this is logged Descriptor calculation and LSH folding done), the right task is assigned and the parameters.json is updated.

If the cut-off is specified( say zairachem fit -i train.csv -c 0.1 -d low -m model), the parameters.json file will have classification as the default task before any checks on the data are performed, Otherwise, it is regression, which gets updated to classification by the end of the stepup step.

So, before the Describe step(calculating the different descriptors), the correct task will be seen in the parameters.json file.
cli/commands/fit.py

@miquelduranfrigola
Copy link
Member

Thanks @HellenNamulinda

@GemmaTuron
Copy link
Member Author

Hi @miquelduranfrigola

This issue persists, and I have a dataset only with classification data, which I cannot use as I get stuck while ZairaChem tries to do a regression:

Traceback (most recent call last):
  File "/home/gturon/anaconda3/envs/zairachem2/bin/zairachem", line 33, in <module>
    sys.exit(load_entry_point('zairachem', 'console_scripts', 'zairachem')())
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gturon/anaconda3/envs/zairachem2/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/cli/commands/fit.py", line 124, in fit
    s.setup()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/training.py", line 233, in setup
    self._tasks()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/training.py", line 175, in _tasks
    SingleTasks(os.path.join(self.output_dir, DATA_SUBFOLDER)).run()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 401, in run
    reg = reg_tasks.as_dict()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 116, in as_dict
    res["reg_raw_skip"] = self.raw(smoothen=True)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 82, in raw
    self._raw = self.smoothen(raw)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/tasks.py", line 71, in smoothen
    return SmoothenY(self.smiles_list, raw).run()
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/utils.py", line 88, in run
    boundaries = self.get_boundaries(y[idxs], repeats, lb, ub)
  File "/home/gturon/github/ersilia-os/zaira-chem/zairachem/setup/utils.py", line 69, in get_boundaries
    boundaries[r] = t
UnboundLocalError: local variable 't' referenced before assignment

@HellenNamulinda what do you mean by that: During the setup step where data preparation is performed, this file will get updated, After all the checks and standardization(once this is logged Descriptor calculation and LSH folding done), the right task is assigned and the parameters.json is updated.

Did you successfully pass a classification data (already binarised) and ZairaChem trained a model?

@GemmaTuron
Copy link
Member Author

mmmm I've been doing tests
I found a nan that might be making the _is_a_simple_classification function fail. I think there are enough automated tests provided the user does not have an unexpected value in the dataset - we can close this issue

@HellenNamulinda
Copy link
Collaborator

It works just fine on my end.
But we can look at it gain if data issues cause the pipeline to break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants