feat: Imdb sentiment dataset reader #962

sgrechanik-h · 2019-08-08T17:11:47Z

This PR implements a dataset reader for the IMDb sentiment classification dataset. It also includes a json configuration for BERT (en, cased) which is mostly the same as the configuration for rusentiment except for the max seq length and batch size (which I set to values such that I don't get out-of-memory on my hardware).

This PR also includes a fix for the sets_accuracy metric which should now correctly work for string labels (i.e. wrap them into sets instead converting them to sets). Also I added reporting of cached files in download_decompress.

dilyararimovna · 2019-08-08T17:51:08Z

You should definitely try conversational English BERT to check whether that classifier will be better or not.

deeppavlov/metrics/accuracy.py

sgrechanik-h · 2019-08-12T14:46:06Z

@dilyararimovna Conversational BERT seems to be slightly better: ~93.5% accuracy vs ~92.5% accuracy on the original cased english BERT with the same hyperparameters. However, I trained them for about 3 epochs, so maybe I'll try training them for a longer time.

dilyararimovna · 2019-08-13T13:22:45Z

It's time to get rid out of necessity to cover one class to list of classes.
Please, do the following steps:
Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_iterators/snips_intents_iterator.py#L31 with

result.append((text, intent))

Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_readers/basic_classification_reader.py#L37 with

format: str = "csv", class_sep: str = None,

Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_readers/basic_classification_reader.py#L50 with

sep (str): delimeter for ``"csv"`` files. Default: None -> only one class per sample

Replace https://github.com/Huawei-MRC-OSI/DeepPavlov/blob/pr-imdb-clean/deeppavlov/dataset_readers/basic_classification_reader.py#L90-L93 with

                if isinstance(x, list):
                    if class_sep is None:
                        # each sample is a tuple ("text", "label")
                        data[data_type] = [([row[x_] for x_ in x], str(row[y]))
                                           for _, row in df.iterrows()]
                    else:
                        # each sample is a tuple ("text", ["label", "label", ...])
                        data[data_type] = [([row[x_] for x_ in x], str(row[y]).split(class_sep))
                                           for _, row in df.iterrows()]
                else:
                    if class_sep is None:
                        # each sample is a tuple ("text", "label")
                        data[data_type] = [(row[x], str(row[y])) for _, row in df.iterrows()]
                    else:
                        # each sample is a tuple ("text", ["label", "label", ...])
                        data[data_type] = [(row[x], str(row[y]).split(class_sep)) for _, row in df.iterrows()]

Thank you!

grwlf · 2019-08-14T09:02:15Z

It's time to get rid out of necessity to cover one class to list of classes.

My opinion here is that we should focus on sentiment reader in this ticket. basic_classification_reader.py is a common component, many configurations may be affected by changes in it, so I think it is better to do this kind of work by a separate PR.

Regarding the labels problem I would suggest to make ImdbReader to return singleton lists for now to stay compatible with other components.

dilyararimovna · 2019-08-14T09:50:03Z

@grwlf Then you should revert proba2labels component to its previous version. Because the current version returns non-consistent labels.

deeppavlov/dataset_readers/imdb_reader.py

Co-Authored-By: Aleksei Lymar <yoptar@gmail.com>

dilyararimovna · 2019-08-15T09:44:17Z

@grwlf Are you planning to release pre-trained classification models?

grwlf · 2019-08-15T17:31:13Z

@grwlf Are you planning to release pre-trained classification models?

Do you mean models for this Sentiment task? Not sure about that. I think we could either release the model or provide an instruction on how to fine-tune existing bert. What do you think would be better?
@yoptar ?

yoptar · 2019-08-19T08:40:55Z

What do you think would be better?

Most of our configs come with pretrained models, so it would be preferable in this case as well.

grwlf · 2019-08-20T07:35:40Z

What do you think would be better?

Most of our configs come with pretrained models, so it would be preferable in this case as well.

OK, we will train the model.

sgrechanik-h · 2019-09-03T16:02:36Z

So, how do we upload the weights to files.deeppavlov.ai?

yoptar · 2019-09-04T12:46:32Z

So, how do we upload the weights to files.deeppavlov.ai?

@sgrechanik-h,
You can send me a link to the model on lymar@ipavlov.ai and I'll reply with its url on files.deeppavlov.ai
Or you can publish a link on some other site in the config.

grwlf · 2019-09-07T15:31:58Z

So, how do we upload the weights to files.deeppavlov.ai?

@sgrechanik-h,
You can send me a link to the model on lymar@ipavlov.ai and I'll reply with its url on files.deeppavlov.ai
Or you can publish a link on some other site in the config.

Could you please check this Google.Drive folder? It should contain sentiment_imdb_conv_bert_v0.tar.gz and sentiment_imdb_bert_v0.tar.gz checkpoints.

https://drive.google.com/drive/folders/1fXAR02g-drJKOHhSGND6TLNkhzErz9PE?usp=sharing

avmhawk · 2019-09-10T15:34:16Z

deeppavlov/dataset_readers/imdb_reader.py

+        data_path = Path(data_path)
+
+        if url is None:
+            url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"


hidden logic, why should the file be available at this link? remove, pls

Please suggest how to fix.

Now the path must be specified directly in the config. Add "url" to "dataset_reader": {..}

I'm not convinced, many dataset readers of DeepPavlov have their URLs hardcoded, and forcing the user to provide the URL in the config file will harm usability. (However almost all of these harcoded URLs lead to files.deeppavlov.ai/*, so the imdb dataset should probably be uploaded to files.deeppavlov.ai as well).

@yoptar what do you think about this? Maybe, i was wrong

avmhawk · 2019-09-10T15:38:05Z

deeppavlov/core/data/utils.py

@@ -192,6 +192,11 @@ def download_decompress(url: str, download_path: [Path, str], extract_paths=None
        extracted = extracted_path.exists()
        if not extracted and not arch_file_path.exists():
            simple_download(url, arch_file_path)
+        else:


It seems to me that the following logic is implemented here(maybe i'm wrong):

if extracted_path.exists(): extracted = True log.info(f'Found cached and extracted {url} in {extracted_path}') elif arch_file_path.exists(): log.info(f'Found cached {url} in {arch_file_path}') else: simple_download(url, arch_file_path)

Yes, this is much more readable, thanks.

avmhawk · 2019-09-10T15:44:14Z

deeppavlov/dataset_readers/imdb_reader.py

+            download_decompress(url, data_path)
+            mark_done(data_path)
+
+        alternative_data_path = data_path / "aclImdb"


ok, you check data_path, download file and after try "alternative_data_path"? and if it exists you don't use original data_path anywhere or am I misunderstood?

maybe it's naming problem and you should use "extracted_data_path" for this var.

The problem I'm trying to solve here is that the archive contains the aclImdb folder where all the data files are located, however data_path itself probably ends with aclImdb, so the data files are probably somewhere in ~/.deeppavlov/downloads/aclImdb/aclImdb/. This may create some confusion, and if data_path is provided manually, it may be set to ~/.deeppavlov/downloads/aclImdb/aclImdb/ instead of the correct ~/.deeppavlov/downloads/aclImdb by the user. This logic forgives this mistake.

ok, it's not explicit solution, but it works....

avmhawk · 2019-09-11T10:54:03Z

deeppavlov/dataset_readers/imdb_reader.py

+    for Computational Linguistics (ACL 2011).
+    """
+
+    @overrides


unnecessary decorator

Don't we need it to, I don't know, protect us from misspelling the word read or from the future changes of the function's name in the base class?

IMHO, is a redundant solution. we all write code and monitor its consistency, using a third-party little-known library and operations with metaclasses can shoot us in the future.

avmhawk · 2019-09-11T11:08:24Z

deeppavlov/dataset_readers/imdb_reader.py

+        data_path = Path(data_path)
+
+        if url is None:
+            url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"


Now the path must be specified directly in the config. Add "url" to "dataset_reader": {..}

* feat: Imdb sentiment dataset reader * fix: download_decompress: report found cached files * Use accuracy instead of set_accuracy * fix: Proba2Labels: return single label for max_proba=True * Config for conversational BERT for imdb * Imdb: produce singleton lists of labels instead of labels * Don't convert data_path to Path twice Co-Authored-By: Aleksei Lymar <yoptar@gmail.com> * imdb: always use utf-8 Co-Authored-By: Aleksei Lymar <yoptar@gmail.com> Co-authored-by: Aleksei Lymar <yoptar@gmail.com>

feat: Imdb sentiment dataset reader

db09aee

vaskonov requested review from vaskonov, yoptar and dilyararimovna August 8, 2019 17:23

yoptar reviewed Aug 9, 2019

View reviewed changes

deeppavlov/metrics/accuracy.py Outdated Show resolved Hide resolved

sgrechanik-h added 3 commits August 12, 2019 06:48

fix: download_decompress: report found cached files

4b97f3c

Use accuracy instead of set_accuracy

3148219

fix: Proba2Labels: return single label for max_proba=True

7448af3

sgrechanik-h force-pushed the pr-imdb-clean branch from ce850fd to 7448af3 Compare August 12, 2019 07:15

Config for conversational BERT for imdb

d7bb43d

Imdb: produce singleton lists of labels instead of labels

897fcb8

yoptar suggested changes Aug 14, 2019

View reviewed changes

deeppavlov/dataset_readers/imdb_reader.py Outdated Show resolved Hide resolved

deeppavlov/dataset_readers/imdb_reader.py Outdated Show resolved Hide resolved

sgrechanik-h and others added 2 commits August 14, 2019 15:44

Don't convert data_path to Path twice

3295e21

Co-Authored-By: Aleksei Lymar <yoptar@gmail.com>

imdb: always use utf-8

cebf7a4

Co-Authored-By: Aleksei Lymar <yoptar@gmail.com>

yoptar approved these changes Aug 15, 2019

View reviewed changes

avmhawk reviewed Sep 10, 2019

View reviewed changes

avmhawk reviewed Sep 11, 2019

View reviewed changes

yoptar merged commit db31b59 into deeppavlov:dev Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Imdb sentiment dataset reader #962

feat: Imdb sentiment dataset reader #962

sgrechanik-h commented Aug 8, 2019

dilyararimovna commented Aug 8, 2019

sgrechanik-h commented Aug 12, 2019

dilyararimovna commented Aug 13, 2019 •

edited

grwlf commented Aug 14, 2019 •

edited

dilyararimovna commented Aug 14, 2019

dilyararimovna commented Aug 15, 2019

grwlf commented Aug 15, 2019

yoptar commented Aug 19, 2019 •

edited

grwlf commented Aug 20, 2019

sgrechanik-h commented Sep 3, 2019

yoptar commented Sep 4, 2019 •

edited

grwlf commented Sep 7, 2019

avmhawk Sep 10, 2019 •

edited

grwlf Sep 10, 2019

avmhawk Sep 11, 2019

sgrechanik-h Sep 16, 2019

avmhawk Sep 23, 2019

avmhawk Sep 10, 2019

sgrechanik-h Sep 16, 2019

avmhawk Sep 10, 2019

avmhawk Sep 11, 2019

sgrechanik-h Sep 16, 2019

avmhawk Sep 23, 2019

avmhawk Sep 11, 2019

sgrechanik-h Sep 16, 2019

avmhawk Sep 23, 2019

avmhawk Sep 11, 2019

feat: Imdb sentiment dataset reader #962

feat: Imdb sentiment dataset reader #962

Conversation

sgrechanik-h commented Aug 8, 2019

dilyararimovna commented Aug 8, 2019

sgrechanik-h commented Aug 12, 2019

dilyararimovna commented Aug 13, 2019 • edited

grwlf commented Aug 14, 2019 • edited

dilyararimovna commented Aug 14, 2019

dilyararimovna commented Aug 15, 2019

grwlf commented Aug 15, 2019

yoptar commented Aug 19, 2019 • edited

grwlf commented Aug 20, 2019

sgrechanik-h commented Sep 3, 2019

yoptar commented Sep 4, 2019 • edited

grwlf commented Sep 7, 2019

avmhawk Sep 10, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilyararimovna commented Aug 13, 2019 •

edited

grwlf commented Aug 14, 2019 •

edited

yoptar commented Aug 19, 2019 •

edited

yoptar commented Sep 4, 2019 •

edited

avmhawk Sep 10, 2019 •

edited