Friends dataset #4568

chiehminwei · 2022-06-03T00:01:31Z

Patch description
This patch adds a new task friends for generating multiparty conversation training examples from the Friends Corpus on ConvoKit.

Example Training Sample:

Text:
Person A: I like pizza.
Person B: Me too.
Person C: Could someone order pizza.

Label:
	I will do it

The teacher supports 3 command line arguments, --character, --use-silence-token, and --use-start-token.

The --character option specifies which speaker labels to train on. For example, if 'Rachel Green' is specified, then only sentences uttered by Rachel Green will generate a training sample.

The choices for character are the main characters from the show, or an All option specifying all 6 main characters, excluding supporting cast, should generate training samples. The All option is the default option.

[
  'All',
  'Rachel Green',
  'Monica Geller',
  'Phoebe Buffay',
  'Joey Tribbiani',
  'Chandler Bing',
  'Ross Geller',
]

The --uses-silence-token (default: True) flag indicates whether a training sample should be generated with a token when it's not the chosen character's turn to speak. For example, the earlier pizza conversation would generate the following training samples:

# Chosen Character = Person C

Text:
Person A: I like pizza.

Label:
    __SILENT__

===
Text:
Person A: I like pizza.
Person B: Me too.

Label:
      Could someone order pizza.

The --uses-start-token (default: False) flag indicates whether a token should be included in the beginning of a conversation. For example, the earlier pizza conversation would generate the following training samples:

# --use_start_token = True
# Chosen Character = Person A

Text:
__Start__

Label:
  I like pizza.

Whereas when --use-start-token = False, the first sentence in a conversation would be skipped and not generate a training sample.

There are also two flags --start-token and --silence-token that you can use to specify what special symbols you want to use to represent these tokens. By default, they are __START__ and __SILENCE__.

Testing steps

pytest parlai/tasks/myteacher/test.py

Expected Output:

============================================================= test session starts ==============================================================
test.py ...

Other information

To see the examples generated, run:

parlai display_data --task friends

…us by default

chiehminwei · 2022-06-03T00:06:14Z

@mojtaba-komeili

parlai/tasks/friends/agents.py

mojtaba-komeili · 2022-06-03T14:53:20Z

parlai/tasks/friends/agents.py

+                speaker = utterance['speaker']
+                conversation_id = utterance['conversation_id']
+
+                if conversation_id not in conversations:


Check out python's defaultdict for simpler implantation here.

mojtaba-komeili

Thanks Jimmy, it looks good. Just a few comments to make it better:

try our autoformat.sh script for fixing the lint errors.
Make sure tests are passing. There are some failing now.
Add convokit to the installation requirements (maybe that is what causing test failures)
It's alright to have a single teacher, but maybe add a couple more for convenience use. They may be simple extensions of the current one with minimal changes. For example a teacher for All and another one that is a single charter POV.

mojtaba-komeili · 2022-06-03T15:37:39Z

One more thing, could you add the list of existing character as another entry to the message? It could be a comma separated list of names.

chiehminwei · 2022-06-04T00:36:15Z

Sounds great, all done! See updated PR.

Used autoformat.sh, got around ConvoKit, added multiple convenience Teacher classes and added corresponding tests, and added cli option to specify the list of characters. All tests passed except the cleaninstall one unrelated to this PR.

mojtaba-komeili · 2022-06-06T15:00:29Z

parlai/core/build_data.py

@@ -380,6 +380,8 @@ def _unzip(path, fname, delete=True):
                PathManager.mkdirs(outpath)
                continue
            logging.debug(f"Extracting to {outpath}")
+            if '__MACOSX' in member:


It logically makes sense to submit this as its own patch. Please remove it from this PR and add is as a separate one.

mojtaba-komeili · 2022-06-06T18:59:55Z

parlai/tasks/friends/build.py

+
+def build(opt):
+    dpath = os.path.join(opt['datapath'], 'Friends')
+    version = '1.02'


You can change this to 1 since it was never published to the public code. But remove your local files to force a rebuild.

mojtaba-komeili · 2022-06-06T19:06:12Z

One more thing, could you add the list of existing character as another entry to the message? It could be a comma separated list of names.

You haven't addressed this one yet.

chiehminwei · 2022-06-07T17:41:24Z

parlai/tasks/friends/agents.py

+        agent.add_argument(
+            '--characters',
+            type=str,
+            default='Rachel Green,Monica Geller,Phoebe Buffay,Joey Tribbiani,Chandler Bing,Ross Geller',
+            help='A comma-separated list of characters to train on when `--character` == `All`',
+        )


The list cli arg

chiehminwei · 2022-06-07T17:41:49Z

This this what you mean?
https://github.com/facebookresearch/ParlAI/pull/4568/files#r891537386

mojtaba-komeili · 2022-06-07T23:06:55Z

One more thing that forgot asking you to add is sample screenshots that shows the output when you run parlai display_data -t <your task name>. Just a couple of teachers would be enough.

mojtaba-komeili · 2022-06-08T01:55:47Z

This this what you mean? https://github.com/facebookresearch/ParlAI/pull/4568/files#r891537386

(For the record keeping) we chatted about this in-person.

chiehminwei · 2022-06-09T21:35:58Z

Screenshots

parlai display_data --task friends

parlai display_data --task friends:rachel

parlai display_data --task friends:rachel --use-silence-token=False

parlai display_data --task friends:monica

parlai display_data --task friends --verbose (showing characters in the scene)

mojtaba-komeili

Looks good thanks. Feel free to merge.

* feat: add friends dataset for multiparty convo * feat: generate examples for all 6 main characters in the friends corpus by default * fix: remove unused file * fix: add testing * fix: add speaker label inside label * feat: add support for data folds; clean up code * fix: skip __MACOS folder for zipped files to avoid exception * fix: formatting with autoformat.sh * feat: add convenience teacher classes * feat: add command line option to specify list of characters * undo changes to build_data.py * cleanup * style fix

chiehminwei added 4 commits June 2, 2022 16:24

feat: add friends dataset for multiparty convo

186bcdc

feat: generate examples for all 6 main characters in the friends corp…

22ee121

…us by default

fix: remove unused file

b88a035

fix: add testing

95dfe1e

facebook-github-bot added the CLA Signed label Jun 3, 2022

fix: add speaker label inside label

351dce4

mojtaba-komeili reviewed Jun 3, 2022

View reviewed changes

parlai/tasks/friends/agents.py Outdated Show resolved Hide resolved

mojtaba-komeili reviewed Jun 3, 2022

View reviewed changes

parlai/tasks/friends/agents.py Outdated Show resolved Hide resolved

mojtaba-komeili reviewed Jun 3, 2022

View reviewed changes

parlai/tasks/friends/agents.py Outdated Show resolved Hide resolved

mojtaba-komeili reviewed Jun 3, 2022

View reviewed changes

mojtaba-komeili suggested changes Jun 3, 2022

View reviewed changes

chiehminwei added 5 commits June 3, 2022 10:38

feat: add support for data folds; clean up code

e917554

fix: skip __MACOS folder for zipped files to avoid exception

8a74ed4

fix: formatting with autoformat.sh

1736f4c

feat: add convenience teacher classes

a0badae

feat: add command line option to specify list of characters

07ef5de

mojtaba-komeili reviewed Jun 6, 2022

View reviewed changes

undo changes to build_data.py

b61fa80

mojtaba-komeili reviewed Jun 6, 2022

View reviewed changes

chiehminwei commented Jun 7, 2022

View reviewed changes

cleanup

e465518

style fix

bb444da

mojtaba-komeili approved these changes Jun 10, 2022

View reviewed changes

chiehminwei merged commit 33132e6 into main Jun 10, 2022

chiehminwei deleted the friends_dataset branch June 10, 2022 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Friends dataset #4568

Friends dataset #4568

chiehminwei commented Jun 3, 2022 •

edited

chiehminwei commented Jun 3, 2022

mojtaba-komeili Jun 3, 2022 •

edited

mojtaba-komeili left a comment

mojtaba-komeili commented Jun 3, 2022

chiehminwei commented Jun 4, 2022

mojtaba-komeili Jun 6, 2022

mojtaba-komeili Jun 6, 2022

mojtaba-komeili commented Jun 6, 2022

chiehminwei Jun 7, 2022

chiehminwei commented Jun 7, 2022

mojtaba-komeili commented Jun 7, 2022

mojtaba-komeili commented Jun 8, 2022

chiehminwei commented Jun 9, 2022 •

edited

mojtaba-komeili left a comment

Friends dataset #4568

Friends dataset #4568

Conversation

chiehminwei commented Jun 3, 2022 • edited

chiehminwei commented Jun 3, 2022

mojtaba-komeili Jun 3, 2022 • edited

Choose a reason for hiding this comment

mojtaba-komeili left a comment

Choose a reason for hiding this comment

mojtaba-komeili commented Jun 3, 2022

chiehminwei commented Jun 4, 2022

mojtaba-komeili Jun 6, 2022

Choose a reason for hiding this comment

mojtaba-komeili Jun 6, 2022

Choose a reason for hiding this comment

mojtaba-komeili commented Jun 6, 2022

chiehminwei Jun 7, 2022

Choose a reason for hiding this comment

chiehminwei commented Jun 7, 2022

mojtaba-komeili commented Jun 7, 2022

mojtaba-komeili commented Jun 8, 2022

chiehminwei commented Jun 9, 2022 • edited

Screenshots

mojtaba-komeili left a comment

Choose a reason for hiding this comment

chiehminwei commented Jun 3, 2022 •

edited

mojtaba-komeili Jun 3, 2022 •

edited

chiehminwei commented Jun 9, 2022 •

edited