Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Friends dataset #4568

Merged
merged 13 commits into from Jun 10, 2022
Merged

Friends dataset #4568

merged 13 commits into from Jun 10, 2022

Conversation

chiehminwei
Copy link
Contributor

@chiehminwei chiehminwei commented Jun 3, 2022

Patch description
This patch adds a new task friends for generating multiparty conversation training examples from the Friends Corpus on ConvoKit.

Example Training Sample:

Text:
Person A: I like pizza.
Person B: Me too.
Person C: Could someone order pizza.

Label:
	I will do it

The teacher supports 3 command line arguments, --character, --use-silence-token, and --use-start-token.

The --character option specifies which speaker labels to train on. For example, if 'Rachel Green' is specified, then only sentences uttered by Rachel Green will generate a training sample.

The choices for character are the main characters from the show, or an All option specifying all 6 main characters, excluding supporting cast, should generate training samples. The All option is the default option.

[
  'All',
  'Rachel Green',
  'Monica Geller',
  'Phoebe Buffay',
  'Joey Tribbiani',
  'Chandler Bing',
  'Ross Geller',
]

The --uses-silence-token (default: True) flag indicates whether a training sample should be generated with a token when it's not the chosen character's turn to speak. For example, the earlier pizza conversation would generate the following training samples:

# Chosen Character = Person C

Text:
Person A: I like pizza.

Label:
    __SILENT__

===
Text:
Person A: I like pizza.
Person B: Me too.

Label:
      Could someone order pizza.

The --uses-start-token (default: False) flag indicates whether a token should be included in the beginning of a conversation. For example, the earlier pizza conversation would generate the following training samples:

# --use_start_token = True
# Chosen Character = Person A

Text:
__Start__

Label:
  I like pizza.

Whereas when --use-start-token = False, the first sentence in a conversation would be skipped and not generate a training sample.

There are also two flags --start-token and --silence-token that you can use to specify what special symbols you want to use to represent these tokens. By default, they are __START__ and __SILENCE__.

Testing steps

pytest parlai/tasks/myteacher/test.py

Expected Output:

============================================================= test session starts ==============================================================
test.py ...

Other information

To see the examples generated, run:

parlai display_data --task friends

@chiehminwei
Copy link
Contributor Author

@mojtaba-komeili

speaker = utterance['speaker']
conversation_id = utterance['conversation_id']

if conversation_id not in conversations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out python's defaultdict for simpler implantation here.

Copy link
Contributor

@mojtaba-komeili mojtaba-komeili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jimmy, it looks good. Just a few comments to make it better:

  1. try our autoformat.sh script for fixing the lint errors.
  2. Make sure tests are passing. There are some failing now.
  3. Add convokit to the installation requirements (maybe that is what causing test failures)
  4. It's alright to have a single teacher, but maybe add a couple more for convenience use. They may be simple extensions of the current one with minimal changes. For example a teacher for All and another one that is a single charter POV.

@mojtaba-komeili
Copy link
Contributor

One more thing, could you add the list of existing character as another entry to the message? It could be a comma separated list of names.

@chiehminwei
Copy link
Contributor Author

Sounds great, all done! See updated PR.

Used autoformat.sh, got around ConvoKit, added multiple convenience Teacher classes and added corresponding tests, and added cli option to specify the list of characters. All tests passed except the cleaninstall one unrelated to this PR.

@@ -380,6 +380,8 @@ def _unzip(path, fname, delete=True):
PathManager.mkdirs(outpath)
continue
logging.debug(f"Extracting to {outpath}")
if '__MACOSX' in member:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It logically makes sense to submit this as its own patch. Please remove it from this PR and add is as a separate one.


def build(opt):
dpath = os.path.join(opt['datapath'], 'Friends')
version = '1.02'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change this to 1 since it was never published to the public code. But remove your local files to force a rebuild.

@mojtaba-komeili
Copy link
Contributor

One more thing, could you add the list of existing character as another entry to the message? It could be a comma separated list of names.

You haven't addressed this one yet.

Comment on lines +113 to +118
agent.add_argument(
'--characters',
type=str,
default='Rachel Green,Monica Geller,Phoebe Buffay,Joey Tribbiani,Chandler Bing,Ross Geller',
help='A comma-separated list of characters to train on when `--character` == `All`',
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list cli arg

@chiehminwei
Copy link
Contributor Author

@mojtaba-komeili
Copy link
Contributor

One more thing that forgot asking you to add is sample screenshots that shows the output when you run parlai display_data -t <your task name>. Just a couple of teachers would be enough.

@mojtaba-komeili
Copy link
Contributor

This this what you mean? https://github.com/facebookresearch/ParlAI/pull/4568/files#r891537386

(For the record keeping) we chatted about this in-person.

@chiehminwei
Copy link
Contributor Author

chiehminwei commented Jun 9, 2022

Screenshots

parlai display_data --task friends
image

parlai display_data --task friends:rachel
image

parlai display_data --task friends:rachel --use-silence-token=False
image

parlai display_data --task friends:monica
image

parlai display_data --task friends --verbose (showing characters in the scene)
image

Copy link
Contributor

@mojtaba-komeili mojtaba-komeili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks. Feel free to merge.

@chiehminwei chiehminwei merged commit 33132e6 into main Jun 10, 2022
@chiehminwei chiehminwei deleted the friends_dataset branch June 10, 2022 16:16
kushalarora pushed a commit that referenced this pull request Jun 15, 2022
* feat: add friends dataset for multiparty convo

* feat: generate examples for all 6 main characters in the friends corpus by default

* fix: remove unused file

* fix: add testing

* fix: add speaker label inside label

* feat: add support for data folds; clean up code

* fix: skip __MACOS folder for zipped files to avoid exception

* fix: formatting with autoformat.sh

* feat: add convenience teacher classes

* feat: add command line option to specify list of characters

* undo changes to build_data.py

* cleanup

* style fix
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants