Skip to content

eusip/POM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Table of Contents

Overview

This dataset is an annotated variant of the Persuasive Opinion Multimedia (POM) corpus. It was developed for the opinion prediction task and includes opinion annotations at the expression and word levels. Expression-level annotations label the textual span of the opinion. Word-level annotations (e.g. holder, target, polarity) label the word components of the opinion. Further details can be found in (Garcia et al. 2019 (1)). As part of preprocessing, punctuation was added to the text of the original corpus. The dataset is stored as a pickled pandas MultiIndex DataFrame.

The hierarchical index structure can be understood according to the tuple which forms the MultiIndex object. The first element of the index is one of the following values: features, labels, level_0, seq_level_labels_lvl1 or words.

Each row in words is indexed by the following tuple of values: (index_text, id_sentence, level_1) where index_text indexes the raw filename for each movie review, id_sentence indexes the sentences in the review, and level_1 indexes each word in each sentence. This same tuple indexes the rows of each of the following pieces of data in the dataframe.

Features

Features consist of the tuple (features, [feature name], dimension) where the number of dimensions count the number of columns that comprise a particular feature. This data originates from the original POM corpus (Park et al. 2014) but was re-aligned so that it could be incorporated into this dataset.

feature name feature type dimensions
feature_COAVAREP audio 43
feature_FACET 4.1 video 43
feature_FACET 4.2 video 36
feature_glove_vectors text 300
intervals word start, word stop 2

Video labels

Video labels consist of the tuple (labels, [label name], dimension) where the number of dimensions count the number of columns that comprise a particular label. This data also originates from the original POM corpus (Park et al. 2014) and was re-aligned.

label name dimensions
label_video_personality 16
label_video_persuasion 1
label_video_sentiment 1

Opinion labels

Opinion labels consist of the tuple (seq_level_labels_lvl1, seq_level_labels_lvl2, [label]). The field label consist of all holders, polarities, and targets in the dataset. Each label is boolean. The exception is the sentence-level 4_levels_polarity label which can take the value '0' (no opinion), '1' (negative opinion), or '2' (positive opinion).

label granularity
4_levels_polarity sentence-level
Actor expression-level
Atmosphere and mood expression-level
Character design expression-level
Composer - Singer - Soundmaker expression-level
Director expression-level
Music and Sound effects expression-level
Negative expression-level
Negative_levels expression-level
Neutral expression-level
Other expression-level
Other people involved in movie making expression-level
Overall expression-level
Polarity word-level
Positive expression-level
Positive_levels expression-level
Price expression-level
Producer expression-level
Screenplay expression-level
Target word-level
Token word-level
Very\\_Negative expression-level
Very\\_Positive expression-level
Vision and Special effect expression-level

There are two unique expression-level labels: Negative_levels and Positive_labels. They are both aggregate labels that only take the value '1' if either the values Negative OR Very\\_Negative (Positive OR Very\\_Positive) take the value '1' at the expression level.

An example of a sentence from the dataset is:

This movie came out a few years ago and it is awesome

This sentence has a 4_levels_priority of '2' because the sentence contains the positive expression "it is awesome". The target word is "it" so this word has a value of '1' for the label Target. Finally "it is" refers to the overall film so the words "it" and "is" both have values of '1' for the labels Very\\_Positive, Positive_levels, and Overall.

Considerations

Researcher should keep in mind that this dataset differs from the original POM dataset due to the follow data process:

  1. Annotators did not take into account the video portion of the dataset during annotation. Only the transcripts of each review were considered.
  2. While the original dataset contained punctuation (e.g. silent pauses), this dataset does not contain punctuation and only provides sentence segmentation. This could be of significant importance for those who want to use certain audio features from the CMU SDK -- such as pause (Park et al. 2014).
  3. Because punctation has been removed the Levenshtein distance was used in order to re-match the annotated transcripts with the transcripts of the original dataset.
  4. Finally the annotated transcripts were re-integrated with the remaining features in the original POM dataset.

Download Link

The dataset is available for download through registration at the following link:

http://service.tsi.telecom-paristech.fr/cgi-bin/user-service/subscribe.cgi?form=&license=1&ident=POM

If prompted to sign in simply click 'Cancel' in order to navigate to the registration page.

Filezilla is the recommended FTP client. Please make sure to use the following configuration when connecting to the server.

title title

Acknowledgement

The documentation of this dataset and its issues, and code to parse the data were contributed by Tanvi Dinkar.

Contact Information

Please direction any questions or concerns regarding this dataset to Chloé Clavel (chloe.clavel@telecom-paris.fr) or Tanvi Dinkar (T.Dinkar@hw.ac.uk).

Citation information

@article{garcia2019multimodal,
  title={A multimodal movie review corpus for fine-grained opinion mining},
  author={Garcia, Alexandre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chlo{\'e}},
  journal={arXiv preprint arXiv:1902.10102},
  year={2019}
}

@article{garcia2019token,
  title={From the token to the review: A hierarchical multimodal approach to opinion mining},
  author={Garcia, Alexandre and Colombo, Pierre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chlo{\'e}},
  journal={arXiv preprint arXiv:1908.11216},
  year={2019}
}

@inproceedings{park2014computational,
  title={Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach},
  author={Park, Sunghyun and Shim, Han Suk and Chatterjee, Moitreya and Sagae, Kenji and Morency, Louis-Philippe},
  booktitle={Proceedings of the 16th International Conference on Multimodal Interaction},
  pages={50--57},
  year={2014}
}

About

Sentiment-annotated Persuasive Opinion Multimedia (POM) dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published