How to get the Personality-Captions dataset? #1704

czhxiaohuihui · 2019-05-10T14:38:19Z

I want to get this Personality-Captions dataset, but I'm confused about "The Personality-Captions dataset can be accessed via ParlAI, with -t personality_captions" in https://parl.ai/projects/personality_captions/

-t personality_captions ???
Can you tell me the whole command or link???

Thanks a lot!

stephenroller · 2019-05-10T14:51:12Z

python examples/display_data.py -t personality_captions should download it.

Yukti-09 · 2019-07-04T09:33:59Z

On trying to execute the above command I got an error:
python: can't open file 'examples/display_data.py': [Errno 2] No such file or directory

stephenroller · 2019-07-04T15:20:41Z

You need to run it from within the root of the ParlAI directory. Alternatively, if you've installed properly, you can:

python -m parlai.scripts.display_data -t personality_captions

jaseweston · 2019-07-04T16:09:36Z

Please see: http://parl.ai.s3-website.us-east-2.amazonaws.com/docs/tutorial_quick.html

…

On Thu, Jul 4, 2019 at 11:20 AM Stephen Roller ***@***.***> wrote: You need to run it from within the root of the ParlAI directory. Alternatively, if you've installed properly, you can: python -m parlai.scripts.display_data -t personality_captions — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1704?email_source=notifications&email_token=ACUOJ6H2B7IXV2O3TREQDQ3P5YINHA5CNFSM4HMDV2XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHVWTI#issuecomment-508517197>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACUOJ6GRIH7V65UAID75FMDP5YINHANCNFSM4HMDV2XA> .

Yukti-09 · 2019-07-05T09:30:24Z

Does it mean that 1 image is trained with 1 caption?

klshuster · 2019-07-05T13:36:58Z

Each image in the training set has one corresponding caption - there are roughly 186k image/caption/personality triples in the train set. The test set has 10k images, each with 5 captions based on one personality.

Yukti-09 · 2019-07-06T06:29:13Z

Thank you, Sir.

I am facing a little difficulty in downloading the dataset, I am getting this error:

raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='multimedia-commons.s3-us-west-2.amazonaws.com', port=443): Read timed out. (read timeout=5)

stephenroller · 2019-07-06T16:39:25Z

I'm not sure much how we can help you there; we're not associated with the multimedia-commons servers and they seem to load fine for us. Maybe your system has something different about SSL rootcerts or a firewall? Can you try running this from your machine:

$ curl -I 'https://multimedia-commons.s3-us-west-2.amazonaws.com/'
HTTP/1.1 200 OK
x-amz-id-2: UU9b8yIu1xfc+Yb2dYHOKk2aAw/laLd276cKTyHj5svrQeqqkkPnhRL7MhNutQtXztyqE3yqn1U=
x-amz-request-id: 65B9B8CCB659F28E
Date: Sat, 06 Jul 2019 16:37:24 GMT
x-amz-bucket-region: us-west-2
Content-Type: application/xml
Transfer-Encoding: chunked
Server: AmazonS3

Yukti-09 · 2019-07-06T17:50:32Z

I tried the above command but I am facing the same issue.

stephenroller · 2019-07-06T21:42:53Z

There’s nothing I can do to help you with that; there’s something larger wrong with your machine and its internet setup or similar.

Yukti-09 · 2019-07-08T02:25:11Z

Are there any other means by which I can access the dataset, maybe a google drive link?
I am working on a research project and would be really grateful if I could have access to the dataset. It would be very beneficial to me.

I would be obliged if you could kindly help me out.

stephenroller · 2019-07-08T12:10:38Z

I’m sorry, but that might constitute copyright infringement, so I cannot help there.

You’re looking for the YFCC100M dataset; it can probably be found elsewhere.

Yukti-09 · 2019-07-08T14:04:53Z

Hello Sir,

The research paper states that the images for training, test and validation have been randomly selected from the YFCC100M dataset and therefore just downloading that particular dataset will not prove to be beneficial for me.

I am interested in using the captions as there are 215 personality traits, it could further really help me with my project.

I would be really grateful if I could be helped out.

Thank you.

klshuster · 2019-07-08T16:25:29Z

The examples in the dataset have the corresponding image hash, which you can use to select which images from the YFCC100m dataset; e.g. the first example in the dataset is

{'personality': 'Intense',
 'comment': 'The snow will last as long as my sadness',
 'image_hash': '1e22a9cf867d718551386b427c3b6d18'}

This image hash corresponds to the image from the YFCC100m dataset.

Yukti-09 · 2019-07-09T07:52:43Z

Thank you!

I tried downloading the images via:

https://github.com/stefanbirkner/yfcc100m-downloader
but the image hash is different .

I am not getting allowed to download images directly from:
https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67&guccounter=1

stephenroller · 2019-07-09T11:21:30Z

https://github.com/stefanbirkner/yfcc100m-downloader/blob/06116b9d33b0b3382fb05329c5c5aebbbde2ef2a/download_files.py#L14

That code uses the same servers that Kurt’s code uses. If that’s working for now, then parlai’s download script should work, no?

Yukti-09 · 2019-07-09T13:30:15Z

I do understand but the same error persists. I have tried it on two different systems, Mac and Ubuntu.
I want to work on the dataset as soon as possible so that I can move forward with my project.
I have tried everything under the sun but just cannot figure how to proceed.

stephenroller · 2019-07-09T14:22:21Z

I have just tested the download script in ParlAI, and it works fine on the servers we work on, and there's been multiple proposed solutions for you. Here's a sample of the output:

 downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/355/340/355340eca49a738bff24f6d51b64eede.jpg to /private/home/roller/working/parlai/data/yfcc_images/355340eca49a738bff24f6d51b64eede.jpg ]208kB/s]
  0%|                                                                                                                                                                                       | 83/201858 [00:05<3:50:44, 14.57img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/979/e6a/979e6a8bb69eb697f96dbb34c5094ad.jpg to /private/home/roller/working/parlai/data/yfcc_images/979e6a8bb69eb697f96dbb34c5094ad.jpg ] 266kB/s]
  0%|                                                                                                                                                                                       | 84/201858 [00:06<4:08:32, 13.53img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/956/d8f/956d8f222b39b6b8d26154f30d79e47.jpg to /private/home/roller/working/parlai/data/yfcc_images/956d8f222b39b6b8d26154f30d79e47.jpg ] 279kB/s]
  0%|                                                                                                                                                                                       | 85/201858 [00:06<4:23:46, 12.75img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/87a/c80/87ac8069fb19f5c58011815ebad6c2e.jpg to /private/home/roller/working/parlai/data/yfcc_images/87ac8069fb19f5c58011815ebad6c2e.jpg ] 167kB/s]
  0%|                                                                                                                                                                                       | 86/201858 [00:07<4:38:08, 12.09img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/617/910/6179104e9375f8c944312973bac1e9ba.jpg to /private/home/roller/working/parlai/data/yfcc_images/6179104e9375f8c944312973bac1e9ba.jpg ]32kB/s]
  0%|                                                                                                                                                                                      | 87/201858 [00:20<13:10:06,  4.26img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/8ee/81b/8ee81b81a8748b14ceb2dc146f2cdc3.jpg to /private/home/roller/working/parlai/data/yfcc_images/8ee81b81a8748b14ceb2dc146f2cdc3.jpg ] 231kB/s]
  0%|                                                                                                                                                                                      | 88/201858 [00:20<13:20:27,  4.20img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/85e/b96/85eb969ecba8bcd2adb85b64817a5849.jpg to /private/home/roller/working/parlai/data/yfcc_images/85eb969ecba8bcd2adb85b64817a5849.jpg ]20kB/s]
  0%|                                                                                                                                                                                      | 89/201858 [00:21<13:27:48,  4.16img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/734/0ca/7340ca4727548be0a5b43d434e503eee.jpg to /private/home/roller/working/parlai/data/yfcc_images/7340ca4727548be0a5b43d434e503eee.jpg ]57kB/s]
  0%|                                                                                                                                                                                      | 90/201858 [00:21<13:37:24,  4.11img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/662/2b1/6622b153cb1bf49194c6536b2fb4a67.jpg to /private/home/roller/working/parlai/data/yfcc_images/6622b153cb1bf49194c6536b2fb4a67.jpg ] 284kB/s]
  0%|                                                                                                                                                                                      | 91/201858 [00:22<13:48:37,  4.06img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/ca2/f32/ca2f3238acd1695f9d24b813ea67c38.jpg to /private/home/roller/working/parlai/data/yfcc_images/ca2f3238acd1695f9d24b813ea67c38.jpg ] 260kB/s]
  0%|                                                                                                                                                                                      | 92/201858 [00:22<13:55:58,  4.02img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/985/db1/985db1342c6c7c10f728e6a8e9df12.jpg to /private/home/roller/working/parlai/data/yfcc_images/985db1342c6c7c10f728e6a8e9df12.jpg ]0, 213kB/s]
  0%|                                                                                                                                                                                      | 93/201858 [00:23<14:05:02,  3.98img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/ff5/039/ff5039a87446a48c189ffac76ec16ae.jpg to /private/home/roller/working/parlai/data/yfcc_images/ff5039a87446a48c189ffac76ec16ae.jpg ] 318kB/s]
  0%|                                                                                                                                                                                      | 94/201858 [00:23<14:13:31,  3.94img/s]

The YFCC website says something about needing an AWS account, but I don't know how that's actually enforced. If you've downloaded the YFCC dataset using that other script, then you have the full dataset and the personality captions uses a strict subset, so you have more than you need. Just putting those files into your parlai/data/yfcc_images/ folder should do the trick. I really don't know how to help you more. I suggest manually going to one of those URLs and making sure it loads.

Yukti-09 · 2019-07-10T07:59:07Z

raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:645)>

I do not know how to fix this issue.
Is it possible that the images be uploaded on github?
Yes, the need for the AWS account is enforced, even on creating one, a request has to be sent to download the dataset.

stephenroller · 2019-07-10T13:33:42Z

Facebook does not own the images, so I cannot redistribute them in any manner. You are responsible for obtaining the images and requisite permission to use them.

Thank you for posting the error. Based on a quick google search it looks like there are stack overflow posts discussing a mix of either (1) outdated libraries, but those go back to 2016 and seem unlikely; and (2) issues around proxies.

It looks like urrlib obeys the http_proxy environmental variable, while requests obeys the HTTPS_PROXY and HTTP_PROXY environmental variables. Do you need to set those up to work on your network? Perhaps you need to modify our download code to deal appropriately with your environment? Perhaps those ARE set in your enviornment and they shouldn't be? Paste a copy of your env please.

ParlAI/parlai/core/build_data.py

Lines 86 to 93 in f57f5ac

    
           with requests.Session() as session: 
        
               try: 
        
                   header = ( 
        
                       {'Range': 'bytes=%d-' % resume_pos, 'Accept-Encoding': 'identity'} 
        
                       if resume 
        
                       else {} 
        
                   ) 
        
                   response = session.get(url, stream=True, timeout=5, headers=header)

Yukti-09 · 2019-07-11T04:37:08Z

On Mac, it is anaconda3
On ubuntu, it is venv

emilydinan · 2019-07-11T13:42:50Z

Hi @Yukti-09 that isn't quite enough information for us to debug. We've been unable to reproduce these errors on several different environments. I'd suggest looking to the internet to help you debug at this point.

njucckevin · 2021-04-19T08:15:29Z

Each image in the training set has one corresponding caption - there are roughly 186k image/caption/personality triples in the train set. The test set has 10k images, each with 5 captions based on one personality.

I have download the personality_captions dataset successfully. But where can I find the 5 reference captions in val.json and test.json? This two files have a lot of sentences for one "image_hash".
Thanks a lot!

klshuster · 2021-04-19T13:05:05Z

hi there! if you're looking at the raw test.json file, you'll notice that it's a list of 10k dictionaries; the gold human reference caption can be found in the 'comment' key, and the 4 additional captions can be found in the 'additional_comments' key. The valid set only has 1 caption per image.

Hope that helps!

njucckevin · 2021-04-19T13:17:53Z

Thank for the quick reply! I missed 'additional_comments' before, now the data is no problem.

nr596 · 2021-05-15T19:14:14Z

Each image in the training set has one corresponding caption - there are roughly 186k image/caption/personality triples in the train set. The test set has 10k images, each with 5 captions based on one personality.

I have download the personality_captions dataset successfully. But where can I find the 5 reference captions in val.json and test.json? This two files have a lot of sentences for one "image_hash".
Thanks a lot!

HI Nick can you please tell me how you have downloaded the data set please

stephenroller closed this as completed May 10, 2019

stephenroller assigned klshuster and dexterju27 Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the Personality-Captions dataset? #1704

How to get the Personality-Captions dataset? #1704

czhxiaohuihui commented May 10, 2019 •

edited

stephenroller commented May 10, 2019

Yukti-09 commented Jul 4, 2019

stephenroller commented Jul 4, 2019

jaseweston commented Jul 4, 2019 via email

Yukti-09 commented Jul 5, 2019

klshuster commented Jul 5, 2019

Yukti-09 commented Jul 6, 2019

stephenroller commented Jul 6, 2019

Yukti-09 commented Jul 6, 2019

stephenroller commented Jul 6, 2019

Yukti-09 commented Jul 8, 2019

stephenroller commented Jul 8, 2019 •

edited

Yukti-09 commented Jul 8, 2019 •

edited

klshuster commented Jul 8, 2019

Yukti-09 commented Jul 9, 2019

stephenroller commented Jul 9, 2019

Yukti-09 commented Jul 9, 2019

stephenroller commented Jul 9, 2019 •

edited

Yukti-09 commented Jul 10, 2019

stephenroller commented Jul 10, 2019 •

edited

Yukti-09 commented Jul 11, 2019

emilydinan commented Jul 11, 2019

njucckevin commented Apr 19, 2021

klshuster commented Apr 19, 2021

njucckevin commented Apr 19, 2021

nr596 commented May 15, 2021

How to get the Personality-Captions dataset? #1704

How to get the Personality-Captions dataset? #1704

Comments

czhxiaohuihui commented May 10, 2019 • edited

stephenroller commented May 10, 2019

Yukti-09 commented Jul 4, 2019

stephenroller commented Jul 4, 2019

jaseweston commented Jul 4, 2019 via email

Yukti-09 commented Jul 5, 2019

klshuster commented Jul 5, 2019

Yukti-09 commented Jul 6, 2019

stephenroller commented Jul 6, 2019

Yukti-09 commented Jul 6, 2019

stephenroller commented Jul 6, 2019

Yukti-09 commented Jul 8, 2019

stephenroller commented Jul 8, 2019 • edited

Yukti-09 commented Jul 8, 2019 • edited

klshuster commented Jul 8, 2019

Yukti-09 commented Jul 9, 2019

stephenroller commented Jul 9, 2019

Yukti-09 commented Jul 9, 2019

stephenroller commented Jul 9, 2019 • edited

Yukti-09 commented Jul 10, 2019

stephenroller commented Jul 10, 2019 • edited

Yukti-09 commented Jul 11, 2019

emilydinan commented Jul 11, 2019

njucckevin commented Apr 19, 2021

klshuster commented Apr 19, 2021

njucckevin commented Apr 19, 2021

nr596 commented May 15, 2021

czhxiaohuihui commented May 10, 2019 •

edited

stephenroller commented Jul 8, 2019 •

edited

Yukti-09 commented Jul 8, 2019 •

edited

stephenroller commented Jul 9, 2019 •

edited

stephenroller commented Jul 10, 2019 •

edited