Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

How to get the Personality-Captions dataset? #1704

Closed
czhxiaohuihui opened this issue May 10, 2019 · 26 comments
Closed

How to get the Personality-Captions dataset? #1704

czhxiaohuihui opened this issue May 10, 2019 · 26 comments
Assignees

Comments

@czhxiaohuihui
Copy link

czhxiaohuihui commented May 10, 2019

I want to get this Personality-Captions dataset, but I'm confused about "The Personality-Captions dataset can be accessed via ParlAI, with -t personality_captions" in https://parl.ai/projects/personality_captions/

-t personality_captions ???
Can you tell me the whole command or link???

Thanks a lot!

@stephenroller
Copy link
Contributor

python examples/display_data.py -t personality_captions should download it.

@Yukti-09
Copy link

Yukti-09 commented Jul 4, 2019

On trying to execute the above command I got an error:
python: can't open file 'examples/display_data.py': [Errno 2] No such file or directory

@stephenroller
Copy link
Contributor

You need to run it from within the root of the ParlAI directory. Alternatively, if you've installed properly, you can:

python -m parlai.scripts.display_data -t personality_captions

@jaseweston
Copy link
Contributor

jaseweston commented Jul 4, 2019 via email

@Yukti-09
Copy link

Yukti-09 commented Jul 5, 2019

Does it mean that 1 image is trained with 1 caption?

@klshuster
Copy link
Contributor

Each image in the training set has one corresponding caption - there are roughly 186k image/caption/personality triples in the train set. The test set has 10k images, each with 5 captions based on one personality.

@Yukti-09
Copy link

Yukti-09 commented Jul 6, 2019

Thank you, Sir.

I am facing a little difficulty in downloading the dataset, I am getting this error:

raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='multimedia-commons.s3-us-west-2.amazonaws.com', port=443): Read timed out. (read timeout=5)

@stephenroller
Copy link
Contributor

I'm not sure much how we can help you there; we're not associated with the multimedia-commons servers and they seem to load fine for us. Maybe your system has something different about SSL rootcerts or a firewall? Can you try running this from your machine:

$ curl -I 'https://multimedia-commons.s3-us-west-2.amazonaws.com/'
HTTP/1.1 200 OK
x-amz-id-2: UU9b8yIu1xfc+Yb2dYHOKk2aAw/laLd276cKTyHj5svrQeqqkkPnhRL7MhNutQtXztyqE3yqn1U=
x-amz-request-id: 65B9B8CCB659F28E
Date: Sat, 06 Jul 2019 16:37:24 GMT
x-amz-bucket-region: us-west-2
Content-Type: application/xml
Transfer-Encoding: chunked
Server: AmazonS3

@Yukti-09
Copy link

Yukti-09 commented Jul 6, 2019

I tried the above command but I am facing the same issue.

@stephenroller
Copy link
Contributor

There’s nothing I can do to help you with that; there’s something larger wrong with your machine and its internet setup or similar.

@Yukti-09
Copy link

Yukti-09 commented Jul 8, 2019

Are there any other means by which I can access the dataset, maybe a google drive link?
I am working on a research project and would be really grateful if I could have access to the dataset. It would be very beneficial to me.

I would be obliged if you could kindly help me out.

@stephenroller
Copy link
Contributor

stephenroller commented Jul 8, 2019

I’m sorry, but that might constitute copyright infringement, so I cannot help there.

You’re looking for the YFCC100M dataset; it can probably be found elsewhere.

@Yukti-09
Copy link

Yukti-09 commented Jul 8, 2019

Hello Sir,

The research paper states that the images for training, test and validation have been randomly selected from the YFCC100M dataset and therefore just downloading that particular dataset will not prove to be beneficial for me.

I am interested in using the captions as there are 215 personality traits, it could further really help me with my project.

I would be really grateful if I could be helped out.

Thank you.

@klshuster
Copy link
Contributor

The examples in the dataset have the corresponding image hash, which you can use to select which images from the YFCC100m dataset; e.g. the first example in the dataset is

{'personality': 'Intense',
 'comment': 'The snow will last as long as my sadness',
 'image_hash': '1e22a9cf867d718551386b427c3b6d18'}

This image hash corresponds to the image from the YFCC100m dataset.

@Yukti-09
Copy link

Yukti-09 commented Jul 9, 2019

Thank you!

I tried downloading the images via:

https://github.com/stefanbirkner/yfcc100m-downloader
but the image hash is different .

I am not getting allowed to download images directly from:
https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67&guccounter=1

@stephenroller
Copy link
Contributor

https://github.com/stefanbirkner/yfcc100m-downloader/blob/06116b9d33b0b3382fb05329c5c5aebbbde2ef2a/download_files.py#L14

That code uses the same servers that Kurt’s code uses. If that’s working for now, then parlai’s download script should work, no?

@Yukti-09
Copy link

Yukti-09 commented Jul 9, 2019

I do understand but the same error persists. I have tried it on two different systems, Mac and Ubuntu.
I want to work on the dataset as soon as possible so that I can move forward with my project.
I have tried everything under the sun but just cannot figure how to proceed.

@stephenroller
Copy link
Contributor

stephenroller commented Jul 9, 2019

I have just tested the download script in ParlAI, and it works fine on the servers we work on, and there's been multiple proposed solutions for you. Here's a sample of the output:

 downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/355/340/355340eca49a738bff24f6d51b64eede.jpg to /private/home/roller/working/parlai/data/yfcc_images/355340eca49a738bff24f6d51b64eede.jpg ]208kB/s]
  0%|                                                                                                                                                                                       | 83/201858 [00:05<3:50:44, 14.57img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/979/e6a/979e6a8bb69eb697f96dbb34c5094ad.jpg to /private/home/roller/working/parlai/data/yfcc_images/979e6a8bb69eb697f96dbb34c5094ad.jpg ] 266kB/s]
  0%|                                                                                                                                                                                       | 84/201858 [00:06<4:08:32, 13.53img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/956/d8f/956d8f222b39b6b8d26154f30d79e47.jpg to /private/home/roller/working/parlai/data/yfcc_images/956d8f222b39b6b8d26154f30d79e47.jpg ] 279kB/s]
  0%|                                                                                                                                                                                       | 85/201858 [00:06<4:23:46, 12.75img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/87a/c80/87ac8069fb19f5c58011815ebad6c2e.jpg to /private/home/roller/working/parlai/data/yfcc_images/87ac8069fb19f5c58011815ebad6c2e.jpg ] 167kB/s]
  0%|                                                                                                                                                                                       | 86/201858 [00:07<4:38:08, 12.09img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/617/910/6179104e9375f8c944312973bac1e9ba.jpg to /private/home/roller/working/parlai/data/yfcc_images/6179104e9375f8c944312973bac1e9ba.jpg ]32kB/s]
  0%|                                                                                                                                                                                      | 87/201858 [00:20<13:10:06,  4.26img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/8ee/81b/8ee81b81a8748b14ceb2dc146f2cdc3.jpg to /private/home/roller/working/parlai/data/yfcc_images/8ee81b81a8748b14ceb2dc146f2cdc3.jpg ] 231kB/s]
  0%|                                                                                                                                                                                      | 88/201858 [00:20<13:20:27,  4.20img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/85e/b96/85eb969ecba8bcd2adb85b64817a5849.jpg to /private/home/roller/working/parlai/data/yfcc_images/85eb969ecba8bcd2adb85b64817a5849.jpg ]20kB/s]
  0%|                                                                                                                                                                                      | 89/201858 [00:21<13:27:48,  4.16img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/734/0ca/7340ca4727548be0a5b43d434e503eee.jpg to /private/home/roller/working/parlai/data/yfcc_images/7340ca4727548be0a5b43d434e503eee.jpg ]57kB/s]
  0%|                                                                                                                                                                                      | 90/201858 [00:21<13:37:24,  4.11img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/662/2b1/6622b153cb1bf49194c6536b2fb4a67.jpg to /private/home/roller/working/parlai/data/yfcc_images/6622b153cb1bf49194c6536b2fb4a67.jpg ] 284kB/s]
  0%|                                                                                                                                                                                      | 91/201858 [00:22<13:48:37,  4.06img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/ca2/f32/ca2f3238acd1695f9d24b813ea67c38.jpg to /private/home/roller/working/parlai/data/yfcc_images/ca2f3238acd1695f9d24b813ea67c38.jpg ] 260kB/s]
  0%|                                                                                                                                                                                      | 92/201858 [00:22<13:55:58,  4.02img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/985/db1/985db1342c6c7c10f728e6a8e9df12.jpg to /private/home/roller/working/parlai/data/yfcc_images/985db1342c6c7c10f728e6a8e9df12.jpg ]0, 213kB/s]
  0%|                                                                                                                                                                                      | 93/201858 [00:23<14:05:02,  3.98img/s]
[ downloading: https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/ff5/039/ff5039a87446a48c189ffac76ec16ae.jpg to /private/home/roller/working/parlai/data/yfcc_images/ff5039a87446a48c189ffac76ec16ae.jpg ] 318kB/s]
  0%|                                                                                                                                                                                      | 94/201858 [00:23<14:13:31,  3.94img/s]

The YFCC website says something about needing an AWS account, but I don't know how that's actually enforced. If you've downloaded the YFCC dataset using that other script, then you have the full dataset and the personality captions uses a strict subset, so you have more than you need. Just putting those files into your parlai/data/yfcc_images/ folder should do the trick. I really don't know how to help you more. I suggest manually going to one of those URLs and making sure it loads.

@Yukti-09
Copy link

raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:645)>

I do not know how to fix this issue.
Is it possible that the images be uploaded on github?
Yes, the need for the AWS account is enforced, even on creating one, a request has to be sent to download the dataset.

@stephenroller
Copy link
Contributor

stephenroller commented Jul 10, 2019

Facebook does not own the images, so I cannot redistribute them in any manner. You are responsible for obtaining the images and requisite permission to use them.

Thank you for posting the error. Based on a quick google search it looks like there are stack overflow posts discussing a mix of either (1) outdated libraries, but those go back to 2016 and seem unlikely; and (2) issues around proxies.

It looks like urrlib obeys the http_proxy environmental variable, while requests obeys the HTTPS_PROXY and HTTP_PROXY environmental variables. Do you need to set those up to work on your network? Perhaps you need to modify our download code to deal appropriately with your environment? Perhaps those ARE set in your enviornment and they shouldn't be? Paste a copy of your env please.

with requests.Session() as session:
try:
header = (
{'Range': 'bytes=%d-' % resume_pos, 'Accept-Encoding': 'identity'}
if resume
else {}
)
response = session.get(url, stream=True, timeout=5, headers=header)

@Yukti-09
Copy link

On Mac, it is anaconda3
On ubuntu, it is venv

@emilydinan
Copy link
Contributor

Hi @Yukti-09 that isn't quite enough information for us to debug. We've been unable to reproduce these errors on several different environments. I'd suggest looking to the internet to help you debug at this point.

@njucckevin
Copy link

Each image in the training set has one corresponding caption - there are roughly 186k image/caption/personality triples in the train set. The test set has 10k images, each with 5 captions based on one personality.

I have download the personality_captions dataset successfully. But where can I find the 5 reference captions in val.json and test.json? This two files have a lot of sentences for one "image_hash".
Thanks a lot!

@klshuster
Copy link
Contributor

hi there! if you're looking at the raw test.json file, you'll notice that it's a list of 10k dictionaries; the gold human reference caption can be found in the 'comment' key, and the 4 additional captions can be found in the 'additional_comments' key. The valid set only has 1 caption per image.

Hope that helps!

@njucckevin
Copy link

Thank for the quick reply! I missed 'additional_comments' before, now the data is no problem.

@nr596
Copy link

nr596 commented May 15, 2021

Each image in the training set has one corresponding caption - there are roughly 186k image/caption/personality triples in the train set. The test set has 10k images, each with 5 captions based on one personality.

I have download the personality_captions dataset successfully. But where can I find the 5 reference captions in val.json and test.json? This two files have a lot of sentences for one "image_hash".
Thanks a lot!

HI Nick can you please tell me how you have downloaded the data set please

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants