This repo contains the instructions to perform interactive search on the YFCC100M (Yahoo Flickr Creative Commons 100 Million) image dataset using the Eureka/OpenDiamond (paper) software stack, and using AWS EC2 as back-end.
We have provided:
- A public Amazon Machine Image (AMI) containing an installed Eureka back-end with pre-configured YFCC100M meta data.
- A VirtualBox image and a KVM image containing the pre-configured front-end GUI
- Launching the Eureka Back-ends on AWS EC2
- Starting the Front-end GUI
- Built-in Predicates
- Security and Privacy Risk
Launching the Eureka Back-ends on AWS EC2
- Region: US West (Oregon),
- AMI ID:
- Instance type (recommended):
- Public IP enabled
- Security group
- Inbound: TCP 22, TCP 5872
- Outbound: all
- Create a security group
eureka-sgwith inbound rules TCP 22, TCP 5872, and outbound rules all.
- Create a launch template using the aforementioned AMI ID, security group, and recommended instance type.
- Use the launch template to create subsequent EC2 instances.
- You can create as many instances as you need.
- Wait for the launched instances to show "running" in Instance State and "2/2 checks passed" in Status Checks before starting the front-end GUI.
- Stop or terminate the EC2 instances when you are done.
- You can use a non-GPU instance type, but GPU filters (e.g., DNN image classification) will be unusable. You can still use other filters. It is recommended to use instance types with =16 vCPUs and >= 64 GiB RAM.
- You should use US West Oregon (
us-west-2) because the YFCC100M S3 bucket is in the same region.
Starting the Front-end GUI
You need a pair of AWS Access Key ID and Secret Access Key.
They may look like
Whether in a VM or natively,
you must configure your AWS credentials
so that the scripts can obtain the public IPs of your launched EC2 instances.
$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json
And, of course, you will pay your own AWS bill.
Option 1: Use a VM (VirtualBox or KVM)
Login: ubuntu / Password: ubuntu
# Configure AWS credentials as shown above cd /home/ubuntu/hyperfind/eureka-yfcc100m ./start-search.sh
*Tested host: macOS 10.13.6 + VirtualBox 6.0; Ubuntu 18.04 + VirtualBox 6.0
Option 2: Native Installation
This is basically how the VM image is created.
- Install OpenDiamond. You need to at least have the executable
cookiecutterfrom OpenDiamond functioning.
- Download and compile HyperFind. This is the front-end GUI.
- Install the AWS Command Line Interface
- Configure AWS credentials as mentioned above.
- Clone this repo in the directory where
cd /path/to/hyperfind/dir/ git clone https://github.com/fzqneo/eureka-yfcc100m.git
As a result, the directory structure looks like:
/path/to/hyperfind/dir/ |-- build.xml |-- bin/ |-- edu/ |-- cmu/ |-- ... |-- hyperfind.jar |-- eureka-yfcc100m/ <------ this repo |-- README.md <------ this file |-- ... |-- start-search.sh |-- ...
- Start the front-end GUI after you launch the EC2 back-ends
cd /path/to/hyperfind/dir/eureka-yfcc100m ./start-search.sh
Security and Privacy Risk
The pre-configured Eureka back-end in the AMI has turned off ScopeCookie verification. It means anyone who knows the IP addresses of your EC2 instances can use the GUI to connect to your machines and perform the search using them. Since YFCC100M is a public data set, the privacy risk should be minimal. To further reduce the risk, you can:
- Stop or Terminate your EC2 instances as soon as you are done with your search.
- Configure your inbound rules to only accept connections from your IP address/range.
- Turn on ScopeCookie verification. This requires a private key and certificate be set up on the front-end and the back-end, respectively. Contact me for how.
The progress hangs, not moving forward.
There are several cases when this can happen:
- The first search session after the VMs start. The system may still be starting up, or the redis cache is loading from the disk.
- The first time you use a GPU-involving filter. It can take a long time to activate the GPU on EC2 on its first use.
- You use some just-in-time (JIT) machine learning filters that trains an ML model before filtering images. Depending on the algorithm and the training set size, the JIT training time can be considerable.
The GUI errors with
Make sure you have opened the necessary port (5872) on the EC2 instances.
Wait for the VMs to be in the "running" Instance State and "2/2 checks passed" in Status Checks.
I can't create GPU instances on EC2.
By default, AWS may only allow users to create 0 or 1 GPU instance. You may need to ask Amazon to increase you limit.
Ziqiang Feng (Carnegie Mellon University)
zf at cs dot cmu dot edu