This repository has been archived by the owner on Apr 20, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1f9b956
commit b66df22
Showing
3 changed files
with
38 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Code for the pipeline used to setup and run the Twitter Image Similarity search demo. | ||
|
||
Please see the top-level readme and the readmes in `pipeline` and `webapp` for | ||
more details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
|
||
Workers and scripts for the Twitter Image Similarity search demo. | ||
|
||
## Overview | ||
|
||
Below is a terse overview of the functionality for each program in the pipeline. | ||
See individual programs for more detail. | ||
|
||
1. `ingest_twitter_images.py` ingests tweets from Twitter's streaming API and | ||
saves posted images locally and on S3. This ingests between 500K and 700K images | ||
per day. `twitter-credentials.template.json` should be updated with your Twitter | ||
API credentials to run this program. | ||
2. `stream_produce_image_pointers.py` produces pointers to images to a Kafka topic. | ||
A pointer is simply the S3 bucket and key where the image file is stored. | ||
3. `stream_compute_image_features.py` consumes images pointers and computes | ||
a floating-point feature vector for each image. It stores the feature vectors | ||
on S3 and publishes a pointer to the features to a Kafka topic. This program was | ||
designed such that many instances can be run in parallel to speed up computation. | ||
As long as they are all in the same Kafka consumer group, each one will get | ||
independent chunks of the processing load. | ||
4. `batch_es_aknn_create.py` creates an LSH model in Elasticsearch via the Elasticsearch-Aknn plugin. | ||
5. `batch_es_aknn_index.py` indexes feature vectors in Elasticsearch via the | ||
Elasticsearch-Aknn plugin. | ||
|
||
## Usage | ||
|
||
Install dependencies: `pip3 install -r requirements.txt` | ||
|
||
All Python programs implement an argparse CLI, so you can run `python <name>.py --help` to see the exact parameters. | ||
|
||
Most of the programs require a Kafka cluster and an Elasticsearch cluster. | ||
Instructions to set them up is beyond the scope of this brief documentation, | ||
however the `ec2_es_setup.sh` script should be helpful for Elasticsearch. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters