TMK (for Temporal Match Kernel) is a video-similarity-detection algorithm produced in conjunction with Facebook AI Research ('FAIR').
Full details are located in the hashing.pdf document.
Researchers in video copy-detection algorithms may also be interested in https://github.com/facebookresearch/videoalignment for a research version of TMK, with different parameter-weighting than presented here.
For cross-company sharing, for exports, one needs to compute hashes for a relatively small list of banked videos, then publish those data (over ThreatExchange, for example) -- this is a small-scale endeavor. For imports, one needs to hash all to-be-checked videos and compare them against a hash-list -- this is a large-scale endeavor. At present (2019-01-18) this code is single-machine-only: nothing is connected up to databases or RPC yet.
Integration with FAISS has been successfully prototyped -- we will post information about this soon.
- Make sure you can hash the exact same sample videos (in this repo -- see below) and get the same hashes (within roundoff error)
- Let us know about file-format issues: at FB the upload-checker runs after videos in various formats have been transcoded to MP4, so we have no data on TMK with AVI, MOV, etc. Also note that the floating-point numbers within the
.tmkfiles are all stored little-endian.
- See if you can find matching content on your site, using initial match-threshold numbers.
- Please feel free to contact us with feedback on the code, this documentation, or any other suggestions for additional rollout of this algorithm. And/or simply put up a pul request if you prefer.
- In particular, we already know that several companies preferred a Java port of PDQ; we suspect the same will hold true for TMK.
- For each video file, the hasher channels the output of
ffmpeg(which you need to separately install) to produce a
- You can think of the feature-vectors file as opaque binary files if you like. However please do see
./tmk/tools/tmkdump.pywhich shows how to display the binary
.tmkfiles in human-readable format.
- The .tmk files contain:
- Metadata about how the hash was computed (so we can avoid mixing hashes produced by different, incompatible software).
- A pure-average feature. Think of a 'frame feature' as a floating-point hash of a single video frame. Then the pure-average feature is the slotwise average over all the frames of the video.
- Cosine and sine features. These are nothing more than weighted average of frame features, with various periods and various fourier-coefficient weights
- We can compute a level-1 score for a pair of videos using cosine similarity of their pure-average features. This is quick to compute.
- If the level-1 score is less than a threshold, the videos don't match. If it's above, then we compute a level-2 score using all the cosine/sine features. (Details are in
tmkfv.cpp.) If this level-2 score is over a second threshold, then the videos do match. (If you want to compute the full TMK level-2 pair-score between an arbitrary pair of videos, then set the level-1 threshold to -1.)
- Possible values for level-1 score range from -1 to 1 (1 being perfect match); possible values for level-2 score range from 0 to 1 (1 being perfect match).
- Values of those thresholds are suggested at 0.7 for each.
We need to have ffmpeg (https://www.ffmpeg.org/) downloaded and working.
The sample hashes were generated with ffmpeg version 4.1 so if you have any issues with
make regtest check your ffmpeg version.
First let's compile the code and compute hashes of some sample videos, and make sure we're all getting compatible results. Namely, I've got some sample videos in this repository which I've hashed on my platform. If you hash those exact same video files on your platform, you should get nearly-identical hashes.
Note: these few sample videos I took myself on my phone; there are no data-privacy issues.
Compile the code
There are no external source-code dependencies; you just need a C++14 compiler. Also you'll need the
ffmpeg executable -- for example, located in
$ cd cpp $ make
- We have a simple
Makefilein order to reduce the number of dependencies, so people can get up and running quicker.
- FB-internally we use
buck; please let us know if you want Buck
- This simple Makefile build does not execute any
buck testwould. Hence the importance of computing hashes for sample videos (see next step).
Compute hashes of sample videos and compare to previous outputs
You can simply do
make regtest. Or you may prefer to do the following.
$ for v in ../sample-videos/*.mp4; do ./tmk-hash-video -f /usr/local/bin/ffmpeg -i $v -d . done
$ ls *.tmk chair-19-sd-bar.tmk chair-22-with-small-logo-bar.tmk pattern-hd-no-bar.tmk chair-20-sd-bar.tmk chair-orig-22-fhd-no-bar.tmk pattern-longer-no-bar.tmk chair-22-sd-grey-bar.tmk chair-orig-22-hd-no-bar.tmk pattern-sd-grey-bar.tmk chair-22-sd-sepia-bar.tmk chair-orig-22-sd-bar.tmk pattern-sd-with-large-logo-bar.tmk chair-22-with-large-logo-bar.tmk doorknob-hd-no-bar.tmk pattern-sd-with-small-logo-bar.tmk
$ for h in *.tmk; do ./tmk-compare-two-tmks $h ../sample-hashes/$h done
If there are mismatches, please stop here -- we need to debug.
Look at all pair-scores
n*(n-1)/2 possible pairs for
n hashes so this is a bit verbose
-- but it gives the idea of what pairwise scores look like.
The sample videos in the repo are few, and short. The variants are discussed
$ ./tmk-two-level-score --c1 -1.0 --c2 0.0 *.tmk | sort -n -0.310042 0.017566 doorknob-hd-no-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.088088 0.033167 chair-19-sd-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.083061 0.033179 chair-20-sd-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.077388 0.026434 chair-22-with-small-logo-bar.tmk pattern-sd-with-small-logo-bar.tmk -0.074581 0.021536 chair-22-with-small-logo-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.071327 0.021766 chair-22-sd-grey-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.070220 0.022146 chair-orig-22-sd-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.069683 0.019660 chair-22-sd-sepia-bar.tmk pattern-sd-with-large-logo-bar.tmk -0.067127 0.025258 chair-22-sd-sepia-bar.tmk pattern-sd-with-small-logo-bar.tmk -0.066280 0.023696 chair-orig-22-hd-no-bar.tmk pattern-sd-with-small-logo-bar.tmk -0.066129 0.028783 chair-20-sd-bar.tmk pattern-sd-with-small-logo-bar.tmk -0.064803 0.023406 chair-orig-22-fhd-no-bar.tmk pattern-sd-with-small-logo-bar.tmk -0.063257 0.025060 chair-22-sd-grey-bar.tmk pattern-sd-with-small-logo-bar.tmk -0.063251 0.019399 chair-orig-22-hd-no-bar.tmk pattern-sd-grey-bar.tmk -0.061950 0.019116 chair-orig-22-fhd-no-bar.tmk pattern-sd-grey-bar.tmk -0.061946 0.025864 chair-19-sd-bar.tmk pattern-sd-with-small-logo-bar.tmk ... 0.919339 0.921573 chair-20-sd-bar.tmk chair-22-with-small-logo-bar.tmk 0.926403 0.953071 chair-22-with-small-logo-bar.tmk chair-orig-22-sd-bar.tmk 0.926929 0.953268 chair-22-sd-grey-bar.tmk chair-22-with-small-logo-bar.tmk 0.927289 0.953265 chair-22-sd-sepia-bar.tmk chair-22-with-small-logo-bar.tmk 0.952329 0.961902 chair-19-sd-bar.tmk chair-orig-22-sd-bar.tmk 0.952936 0.962041 chair-19-sd-bar.tmk chair-22-sd-sepia-bar.tmk 0.953346 0.962158 chair-19-sd-bar.tmk chair-22-sd-grey-bar.tmk 0.981078 0.954990 pattern-hd-no-bar.tmk pattern-longer-no-bar.tmk 0.985408 0.988313 chair-20-sd-bar.tmk chair-orig-22-sd-bar.tmk 0.985554 0.988220 chair-20-sd-bar.tmk chair-22-sd-sepia-bar.tmk 0.985930 0.988437 chair-20-sd-bar.tmk chair-22-sd-grey-bar.tmk 0.989083 0.991138 chair-19-sd-bar.tmk chair-20-sd-bar.tmk 0.999750 0.999720 chair-22-sd-sepia-bar.tmk chair-orig-22-sd-bar.tmk 0.999779 0.999763 chair-22-sd-grey-bar.tmk chair-22-sd-sepia-bar.tmk 0.999883 0.999877 chair-orig-22-fhd-no-bar.tmk chair-orig-22-hd-no-bar.tmk 0.999963 0.999964 chair-22-sd-grey-bar.tmk chair-orig-22-sd-bar.tmk
Look for duplicates within sample videos
Here we see, as expected and in a different way, the variants being matched up
-- as discussed in more detail in
$ tmk-clusterize --c1 0.7 --c2 0.7 -s *.tmk clidx=1,clusz=8,filename=chair-19-sd-bar.tmk clidx=1,clusz=8,filename=chair-20-sd-bar.tmk clidx=1,clusz=8,filename=chair-22-sd-grey-bar.tmk clidx=1,clusz=8,filename=chair-22-sd-sepia-bar.tmk clidx=1,clusz=8,filename=chair-22-with-small-logo-bar.tmk clidx=1,clusz=8,filename=chair-orig-22-fhd-no-bar.tmk clidx=1,clusz=8,filename=chair-orig-22-hd-no-bar.tmk clidx=1,clusz=8,filename=chair-orig-22-sd-bar.tmk clidx=2,clusz=1,filename=chair-22-with-large-logo-bar.tmk clidx=3,clusz=1,filename=doorknob-hd-no-bar.tmk clidx=4,clusz=4,filename=pattern-hd-no-bar.tmk clidx=4,clusz=4,filename=pattern-longer-no-bar.tmk clidx=4,clusz=4,filename=pattern-sd-grey-bar.tmk clidx=4,clusz=4,filename=pattern-sd-with-small-logo-bar.tmk clidx=5,clusz=1,filename=pattern-sd-with-large-logo-bar.tmk
Using a more conservative tolerance:
$ tmk-clusterize --c1 0.95 --c2 0.95 -s *.tmk clidx=1,clusz=5,filename=chair-19-sd-bar.tmk clidx=1,clusz=5,filename=chair-20-sd-bar.tmk clidx=1,clusz=5,filename=chair-22-sd-grey-bar.tmk clidx=1,clusz=5,filename=chair-22-sd-sepia-bar.tmk clidx=1,clusz=5,filename=chair-orig-22-sd-bar.tmk clidx=2,clusz=1,filename=chair-22-with-large-logo-bar.tmk clidx=3,clusz=1,filename=chair-22-with-small-logo-bar.tmk clidx=4,clusz=2,filename=chair-orig-22-fhd-no-bar.tmk clidx=4,clusz=2,filename=chair-orig-22-hd-no-bar.tmk clidx=5,clusz=1,filename=doorknob-hd-no-bar.tmk clidx=6,clusz=2,filename=pattern-hd-no-bar.tmk clidx=6,clusz=2,filename=pattern-longer-no-bar.tmk clidx=7,clusz=1,filename=pattern-sd-grey-bar.tmk clidx=8,clusz=1,filename=pattern-sd-with-large-logo-bar.tmk clidx=9,clusz=1,filename=pattern-sd-with-small-logo-bar.tmk
Look for matches between shared hashes and your own hashes
OK, now that we've sanity-checked a few sample videos from the repo, now let's broaden our scope. Here we can look at shared hashes -- no videos, only hashes -- which are delivered separately depending on with whom you're sharing hashes. (See https://github.com/facebook/ThreatExchange/blob/master/hashing/te-tag-query-java/README.md for example.) Then you can compute hashes of some of your own videos. Then, look to see what (if anything) is in common.
Get shared hashes
Put hash-file names into a text file:
$ find /path/to/hashes -name '*.tmk' > needles.txt
Compute local hashes
This step depends on what you have locally. Here we compute hashes for all the videos, then put the hash-file names into a text file.
$ mkdir /path/to/your/hashes $ find /path/to/your/videos -name '*.mp4' \ | xargs -L 1 tmk-hash-video -f /usr/local/bin/ffmpeg -d /path/to/your/hashes -i $ find /path/to/your/hashes -name '*.tmk' > haystack.txt
Now we look for matches between the shared-hash "needles" and your local "haystack". Your results will vary, of course; here is the output using some of our own test data.
$ ls *.tmk > haystack.txt $ echo chair-orig-22-hd-no-bar.tmk > needles.txt
$ wc -l needles.txt haystack.txt 1 needles.txt 15 haystack.txt 16 total
$ tmk-query --c1 0.7 --c2 0.7 needles.txt haystack.txt | sort -n 0.719880 0.737563 chair-orig-22-hd-no-bar.tmk chair-orig-22-sd-bar.tmk 0.719910 0.736369 chair-orig-22-hd-no-bar.tmk chair-22-sd-sepia-bar.tmk 0.720115 0.737112 chair-orig-22-hd-no-bar.tmk chair-22-sd-grey-bar.tmk 0.727833 0.733890 chair-orig-22-hd-no-bar.tmk chair-19-sd-bar.tmk 0.732827 0.742627 chair-orig-22-hd-no-bar.tmk chair-20-sd-bar.tmk 0.999883 0.999877 chair-orig-22-hd-no-bar.tmk chair-orig-22-fhd-no-bar.tmk 1.000000 1.000000 chair-orig-22-hd-no-bar.tmk chair-orig-22-hd-no-bar.tmk
Look for duplicates within your own hashes
--min 2 to only show similarity clusters of size greater than 1.
$ find /path/to/your/hashes -name '*.tmk' | tmk-clusterize -s --min 2 -i ...
Look for duplicates within the shared hashes
$ find /path/to/shared/hashes -name '*.tmk' | tmk-clusterize --min 2 -i
This concludes the walkthrough. Please see README-more.md for more detailed reference information.
Some scoring data
n videos there are
n*(n-1)/2 pairs. For all 2,679 video hashes
within a particular dataset used for this section, that's over 3 million
pair-scores to compute. So we took a random sample of about 400, resulting in
about 85,000 possible distinct pairs. Here we're plotting histograms of the
level-1 scores, the level-2 scores, and the joint density.
Remember from above we use the level-2 score to decide if two videos match or not. There should be a gap between the matches and the non-matches. For this sample dataset, we can see in the middle plot that there clearly is.
Since level-1 and level-2 scores of 1 are best, we can see dots for matching videos in the upper-right-hand corner of the right-hand plot. The big blue blob in that plot is the distribution of the pair-scores of unrelated videos.
Also remember from above that the level-2 score is a bit expensive to compute so we use the level-1 score to see if we should bother. It would be bad if there were pairs of videos with high level-2 score but low level-1 score -- those would be false negatives. For this plot, we're computing the level-2 score regardless of the level-1 score and we can see from the right-hand plot that for this sample dataset we aren't 'missing' any matches this way.