# Find Face Matches with Rekognition

This Notebook is part of a demo designd to illustrate how to use the [face matching features](https://docs.aws.amazon.com/rekognition/latest/dg/collections.html) of [Amazon Rekognition](https://aws.amazon.com/rekognition/) to identify Matching Faces in a collection of photos.
***
***

### Architecture
![Demo Architecture](figures/FaceDuplicates-Page-3.png)

***
### Dataset

For this demo the we use the [Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)  dataset \[1\]. From the project website:
> \[...\] a database of face photographs designed for studying the problem of unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the data set.

\[1\] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.

### Prepare the environment
To ensure all the necessary python libraries are installed, make sure to pip-install the `requirements.txt` file in the same environment as the kernel of this notebook.

```terminal
~$ pip install -r requirements.txt
```

In [None]:
import json
import shutil
from pathlib import Path

import awswrangler as wr
import boto3
import s3fs

from utils import count_hits, inspect_matches

## Prepare the dataset

We download the compressed archive containing all the images in the dataset from the project website, and unpack it to the `data` folder.

In [None]:
images_source = "http://vis-www.cs.umass.edu/lfw/lfw.tgz"
!wget -nc {images_source}

In [None]:
images_path_local = Path("data/lfw/")
if images_path_local.exists() is False:
    shutil.unpack_archive("lfw.tgz", "data")

The photos are organized by people, and for some people there are multiple photos.

In [None]:
images_path_local = Path("data/lfw/")

# !tree {images_path_local}  > data_tree.txt
with open("data_tree.txt") as f:
    a = "".join([f.readline() for _ in range(15)] + ["..."])
    print(a)

## Add faces to the Rekognition collection

We can add faces to the Rekognition collection by PUTting image files 

In [None]:
ssm = boto3.client("ssm")
rekognition = boto3.client("rekognition")

s3 = s3fs.S3FileSystem()

The demo template stores the S3 bucket name and the name of the pre-created Rekognition collection in AWS Systems Manager Paramter Store.

In [None]:
stack_name = "RekognitionBatchDetect"
images_bucket_name = ssm.get_parameters(Names=[f"/{stack_name}/ImageBucket"])[
    "Parameters"
][0]["Value"]
output_bucket_name = ssm.get_parameters(Names=[f"/{stack_name}/OutBucket"])[
    "Parameters"
][0]["Value"]
collection_id = ssm.get_parameters(Names=[f"/{stack_name}/CollectionId"])["Parameters"][
    0
]["Value"]

print(
    f"Collection ID: {collection_id}\nImages Bucket: {images_bucket_name}\nOutput Bucket: {output_bucket_name}"
)

In [None]:
n_faces_collection = len(rekognition.list_faces(CollectionId=collection_id)["Faces"])
print(f"There are currently {n_faces_collection} in the {collection_id} collection")

### Upload files
In order to observe the dynamics of the collection and the matching mechanism and notifications, we upload images
by initial letter.

This step can be repeated (possibly changing letter at every iteration), to observe and validate the operation of the lambda function and the notification settings.


In [None]:
name_initial = "C"  # replace with any capital letter of the English alphabet
n_upload = len(list(images_path_local.glob(name_initial + "*")))

print(f"We are going to upload {n_upload} images to the S3 bucket {images_bucket_name}")

In [None]:
response = s3.put(
    f"data/lfw/{name_initial}*", f"s3://{images_bucket_name}/images/", recursive=True
)

In [None]:
n_faces_collection = len(rekognition.list_faces(CollectionId=collection_id)["Faces"])
print(f"There are currently {n_faces_collection} in the {collection_id} collection")

## Check results

To check the results we need to parse the reports uploaded to the `output_bucket` as `json` files. To make the analysis more accessible, we will use the AWS Glue catalog and the Athena table and view created by the template.

For this demonstration, we will read the tables into Pandas dataframes for ease of manipulation. This approach is valid for datasets of few thousands of records, but for larger sets of data an Amazon Quicksight dashboard is a more robust and salable.

In [None]:
db_name = "face_match_output_db"

In [None]:
message = "Glue catalog ready!"
if db_name not in wr.catalog.databases().values:
    message = "Check that the template is fully deployed"
print(message)

There should be two tables in the database:
- `face_match_output`: obtained by queriying the output reports directly
- `matchingstats`: unnest the array of matches for entries with at least one match

In [None]:
wr.catalog.tables(database=db_name)

In [None]:
df_reports = wr.athena.read_sql_query(
    "SELECT source, customerid FROM face_match_output",
    database=db_name,
    ctas_approach=False,
)

# The matches column is a `ROW` format that doesn't play well with pandas,
# better exclude it and unwrap it using a view in Athena
df_reports

In [None]:
df_matches = wr.athena.read_sql_query(
    "SELECT * FROM matchingstats", database=db_name, ctas_approach=False
)
print(f"There are a total of {len(df_matches)} matches")
df_matches.sample(10)

We can look into the distribution of the similarity scores

In [None]:
_ = df_matches.similarity.hist(bins=25, figsize=(14, 8))

But this histogram can be polluted by occurences of legitimate similarities. Probalby more interesting to look into the max similarity for each pair of names

In [None]:
_ = (
    df_matches.groupby(["customerid", "suspect_match"])
    .max()
    .hist(bins=25, figsize=(14, 8))
)

And check how many images are a suspected match

In [None]:
_ = (
    df_matches.groupby(["customerid", "suspect_match"])
    .count()
    .sort_values("similarity")
    .rename(columns={"similarity": "# of Matches"})
    .plot.barh(figsize=(14, 18))
)

To cleanup the results it's a good idea to set a minimum similarity score.

In [None]:
threshold = 95
_ = (
    df_matches.groupby(
        ["customerid", "suspect_match", (df_matches.similarity > threshold)]
    )
    .count()
    .rename(columns={"similarity": "# of Matches"})
    .sort_values(["# of Matches"])
    .unstack(level=-1)
    .plot.barh(
        figsize=(14, 18),
        subplots=True,
        sharey=True,
        sharex=False,
        layout=(1, 2),
        title=[f"Similarity below {threshold}", f"Similarity above {threshold}"],
    )
)

### Check duplicates
We can look in detail into one of the cases identified, and look into the images themselves.

In [None]:
cases_of_interest = (
    df_matches[df_matches.similarity > threshold]
    .groupby(["customerid", "suspect_match"])
    .max()
)
cases_of_interest

In [None]:
name_to_check = cases_of_interest.sample(1).index[0][0]

name_list = s3.glob(f"{output_bucket_name}/output/{name_to_check}*")
duplicate_list = [k for k in name_list if count_hits(k) > 0]

print(
    f"We will check {name_to_check}, in particular, these maching records\n{duplicate_list}\n\n"
    "An example of the structure of the match record:"
)
with s3.open(duplicate_list[0]) as f:
    example = json.load(f)
example

We can now inspect the matches.

In [None]:
inspect_matches(images_bucket_name, duplicate_list[0])