## Setup

* AWS region.
* IAM role 에 필요한 권한을 추가해야 합니다.
** Rekognition, Textract API and S3 bucket.

* 테스트 환경 : SageMeker - Jupyter

In [None]:

import os
import boto3
import re
import json
import sagemaker
from sagemaker import get_execution_role

region = boto3.Session().region_name

role = get_execution_role()

bucket = sagemaker.Session().default_bucket()

In [None]:
prefix = "sagemaker/pii-detection-redaction"
bucket_path = "https://s3-{}.amazonaws.com/{}".format(region, bucket)
# Customize to your bucket where you have stored the data
print(bucket_path)

## Textract

Amazon Textract는 스캔한 문서에서 텍스트, 필기 및 데이터를 자동으로 추출하는 기계 학습(ML) 서비스입니다. 단순한 광학 문자 인식(OCR) 이상으로 양식 및 표의 데이터를 식별하고 이해하며 추출합니다.

* [Amazon Textract Code Samples](https://github.com/aws-samples/amazon-textract-code-samples)
* [python-textract-textract_wrapper.py](https://docs.aws.amazon.com/ko_kr/code-samples/latest/catalog/python-textract-textract_wrapper.py.html)

In [None]:
import boto3

object='sagemaker/pii-detection-redaction/sample1.jpg'

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': bucket,
            'Name': object
        }
    })

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

## Amazon Rekognition

Amazon Rekognition은 이미지 및 비디오에서 정보와 인사이트를 추출하기 위해 사전 훈련된 컴퓨터 비전(CV) 및 사용자 지정 가능한 CV 기능을 제공합니다. Amazon Rekognition - DetectLabels 작업을 사용하여 이미지에서 레이블을 감지할 수 있습니다.
[이미지에서 레이블 감지](https://docs.aws.amazon.com/ko_kr/rekognition/latest/dg/labels-detect-labels-image.html)

DetectText는 이미지에서 최대 100개 단어를 감지할 수 있습니다. 이미지에서 추가로 감지 할 수 있는지는 티켓으로 문의해봐야 합니다.
[Amazon Rekognition 텍스트 감지 기능 향상 발표](https://aws.amazon.com/ko/about-aws/whats-new/2021/05/enhancements-to-amazon-rekognition-text-detection-support-for-more-words-higher-accuracy-lower-latency/)

양식이 정해져 있고 주민번호가 문서의 하단에 있기 때문에, 이미지를 거꾸로 회전하여 검색하시면 100자 이내에 주민번호 검색이 가능했습니다.
    

In [None]:
object='sagemaker/pii-detection-redaction/sample2.jpg'

redacted_box_color='red'
dpi = 72
pii_detection_threshold = 0.00


# If the image is in DICOM format, convert it to PNG
if (object.split(".")[-1:][0] == "dcm"):
    ! aws s3 cp s3://{bucket}/{object} .
    ! convert -format png {object.split("/")[-1:][0]} {object.split("/")[-1:][0]}.png
    ! aws s3 cp {object.split("/")[-1:][0]}.png s3://{bucket}/{object}.png
    object=object+'.png'
    print(object)

# Import all of the required libraries
%matplotlib inline
import boto3
import json
import io
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np
import matplotlib as mpl
from imageio import imread

import base64
# Import cStringIO0



# Implement AWS Services
rekognition=boto3.client('rekognition')
comprehend = boto3.client(service_name='comprehend')
s3=boto3.resource('s3')

# Download the image from S3 and hold it in memory
img_bucket = s3.Bucket(bucket)
img_object = img_bucket.Object(object)
xray = io.BytesIO()
img_object.download_fileobj(xray)
img = np.array(Image.open(xray), dtype=np.uint8)
print(img.shape)
# Set the image color map to grayscale, turn off axis grapiing, and display the image
height, width,channel = img.shape
# What size does the figure need to be in inches to fit the image?
figsize = width / float(dpi), height / float(dpi)
# Create a figure of the right size with one axes that takes up the full figure
fig = plt.figure(figsize=figsize)
ax = fig.add_axes([0, 0, 1, 1])
# Hide spines, ticks, etc.
ax.axis('off')
# Display the image.
ax.imshow(img, cmap='gray')
plt.show()

In [None]:
# 검색된 이미지 라벨에서 '숫자 6자리 - 숫자 7자리' 형식으로 검색된 글자를 주민번호라 가정하였습니다.
import re
regex = re.compile(r'\d{6}-\d{7}')

# Use Amazon Rekognition to detect all of the text in the image
response=rekognition.detect_text(Image={'Bytes':xray.getvalue()})
textDetections=response['TextDetections']
print ('Aggregating detected text...')
textblock=""
offsetarray=[]
totallength=0

for text in textDetections:
    if text['Type'] == "LINE":
        match = regex.search(text['DetectedText'])
        if bool(match) == True:
            offsetarray.append(text)
            totallength+=len(text['DetectedText'])+1
            textblock=textblock+text['DetectedText']+" "  

totaloffsets=len(offsetarray)
print(offsetarray)

In [None]:
# Now this list of bounding boxes will be used to draw red boxes over the pii text.
height, width, channel = img.shape
# What size does the figure need to be in inches to fit the image?
figsize = width / float(dpi), height / float(dpi)
# Create a figure of the right size with one axes that takes up the full figure
fig = plt.figure(figsize=figsize)
ax = fig.add_axes([0, 0, 1, 1])
ax.imshow(img)
plt.imshow(img, cmap='gray')
# for box in pii_boxes_list:
for box in offsetarray:
    #The bounding boxes are described as a ratio of the overall image dimensions, so we must multiply them
    #by the total image dimensions to get the exact pixel values for each dimension.
    x = img.shape[1] * box['Geometry']['BoundingBox']['Left']
    y = img.shape[0] * box['Geometry']['BoundingBox']['Top']
    width = img.shape[1] * box['Geometry']['BoundingBox']['Width']
    height = img.shape[0] * box['Geometry']['BoundingBox']['Height']
    rect = patches.Rectangle((x,y),width,height,linewidth=0,edgecolor=redacted_box_color,facecolor=redacted_box_color)
    ax.add_patch(rect)
# Ensure that no axis or whitespaces is printed in the image file we want to save.
plt.axis('off')    
plt.gca().xaxis.set_major_locator(plt.NullLocator())
plt.gca().yaxis.set_major_locator(plt.NullLocator())

# Save redacted   image to the same Amazon S3 bucket, in PNG format, with 'de-id-' in front of the original filename.
img_data = io.BytesIO()
plt.savefig(img_data, bbox_inches='tight', pad_inches=0, format='png')
img_data.seek(0)
# Write the redacted image to S3
#object='sagemaker/pii-detection-redaction/wa-license.png'
img_bucket.put_object(Body=img_data, ContentType='image/png', Key="redacted/"+object)

## 오픈소스 EasyOCR 

한글뿐만 아니라 다양한 언어를 지원하고 있고, Demo website에서 테스트 해보니 한글 인식이 잘되서 테스트 했습니다.

* [github](https://github.com/JaidedAI/EasyOCR)
* [document](https://www.jaided.ai/easyocr/documentation/)

패키지 설치후 "ImportError: cannot import name _registerMatType" 오류가 발생하는 경우 버전을 내려서 맞춰주시면 됩니다.
참고로, gpu 옵션도 사용이 가능합니다.

In [None]:
object='sagemaker/pii-detection-redaction/sample1.jpg'

redacted_box_color='red'
dpi = 72
pii_detection_threshold = 0.00


# If the image is in DICOM format, convert it to PNG
if (object.split(".")[-1:][0] == "dcm"):
    ! aws s3 cp s3://{bucket}/{object} .
    ! convert -format png {object.split("/")[-1:][0]} {object.split("/")[-1:][0]}.png
    ! aws s3 cp {object.split("/")[-1:][0]}.png s3://{bucket}/{object}.png
    object=object+'.png'
    print(object)

# Import all of the required libraries
%matplotlib inline
import boto3
import json
import io
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np
import matplotlib as mpl
from imageio import imread

import base64
# Import cStringIO



# Implement AWS Services
rekognition=boto3.client('rekognition')
comprehend = boto3.client(service_name='comprehend')
s3=boto3.resource('s3')

# Download the image from S3 and hold it in memory
img_bucket = s3.Bucket(bucket)
img_object = img_bucket.Object(object)
xray = io.BytesIO()
img_object.download_fileobj(xray)
img = np.array(Image.open(xray), dtype=np.uint8)
print(img.shape)
# Set the image color map to grayscale, turn off axis grapiing, and display the image
height, width,channel = img.shape
# What size does the figure need to be in inches to fit the image?
figsize = width / float(dpi), height / float(dpi)
# Create a figure of the right size with one axes that takes up the full figure
fig = plt.figure(figsize=figsize)
ax = fig.add_axes([0, 0, 1, 1])
# Hide spines, ticks, etc.
ax.axis('off')
# Display the image.
ax.imshow(img, cmap='gray')
plt.show()

In [None]:
!pip install easyocr

In [None]:
#ImportError: cannot import name '_registerMatType'
#주의 : 위 오류 발생시에만 작업하세요!!!
!pip install "opencv-python-headless<4.3"

In [None]:
# this needs to run only once to load the model into memory
import easyocr

reader = easyocr.Reader(['ko'])
# reader = easyocr.Reader(['ko', 'en'], gpu=False)

In [None]:
# 파일을 직접 읽을때 사용합니다
# textDetections = reader.readtext('sample1.jpg')

# 위에서 S3 데이터를 설정하였습니다
textDetections = reader.readtext(img)

In [None]:
import re
regex = re.compile(r'.*\d{6}-\d{7}')

print ('Aggregating detected text...')
textblock=""
offsetarray=[]

for text in textDetections:
    match = regex.search(text[1])
    if bool(match) == True:
        offsetarray.append(text)
        textblock=textblock+text[1]+" "  

print(offsetarray)

In [None]:
# This list of bounding boxes will be used to draw red boxes over the PII text.
height, width, channel = img.shape
# What size does the figure need to be in inches to fit the image?
figsize = width / float(dpi), height / float(dpi)
# Create a figure of the right size with one axes that takes up the full figure
fig = plt.figure(figsize=figsize)
ax = fig.add_axes([0, 0, 1, 1])
ax.imshow(img)
plt.imshow(img, cmap='gray')
for box in offsetarray:
    #The bounding boxes are described as a ratio of the overall image dimensions, so we must multiply them
    #by the total image dimensions to get the exact pixel values for each dimension.
    x = box[0][0][0]
    y = box[0][0][1]
    width = box[0][1][0]-box[0][0][0]
    height = box[0][3][1]-box[0][0][1]
    rect = patches.Rectangle((x,y),width,height,linewidth=0,edgecolor=redacted_box_color,facecolor=redacted_box_color)
    ax.add_patch(rect)
#Ensure that no axis or whitespaces is printed in the image file we want to save.
plt.axis('off')    
plt.gca().xaxis.set_major_locator(plt.NullLocator())
plt.gca().yaxis.set_major_locator(plt.NullLocator())

#Save redacted image to the same Amazon S3 bucket, in PNG format, with 'de-id-' in front of the original filename.
img_data = io.BytesIO()
plt.savefig(img_data, bbox_inches='tight', pad_inches=0, format='png')
img_data.seek(0)
#Write the redacted image to S3
img_bucket.put_object(Body=img_data, ContentType='image/png',  Key="redacted/"+object)