<a href="https://colab.research.google.com/github/georgetegral/ASPNETTut/blob/master/JusticIA_AccesoDatos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Instalar dependencias 

Se usarán las siguientes librerías:
* pdf2image: Librería para convertir automáticamente archivos .pdf a .jpg.
* pillow: Librería de imágenes de Python, se usa para manipular imágenes.
* requests: Librería de llamadas http, se usa para llamar a los endpoints en Azure.
* condacolab: Implementación de anaconda en Google Colab, requerido para instalar un requerimiento de pdf2image, "poppler".

Finalmente, importamos drive de google.colab.

Nota: cuando Poppler se instala por primera vez, este reinicia el kernel automáticamente.

In [15]:
!pip install pdf2image
!pip install pillow
!pip install requests
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c conda-forge poppler

✨🍰✨ Everything looks OK!
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | / - \ | / - \ | / - \ | done

# All requested packages already installed.



In [16]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Directorios
Aquí se definen los directorios para el proceso, el primero es el directorio donde se encuentran las imágenes, y el segundo directorio es donde se almacenan las imágenes procesadas.

Recuerda poner el directorio con "/" al final.

In [17]:
#path = 'drive/MyDrive/Datos - Hackathon JusticIA/Expedientes/'
path = 'drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtest/'
pathResult = 'drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/'

## Copiar archivos
Se copian los archivos del directorio 1 al directorio 2.

Esto se hace para no alterar los archivos originales.

In [18]:
from distutils.dir_util import copy_tree
copy_tree(path, pathResult)

['drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/izquierda.jpg',
 'drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/abajo.jpg',
 'drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/derecha.jpg',
 'drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/arriba.jpg',
 'drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/jpg2pdf.pdf']

## Verificar copiado

Podemos verificar que se copiaron los archivos haciendo una muestra del segundo directorio.

In [19]:
import PIL
from glob import glob
from random import sample

files = glob(pathResult+'*.*')
files = sample(files, 5)

for file_name in files:
    print(file_name)

drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/arriba.jpg
drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/jpg2pdf.pdf
drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/derecha.jpg
drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/jpg2pdf.jpg
drive/MyDrive/Colab Notebooks/RII AA Hackathon/imgtestresult/resultados.csv


# Métodos necesarios
Se utilizarán los siguientes métodos en nuestro pipeline

In [20]:
import os
import sys
from PIL import Image, ImageOps
from pdf2image import convert_from_path
from math import ceil, floor
import requests
import json
import csv
import time

#Function to check if returned JSON has all the necessary fields in each nested field of the JSON
def checkValueField(rs,field,i,var):
    val = ""
    try:
        val = rs[field][i][var]
    except ValueError:
        val = ""
    except KeyError:
        val = ""
    finally:
        return val

def convertPdf(path, file):
    s = time.perf_counter()
    fullpath = path + file
    images = convert_from_path(fullpath)
    name = file.split('.pdf')[0]
    nameJpg = name +'.jpg'
    for i in range(len(images)):
        # Save pages as images in the pdf
        images[i].save(path + nameJpg, 'JPEG')
    print(f'convertPdf in {time.perf_counter() - s:0.2f} seconds.')
    return nameJpg

def orientationModel(path, file):
    s = time.perf_counter()
    #Request to orientation model
    with open(path+file, 'rb') as f:
        data = f.read()
    req = requests.post(
        url="https://cscvriia-prediction.cognitiveservices.azure.com/customvision/v3.0/Prediction/2bb8fee1-6413-418c-9c6e-3c41629397ce/classify/iterations/Iteration2/image",
        headers={
            "Content-Type": "application/octet-stream",
            "Prediction-Key": "bac8aef998f94f04b527450124a7864a"
        },
        data = data
    )
    result = json.loads(req.content)
    tag = result['predictions'][0]['tagName']
    print(f'orientationModel in {time.perf_counter() - s:0.2f} seconds.')
    return int(tag)

def removeExif(path, file):
    s = time.perf_counter()

    fullpath = path + file
    try:
      im = Image.open(fullpath)
      # next 4 lines strip exif metadata
      data = list(im.getdata())
      im = Image.new(im.mode, im.size)
      im.putdata(data)
      im.save(fullpath)
    except IOError:
      return
    
    print(f'removeExif in {time.perf_counter() - s:0.2f} seconds.')

def rotateFile(path, file, orientation):
    fullpath = path + file
    im = Image.open(fullpath)

    if orientation == 90:
        im = im.transpose(Image.ROTATE_90)
        im.save(fullpath,'JPEG')

    elif orientation == 180:
        im = im.transpose(Image.ROTATE_180)
        im.save(fullpath,'JPEG')

    elif orientation == 270:
        im = im.transpose(Image.ROTATE_270)
        im.save(fullpath,'JPEG')

def getImageSize(path, file):
    im = Image.open(path+file)
    return int(im.width),int(im.height)

def roundNum(num):
    decimal = num % 1
    if decimal >= .5:
        return ceil(num)
    else:
        return floor(num)

def calculateBoundingBox(left, top, width, height, widthImg, heightImg):
    #https://stackoverflow.com/questions/53737055/bounding-box-left-top-height-width-to-php-x1-x2-y1-y2-coordinates
    w = widthImg *  width
    h = heightImg * height

    xmin = left * widthImg
    ymin = top * heightImg
    xmax = xmin + w
    ymax = ymin + h

    return roundNum(xmin), roundNum(ymin), roundNum(xmax), roundNum(ymax)

def objectDetectionModel(path, file, width, height):
    s = time.perf_counter()
    #request to Object Detection Model
    with open(path+file, 'rb') as f:
        data = f.read()
    req = requests.post(
        url="https://cscvriia-prediction.cognitiveservices.azure.com/customvision/v3.0/Prediction/c2380253-45d3-4804-9c76-093802973de6/detect/iterations/Iteration3/image",
        headers={
            "Content-Type": "application/octet-stream",
            "Prediction-Key": "bac8aef998f94f04b527450124a7864a"
        },
        data = data
    )
    result = json.loads(req.content)
    finalResult = {}
    #Ignore probability below 0.5
    i = 0
    acum = 0
    while checkValueField(result,'predictions',i,'probability'):
        probability = checkValueField(result,'predictions',i,'probability')
        if probability >= 0.5:
            tagName = checkValueField(result,'predictions',i,'tagName')
            boundingBoxLeft = result['predictions'][i]['boundingBox']['left']
            boundingBoxTop = result['predictions'][i]['boundingBox']['top']
            boundingBoxWidth = result['predictions'][i]['boundingBox']['width']
            boundingBoxHeight = result['predictions'][i]['boundingBox']['height']
            xmin, ymin, xmax, ymax = calculateBoundingBox(boundingBoxLeft, boundingBoxTop, boundingBoxWidth, boundingBoxHeight, width, height)
            finalResult[acum] = [tagName, probability, xmin, ymin, xmax, ymax]
            acum += 1
        
        i += 1

    print(f'objectDetectionModel in {time.perf_counter() - s:0.2f} seconds.')
    return finalResult

def createCsv(csvname, prefix):
    with open(prefix+csvname, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['filename','width','height','class','xmin','ymin','xmax','ymax'])

def appendCsv(csvname, filename, width, height, result,prefix):
    with open(prefix+csvname, 'a+', newline='') as file:
        writer = csv.writer(file)
        #For loop each result
        for i in result:
            writer.writerow([filename, width, height, result[i][0], result[i][2], result[i][3], result[i][4], result[i][5]])

# Pipeline principal

In [22]:
s = time.perf_counter()

list_of_files = []
prefix = pathResult

#Leer todos los archivos del directorio
for root, dirs, files in os.walk(prefix):
    for file in files:
        if not file.endswith('.jpg'):
            continue
        list_of_files.append(os.path.join(file))

#Limpiar Fotos
for file in list_of_files:
    removeExif(prefix, file)

#El csv de resultados se guarda en el mismo directorio
createCsv('resultados.csv',prefix)

for file in list_of_files:
    #Si el archivo es pdf, convertirlo a jpg
    if file.endswith('.pdf'):
        file = convertPdf(prefix, file)
    #Obtener la orientacion del archivo
    orientation = orientationModel(prefix, file)
    #Rotar la imagen
    rotateFile(prefix, file, orientation)
    #Obtener ancho y largo de la foto
    width, height = getImageSize(prefix, file)
    #Llamar al modelo de detección de objetos
    result = objectDetectionModel(prefix, file, width, height)
    #Adjuntar a un csv
    appendCsv('resultados.csv', file, width, height, result,prefix)
    print(file)
    print("----------")


print(f'finished in {time.perf_counter() - s:0.2f} seconds.')

removeExif in 0.64 seconds.
removeExif in 0.57 seconds.
removeExif in 0.56 seconds.
removeExif in 0.56 seconds.
removeExif in 0.82 seconds.
orientationModel in 0.78 seconds.
objectDetectionModel in 1.82 seconds.
izquierda.jpg
----------
orientationModel in 0.67 seconds.
objectDetectionModel in 2.04 seconds.
abajo.jpg
----------
orientationModel in 0.89 seconds.
objectDetectionModel in 1.79 seconds.
derecha.jpg
----------
orientationModel in 0.70 seconds.
objectDetectionModel in 0.89 seconds.
arriba.jpg
----------
orientationModel in 0.77 seconds.
objectDetectionModel in 2.43 seconds.
jpg2pdf.jpg
----------
finished in 16.43 seconds.
