<a href="https://colab.research.google.com/github/giordanovitale/Prado-Museum-CNN/blob/main/Prado_Artists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



1.   [Data Augmentation Techniques](https://medium.com/ymedialabs-innovation/data-augmentation-techniques-in-cnn-using-tensorflow-371ae43d5be9#8be0)
2.   [Model Architectures](https://medium.com/@navarai/unveiling-the-diversity-a-comprehensive-guide-to-types-of-cnn-architectures-9d70da0b4521)
3. [EfficientNet](https://towardsdatascience.com/complete-architectural-details-of-all-efficientnet-models-5fd5b736142)

# 0 - Load the necessary libraries

Dataset Source: https://www.kaggle.com/datasets/maparla/prado-museum-pictures

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import tensorflow as tf
import polars as pl


import os
import requests

from multiprocessing import cpu_count
from multiprocessing.pool import ThreadPool
# import visualkeras as vk

from scipy.optimize import fsolve
from math import exp
import matplotlib.pyplot as plt

from collections import defaultdict

from google.colab import userdata

import keras.backend as K
from keras.layers import Layer
from tensorflow.keras import Sequential, Model
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, \
    AveragePooling2D, BatchNormalization, ReLU, PReLU, ZeroPadding2D, \
    GlobalAveragePooling2D, Input, DepthwiseConv2D, Add, Activation, Lambda, RandomFlip
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.callbacks import CSVLogger
from tensorflow.keras.applications.resnet_v2 import ResNet50V2
from tensorflow.keras.applications.resnet_v2 import preprocess_input as resnet_v2_preproccessing
from tensorflow.keras.applications.efficientnet_v2 import preprocess_input as efficientnet_preproccessing
from tensorflow.keras.applications.efficientnet_v2 import EfficientNetV2B3
from tensorflow.keras.applications.vgg19 import VGG19
from tensorflow.keras.applications.vgg19 import preprocess_input as vgg_preproccessing
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input as mobilenet_preprocessing

# 1 - Helper Functions

In [2]:
def download_url(args):
    """
    Downloads a file from an url
    :param args: Tuple containing the url and filename
    :return: None
    """
    url, filename = args[0], args[1]
    try:
      r = requests.get(url)
      if r.status_code != 404:
        with open(filename, "wb") as f:
          f.write(r.content)

    except Exception as e:
      print("Exception in download_url():", e)

In [3]:
def download_parallel(args):
    """
    Downloads urls in parallel
    :param args: List of tuples containing the url and filename
    :return: None
    """
    cpus = cpu_count()
    threadpool = ThreadPool(cpus)
    results = threadpool.imap_unordered(download_url, args)
    threadpool.close()
    threadpool.join()

# 2 - Load the dataset using Kaggle API

My Username and Key have been secreted. Replace `userdata.get('KAGGLE_USERNAME')` and `userdata.get('KAGGLE_KEY')`with your username and key, respectively.

In [4]:
os.environ["KAGGLE_USERNAME"] = "userdata.get('KAGGLE_USERNAME')"
os.environ["KAGGLE_KEY"] = "userdata.get('KAGGLE_KEY')"
!kaggle datasets download maparla/prado-museum-pictures -f prado.csv
!unzip prado.csv.zip

Dataset URL: https://www.kaggle.com/datasets/maparla/prado-museum-pictures
License(s): MIT
Downloading prado.csv.zip to /content
 38% 7.00M/18.3M [00:00<00:00, 68.6MB/s]
100% 18.3M/18.3M [00:00<00:00, 118MB/s] 
Archive:  prado.csv.zip
  inflating: prado.csv               


Create the dataframe from the unzipepd csv file.

In [5]:
df = pd.read_csv(os.path.join("prado.csv"))

Since no target class has been defined from the project assignment, I have to decide it. After a careful inspection of the columns, I found out that the more suitable ones are `author` and `technical_sheet_tecnica`. The latter seems more intriguing as it has more observations, thus being more suitable to big data algorithms.

In [141]:
df['author'].value_counts()

author
Anónimo                                                                       2698
Goya y Lucientes, Francisco de                                                1080
Bayeu y Subías, Francisco                                                      446
Haes, Carlos de                                                                326
Pizarro y Librado, Cecilio                                                     290
                                                                              ... 
Malombra, Pietro                                                                 1
Taller de Bellini, Giovanni                                                      1
Mattioli, Ludovico -Dibujante- (Autor de la obra original: Cignani, Carlo)       1
Ricci, Marco                                                                     1
García, Sergio                                                                   1
Name: count, Length: 2560, dtype: int64

In [142]:
df['technical_sheet_tecnica'].value_counts().sort_values(ascending=False)[:10]

Exception in download_url(): [Errno 2] No such file or directory: 'data/pencil/59825391-c4b9-4f84-8d6f-413fa045a3ea.jpg'


technical_sheet_tecnica
Óleo                    4156
Acuñación               1118
Esculpido                550
Lápiz compuesto          476
Clarión; Lápiz negro     396
Albúmina                 395
Sanguina                 372
Lápiz                    259
Lápiz negro              237
Pluma; Tinta parda       214
Name: count, dtype: int64

Reduce the data set by keeping observations belonging to the 4 classes of interest only.

In [6]:
df = df[df['technical_sheet_tecnica'].isin(['Óleo',
                                            'Acuñación',
                                            'Esculpido',
                                            'Lápiz compuesto'])]

In [7]:
df.shape

(6300, 30)

In order to obtain the JPGs images, we need to start from the given URL column `work_image_url`.

In [7]:
df['work_id'] = df['work_image_url'].apply(lambda x: x.split('/')[-1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['work_id'] = df['work_image_url'].apply(lambda x: x.split('/')[-1])


Create the folders into which the images will be stored, according to their respective class.

In [9]:
df.head(3)

Unnamed: 0,work_url,work_image_url,author,author_bio,author_url,author_id,work_title,work_subtitle,work_exposed,work_description,...,inventory,expositions,ubication,technical_sheet_autores,technical_sheet_edicion_/_estado,technical_sheet_materia,technical_sheet_ceca,technical_sheet_autora,technical_sheet_lugar_de_produccion,work_id
2,https://www.museodelprado.es/coleccion/obra-de...,https://content3.cdnprado.net/imagenes/Documen...,"Cronenburch, Adriaen van","Schagen (Países Bajos), 1520 - Bergum (Países ...",https://www.museodelprado.es/coleccion/artista...,26861819-ff88-4fde-8a37-56db9e1c1ba4,Dama con una flor amarilla,"Hacia 1567. Óleo sobre tabla, 107 x 79 cm",No expuesto,"Esta obra, junto a sus compañeras (P02074, P02...",...,"Catálogo Museo del Prado, 1873-1907.\nNúm. 130...","Aaaa[""a mas tres(dri)aes"", jeroglífico de ""Adr...",El retrato del Renacimiento\n ...,,,,,,,4a8bab74-ca91-450a-b5b7-39dd61e2d7f3.jpg
3,https://www.museodelprado.es/coleccion/obra-de...,https://content3.cdnprado.net/imagenes/Documen...,"González Velázquez, Zacarías","Madrid, 1763 - Madrid, 1834\n\nZacarías Joaquí...",https://www.museodelprado.es/coleccion/artista...,a8c659ad-d887-4703-8af3-1832dfc88eb7,"Dos pescadores, uno con caña y otro sentado","1785. Óleo sobre lienzo, 174 x 135 cm",Depósito en otra institución,Forma parte de un conjunto de cartones para lo...,...,Inv. Cartones para Tapices.\nNúm. 5710.\n\n571...,5710\nManuscrito en color anaranjado.\nAnverso...,Madrid - Cuartel General del Ejército (Depósito),,,,,,,9af5b176-b4d3-4930-854b-5b5f252829f1.jpg
4,https://www.museodelprado.es/coleccion/obra-de...,https://content3.cdnprado.net/imagenes/Documen...,"Obra copiada de Cano de la Peña, Eduardo","Madrid, 1823 - Sevilla, 1897\n\nSiendo niño se...",https://www.museodelprado.es/coleccion/artista...,521b82d6-6848-4f7d-96dc-3b8f102dd8b5,Tintoretto pintando a su hija muerta (copia),"Hacia 1856. Óleo sobre cartón, 19 x 24,5 cm",No expuesto,"Marietta (1560-1590), hija mayor del pintor Ja...",...,Inv. Nuevas Adquisiciones (iniciado en 1856).\...,"Recuerdo de la esposicion, dedicado a D. Luis ...",Caballete,,,,,,,4c494f0a-d5ae-45ca-826b-59f4b5fd4398.jpg


The important columns are:
`['work_image_url', 'work_id', 'technical_sheet_tecnica']`

In [8]:
!mkdir data

In [9]:
esp_techniques = ['Óleo', 'Acuñación', 'Esculpido', 'Lápiz compuesto']
eng_techniques = ['oil', 'minting', 'sculpture', 'pencil']

In [35]:
# rm -rf .

rm: refusing to remove '.' or '..' directory: skipping '.'


In [10]:
for technique in eng_techniques:
  !mkdir -p data/$technique

In [21]:
download_list = list()

for esp_technique, eng_technique in zip(esp_techniques, eng_techniques):

  image_urls = list(df['work_image_url'].loc[df['technical_sheet_tecnica'] == esp_technique])
  image_names = list(df['work_id'].loc[df['technical_sheet_tecnica'] == esp_technique])
  image_names = [f"data/{eng_technique}/" + fn for fn in image_names]

  for url, fn in zip(image_urls, image_names):
    download_list.append((url, fn))

In [None]:
for i in range(4000,4200):
  print(download_list[i][1])

In [18]:
len(download_list)

702

In [21]:
download_parallel(download_list)

KeyboardInterrupt: 