Creating Datasets

- Dutch Golden Age -- Age of Vermeer Master Dataset: Downloaded the dataset from Frick Collection's Github page. It is compiled by its Digital Art History Lab [DAHL]( https://www.frick.org/research/DAHL)
- The dataset contains a lot of files, but not all have image URLs to scrape off the web. So a manual cleanup was done first to remove any irrelevant files.
- The code used here in this notebook is largely from the class. I also made some modification with the help of ChatGPT.

In [1]:
import json
import zipfile
import os

# Path to the DAHL dataset zip file
zip_path = '../AIM-Project/age-of-vermeer-master.zip'

# Temporary directory to extract files
temp_dir = 'temp_extracted_jsons'

# Create a temporary directory
os.makedirs(temp_dir, exist_ok=True)

# Initialize an empty list to hold merged data
merged_data = []

# Extract files
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    # Extract only .json files
    for file in zip_ref.namelist():
        if file.endswith('.json'):
            zip_ref.extract(file, temp_dir)
            # Read the extracted json file and append its content to merged_data
            with open(os.path.join(temp_dir, file), 'r') as json_file:
                data = json.load(json_file)
                merged_data.append(data)

# Optionally, remove the temporary directory and its contents after merging
import shutil
shutil.rmtree(temp_dir)

# Now, merged_data contains all merged JSON data

# Saving the merged data to a new JSON file
with open('merged.json', 'w') as outfile:
    json.dump(merged_data, outfile, indent=4)

# merged.json now contains all the merged data from the .json files in the zip archive.


The readme from the dataset was unclear in terms of how the information was organized. Now that I have merged the 6 different .json files from the dataset, I want to remove any duplicated URLs in the merged.json file.

In [14]:
# Load the merged JSON data
with open('merged.json', 'r') as file:
    data = json.load(file)

# Initialize a set to keep track of unique URLs
unique_urls = set()

# Initialize a list to store cleaned data without duplicate URLs
cleaned_data = []

for item in data:
    # If item is a dictionary, process it directly
    if isinstance(item, dict):
        url = item.get('Image Link')
        if url not in unique_urls:
            unique_urls.add(url)
            cleaned_data.append(item)
    # If item is a list, iterate through each dictionary in the list
    elif isinstance(item, list):
        for sub_item in item:
            # Ensure sub_item is a dictionary before attempting to access 'Image Link'
            if isinstance(sub_item, dict):
                url = sub_item.get('Image Link')
                if url not in unique_urls:
                    unique_urls.add(url)
                    cleaned_data.append(sub_item)
            else:
                # Handle cases where sub_item is not a dictionary
                print("Encountered a non-dictionary item within a list:", sub_item)
    else:
        # Log any items that are neither dictionaries nor lists
        print("Encountered an unsupported item:", item)

# Save the cleaned data back to a new or the same JSON file
with open('cleaned_merged.json', 'w') as outfile:
    json.dump(cleaned_data, outfile, indent=4)


Not all the records in the cleaned_merged.json file have URLs, I need to clean the dataset again.

In [16]:

# Load the JSON file
with open('cleaned_merged.json', 'r') as file:
    data = json.load(file)

# Filter out any records that don't have "https://" in their "Image Link"
filtered_data = [record for record in data if "https://" in record.get("Image Link", "")]

# Save the filtered dataset to a new file, excluding records without "https://" in the Image Link
with open('filtered_cleaned_merged.json', 'w') as file:
    json.dump(filtered_data, file, indent=4)

print(f"The filtered dataset has been saved. It includes {len(filtered_data)} records, each with 'https://' in the Image Link.")


The filtered dataset has been saved. It includes 92 records, each with 'https://' in the Image Link.


Now, all the URLs are ready to be scraped. 

In [19]:
pip install beautifulsoup4


Collecting beautifulsoup4
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Using cached soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.3 soupsieve-2.5
Note: you may need to restart the kernel to use updated packages.


In [20]:


from bs4 import BeautifulSoup
from urllib.parse import urljoin
from pathlib import Path

# Define the directory where images will be saved
save_dir = Path("../AIM-Project/age_of_vermeer_master_images")
os.makedirs(save_dir, exist_ok=True)

# Load your JSON data
# Assuming it's already loaded into a variable named `data`

# Function to download an image given its URL
def download_image(image_url, save_path):
    try:
        response = requests.get(image_url, stream=True)
        if response.status_code == 200:
            with open(save_path, 'wb') as f:
                f.write(response.content)
            print(f"Downloaded {image_url}")
        else:
            print(f"Failed to download {image_url}")
    except Exception as e:
        print(f"Error downloading {image_url}: {e}")

# Function to fetch and parse a webpage to find an image URL, then download it
def scrape_and_download_image(page_url, save_dir):
    try:
        response = requests.get(page_url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            # Example: Find the first <img> element (You might need to adjust this based on your needs)
            img_tag = soup.find('img')
            if img_tag and 'src' in img_tag.attrs:
                img_url = urljoin(page_url, img_tag['src'])
                image_name = os.path.basename(img_url)
                save_path = save_dir / image_name
                download_image(img_url, save_path)
            else:
                print(f"No image found at {page_url}")
        else:
            print(f"Failed to fetch page {page_url}")
    except Exception as e:
        print(f"Error fetching page {page_url}: {e}")

# Iterate through each record and attempt to scrape and download the image
for record in data:
    page_url = record.get("Image Link", "")
    if page_url:
        scrape_and_download_image(page_url, save_dir)


Downloaded https://www.facebook.com/tr?id=962351770479675&ev=PageView&noscript=1
Downloaded https://www.nationalgalleries.org/sites/default/files/styles/thumbnail/public/externals/94784.jpg?itok=KFv92fdy
Downloaded https://skd-online-collection.skd.museum/thumb/302/17dcd117-d81a-438c-aa1e-2b8655acf33e.jpg
Downloaded https://www.metmuseum.org/art/collection/search/437878
Downloaded https://skd-online-collection.skd.museum/thumb/122/719ac509-a48c-4306-8ac1-e3c4e60e492d.jpg
Downloaded https://collections.frick.org:443/internal/media/dispatcher/20123/resize%3Aformat%3Dzoomify;jsessionid=58D1CC327C1401B6FD5423A3B304EE51
No image found at https://www.rijksmuseum.nl/en/collection/SK-A-2344
Downloaded https://collections.frick.org:443/internal/media/dispatcher/18885/resize%3Aformat%3Dzoomify;jsessionid=F6C84EDFCE1D6D3E99257DF1DBF026B2
Failed to fetch page https://www.3landesmuseen.de/Hollaendische-Malerei-17-und-18-Jahrhu.644.0.html
Failed to fetch page https://www.rijksmuseum.nl/en/vermeers-t

Most images downloaded were not images since many URLs linked in the dataset are outdated. So I ended up not using the datasets I found online. Instead, I searched for images on Pinterested and created a Pinterest board, pulling 17th Century Dutch portraits of subjects with large lace collars. After installing gallery-dl via command line into my project conda environment, I scraped a Pinterest board for the first dataset.

In [None]:
!gallery-dl "https://www.pinterest.co.uk/clin0320233/dutch-portrait-project/"

[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496029.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496024.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496022.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496019.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496018.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496013.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496011.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496008.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958496000.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958495997.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Dutc…rtraits/pinterest_885661082958495

After training with Dreambooth with two pre-trained stable diffusion models, I feel the end results are not close to my original intended results. So I scraped Hendrik Kerstens' photographic work directly to create a second dataset. I used Danziger Gallery's artist profile (https://www.danzigergallery.com/artists/hendrik-kerstens2?view=slider#1) to source the photos. Pinned them onto a Pinterest Board and repeat the code above from class. 

In [9]:
!gallery-dl "https://www.pinterest.co.uk/clin0320233/hendrik-kerstens/"

[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572253.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572250.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572247.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572241.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572240.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572235.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572227.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572225.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572223.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572219.jpg[0m
[1;32m./gallery-dl/pinterest/clin0320233/Hend…erstens/pinterest_885661082958572