# Data collection

This notebook holds the code to scrape Google image search for image urls and download the images. You should also be able to use this notebook to generate the dataset if you need to recreate it from the already scraped URIs.

As usual the first two cells are there to load external libraries via nuget (in this case canopy, ImageSharp and the native Selenium Chrome webdriver) and open the used namespaces.

In [1]:
#!fsharp
#r "nuget: canopy"
#r "nuget: Selenium.WebDriver.ChromeDriver, 87.0.4280.8800"
#r "nuget: SixLabors.ImageSharp, 1.0.2"

Installed package canopy version 2.1.5

Installed package SixLabors.ImageSharp version 1.0.2

Installing package Selenium.WebDriver.ChromeDriver, version 87.0.4280.8800............

In [1]:
#!fsharp
open System
open canopy.configuration
open canopy.classic
open OpenQA.Selenium
open SixLabors.ImageSharp
open SixLabors.ImageSharp.Processing

There are a couple of paths you need to save your search results, the dowloaded raw files and the results of your dowload attempts. .NET Interactive uses paths relative to the Kernel location (at least in the current VS Code build) and not to the notebook location. Because of this it makes more sense to keep the paths absolute. Change this according to your setup.

In [1]:
#!fsharp
let rootDirectory = @"C:\Users\grego\source\repos\IsItKrampus.NET" // change this to reflect your setup

let dataDir = Path.Combine(rootDirectory, "data")
let imageSourcesTarget = Path.Combine(dataDir, "image_sources.tsv")
let rawFolder = Path.Combine(dataDir, "raw")
let imageDownloadsPathFile = Path.Combine(dataDir, "image_downloads.tsv")

The search URL is pretty much just taken out of my browser. It doesn't really need all the query parameters but it was the URL I used to create the dataset so I wanted to document it as closely as possible.

In [1]:
#!fsharp
let getSearchUrl (query: string) =
    $"https://www.google.com/search?q={query}&sclient=img&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiJwLa-7s_tAhUH9IUKHfwYCaYQ_AUoAXoECBIQAw&biw=1536&bih=719&dpr=1.25"

Canopy is a nice DSL over Selenium. To use it you'll need to use the correct Selenium webdriver for your browser and yoru operating system. I used Chrome on a 64bit Windows 10 build. If you want to use something else (like Firefox on Linux) you need to reference the correct nuget package for your browser and configure the correct native webdriver directory.

.NET interactive - per default - uses your global Nuget settings. This means, that packages are cached in your `~/.nuget/packages` directory.

In [1]:
#!fsharp
canopy.configuration.chromeDir <- @"C:\Users\grego\.nuget\packages\selenium.webdriver.chromedriver\87.0.4280.8800\driver\win32" // change this to reflect your system
start chrome

This cell starts a new image search and looks for the thumbnails on the page. I try to get 50 samples for each search term (because that's roughly how many you can get withouth scrolling and forcing the web app to load more images). The DOM queries aren't very generalized, so they might break when Google deploys a new version with new mangled class names. It shouldn't be too hard to adapt the query the the new markup. Selenium is pretty dependent on the actual viewport of your machine - so better expand your remote controlled browser window to the full screen size.

In [1]:
#!fsharp
let getImgUrls (n: int) (query: string) =
    let searchUrl = getSearchUrl query
    url searchUrl
    // let the browser load the page before going further
    sleep 1

    let imagesToClick =
        elements "div#islmp a.wXeWr.islib.nfEiy.mM5pbd img"

    let toTake = min (List.length imagesToClick) n

    let getImageUrl (elem : IWebElement) =
        try
            click elem

            sleep 1

            // nah this is not brittle and hacky as hell at all
            elem |> parent |> parent |> fun e -> e.GetAttribute("href")
            |> fun s -> s.Split('?').[1].Split('&').[0].Substring(7)
            |> Uri.UnescapeDataString
            |> Some
        with
        | e -> None

    imagesToClick
    |> List.take toTake
    |> List.map getImageUrl
    |> List.filter Option.isSome
    |> List.map (Option.defaultValue String.Empty)

let queryString = "person+in+autumn"
let imgUrls = getImgUrls 50 queryString

You can always visually inspect the images in your notbeook. .NET Interactive is using web technologies which allows you to use most HTML tags, CSS as well as JavaScript to visualize your data.

In [1]:
#!fsharp
DisplayFunctions.display imgUrls

DisplayFunctions.HTML $"<img src=\"%s{imgUrls |> List.skip 6 |> List.head}\"></img>"

index,value
0,https://cdn.psychologytoday.com/sites/default/files/field_blog_entry_images/2019-09/happy-woman-fall-leaves_istock-1016602340_martinan.jpg
1,https://previews.123rf.com/images/stakhov/stakhov1409/stakhov140900008/31482501-curly-man-in-blue-jacket-with-computer-tablet-in-autumn.jpg
2,https://previews.123rf.com/images/kmphotography/kmphotography1810/kmphotography181000025/110439368-lonely-man-walking-in-park-alone-in-autumn.jpg
3,https://envato-shoebox-0.imgix.net/5ec0/e148-d102-4350-80f4-fef10d587aab/ya+na+ozere+osen+2.jpg?auto=compress%2Cformat&fit=max&mark=https%3A%2F%2Felements-assets.envato.com%2Fstatic%2Fwatermark2.png&markalign=center%2Cmiddle&markalpha=18&w=700&s=184dd8e279f93f3bfeb9c1bb3b90808e
4,https://previews.123rf.com/images/kmphotography/kmphotography1810/kmphotography181000063/110628447-handsome-man-leaning-against-a-tree-in-a-park-in-autumn-while-smiling.jpg
5,https://media1.s-nbcnews.com/i/newscms/2016_43/1169044/autumn-today-161024-tease_902b5b66bed0e272b41c35cf72828389.jpg
6,https://static.urbandaddy.com/uploads/assets/image/articles/standard/77f1f860154b12ce617912ee96ed2286.jpg
7,https://get.pxhere.com/photo/man-tree-person-people-fall-guy-portrait-spring-red-color-autumn-season-avenue-human-action-981181.jpg
8,https://images.snapwi.re/bbfb/5bf2bfa1b9a48b64bfe91216.w800.jpg
9,https://image.shutterstock.com/image-photo/young-handsome-man-posing-autumn-260nw-475992364.jpg


If you're happy with what you got you should save your search results (the query string you used and the resulting image URLs). If you loose all your data (or have to create it in the first place - I'm not allowed to share the images because I don't own them) this helps you to recreate it.

In [1]:
#!fsharp
imgUrls
|> List.map (fun s -> $"{queryString}\t{s}")
|> fun lines -> File.AppendAllLines(imageSourcesTarget, lines)

let urls =
    imgUrls
    |> Array.ofList

Create a HttpClient to use for all web requests. If they didn't change it in .NET 5 it should still be the "correct" way to reuse the client for the life time of your Kernel instance (rather than using a new client for each request and disposing of them).

In [1]:
#!fsharp
open System.Net.Http

let httpClient = new HttpClient()

Download the images as raw data. If - for whatever reason - the request fails (URLs can be scraped incorreclty, some hosts block you from loading their images without a user agent, etc) it just logs, that it failed and goes on. In practice it really doesn't matter if a small set of images are lost in the process.

In [1]:
#!fsharp
let downloadImage (uri: string) =
    let req =
        try
            httpClient.GetAsync uri
            |> Async.AwaitTask
            |> Async.RunSynchronously
            |> Some
        with e ->
            display $"Req failed. Message: {e.Message}" |> ignore
            None

    match req with
    | Some req when req.IsSuccessStatusCode && (isNull req.Content |> not) ->
        let bytes =
            req.Content.ReadAsByteArrayAsync()
            |> Async.AwaitTask
            |> Async.RunSynchronously

        let format = Image.DetectFormat(bytes)

        let guid = Guid.NewGuid()

        let ext = if isNull format || isNull format.Name then String.Empty else "." + format.Name.ToLower()
        let fileName = $"{guid}{ext}"
        File.WriteAllBytes(Path.Combine(rawFolder, fileName), bytes)

        Some (uri, guid, fileName)
    | _ ->
        display $"{uri}: could not be processed" |> ignore
        None

let processedImages =
    urls
    |> Array.map downloadImage

Req failed. Message: Invalid URI: The hostname could not be parsed.

https://www.h%C3%A4ngemattewelt.at/media/catalog/product/cache/832edbc25b6b9f06432b6b25a7301d05/h/a/hammock-grenada-autumn-2.jpg: could not be processed

For every successfully downloaded image save away the source URL, the associated GUID as well as the image name including the extension (if there is one). Some of the images might get saved in a format, that can't be processed in the future but the bulk will be jpegs anyway.

In [1]:
#!fsharp
processedImages
|> Array.filter Option.isSome
|> Array.map (Option.defaultValue (String.Empty, Guid.Empty, String.Empty))
|> Array.map (fun (uri, id, name) -> $"{uri}\t{id}\t{name}")
|> fun lines -> File.AppendAllLines(imageDownloadsPathFile, lines)

If you want to recreate the dataset I'd suggest to start with the `image_downloads.tsv` file, download all of the images (URI given in the first column) and save them with the same names given in the thrid column. From there you can use the `image_prep.csv` file to apply the correct crops (look at `src/IsItKrampus.NET.DataSet.Server/Startup.fs` - especially the `applyProcessing` implementation - if you need a template) to all included images. If there are images you can't download from the the sources (maybe because they were deleted from the host) just throw them out of the `image_prep.csv` file. If for some reason you experience major problems recreating the dataset please get in touch with me. I can't publicly host my training data set because I hold no rights to the images I used but I'm sure we can find away to get you going while still holding the owners' copyright.