# Python Image Manipulation with thousands of files
## Or, How to Get Computers to Do Tedious Things for You


### Before you start

Most of this session will be in a [browser version of this session](https://andre-geldenhuis.github.io/python-image-manipulation-session/) where you can do most of the single threaded steps without needing to install anything. (Using the [JupyterLite Xeus Kernel](https://github.com/jupyterlite/xeus) for those interested.)

Another way to follow along is to install jupyterlab, git and a terminal locally.  See the [Setup instructions](https://github.com/andre-geldenhuis/python-image-manipulation-session/blob/main/SETUP.md)

If you're going to run the session in your own Jupyter environment you'll need:

* A local python install (system version is fine for macOS and most Linux Distros)
* Git
* a GitHub account
* Git configured with your username and email address.
* Jupyter and pillow installed in either a python virtual enviroment or a conda enviroment

## Intro

Sometimes, you find yourself performing repetitive tasks on a computer. Since computers excel at handling repetitive tasks, we'll explore an example that involves manipulating tens of thousands of images. To make this session accessible to everyone the first half of the session will be in a pure browser jupyterlite session.  Afterward we will switch to local jupyter sessions, for those who have set that up, to use multiprocessing.
During this session, we'll cover:

* Using Jupyter notebooks for prototyping.
    * Shifting the code to python functions for version controll and clarity.
* Downloading and working with large datasets
* Using Python virtual environments
* Iteratively writing code
* Version controlling our code as we progress
* Optimising code to run in parallel for increased speed

We'll conduct most of this using Python in a Jupyterlite notebook. If you've completed the setup, you should be able to follow along in your own local jupyter session. If you've not managed to get though the setup, the browser jupyterlite session will good enough for most of the session.

<div style="color: white; background-color: #2196F3; border-left: 6px solid #1976D2; padding: 0.5em;">
    <strong>Note:</strong> For those following along in their own, non jupyterlite session, make sure you've installed pillow.  See the setup instructions
</div>

## The Dataset

From 2014 to 2018 there was a campaign using automatic cameras (camera traps) in the Wellington Region.  The large number of captured images, 270,450 by 2018, required some pre-processing before being identified with the help of Citizen Scientists.

The images have a bottom bar which contrains date and time, which we will be removing.  We also might want to remove the manufacturer logo incase the images are used to machine learning later on.

![Example image](docs/images/wellington_thumb_800.jpg "Example Image")

We will be utilizing images from camera traps to experiment with manipulating images on a small scale - then eventually scale to potentially hundreds of thousands of images. Here are the data and instructions for accessing it: https://lila.science/datasets/wellingtoncameratraps

### Example Images

We will start working with the example images in `example_images`.  Once we have some tidy working code, we will cover how you'd download a large subset of the images and work on that.

## The code
### Let's write some python! - Listing files with *pathlib*

There are many ways to generate a list of files in Python. We're going to use `pathlib` and `glob`. Note that `pathlib` was only introduced in Python 3.4, but most Python versions in use today should be newer than that.

We'll build up our Python program iteratively.


In [None]:
from pathlib import Path

# See what the path to our example_images is represented as
our_path = Path.cwd().joinpath('example_images')
our_path

In [None]:
# let's try to get the list of all the files
image_paths = our_path.glob("*")
image_paths

> A generator in Python is a type of iterable that lazily produces items one at a time and only as needed, which can be more memory efficient than generating all items at once. They are useful when you're working with large datasets or streams of data where you don't want to hold all items in memory simultaneously.


In [None]:
# It's just a generator object, let's iterate through it
for f in image_paths:
    print(f)

In [None]:
# What if we had non-image files? Let's make it specific to the JPG files we downloaded
image_paths = our_path.glob("*.JPG")
image_paths = our_path.rglob("*.JPG")  # if we had subdirectories ( rglob - recursive glob)

### Editing an image

Let's get a single image and crop the bottom of it.

In [None]:
from pathlib import Path

our_path = Path.cwd().joinpath('example_images')

# Generate Path objects
image_paths = our_path.glob("*.JPG")

image_path = next(image_paths) # google python generator to find out about -next-
image_path

In [None]:
from PIL import Image #we'll need pillow to load the image

im = Image.open(image_path)

In [None]:
#explore the im object
im.height
im.size

In [None]:
im # Show the image to get an idea of what we are doing.

### Get the crop ratios

We want to crop away the bottom of the image, so most of the Bushnell logo and the time and date.  You'll notice that the bottom of the logo is already cropped off - In the original images there was another row below that had the GPS locations of the camera traps.  These images were used in citizen science project to identify what was in the images, this ran while the camera traps were still in the field, and we didn't want them to get stolen. 

We'll get the amount to crop off the bottom by trial and error. It's a little easier this way in a JupyterLab session.  In real life you could use some image editing program to work out the correct number of pixels to remove.

In [None]:
#Store the width and height
width, height = im.size

# Setting the points for cropped image
left = 0
top = 0
right = width
bottom = height-50  # lets take a guess here

# note the new variable name to prevent overriding the original as we may want to make further adjustments
im1 = im.crop((left, top, right, bottom)) 
im1


In [None]:
# Iterate on above code block until the image is correctly cropped.

bottom = height-90
im1 = im.crop((left, top, right, bottom))
im1

## Drawing on images

The next step in our Python image processing journey is to cover the camera vendor's brand by drawing a rectangle. This is a useful technique if you're concerned about your machine learning model training on a logo. However, in such cases, it would be even better to trim the entire bottom of the image to avoid the model training on a black square. We'll just go ahead with a black square to demo how to do it from PIL import Image, ImageDraw # Import ImageDraw

In [None]:
from PIL import Image, ImageDraw  # Import ImageDraw

In [None]:
# Get the new image width and height to simplify the subsequent calculations
width, height = im1.size

# Prepare an image draw object, we'll call overlay as we'll be using it to cover up the logo
overlay = ImageDraw.Draw(im1)

In [None]:
# The 'rectangle' function takes coordinates as follows: [(x0, y0), (x1, y1)]

# Let's estimate the size of the rectangle we want to draw
w, h = 100, 90

# Establish the rectangle's coordinates
x0 = 0
y0 = height - h
x1 = w
y1 = height

shape = [(x0, y0), (x1, y1)]

# Draw a rectangle on the image with a green outline and a black fill
overlay.rectangle(shape, fill="black", outline="green")

In [None]:
im1 # Note we want to look at im1 as it's what we are applying the ImageDraw object 'overlay' too

In [None]:
# Iterate on above code block until the logo is fully covered.

# Prepare an image draw object, we'll call overlay as we'll be using it to cover up the logo
overlay = ImageDraw.Draw(im1)

# Let's modify the size of the rectangle
w, h = 200, 108

# Update the coordinates
x0 = 0
y0 = height - h
x1 = w
y1 = height

shape = [(x0, y0), (x1, y1)]

# Draw another rectangle, this time with a black outline and black fill
overlay.rectangle(shape, fill="black", outline="black")

im1

### Saving the output

So far all the changes we have made to the image has been in memory only.  Its good practise to save your changed data, images, datasets in a new location.  We don't want to overwrite our originals. This lets us back out of any mistakes, but it also makes sure our workflow is on the path to repeatability.

Going back to how we first got the path for an image, let's modify that to make a path for a new location to save it.

In [None]:
from pathlib import Path

our_path = Path.cwd().joinpath('example_images')

# Generate Path objects
image_paths = our_path.glob("*.JPG")

image_path = next(image_paths) 

In [None]:
our_path # the full path the images are in

In [None]:
image_path # the full path we loaded a particular image from

In [None]:
image_path.name # we can also get just the filename

In [None]:
# What we want is to have an output path or folder next to the example_images folder that we can save a file with the same name into
# We can construct this just like we construced the path to example iamges.

output_path = Path.cwd().joinpath('output')

output_image_path =  Path.cwd().joinpath(output_path, image_path.name)
output_image_path

In [None]:
im1.save(output_image_path) 

In [None]:
# The above will fail the first time as the directory doesn't exist yet, let's create it and try again.

# Create the directory if it doesn't exist
output_path.mkdir(parents=True, exist_ok=True)

im1.save(output_image_path) 

## Custom functions for repeatable workflows

In our Jupyter notebook session, we've iteratively developed and tested our code in an interactive environment. This format is excellent for experimentation and learning, as it allows immediate feedback and simplifies complex concepts.

However, as we transition our code into a separate .py file and define our functions there, you might notice that it abstracts away much of the interactivity we had in the notebook. While this might seem like a drawback at first, it actually brings several significant advantages:

* Modularity and Reusability: By placing our code into functions within a .py file, we make our code more modular. This means you can easily reuse these functions in other projects or within different parts of the same project without rewriting them.

* Cleaner Code: Having our code in a separate file encourages cleaner, more organized coding practices. Functions are clearly defined, and the code is structured in a way that’s easy to read and maintain.

* Version Control: Jupyter notebooks can be a bit cumbersome when it comes to version control. The file format includes output cells, metadata, and other elements that can clutter the version history and make merging changes more complex. Pure Python scripts, on the other hand, are straightforward text files that integrate seamlessly with version control systems like Git. This makes tracking changes, collaborating with others, and maintaining a clean development history much easier.


### Building up the function in a cell

We'll write our function in one big cell block, and once we're happy with it, we'll write a separate .py file.  We want to do all the steps we've done so far.



In [None]:
from pathlib import Path
from PIL import Image, ImageDraw

example_image_path = 'example_images'
output_path = 'output'

# Generate Path objects
raw_path = Path.cwd().joinpath(example_image_path)
image_paths = raw_path.glob("*.JPG")

# Get the first image path out of the generator
image_path = next(image_paths)

# Print image path and load
print(image_path)
im = Image.open(image_path)

#Store the width and height
width, height = im.size

# Setting the points for cropped image
left = 0
top = 0
right = width
bottom = height-100

#crop the bottom off the image
im_cropped = im.crop((left, top, right, bottom))

#get new image width and height, this'll make the math easier
width, height = im_cropped.size

#Prepare rectangle
draw = ImageDraw.Draw(im_cropped)

#Size of rectangle
w, h = 200, 100

x0 = 0
y0 = height - h
x1 = w
y1 = height    

shape = [(x0, y0), (x1, y1)]
draw.rectangle(shape, fill ="black",outline ="black")

#how will we save without overwriting?
# current filepath is
# PosixPath('/home/andre/Documents/talks/python_image_manip/test_data/images/010116060142029a3301.JPG')

output_image_path = Path.cwd().joinpath(output_path,image_path.name)

im_cropped.save(output_image_path)

#### Converting to a function


In [None]:
from pathlib import Path
from PIL import Image, ImageDraw

example_image_path = 'example_images'
output_path = 'output'

# Generate Path objects
raw_path = Path.cwd().joinpath(example_image_path)
image_paths = raw_path.glob("*.JPG")

# Get the first image path out of the generator
image_path = next(image_paths)

print(image_path)

def imageprocess_dev(image_path, output_path):
    """
    Crops the bottom 100 pixels from an image and adds a black rectangle at the bottom.

    Args:
        image_path (str): Path to the image file.
        output_path (str): Directory where the processed image will be saved.

    This function opens an image, crops the bottom 100 pixels, and adds a black rectangle
    at the bottom of the cropped image. The modified image is then saved to the specified
    output path which must be provided.

    Example:
    >>> imageprocess_dev("/path/to/image.jpg", "/path/to/output/")

    Prepare image to be suitable for MachineLearning

    Takes image at filepath and crops the vendor info bar off 
    the bottom. It also blocks out the vendor logo in the lower
    left corner.

    Keyword arguments:
    image_path(str) -- the path to the image
    output_path(str) -- the path the modified image is saved to
    """
    
    if not output_path:
        raise ValueError("An output path must be provided.")
        
    im = Image.open(image_path)
    
    #Store the width and height
    width, height = im.size
    
    # Setting the points for cropped image
    left = 0
    top = 0
    right = width
    bottom = height-100
    
    #crop the bottom off the image
    im_cropped = im.crop((left, top, right, bottom))
    
    #get new image width and height, this'll make the math easier
    width, height = im_cropped.size
    
    #Prepare rectangle
    draw = ImageDraw.Draw(im_cropped)
    
    #Size of rectangle
    w, h = 200, 100
    
    x0 = 0
    y0 = height - h
    x1 = w
    y1 = height    
    
    shape = [(x0, y0), (x1, y1)]
    draw.rectangle(shape, fill ="black",outline ="black")
    
    output_image_path = Path.cwd().joinpath(output_path,image_path.name)

    # Create the directory if it doesn't exist
    output_image_path.parent.mkdir(parents=True, exist_ok=True)
    
    im_cropped.save(output_image_path)


imageprocess_dev(image_path, output_path)

In [None]:
# We can now use the function to process another image

# Get the next image path out of the generator
image_path = next(image_paths)
image_path

In [None]:
imageprocess_dev(image_path, output_path)

#### __init__.py → make your functions into a 'package'

Let's make a folder called `utils` for our utility functions.  Inside this folder, create a file called `imagefunctions.py`.  This will be a simple python file rather than a jupyter notebook.ipynb file.

We also need to create an empty file called `__init__.py`.  This will let python know to treat this folder as a 'package', it'll make importing functions from this folder much easier

In [None]:
ls utils/ # show the contents of the utils folder

In [None]:
# Import the imageprocess function from the imagefunctions file
from utils.imagefunctions import imageprocess

In [None]:
# Test if it works
image_path = next(image_paths) # next image
imageprocess(image_path, output_path)

#### The whole process in one go

Rather than using a bunch of next(image_paths), let's use a for loop and put all the steps together.

In [None]:
from pathlib import Path
from PIL import Image, ImageDraw
from utils.imagefunctions import imageprocess

example_image_path = 'example_images'
output_path = 'output'

# Generate Path objects
raw_path = Path.cwd().joinpath(example_image_path)
image_paths = raw_path.glob("*.JPG")

# Iterate though the image_paths with a for loop
for image_path in image_paths:
    imageprocess(image_path, output_path)

    # Print progress using python fstrings
    print(f"Processed {image_path.name}")
    


## Estimating runtime

In [None]:
import timeit

In [None]:
run_num = 100
runtime = timeit.timeit('imageprocess(image_path, output_path)', globals=globals(), number=run_num)
ave_runtime = runtime / run_num

total_images = 270450
total_images = 800

total_runtime_min = ave_runtime * total_images/60

print(f"Average runtime over {run_num} runs: {ave_runtime} seconds")
print(f"Total runtime for {total_images} images would be very roughtly: {total_runtime_min} minutes")

## Multiprocessing

The estimated runtime isn't terrible, but we can do better by using all the processors on your computer.

<div style="color: white; background-color: #2196F3; border-left: 6px solid #1976D2; padding: 0.5em;">
    <strong>Note:</strong> In the jupyterlite environment you only have access to a single processor, or 'cpu'.  The multiprocessing code below will still work, but you won't get any speedup.  You can take this code and try running it locally in a jupyter notebook session to see the effect.
</div>


In [None]:
import multiprocessing as mp

print(f'Python can see {mp.cpu_count()} cpus is this environment') # on my laptop, I get 8

<div style="color: black; background-color: #dcffdb; border-left: 6px solid #1976D2; padding: 0.5em;">
    <strong>Aside:</strong> It is increasingly common for modern processors to have different types of cpus.  Often there will be a few high preformance cores and several more efficent, but less capable efficienty cores. If this is the case for you, you might have to take that into account when you divide up the work to be done as dividing equally between all cpus won't get the fastest thoughput
</div>

In [3]:
from pathlib import Path
from PIL import Image, ImageDraw
from utils.imagefunctions import imageprocess
import multiprocessing as mp
import time

image_path = 'example_images' #'test_data/'
image_path = 'test_data/'
output_path = 'output'

# Initialise multiprocessing pool
pool = mp.Pool(mp.cpu_count())

# Generate Path objects
raw_path = Path.cwd().joinpath(image_path)
image_paths = raw_path.glob("*.JPG")

# Iterate though the image_paths with a for loop
start_time = time.time()
for result in pool.map(imageprocess, image_paths):
    pass
end_time = time.time()

print(f"Runtime of the mutliprocessing is {end_time - start_time} seconds")



Runtime of the mutliprocessing is 11.744295120239258 seconds
