# Python Image Manipulation with thousands of files
## Or, How to Get Computers to Do Tedious Things for You


### Before you start

Most of this session will be in a in [browser version of this session](https://andre-geldenhuis.github.io/python-image-manipulation-session/) where you can do most of the single threaded steps without needing to install anything. (Using the [JupyterLite Xeus Kernel](https://github.com/jupyterlite/xeus) for those interested.)

Another way to follow along is to install jupyterlab, git and a terminal locally.  See the [Setup instructions](https://github.com/andre-geldenhuis/python-image-manipulation-session/blob/main/SETUP.md)

If you're going to run the session in your own Jupyter environment you'll need:

* A local python install (system version is fine for MacOS and most Linux Distros)
* Git
* a Github account
* Git configured with your username and email address.
* Jupyter and pillow installed in either a python virtual enviroment or a conda enviroment

## Intro

Sometimes, you find yourself performing repetitive tasks on a computer. Since computers excel at handling repetitive tasks, we'll explore an example that involves manipulating tens of thousands of images. To make this session accessable to everyone the first half of the session will be in a pure browser jupyterlite session.  Afterward we will switch to local jupyter sessions, for those who have set that up, to use multiprocessing.
During this session, we'll cover:

* Using Jupyter notebooks for prototyping.
    * Shifting the code to python functions for version controll and clarity.
* Downloading and working with large datasets
* Using Python virtual environments
* Iteratively writing code
* Version controlling our code as we progress
* Optimising code to run in parallel for increased speed

We'll conduct most of this using Python in a Jupyterlite notebook. If you've completed the setup, you should be able to follow along in your own local jupyter session. If you've not managed to get though the setup, the browser jupyterlite session will good enough for most of the session.

<div style="color: white; background-color: #2196F3; border-left: 6px solid #1976D2; padding: 0.5em;">
    <strong>Note:</strong> For those following along in their own, non jupyterlite session, make sure you've installed pillow.  See the setup instructions
</div>

## The Dataset

From 2014 to 2018 there was a campain using automatic cameras (camera traps) in the Wellington Region.  The large number of captured images, 270,450 by 2018, required some pre processing before being identified with the help of Citizen Scientists.

The images have a bottom bar which contrains date and time, which we will be removing.  We also might want to remove the manufacturer logo incase the images are used to machine learning later on.

![Example image](docs/images/wellington_thumb_800.jpg "Examppel Image")

We will be utilizing images from camera traps to experiment with manipulating images on a small scale - then eventually scale to potentially hundreds of thousands of images. Here are the data and instructions for accessing it: https://lila.science/datasets/wellingtoncameratraps

### Example Images

We will start working with the example images in `example_images`.  Once we have some tidy working code, we will cover how you'd download a large subset of the images and work on that.

### Let's write some python! - Listing files with *pathlib*

There are many ways to generate a list of files in Python. We're going to use `pathlib` and `glob`. Note that `pathlib` was only introduced in Python 3.4, but most Python versions in use today should be newer than that.

We'll build up our Python program iteratively.


In [1]:
from pathlib import Path

# See what the path to our example_images is represented as
our_path = Path.cwd().joinpath('example_images')
our_path

PosixPath('/Users/geldenan/notbacked/python-image-manipulation-session/example_images')

In [2]:
# let's try to get all the files
imfiles = our_path.glob("*")
imfiles

<generator object Path.glob at 0x1040bc6a0>

> A generator in Python is a type of iterable that lazily produces items one at a time and only as needed, which can be more memory efficient than generating all items at once. They are useful when you're working with large datasets or streams of data where you don't want to hold all items in memory simultaneously.


In [3]:
# It's just a generator object, let's iterate through it
for f in imfiles:
    print(f)

/Users/geldenan/notbacked/python-image-manipulation-session/example_images/010116045550034b9963.JPG
/Users/geldenan/notbacked/python-image-manipulation-session/example_images/010116042724043b5173.JPG
/Users/geldenan/notbacked/python-image-manipulation-session/example_images/010116060244029a3302.JPG


In [4]:
# What if we had non-image files? Let's make it specific to the JPG files we downloaded
imfiles = our_path.glob("*.JPG")
imfiles = our_path.rglob("*.JPG")  # if we had subdirectories (recursive glob)

### Editing an image

Let's get a single image and crop the bottom of it.

In [5]:
from pathlib import Path

test_path='example_images'

our_path = Path.cwd().joinpath('example_images')

# Generate Path objects
imfiles = our_path.glob("*.JPG")

imfile=next(imfiles) # google python generator to find out about next
imfile

PosixPath('/Users/geldenan/notbacked/python-image-manipulation-session/example_images/010116045550034b9963.JPG')

In [6]:
from PIL import Image #we'll need pillow to load the image

im = Image.open(imfile)

In [7]:
#explore the im object
im.height
im.size

(3264, 2448)

In [None]:
im.show()