# 6. How To Download Multiple Images In Python

## Learning Outcomes

- To learn how to download multiple images in Python using synchronous and asynchronous code.

------------------------------------------------------

Automatically downloading images from a number of your HTML pages is an essential skill, in this guide you'll be learning 4 methods on how to download images using Python! 

---------------------------------------------------------------

Let's begin with the easiest example, if you already have a list of image URLs then we can follow this process:

1. Change into a directory where we would like to store all of the images.
2. Make a request to download all of the images, one by one.
3. We will also include error handling so that if a URL no longer exists the code will still work.

------------------------------------------------------------------------------------------------

## Python Imports

In [1]:
!pip install tldextract



In [2]:
import requests
import os
import subprocess
import urllib.request
from bs4 import BeautifulSoup
import tldextract

----------------

In [3]:
!mkdir all_images

In [4]:
!ls

[34mall_images[m[m
asyncio-aiofiles.py
how-to-download-multiple-images-starter-code.ipynb
how-to-download-multiple-images.ipynb


Changing into the directory of the folder called all_images, this can be done by either:

~~~

cd all_images
os.chdir('path')

~~~

In [5]:
os.chdir('all_images')

In [6]:
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images


---------------------

## Method One: How To Download Multiple Images From A Python List

In order to download the multiple images, we can use the [requests library](https://requests.readthedocs.io/en/master/). We'll also create a python list to store any images that didn't have a 200 status code:

In [7]:
broken_images = []

In [10]:
image_urls = ['https://sempioneer.com/wp-content/uploads/2020/05/dataframe-300x84.png',
             'https://sempioneer.com/wp-content/uploads/2020/05/json_format_data-300x72.png']

In [14]:
for img in image_urls:
    
    file_name = img.split('/')[-1]
    
    r = requests.get(img, stream=True)
    
    if r.status_code == 200:
        with open(file_name, 'wb') as f:
            for chunk in r:
                f.write(chunk)
    else:
        broken_images.append(img)
    
    print(file_name)
    
    print(img)

dataframe-300x84.png
https://sempioneer.com/wp-content/uploads/2020/05/dataframe-300x84.png
json_format_data-300x72.png
https://sempioneer.com/wp-content/uploads/2020/05/json_format_data-300x72.png


☝️ See how simple that is! ☝️

If you check your folder, you will have now downloaded all of the images that contained a status code of 200! 

------------------------------------------------

![downloading images correctly with python](https://sempioneer.com/wp-content/uploads/2020/06/how-to-download-images-with-python.png)

----------------

## Method Two: How To Download Multiple Images From Many HTML Web Pages

If we don't yet have the exact image URLs, we will need to do the following:

1. Download the HTML content of every web page.
2. Extract all of the image URLs for every page.
3. Create the file names.
4. Check to see if the image status code is 200.
5. Write all of images to your local computer.

This website [internetingishard.com](https://www.internetingishard.com/html-and-css/links-and-images/) has some relative image URLs. Therefore we will need to ensure that our code can handle for the following two types of image source URLs:

---

- <strong> Exact Filepath: https://www.internetingishard.com/html-and-css/links-and-images/html-attributes-6f5690.png </strong>
- <strong> Relative Filepath: /html-and-css/links-and-images/html-attributes-6f5690.png </strong>

---------------------------------------------------------------

In [18]:
web_pages = ['https://understandingdata.com/', 
             'https://understandingdata.com/data-engineering-services/',
             'https://www.internetingishard.com/html-and-css/links-and-images/']

We will also extract the domain of every URL whilst we loop over the webpages like so:
    
~~~

for page in web_pages:
    domain_name = tldextract.extract(page).registered_domain

~~~

In [19]:
url_dictionary = {}

In [24]:
for page in web_pages:
    # 1. Extract domain name
    domain_name = tldextract.extract(page).registered_domain
    print(domain_name)
    
    r = requests.get(page)
    
    if r.status_code == 200:
        
        url_dictionary[page] = []
        
        soup = BeautifulSoup(r.content, 'html.parser')
        
        images = soup.findAll('img')
        
        url_dictionary[page].extend(images)

understandingdata.com
understandingdata.com
internetingishard.com


--------------------------------------------------------

Now let's double check and filter our dictionary so that we only look at web pages where there was at least 1 image tag:

In [31]:
for key, value in url_dictionary.items():
    if len(value) > 0:
        print(value)

[<img alt="Just Understanding Data" height="136" src="//understandingdata.com/wp-content/uploads/2019/04/cropped-logo_transparent-1.png" width="1200"/>, <img alt="Just Understanding Data" height="136" src="//understandingdata.com/wp-content/uploads/2019/04/cropped-logo_transparent-1.png" width="1200"/>, <img src="https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg" style="border-radius: 50%; max-width:75%; padding-bottom:15px;"/>, <img class="desktop-image" src="https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg" style="border-radius: 50%; max-width:390px;"/>, <img alt="is web scraping illegal? header image" class="attachment-gutentype-thumb-masonry size-gutentype-thumb-masonry wp-post-image" height="370" sizes="(max-width: 370px) 100vw, 370px" src="https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png" srcset="https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Il

--------------------------------------------------------

An easier way to write the above code would be via a dictionary comprehension:

In [35]:
cleaned_dictionary = {key: value for key, value in url_dictionary.items() if len(value) > 0}

We can now clean all of the image URLs inside of every dictionary key and change all of the relative URL paths to exact URL paths.

Let's start by printing out all of the different image sources to see how we might need to clean up the data below:

In [39]:
for key, value in cleaned_dictionary.items():
    for item in value:
        print(item.attrs['src'])

//understandingdata.com/wp-content/uploads/2019/04/cropped-logo_transparent-1.png
//understandingdata.com/wp-content/uploads/2019/04/cropped-logo_transparent-1.png
https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/community-detection-370x238.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/what-is-web-scraping-370x370.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
ht

----------------------------------------------------------------------

For the scope of this tutorial, I have decided to:
    
- Remove the logo links with the //
- Add on the domain to the relative URLs

In [41]:
all_images = []

for key, images in cleaned_dictionary.items():
    
    clean_urls = []
    domain_name = tldextract.extract(key).registered_domain
    
    for image in images:
        
        source_image_url = image.attrs['src']
        
        if source_image_url.startswith("//"):
            pass
        elif domain_name not in source_image_url and 'http' not in source_image_url: 
            url = "https://" + domain_name + source_image_url
            all_images.append(url)
        else:
            all_images.append(source_image_url)


In [42]:
all_images

['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/community-detection-370x238.png',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/what-is-web-scraping-370x370.png',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/installing-chromedriver-headless-browser.png',
 'https://secure.gra

-------------------------------------------------------------------------------------

After cleaning the image URLs, we can now refer to method one for downloading the images to our computer! 

This time let's convert it into a function:

In [43]:
def extract_images(image_urls_list:list, directory_path):
    
    os.chdir(directory_path)
    
    for img in image_urls_list:
        file_name = img.split('/')[-1]
        
        url_paths_to_try = [img, img.replace('https://', 'https://www.')]
        
        for url_image_path in url_paths_to_try:
            try:
                r = requests.get(img, stream=True)
                if r.status_code == 200:
                    with open(file_name, 'wb') as f:
                        for chunk in r:
                            f.write(chunk)
            except Exception as e:
                pass   

In [44]:
path = '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images'

In [45]:
extract_images(image_urls_list=all_images, directory_path=path)

Fantastic! 

Now there are some things that we didn't necessarily cover for which include:

- http:// only image urls.
- http://www. only image urls.

But for the most part, you'll be able to download images in bulk!

---------------------------------------------

![how to download multiple images within python](https://sempioneer.com/wp-content/uploads/2020/06/all_images.png)

------------------------------------------------------------------------

## How To Speed Up Your Image Downloads

Its important when working with 100's or 1000's of URLs to avoid using as synchronous approach to downloading images. An asynchronous approach means that we can download multiple web pages or multiple images in parallel.

<strong> This means that the overall execution time will be much quicker! </strong>

--------------------

### ThreadPoolExecutor()

The ThreadPoolExecutor is one of python's built in I/O packages for creating an asynchronous behaviour via multiple threads. In order to utilise it, we will make sure that the function will only work on a single URL.

Then we will pass the image URL list into multiple workers ;) 

In [46]:
def extract_single_image(img):
    file_name = img.split('/')[-1]
    
    # Let's try both of these versions in a loop [https:// and https://www.]
    url_paths_to_try = [img, img.replace('https://', 'https://www.')]
    for url_image_path in url_paths_to_try:
        try:
            r = requests.get(img, stream=True)
            if r.status_code == 200:
                with open(file_name, 'wb') as f:
                    for chunk in r:
                        f.write(chunk)
            return "Completed"
        except Exception as e:
            return "Failed"

In [47]:
all_images[0:5]

['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg']

------------------------------------------------------------------------------------------

The below code will create a new directory and then make it the current active working directory:

In [48]:
os.mkdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc')

In [49]:
os.chdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_asnyc')

In [50]:
import concurrent.futures
import urllib.request

In [51]:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_url = {executor.submit(extract_single_image, image_url) for image_url in all_images}
    
    for future in concurrent.futures.as_completed(future_to_url):
        try:
            url = future_to_url[future]
        except Exception as e:
            pass
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
    

You should've downloaded the images but at a much faster rate! 

-------------------------------------------------------------------------------------

### Async Programming! 

Just like JavaScript, Python 3.6+ comes bundled with native support for co-routines called [asyncio](https://docs.python.org/3/library/asyncio.html). Similar to NodeJS, there is a method available to you for creating custom event loops for async code. 

We will also need to download an async code HTTP requests library called [aiohttp](https://docs.aiohttp.org/en/stable/)

In [52]:
!pip install aiohttp



We will also download aiofiles that allows us to write multiple image files asynchronously:

In [53]:
!pip install aiofiles



In [54]:
import aiohttp
import aiofiles
import asyncio

------------------------------------------------------


----------------------------------------

In [55]:
 os.mkdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_async_event_loop')


In [58]:
os.chdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_async_event_loop')

--------------------------------------------

## How To Download 1 File Asychronously

In [59]:
print(all_images[0:1])

['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg']


In [60]:
single_image = all_images[0:1]

'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg'

In [72]:
async with aiohttp.ClientSession() as session:
    async with session.get(single_image[0]) as resp:
        # 1. Capturing the image file name like we did before:
        single_image_name = single_image[0].split('/')[-1]
        # 2. Only proceed further if the HTTP response is 200 (Ok)
        if resp.status == 200:
            async with aiofiles.open(single_image_name, mode='wb') as f:
                await f.write(await resp.read())
                await f.close()

![Downloading one image with aiofiles](https://sempioneer.com/wp-content/uploads/2020/06/image_files.png)

---------------------------------------------------------------------------------------------------------


We will need to structure our code slightly different for the async version to work across multiple files:

1. We will have a fetch function to query every image URL.
2. We will have a main function that creates, then executes a series of co-routines.



In [73]:
async def fetch(session, url):
    async with session.get(url) as resp:
        # 1. Capturing the image file name like we did before:
        url_name = url.split('/')[-1]
        # 2. Only proceed further if the HTTP response is 200 (Ok)
        if resp.status == 200:
            async with aiofiles.open(url_name, mode='wb') as f:
                await f.write(await resp.read())
                await f.close()

In [75]:
async def main(image_urls:list):
    tasks = []
    headers = {
        "user-agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
    async with aiohttp.ClientSession(headers=headers) as session:
        for image in image_urls:
            tasks.append(await fetch(session, url))
    data = await asyncio.gather(*tasks)

In [76]:
main(all_images)

<coroutine object main at 0x10bed09e0>

In [77]:
asyncio.run(main(all_images))

RuntimeError: asyncio.run() cannot be called from a running event loop

☝️☝️☝️ Notice how when we call this function, it doesn't actually run and produces a [co-routine!](https://docs.python.org/3/library/asyncio-task.html) ☝️☝️☝️

We can then use asyncio as method for executing all of the fetch callables that need to be completed:

![Error with asyncio.run](https://sempioneer.com/wp-content/uploads/2020/06/error-downloading-python-files.png)

If you receive this type of error when running the following command:

~~~

asyncio.run(main(all_images))

~~~


---

<strong> It is likely because you're trying to run asyncio within an event loop which is not natively possible. (Jupyter notebook runs in an event loop!). </strong>

---------------------------------------------------------------

## How To Download Multiple Python Files Inside Of A Python File (.py)

Let's save the variable containing our URLs to a .txt file:

In [78]:
with open('images.txt', 'w') as f:
    for item in all_images:
        f.write(f"{item}\n")

In [79]:
with open('images.txt', 'r') as f:
    data = f.read()

------------------------------------------

### Create A Python File

Then you will need to create a python file and add the following code to it:

------------------------------------------

Then run the python script in <strong> either your terminal / command line with: </strong>
    
    
~~~

python3 python_file_name.py


~~~

---------------------------------------------------------------

Let's break down what's happening in the above code snippet:
    
1. We are importing all of the relevant packages for async programming with files.
2. Then we create a new directory.
3. After creating the new folder we change that folder to be the active working directory.
4. We then read the variable data which was previously saved from the file called images.txt
5. Then we create a series of co-routines and execute them within a main() function with asyncio.
6. As these co-routines are executed every file is asynchronously saved to your computer.


![downloading multiple files with asyncio-aiohttp](https://sempioneer.com/wp-content/uploads/2020/06/asyncio-with-aiofiles.png)

------------------------------------------------------

Finally let's clear up and delete all of the folders to clean up our environment:

In [None]:
all_paths = ['/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images',
            '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_async_event_loop',
            '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_async_event_loop'
            
            ]

In [82]:
!pwd


/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images_async_event_loop


In [None]:
import shutil

---------------------------------------------------------------------------------------------------------

Whether you decide to download images synchronously or asynchronously, its important to realise that although you can do this in tools such as ScreamingFrog or with Google Chrome Extensions. Being able to download images with python allows you to extend your automation capabilities and what other programs, APIs etc you might use that image data with! 