# Scraping Images

## Introduction

You've definitely started to hone your skills at scraping now! With that, let's look at another data format you're apt to want to pull from the web: images! In this lesson, you'll see how to save images from the web as well as display them in a pandas DataFrame for easy perusal!

## Objectives

You will be able to:

* Save Images from the Web
* Display Images in a Pandas DataFrame

## Grabbing an HTML Page

Start with the same page that you've been working with: books.toscrape.com.

<img src="images/book-section.png">

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
html_page = requests.get('http://books.toscrape.com/') #Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') #Pass the page contents to beautiful soup for parsing
warning = soup.find('div', class_="alert alert-warning")
book_container = warning.nextSibling.nextSibling

## Finding Images

First, simply retrieve a list of images by searching for `img` tags with beautiful soup:

In [3]:
images = book_container.findAll('img')
ex_img = images[0] #Preview an entry
ex_img

<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>

In [5]:
#Use tab complete to preview what types of methods are available for the entry
ex_img.attrs['src']

'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg'

In [6]:
#While there's plenty of other methods to explore, simply select the url for the image for now.
ex_img.attrs['src']

'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg'

## Saving Images

Great! Now that you have a url (well, a url extension to be more precise) you can download the image locally!

In [6]:
import shutil

In [7]:
url_base = "http://books.toscrape.com/"
url_ext = ex_img.attrs['src']
full_url = url_base + url_ext
r = requests.get(full_url, stream=True)
if r.status_code == 200:
    with open("images/book1.jpg", 'wb') as f:
        r.raw.decode_content = True
        shutil.copyfileobj(r.raw, f)

## Showing Images in the File Directory

You can also run a simple bash command in a standalone cell to preview that the image is indeed there:

In [8]:
ls images/

book10.jpg  book14.jpg  book18.jpg  book2.jpg  book6.jpg  book-section.png
book11.jpg  book15.jpg  book19.jpg  book3.jpg  book7.jpg
book12.jpg  book16.jpg  book1.jpg   book4.jpg  book8.jpg
book13.jpg  book17.jpg  book20.jpg  book5.jpg  book9.jpg


## Previewing an Individual Image

In [11]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

In [12]:
img=mpimg.imread('images/book1.jpg')
imgplot = plt.imshow(img)
plt.show()

ValueError: Only know how to handle extensions: ['png']; with Pillow installed matplotlib can handle more images

In [13]:
!pip freeze

absl-py==0.7.1
alembic==1.0.9
asn1crypto==0.24.0
astor==0.7.1
atomicwrites==1.3.0
attrs==19.1.0
backcall==0.1.0
beautifulsoup4==4.7.1
bleach==1.5.0
boto==2.49.0
boto3==1.9.134
botocore==1.12.134
branca==0.3.1
bz2file==0.98
certifi==2018.8.13
cffi==1.11.5
chardet==3.0.4
cryptography==2.3.1
cryptography-vectors==2.3.1
cycler==0.10.0
Cython==0.28.5
decorator==4.3.0
defusedxml==0.6.0
docutils==0.14
entrypoints==0.3
folium==0.8.3
gast==0.2.2
gensim==3.4.0
graphviz==0.10.1
grpcio==1.20.1
h5py==2.9.0
html5lib==0.9999999
idna==2.7
imbalanced-learn==0.4.3
ipykernel==5.1.0
ipynb==0.5.1
ipython==6.5.0
ipython-genutils==0.2.0
jedi==0.12.1
Jinja2==2.10.1
jmespath==0.9.4
jsonschema==2.6.0
jupyter-client==5.2.4
jupyter-core==4.4.0
jupyterhub==0.8.1
jupyterlab==0.31.12
jupyterlab-launcher==0.10.5
Keras==2.2.1
Keras-Applications==1.0.4
Keras-Preprocessing==1.0.2
kiwisolver==1.1.0
Mako==1.0.9
Markdown==3.1
MarkupSafe==1.1.1
matplotlib==3.0.2
mistune

## Displaying Images in Pandas DataFrames

You can even display images within a pandas DataFrame by using a little HTML yourself!

In [16]:
import pandas as pd
from IPython.display import Image, HTML

In [18]:
row1 = [ex_img.attrs['alt'], '<img src="images/book1.jpg"/>']
df = pd.DataFrame(row1).transpose()
df.columns = ['title', 'cover']
HTML(df.to_html(escape=False))

Unnamed: 0,title,cover
0,A Light in the Attic,


## All Together Now

In [22]:
data = []
for n, img in enumerate(images):
    url_base = "http://books.toscrape.com/"
    url_ext = img.attrs['src']
    full_url = url_base + url_ext
    r = requests.get(full_url, stream=True)
    path = "images/book{}.jpg".format(n+1)
    title = ex_img.attrs['alt']
    if r.status_code == 200:
        with open(path, 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)
        row = [title, '<img src="{}"/>'.format(path)]
        data.append(row)
df = pd.DataFrame(data)
print('Number of rows: ', len(df))
df.columns = ['title', 'cover']
HTML(df.to_html(escape=False))   

Number of rows:  20


Unnamed: 0,title,cover
0,A Light in the Attic,
1,A Light in the Attic,
2,A Light in the Attic,
3,A Light in the Attic,
4,A Light in the Attic,
5,A Light in the Attic,
6,A Light in the Attic,
7,A Light in the Attic,
8,A Light in the Attic,
9,A Light in the Attic,


## Summary

Voila! You really are turning into a scraping champion! Now go get scraping!