# Image Scraper

### Description
A web scraping utility used to get album covers and record label board pics.

#### Rough Workflow
* Target URL
* HTML Preview to find specific ID and class identifiers
* BS4 search by identifiers
* Batch labeling and saving in specified directory

### Sections
1. <a href=#data>Ingest Data</a>
2. <a href=#wrangle>Wrangle Data</a>
3. <a href=#eda>EDA</a>
4. <a href=#process>Process Data</a>
5. <a href=#export>Export Data</a>

<a id=data></a>
## Ingest Data

In [1]:
# Import libraries
import requests, bs4, re, os, shutil
import pandas as pd
import numpy as np
# Importing Image module from PIL package  
from PIL import Image  
import PIL  

In [2]:
# Declare data directories
company = "sony"
cache_dir = os.path.join("data", "img", company)  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists


In [3]:
# Declare search URL

rootURL = "https://www.sonymusic.com/executives/"

# Save root URL HTML data into bs4 object
rootURL_data = requests.get(rootURL)
rootURL_soup = bs4.BeautifulSoup(rootURL_data.text)

In [4]:
page_imgs = rootURL_soup.select('img')

In [5]:
img_list = []
for img in page_imgs:
    img_str = str(img)
    url = re.search("src=\"([^\n\r]*)\"", img_str)
    img_list.append(url[1])

In [6]:
for i in range(len(img_list)):
## Set up the image URL and filename
    image_url = img_list[i]
    image_format = image_url.split('.')[-1]
    filename = str(i)+'.'+image_format

    try:
        # Open the url image, set stream to True, this will return the stream content.
        r = requests.get(image_url, stream = True)

        # Check if the image was retrieved successfully
        if r.status_code == 200:
            # Set decode_content value to True, otherwise the downloaded image file's size will be zero.
            r.raw.decode_content = True

            # Open a local file with wb ( write binary ) permission.
            with open(os.path.join(cache_dir, filename),'wb') as f:
                shutil.copyfileobj(r.raw, f)
            print('Image sucessfully Downloaded: ',filename)
        else:
            print('Image Couldn\'t be retreived')
    except:
        print('Bad url: ' + image_url)

Image sucessfully Downloaded:  0.jpg
Image sucessfully Downloaded:  1.jpg
Image sucessfully Downloaded:  2.jpg
Image sucessfully Downloaded:  3.jpg
Image sucessfully Downloaded:  4.jpg
Image sucessfully Downloaded:  5.jpg
Image sucessfully Downloaded:  6.jpg
Image sucessfully Downloaded:  7.jpg
Image sucessfully Downloaded:  8.jpg
Image sucessfully Downloaded:  9.jpg
Image sucessfully Downloaded:  10.jpg
Image sucessfully Downloaded:  11.jpg
Image sucessfully Downloaded:  12.jpg
Image sucessfully Downloaded:  13.jpg
Bad url: //img.youtube.com/vi/9NZvM1918_E/hqdefault.jpg" width="310px
Image sucessfully Downloaded:  15.png
Image sucessfully Downloaded:  16.png
Image sucessfully Downloaded:  17.png


## To Do:
1. Edit the image display with PIL
2. Refactor to make this a custom utility class

<a id=wrangle></a>
## Wrangle Data

<a id=eda></a>
## EDA

<a id=process></a>
## Process Data

<a id=export></a>
## Export Data