## Corrupted images
On this notebook I will explore the corrupted images

In [3]:
import pandas as pd
import os
from PIL import Image
import requests
import urllib
import csv

### Identifying corrupted images
I will attempt to open all images and capture the exceptions to identify the corrupted images

In [4]:
def corrupted_imgs(img_paths):
    corrupted = pd.DataFrame(columns=["filename"])
    for img_path in img_paths:
        try:
            img = Image.open(img_path)
            img.verify()
        except (IOError, SyntaxError) as e:
            corrupted = corrupted.append({"filename":img_path.split("/")[-1]}, ignore_index=True)
    corrupted.to_csv("corrupted_photos.csv", index=False)

In [5]:
%%time
path_photos = "./photos/"
img_paths = [path_photos + img for img in os.listdir(path_photos)]
corrupted_imgs(img_paths)

Wall time: 474 ms


In [5]:
#loads previous file and joins it to the file having all the urls
def corrupted_urls(urls, corrupted_ids):
    photos_file = pd.read_table(urls, header=None)
    photos_file = photos_file[0].str.split(pat=",", n=1, expand=True)
    photos_file.columns = ["photo", "url"]
    
    corrupted = pd.read_csv(corrupted_ids)
    print(corrupted.shape[0], "corrupted images")
    
    corrupted["filename"] = corrupted["filename"].str.replace(".jpg", "")
    photos_file["photo"] = photos_file["photo"].str.lstrip("0")
    corrupted_df = pd.merge(corrupted, photos_file, how="inner", left_on=["filename"], right_on=["photo"])
    url_domains = corrupted_df["url"].str.extract(r'(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)')
    print(url_domains[0].value_counts().sort_values(ascending=False))
    
    return corrupted_df

In [6]:
url_path = "./photos.txt" #path where the photos.txt file is located
corrupted = "./corrupted_photos.csv" #file location from where the previous function's output was saved to
corrupted_df = corrupted_urls(url_path, corrupted)

364 corrupted images
piperlime.gap.com    215
www.forever21.com    148
s3.amazonaws.com       1
Name: 0, dtype: int64


In [7]:
corrupted_df.head()

Unnamed: 0,filename,photo,url
0,100676,100676,http://www.forever21.com/images/7_additional_7...
1,100679,100679,http://www.forever21.com/images/7_additional_7...
2,100680,100680,http://www.forever21.com/images/2_side_750/001...
3,100986,100986,http://www.forever21.com/images/2_side_750/001...
4,100987,100987,http://www.forever21.com/images/1_front_750/00...


### Re-Downloading with urllib corrupted images
For some reason the image_extraction function used on the download notebook corrupts some of the images, the same function has now been tweeked to allow downloading with the urllib library which does not corrupt any images

In [49]:
#attempting download with either requests or urllib
def image_extraction(df, requests_on=True):
    """
    df = needs to be a dataframe with the url and photo columns
    """

    img_path = "./photos_corrupted/"
    urls = df["url"].tolist()
    photo_ids = df["photo"].tolist()
    
    broken_urls = pd.DataFrame(columns=["photo", "url"])
    with open("broken_urls_corrupted.csv", "a") as f:
        broken_urls.to_csv(f, index=False)

    for url, photo_id in zip(urls, photo_ids):
        try:
            if requests_on == True:
                r = requests.get(url, timeout=3, 
                                 headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36"})
                if r.status_code == requests.codes.ok:
                    with open(str(img_path + photo_id.lstrip("0") + ".jpg"), "wb") as f:
                        f.write(r.content)
            else:
                urllib.request.urlretrieve(url, img_path + photo_id.lstrip("0") + ".jpg")
        except:
            with open("broken_urls_corrupted.csv", "a") as f:
                writer = csv.writer(f)
                writer.writerow([photo_id, url])

In [None]:
start_n = 0 #beware photos ids start at 1 not at 0 as python indexes. ie. start_n = 5216 will download from 5217
#finish_n = 8000 #ie. finish_n = 5218 will download until 5219 included
split_urls = corrupted_df.loc[start_n:finish_n]

In [None]:
%%time
image_extraction(split_urls, requests=False) #DO NOT OPEN CSV FILE WHILE SCRIPT IS RUNNING

---------------

### When downloading with urllib now some urls are showing as broken but they are not

Some websites seem to have protection against scraping, in these cases the functions with requests works better, switching between urllib and requests until all corrupted images are correctly downloaded.

In [None]:
broken_urls = pd.read_csv("./broken_urls_corrupted.csv")
broken_urls = broken_urls.dropna(axis=0)
broken_urls["photo"] = broken_urls["photo"].apply(int)

In [None]:
start_n = 1062 #beware photos ids start at 1 not at 0 as python indexes. ie. start_n = 5216 will download from 5217
# finish_n =  #ie. finish_n = 5218 will download until 5219 included
split_urls = broken_urls.loc[start_n:]

In [None]:
%%time
image_extraction(split_urls, requests_on=True) #DO NOT OPEN CSV FILE WHILE SCRIPT IS RUNNING

## Running again from the beggining

In [None]:
%%time
path_photos = "./photos/"
img_paths = [path_photos + img for img in os.listdir(path_photos)]
corrupted_imgs(img_paths)

In [13]:
url_path = "./photos.txt"
corrupted = "./corrupted_photos.csv"
corrupted_df = corrupted_urls(url_path, corrupted)

364 corrupted images
piperlime.gap.com    215
www.forever21.com    148
s3.amazonaws.com       1
Name: 0, dtype: int64


Images from piperlime and amazon are broken urls while images from forever21 are still corrupted.  
Will download forever21 images.

In [14]:
corrupted_df = corrupted_df[corrupted_df["url"].str.contains("forever21")]

In [18]:
corrupted_df[:2].values

array([['100676', '100676',
        'http://www.forever21.com/images/7_additional_750/00101719-04.jpg'],
       ['100679', '100679',
        'http://www.forever21.com/images/7_additional_750/00118396-03.jpg']],
      dtype=object)

In [33]:
start_n = 0 #beware photos ids start at 1 not at 0 as python indexes. ie. start_n = 5216 will download from 5217
#finish_n = 2 #ie. finish_n = 5218 will download until 5219 included
split_urls = corrupted_df.loc[start_n:]

In [50]:
%%time
image_extraction(split_urls, requests_on=False) #DO NOT OPEN CSV FILE WHILE SCRIPT IS RUNNING

Wall time: 7.01 s


The *piperlime.gap* & *amazon* urls despite downloading a jpg file are actually genuine broken urls, they will be double checked on the last reconciliation notebook, so for now we will leave them on the dataset folder.