## Broken urls
On this notebook I will explore the broken urls.

In [1]:
import pandas as pd
import os
import requests
import urllib
import csv
import shutil

Using the same function from previous notebooks to download the images.  

Some of these images are available and will download with either urllib or requests. There seems to be a certain randomness when downloading these images so multiple attempts are performed, this is possibly due to antiscraping protection from som of this websites.  

The main goal is to download as many images as possible and request any left images to the team behind the original street2shop dataset.

In [2]:
#attempting download with either requests or urllib
def image_extraction(df, requests_on=True):
    """
    df = needs to be a dataframe with the url and photo columns
    """
    img_path = "./photos_v2/"
    urls = df["url"].tolist()
    photo_ids = df["photo"].tolist()
    
    broken_urls = pd.DataFrame(columns=["photo", "url"])
    with open("broken_urls.csv", "a") as f:
        broken_urls.to_csv(f, index=False)

    for url, photo_id in zip(urls, photo_ids):
        try:
            if requests_on == True:
                r = requests.get(url, timeout=5, 
                                 headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36"})
                if r.status_code == requests.codes.ok:
                    with open(str(img_path + photo_id + ".jpg"), "wb") as f:
                        f.write(r.content)
            else:
                urllib.request.urlretrieve(url, img_path + str(photo_id) + ".jpg")
        except:
            with open("broken_urls.csv", "a") as f:
                writer = csv.writer(f)
                writer.writerow([photo_id, url])

## Missing images
Rather thank using a csv with broken links I will use the function below to check which images are missing in the dataset folder, this allows for easier and more robust reconciliation.  

This process is repeated multiple times, for efficiency targeting each url domain at a time and making attempts with both requests and urllib.

In [21]:
def missing_images(list_pics, n):    
    all_ids = [str(n)+".jpg" for n in range(n)]
    missing = list(set(all_ids) - set(list_pics)) #this will compare all images up to n with the missing ones in the folder
    print("Total photos in folder {}, Total photos missing {}".format(len(list_pics), len(missing)))
    
    missing_df = pd.DataFrame(missing, columns=["filename"])
    missing_df["filename"] = missing_df["filename"].str.replace(".jpg", "")
    photos_file = pd.read_table("./photos.txt", header=None) #now loading the file with the urls so can join them up later
    photos_file = photos_file[0].str.split(pat=",", n=1, expand=True)
    photos_file.columns = ["photo", "url"]
    photos_file["photo"] = photos_file["photo"].str.lstrip("0")

    all_missing_df = pd.merge(missing_df, photos_file, how="inner", left_on=["filename"], right_on=["photo"])
    return all_missing_df

In [22]:
list_pics = os.listdir("./photos")
n_pics = 291050
missing = missing_images(list_pics, n_pics)

Total photos in folder 259430, Total photos missing 33955


In [23]:
url_domains = missing["url"].str.extract(r'(?:^https?:\/\/([^\/]+)(?:[\/,]|$)|^(.*)$)')
print(url_domains[0].value_counts().sort_values(ascending=False))

g.nordstromimage.com             20376
images.bloomingdales.com          6343
productshots1.modcloth.net        1788
productshots0.modcloth.net        1746
productshots2.modcloth.net        1745
productshots3.modcloth.net        1724
www.forever21.com                  188
media.kohls.com.edgesuite.net       42
images.express.com                   1
ecx.images-amazon.com                1
Name: 0, dtype: int64


In [None]:
urls_2 = broken_urls_all[~broken_urls_all["url"].str.contains("nordstrom")] #nordstrom all are broken
#urls_modcloth = urls_2[urls_2["url"].str.contains("modcloth")] #downloading modcloth

### Downloading

In [13]:
start_n = 0 #beware photos ids start at 1 not at 0 as python indexes. ie. start_n = 5216 will download from 5217
#finish_n = 2 #ie. finish_n = 5218 will download until 5219 included
split_urls = missing.loc[start_n:]

In [1]:
%%time
image_extraction(split_urls, requests_on=False) #DO NOT OPEN CSV FILE WHILE SCRIPT IS RUNNING

--------

Once satified with the amount of images downloaded the remaining ones are saved to a csv file to be send to the street2shop team.

In [29]:
missing.sort_values(by="photo").drop(columns=["filename"]).to_csv("broken_urls.csv", index=False)