# What's that Doggy in the Window?
## Data Gathering - API Calls

By: **Bryan Santos**

Have you ever wanted to know the breed of dogs you see in social media or with other people because you like how they look, whether tough or cute?

This project aims to build an application that lets users upload images of a dog and to get its breed. The application will then assess the breed characteristics if it is suitable for the user based on lifestyle. If it is, then the system will redirect the user to dogs of that particular breed that is up for adoption. If not compatible, then the the system will suggest top five most compatible breeds.

The project will utilize multi-class image classification and recommendation systems machine learning models to achieve its goals.

The pet industry is a multi-billion dollar industry even just in the United States alone. The trend of owning pets is on a steady rise. Unfortunately, so do the number of dogs that would be without a permanent home or that would be euthanized. Many people buy dogs because of fad or appearances and abandon them, most likely because they do not realize that dogs of different breeds have unique characteristics and may not necessarily match their lifestyles.

***

This is the second notebook in the series. Our project will not work without dog breed images to train our classification models on so this part focuses on getting our dataset, which in this case are thousands of dog breed images.

I will be getting my dataset from two sources. The first one is www.dog.ceo, a site that provides users with API endpoints in order to get all of the dog breed images. The second one is the dataset from Udacity stores in AWS.

## 1: Package Imports

Below are the libraries used to process API calls and responses, usually in JSON format. Selenium will again be used to process the www.dog.ceo site.

In [28]:
import pandas as pd
import numpy as np
import json
import requests
from requests import get
import urllib.request as req
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os, sys, random, string
from os import path

import tqdm
import shutil
import random

In [29]:
%%capture

from tqdm import tqdm_notebook as tqdm
from tqdm import tnrange
tqdm().pandas()

***

## 2: Getting Images www.dog.ceo

This is part 1 of this notebook. This is where we will download all images stores in www.dog.ceo through API calls.

### Initial Setup

In [4]:
### Initial selenium and beautiful setup to get the breed list
url = 'https://dog.ceo/dog-api/breeds-list'
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')

In [5]:
### Collect all breeds that they have
breed_list = []
select = soup.find('select', {'class': 'dog-selector'})
options = select.findAll('option')
for option in options:
    breed = option.get_attribute_list('value')
    breed_list.append(breed)

In [6]:
### Validate list
breed_list

[['affenpinscher'],
 ['african'],
 ['airedale'],
 ['akita'],
 ['appenzeller'],
 ['australian-shepherd'],
 ['basenji'],
 ['beagle'],
 ['bluetick'],
 ['borzoi'],
 ['bouvier'],
 ['boxer'],
 ['brabancon'],
 ['briard'],
 ['buhund-norwegian'],
 ['bulldog-boston'],
 ['bulldog-english'],
 ['bulldog-french'],
 ['bullterrier-staffordshire'],
 ['cairn'],
 ['cattledog-australian'],
 ['chihuahua'],
 ['chow'],
 ['clumber'],
 ['cockapoo'],
 ['collie-border'],
 ['coonhound'],
 ['corgi-cardigan'],
 ['cotondetulear'],
 ['dachshund'],
 ['dalmatian'],
 ['dane-great'],
 ['deerhound-scottish'],
 ['dhole'],
 ['dingo'],
 ['doberman'],
 ['elkhound-norwegian'],
 ['entlebucher'],
 ['eskimo'],
 ['finnish-lapphund'],
 ['frise-bichon'],
 ['germanshepherd'],
 ['greyhound-italian'],
 ['groenendael'],
 ['havanese'],
 ['hound-afghan'],
 ['hound-basset'],
 ['hound-blood'],
 ['hound-english'],
 ['hound-ibizan'],
 ['hound-plott'],
 ['hound-walker'],
 ['husky'],
 ['keeshond'],
 ['kelpie'],
 ['komondor'],
 ['kuvasz'],
 ['la

In [20]:
### Convert breed list into API endpoints
link_list = []

for item in breed_list:
    link = 'https://dog.ceo/api/breed/' + item[0] + '/images'
    link_list.append(link)


In [22]:
### Clean the list
for link in link_list:
    link = link.replace("-","/")

In [23]:
link_list

['https://dog.ceo/api/breed/affenpinscher/images',
 'https://dog.ceo/api/breed/african/images',
 'https://dog.ceo/api/breed/airedale/images',
 'https://dog.ceo/api/breed/akita/images',
 'https://dog.ceo/api/breed/appenzeller/images',
 'https://dog.ceo/api/breed/australian-shepherd/images',
 'https://dog.ceo/api/breed/basenji/images',
 'https://dog.ceo/api/breed/beagle/images',
 'https://dog.ceo/api/breed/bluetick/images',
 'https://dog.ceo/api/breed/borzoi/images',
 'https://dog.ceo/api/breed/bouvier/images',
 'https://dog.ceo/api/breed/boxer/images',
 'https://dog.ceo/api/breed/brabancon/images',
 'https://dog.ceo/api/breed/briard/images',
 'https://dog.ceo/api/breed/buhund-norwegian/images',
 'https://dog.ceo/api/breed/bulldog-boston/images',
 'https://dog.ceo/api/breed/bulldog-english/images',
 'https://dog.ceo/api/breed/bulldog-french/images',
 'https://dog.ceo/api/breed/bullterrier-staffordshire/images',
 'https://dog.ceo/api/breed/cairn/images',
 'https://dog.ceo/api/breed/cattle

In [24]:
### Clean the list
link_list = [link.replace('-', '/') for link in link_list]

In [26]:
### Prepare final breed list
breed_list = [link.replace('https://dog.ceo/api/breed/', '') for link in link_list]
breed_list = [link.replace('/images', '') for link in breed_list]
breed_list = [link.replace('/', '-') for link in breed_list]

In [27]:
breed_list[:5]

['affenpinscher', 'african', 'airedale', 'akita', 'appenzeller']

### API Calls

Main API calls to download images.

In [97]:
i = 1

### Go through each endpoint to download imgaes
for link in tqdm(link_list):
    
    ### Print status
    print('[' + str(i) + '] Getting images from ' + link)
    i += 1
    
    response = requests.get(link)
    data = response.json()
    
    ### Download all images
    try:
        images = data.get('message')
        for image in images:
            clean = image.replace('https://images.dog.ceo/breeds/','')
            breed = clean.split('/')[0]
            filename = clean.split('/')[1]
            req.urlretrieve(image, "images/" + breed + "/" + filename)
    except:
        continue

HBox(children=(IntProgress(value=0, max=141), HTML(value='')))

[1] Getting images from https://dog.ceo/api/breed/affenpinscher/images
[2] Getting images from https://dog.ceo/api/breed/african/images
[3] Getting images from https://dog.ceo/api/breed/airedale/images
[4] Getting images from https://dog.ceo/api/breed/akita/images
[5] Getting images from https://dog.ceo/api/breed/appenzeller/images
[6] Getting images from https://dog.ceo/api/breed/australian/shepherd/images
[7] Getting images from https://dog.ceo/api/breed/basenji/images
[8] Getting images from https://dog.ceo/api/breed/beagle/images
[9] Getting images from https://dog.ceo/api/breed/bluetick/images
[10] Getting images from https://dog.ceo/api/breed/borzoi/images
[11] Getting images from https://dog.ceo/api/breed/bouvier/images
[12] Getting images from https://dog.ceo/api/breed/boxer/images
[13] Getting images from https://dog.ceo/api/breed/brabancon/images
[14] Getting images from https://dog.ceo/api/breed/briard/images
[15] Getting images from https://dog.ceo/api/breed/buhund/norwegia

[118] Getting images from https://dog.ceo/api/breed/terrier/bedlington/images
[119] Getting images from https://dog.ceo/api/breed/terrier/border/images
[120] Getting images from https://dog.ceo/api/breed/terrier/dandie/images
[121] Getting images from https://dog.ceo/api/breed/terrier/fox/images
[122] Getting images from https://dog.ceo/api/breed/terrier/irish/images
[123] Getting images from https://dog.ceo/api/breed/terrier/kerryblue/images
[124] Getting images from https://dog.ceo/api/breed/terrier/lakeland/images
[125] Getting images from https://dog.ceo/api/breed/terrier/norfolk/images
[126] Getting images from https://dog.ceo/api/breed/terrier/norwich/images
[127] Getting images from https://dog.ceo/api/breed/terrier/patterdale/images
[128] Getting images from https://dog.ceo/api/breed/terrier/russell/images
[129] Getting images from https://dog.ceo/api/breed/terrier/scottish/images
[130] Getting images from https://dog.ceo/api/breed/terrier/sealyham/images
[131] Getting images f

***

## 3: Getting Udacity / AWS Images

This is the second part of the notebook where I download additional dataset coming from Udacity, stored in AWS.

### Initial Setup and Download

In [None]:
### The link where the additional dataset is downloaded
url = 'https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip'
wget(url)

In [36]:
### Setup
folder_list = os.listdir('aws/')

The following code blocks are commented out because I only want to run them once as they change actual file system folders. The folder names are changed to match the folder structure of the first dataset.

In [30]:
### Rename downloaded folders

#for folder in folder_list:
#    os.rename('aws/' + folder, 'aws/' + folder[4:])    

In [32]:
### Rename downloaded folders

#for folder in folder_list:
#    os.rename('aws/' + folder, 'aws/' + folder.lower())

In [34]:
# for folder in folder_list:
#     new = folder.split('_')
#     if len(new) == 1:
#         continue
#     elif len(new) == 2:
#         new = new[1] + '-' + new[0]
#         os.rename('aws/' + folder, 'aws/' + new)
#     else :
#         new = new[2] + '-' + new[0] + '-' + new[1]
#         os.rename('aws/' + folder, 'aws/' + new)

In [37]:
### Rename downloaded folders

# for folder in folder_list:
#    os.rename('aws/' + folder, 'aws/' + folder.lower())

***

## 4: Data Record and Export

This section only records all of the images downloaded from both sources into a dataframe so that I have a reference of the image filename and its breed. This will be converted into labels later on.

In [40]:
### Create dataframe
images_df = pd.DataFrame(columns=['filename', 'breed']) 

rootdir = 'images/'

### Browse through root directory and store filenames and breeds (based on folders)
for subdir, dirs, files in tqdm(os.walk(rootdir)):
    for file in files:
        if file == ".DS_Store":
            continue
        else:
            images_df = images_df.append({'filename': file,
                                'breed': subdir[7:]}, ignore_index = True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [42]:
### Validate
images_df.head(5)

Unnamed: 0,filename,breed
0,yr5nolw2d8qzb9i3scg4.jpg,setter-irish
1,01g4tqhjevrdzo3pl2ky.jpg,setter-irish
2,lhyq7io1ek6gbwuc2tn8.jpg,setter-irish
3,5971pwrfx0hunc3m4el2.jpg,setter-irish
4,o6vqbkigf4n32xmzrls5.jpg,setter-irish


In [43]:
images_df.shape

(28669, 2)

In [45]:
### Check number of breeds
len(images_df.breed.unique())

173

The number of breeds above will further be reduced to just the top 50 according to AKC.

In [46]:
### Export to csv
images_df.to_csv("images_df.csv")