# Macy Product List Crawling

The task is to get a list of the available products on the macys.com

In [178]:
import requests 
from bs4 import BeautifulSoup

## Step 1: collect paths

The most straightforward idea is to go to the sitemap of macys.com

The sitemap looks like:
* category 1:
    * sub-category 1
    * sub-category 2
    * ...
* category 2:
    * ...

Then we can collect the urls of each category, and crawling all the products in each category and subcategory

In [179]:
url = 'https://www.macys.com/cms/slp/2/Site-Index'
## you may need to change your agent based on your browser
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'} 
response = requests.get(url, headers=headers)
soup =BeautifulSoup(response.text, 'html.parser') 

by investigating the html of sitemap, it is found that all the urls are embedded in the div block with class sitelink

In [180]:
divTags = soup.find_all("div", {"class": "sitelink_container"})

By checking the categories:

* some sub-category is the same as their parent like Pants have subcategory pants
* some sub-category contains the "ALL" 

for these situation, the other subcategory in the same category don't have to be included to avoid repeated searching

For the categories which are not products like store locations, they don't need to be included

Categories of Bath, Shoes, Watches, Juniors that don't go to list are edge cases

In [181]:
url_list=[]
for divTag in divTags:
    ### edge cases for Bath,Shoes,Watches
    if divTag.find('h2').text == 'Bathroom Collections':
        url_list.append('https://www.macys.com/shop/bed-bath/shower-accessories?id=8237&edge=hybrid')
        continue
    if divTag.find('h2').text == 'Shoes':
        url_list.append('https://www.macys.com/shop/mens-clothing/shop-all-mens-footwear?id=55822&edge=hybrid')
        continue
    if divTag.find('h2').text == 'Watches':
        url_list.append('https://www.macys.com/shop/jewelry-watches/womens-watches?id=57385')
        continue
    if divTag.find('h2').text == 'Juniors Clothing':
        url_list.append('https://www.macys.com/shop/junior-clothing/shop-all-juniors-apparel?id=60983&edge=hybrid')
        continue    
    ### General Cases    
    if divTag.find('h2').text==divTag.find('a', href=True).text:
        a=divTag.find('a', href=True)
        url_list.append(a['href'])
    else:
        sub_list = []
        flag_all = True
        for a in divTag.find_all('a', href=True):
            sub_list.append(a['href'])
            if a.text.find("All")!=-1:
                url_list.append(a['href'])
                flag_all = False
        if flag_all:
            url_list+=sub_list
                
    if(divTag.h2.string=='Rugs'):
        break
    

In [182]:
len(url_list)

264

## Step 2: collect items in each category

In each page, we try to find the nextpage where to search further

In [183]:
def searchnext(sp):
    nextli = sp.find("li",class_="nextPage")
    if nextli:
        nextpage = nextli.find("a",href=True)
        if nextpage and nextpage['href']!='#':
            root = "http://www.macys.com"
            if nextpage['href'].find("macys.com")==-1:
                nextlink = root+nextpage['href']
            else:
                nextlink = nextpage['href']
        else:
            nextlink = None
        return nextlink
    else:
        return None

* In each page, collect the product names inside the main div block

* There are some recommendation grid and historical views which we want to ignore

In [184]:
def prod_page(sp):
    tmp_list=[]
    main_div = sp.find("div", class_="sortableGrid")
    if main_div:
        for div in main_div.find_all("div", class_="productDescription"):
            for a in div.find_all("a", class_="productDescLink", title=True):
                tmp_list.append(a['title'].strip())
    else:
        main_div = sp.find("div", id="macysGlobalLayout")
        if main_div:
            for div in main_div.find_all("div", class_="shortDescription"):
                for a in div.find_all("a", class_="productThumbnailLink", href=True):
                    tmp_list.append(a.text.strip())
        
    return tmp_list

For each category, we traverse to the end of it

In [185]:
#url="http://www.macys.com/shop/womens-clothing/dresses/Pageindex/95?id=5449"
def search_cat(url):
    response = requests.get(url, headers=headers)
    sp =BeautifulSoup(response.text, 'html.parser') 
    cat_list = []
    cat_list+=prod_page(sp)
    nextlink = searchnext(sp)
    #i=0
    while(nextlink):
        response = requests.get(nextlink, headers=headers)
        sp =BeautifulSoup(response.text, 'html.parser') 
        cat_list+=prod_page(sp)
        nextlink = searchnext(sp)
    return cat_list
    #i+=1
    #print('{0}'.format(i),end='\r')
    #print('{0}'.format(nextlink),end='\r')

Save product list for each category

In [186]:
def save_cat(cat_list,i):
    filename = "products/product_list_macy"+str(i)+".txt"
    with open(filename, "a") as myfile:
        myfile.write('\n'.join(cat_list))

In [201]:
i=0
for url in url_list:
    cat_list=search_cat(url)
    save_cat(cat_list,i)
    print('{0}'.format(i),end ='\r')
    #if len(cat_list)<1:
    #    print('{0} abnorm'.format(i))
    i+=1

162

There are some bad links:

* http://www1.macys.com/shop/for-the-home/piggy-banks-snow-globes?id=60821
* http://www1.macys.com/shop/kitchen/combination-coffee-machines?id=43081
* http://www1.macys.com/shop/dining-entertaining/individual-bowls?id=61072
* http://www1.macys.com/shop/bed-bath/apartment-bedding?id=60167
* http://www1.macys.com/shop/makeup-and-perfume/false-eyelashes?id=59291

Wedding and Plus-size are the main directories, assume the subdirectories will cover them 

## Step 3 aggregate and clean

In [229]:
import os
import pandas as pd

In [240]:
path = 'products/'
frame=[]
for file in os.listdir(path):
    if file.endswith('.txt'):
        df=pd.read_csv(path+file,names=['product'],sep='\t')
        frame.append(df)

In [241]:
res = pd.concat(frame)

In [244]:
len(res)

195311

There could be redundant product as some repeated search, perform a simple clean

In [245]:
prod_list = res['product'].unique()

In [246]:
len(prod_list)

112744

In [249]:
with open('products/prodcuts_final.csv', "w") as myfile:
        myfile.write('\n'.join(prod_list))

In [252]:
res.to_csv('products/prodcuts_full.csv',index = False, header = False)

## Genearlize
the framework can be summarized and genearilzed as:
* start from site map, if a valid sitemap.xml file could be obtained that would be perfect, otherwise perform the parsing of sitemap webpage
 
* search each merchandize category from the root to the end, in this step, each website may need to modify the parsing rules, however the framework could be the same

* perform cleaning

## Assessment

some simple check:
* (count:496) http://www1.macys.com/shop/makeup-and-perfume/makeup-brushes-and-makeup-bags?id=56285
* (count:41)http://www1.macys.com/shop/plus-size-clothing/shop-jumpsuits-rompers?id=43910
* (count:43)http://www1.macys.com/shop/kitchen/toasters-toaster-ovens?id=7575
* (count:7)http://www1.macys.com/cms/slp/2/Womens-Board-Shorts
* (count:40)http://www1.macys.com/shop/kitchen/grills-griddles?id=7569

All above checks are correct, shows below.

However, we assume that from we can traverse all the products by using the categories in the sitemap

It is possible that some items which may not be reached as we don't know how good is the sitemap

A dfs or other traversing algorithm may be needed to go through every possible single item page for validation

However, that may takes too much

In [256]:
from random import randint
for i in range(5):
    print(randint(0, len(url_list)),end=' ')

145 56 225 19 231 

In [263]:
for i in [145,56,225,19,231]:
    file = 'products/product_list_macy'+str(i)+'.txt'
    df=pd.read_csv(file,names=['product'],sep='\t')
    print(df.shape[0],end=' ')

496 41 43 7 40 