# Capstone Project - Which Shoe is the Best for You?

General Assembly passion project. Scrape or obtain data from resources online to develop a dataset to perform cleaning, EDA, and analysis on. Try to predict a model on common themes like: 

- Price
- If item is in category A or B
- Cluster and create groups
- Recommender

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import urllib
from bs4 import BeautifulSoup
import requests
from time import sleep, strftime

**Testing one website and finding appropriate keys**

Then run again on page 2.

In [2]:
result = requests.get('https://stockx.com/api/browse?page=1&category=152')
json_res = result.json()

# print json_res['Products'][0]['shortDescription']
# print json_res['Products'][0]['retailPrice']

In [3]:
result = requests.get('https://stockx.com/api/browse?page=2&category=152')
json_res = result.json()

print json_res['Products'][0]['shortDescription']
print json_res['Products'][0]['retailPrice']

Air-Jordan-1-Retro-Black-Blue-2017
160


In [4]:
json_res['Products'][0]['market']

{u'absChangePercentage': 0.023166,
 u'annualHigh': 500,
 u'annualLow': 229,
 u'averageDeadstockPrice': 297,
 u'averageDeadstockPriceRank': 32,
 u'changePercentage': 0.023166,
 u'changeValue': 6,
 u'createdAt': u'2017-01-17T00:28:26+00:00',
 u'deadstockRangeHigh': 275,
 u'deadstockRangeLow': 255,
 u'deadstockSold': 1336,
 u'deadstockSoldRank': 42,
 u'highestBid': 260,
 u'lastHighestBidTime': 1498157136,
 u'lastLowestAskTime': 1498152067,
 u'lastSale': 265,
 u'lastSaleDate': u'2017-06-22T22:33:22+00:00',
 u'lowestAsk': 259,
 u'pricePremium': 0.656,
 u'pricePremiumRank': 55,
 u'productId': 0,
 u'productUuid': u'a51561a1-ec03-47b2-9cdc-82b5b12f6771',
 u'salesLast72Hours': 36,
 u'salesLastPeriod': 0,
 u'salesThisPeriod': 36,
 u'skuUuid': None,
 u'updatedAt': 1498185118,
 u'volatility': 0.037959}

### Functions to run for loop to scrape website

First function scrapes the web for an amount of pages (default 50). It will save the raw data as a csv as well. The second function will then clean the dataframe by taking out the unused columns.

In [5]:
def shoe_scraper(pages=50):
    '''Returns one dataframe of all results. 
    And will save into a new file.'''
    
    # Have to run requests first to get appropriate column names
    req = requests.get('https://stockx.com/api/browse?page=1&category=152')
    json_req = req.json()
    df = pd.DataFrame([], columns=json_req['Products'][0].keys())

    for i in range(1,pages):
        try:
            html = 'https://stockx.com/api/browse?page=' + str(int(i)) + '&category=152'
            result = requests.get(html)
            json_res = result.json()
            df = pd.concat([df, pd.DataFrame(json_res['Products'])])
            sleep(0.5)
        except:
            break
    
    # Drop row duplicates
    df.drop_duplicates(['shortDescription', 'urlKey'], inplace=True) 
    
    # Function to save as csv file under today's day as raw (before dropping)
    def csv_maker(df):
        filename = 'StockX_' + strftime("%m%d%H")
        df.to_csv(path_or_buf='C:\\Users\\Chris\\Desktop\\dsi-atl-3\\project\\Capstone\\datasets\\' + filename, encoding='utf-8')
    
    csv_maker(df)
    
    return df

In [6]:
def clean_df(busy_dataframe):
    '''Dropping columns that hold little to no information.
    Then reset the index since we are getting repeated indices'''
    
    # Market DataFrame
    market_df = pd.DataFrame([row for row in busy_dataframe['market']])
    
    
    # Drop unnecessary columns
    cleaner_dataframe = busy_dataframe.join(market_df)
    
    cleanest_dataframe = cleaner_dataframe.drop(['breadcrumbs', 'childId', 'countryOfManufacture', 'type', 
        'uuid', 'dataType', 'doppelgangers', 'condition', 'description', 'hidden', 'ipoDate', 'productCategory', 
        'shoeSize', 'urlKey', 'charityCondition', 'releaseTime', 'shortDescription', 'media', '_highlightResult', 
        'market', '_tags', 'id', 'objectID', 'lastHighestBidTime', 'lastLowestAskTime', 'styleId', 'productId',
        'productUuid', 'skuUuid', 'updatedAt', 'title', 'traits', 'tickerSymbol', 'salesLastPeriod'], axis=1)
    
    # Remember title = shoe + name

    # Reset the index, since we are getting repeated indices
    cleanest_dataframe.reset_index(drop=True, inplace=True)
    return cleanest_dataframe

In [7]:
scraped_shoe = shoe_scraper()