### Imports

For this scraper we will be using requests only and not beautifulsoup since shopify stores have a very clear json structure when you add parameters to the url as said in the documentation "https://shopify.dev/docs/admin-api/rest/reference/products/product"

In [1]:
import requests #Scraping import
import json #We will be working with json files 
import pandas as pd #Storing and loading the data as a dataframe/csv type
import numpy as np #Useful in almost every project
import time #Resting time for looping and requesting
import sqlalchemy #SQL Database connector
from sqlalchemy import create_engine
import config #Database credentials

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_columns = 300
pd.options.display.max_rows = 300
pd.options.display.max_colwidth = 400

### Set up

Let's define a list with each url we will be scraping. In this list we will include the following brand:
- Scalpers
- Edmmond
- Pompeii
- Muroexe
- Brubaker
- Barner
- Alohas

What these brand have in common is that they all belong to the retail/fashion industry so it makes it easier for us to relate the products in some way.

In [2]:
urls = ['https://scalperscompany.com',
        'https://edmmond.com',
        'https://www.pompeiibrand.com',
        'https://es.muroexe.com',
        'https://thebrubaker.com',
        'https://barnerbrand.com',
        'https://www.alohas.io']

Add the pagination url to scrape only the products

In [3]:
shopify_products_url = 'products.json?limit=500&page=1'

We will be saving all the scraped data as a tabular format in a SQL DB, for that, we define all the credentials of the DB using a config file and set up the connection with SqlAlchemy's create_engine function

In [4]:
dbtype = config.database_new['dbtype']
user = config.database_new['user']
password = config.database_new['password']
ip = config.database_new['ip']
port = config.database_new['port']
name = config.database_new['name']

engine = create_engine(f'{dbtype}://{user}:{password}@{ip}:{port}/{name}')

product_list = []

Scraping the data. Using the "get" function from the request module we will be able to get a json file from the website.

In [5]:
r = requests.get(f'{urls[0]}/{shopify_products_url}')
data = r.json()
for x in data['products'][0]:
    print(x)
display(data['products'][0])

id
title
handle
body_html
published_at
created_at
updated_at
vendor
product_type
tags
variants
images
options


{'id': 4799506513981,
 'title': 'CAMISETA HALLOWEEN FOSFORESCENTE',
 'handle': '28060-scary-skull-tee-kids-aw2021-grey',
 'body_html': 'Camiseta confeccionada en tejido 100% algodón orgánico. Corte Regular Fit, cuello redondo y manga corta. Detalle de print de calavera fosforescente en la oscuridad.',
 'published_at': '2020-10-23T14:21:05+02:00',
 'created_at': '2020-09-17T12:27:50+02:00',
 'updated_at': '2020-10-25T18:22:00+01:00',
 'vendor': 'scalperscompany',
 'product_type': 'Camiseta',
 'tags': ['28060',
  'AW2021',
  'Camisetas',
  'feed-cl2-aw2021',
  'Infantil',
  'Niño',
  'nopromociones',
  'Nueva Colección',
  'Ropa',
  'scaryskullteekids',
  'Talla Superior_10',
  'Talla Superior_12',
  'Talla Superior_14',
  'Talla Superior_4',
  'Talla Superior_6',
  'Talla Superior_8'],
 'variants': [{'id': 32870380732477,
   'title': 'GREY / 4',
   'option1': 'GREY',
   'option2': '4',
   'option3': None,
   'sku': '8445279061574',
   'requires_shipping': True,
   'taxable': True,
   'f

So looking at the info of this product, there is a lot of things that can be useful for a lot of stuff.
- The title is definately useful
- handle is not useful since it's a variable that comes from the title. "A unique human-friendly string for the product. Automatically generated from the product's title. Used by the Liquid templating language to refer to objects." - Shopify Docu.
- Ids are always useful, maybe in the future we will not use the title and only use the id, plus it is good practice to keep just in case you have other tables and want to merge and all.
- We will not be using body html, can't think of one useful thing to do with a html body in data science haha.
- The three dates, we will be keeping created_at and updated_at. One refers to well, the created time of the article and the other refers to the date and time (ISO 8601 format) when the product was last modified.
- From the variant property we will be keeping a lot of thing. The price, the sku, the availability boolean, if it requires shipping or not and the position (The position in ecommerce generally refers to the order in which an item is displayed in the store, not in this case. It means the position of the variant in which it is displayed inside the item).

Now that we know what we will be using, lets just make a for loop to iterate through all the urls in the url list for the code to run for each link. We will append all the results to an empty list defined outside of the for loop

In [6]:
def get_data(): 
    '''
    This function returns a dictionary that contains a list products scraped from a website.
    '''
    r = requests.get(f'{urls[i]}/{shopify_products_url}')
    data = r.json()
    return data

In [7]:
def transform_data(data): 
    '''
    This function transforms the dictionary scraped by get_data function and stores it in a list with only
    the useful properties that will be used for analysis or machine learning.
    The "product" dictionary in the bottom side of the function basically summarizes the data that will be returned.
    '''
    for item in data['products']:
        title = item['title']
        handle = item['handle']
        created = item['created_at']
        updated = item['updated_at']
        product_type = item['product_type']
        vendor = item['vendor']
        for image in item['images']:
            try:
                imagesrc = image['src']
            except:
                imagesrc = np.nan
        for variant in item['variants']:
            price = variant['price']
            sku = variant['sku']
            available = variant['available']
            require_shipping = variant['requires_shipping']
            position = variant['position']
            try:
                compare_at_price = variant['compare_at_price']
            except:
                compare_at_price = np.nan    

            product = {
                'title': title,
                'handle': handle,
                'created': created,
                'updated': updated,
                'product_type':product_type,
                'vendor':vendor,
                'price': price,
                'compare_at_price': compare_at_price,
                'sku': sku,
                'available': available,
                'image': imagesrc,
                'require_shipping': require_shipping,
                'position': position
            }
            product_list.append(product)

Saving the scraped that as a dataframe/csv will alow us to push it to the database.

Using the same engine we defined at the beginning we can push the dataframe to a database with sqlalchemy

In [8]:
def load_data(product_list):
    '''
    Takes the returned list from transform_data, changes some types and loads it to a database.
    '''
    df = pd.DataFrame(product_list)
    df['created'] = pd.to_datetime(df['created'], utc = True)
    df['updated'] = pd.to_datetime(df['updated'], utc = True)
    df['price'] = pd.to_numeric(df['price'])
    df['compare_at_price'] = pd.to_numeric(df['compare_at_price'])
    df.to_sql(name='competitor_products',con=engine, index=False, if_exists='replace', method='multi', chunksize=110)

Now lets make a little for loop that starts the whole process

In [9]:
for i in range(len(urls)):
    transform_data(get_data())
    time.sleep(1)
load_data(product_list)

To check everything is alright, we can query into the competitor_products table in the database.

In [10]:
df = pd.read_sql_table('competitor_products', engine)

In [12]:
print(f'The shape of the table/dataframe is: {df.shape}')

The shape of the table/dataframe is: (8390, 13)
