# Shopify

[Shopify](https://shopify.com) is a popular e-commerce platform for creating storefronts on the web that can accept payments for goods. According to the Wikipedia page:

> The company reported that it had more than 1,000,000 businesses in approximately 175 countries using its platform as of June 2019, with total gross merchandise volume exceeding $41.1 billion for calendar 2018.

Since we crawled the DomainTools domains and save their homepages in a WARC file, we can read it back and see which sites look like they have shopify content.

WARC files record both the HTTP requests and responses. First lets just read the file as a gzipped text file and see if shopify appears anywhere in it.

In [6]:
import gzip

found = 0
for line in gzip.open('data/warc/domaintools-2020-04-13.warc.gz'):
    if b'shopify' in line:
        found += 1
        if found > 100:
            break
        print(line)

b'Set-Cookie: _shopify_y=ae45e9c8-4c8f-48c8-b746-3711d72c02c5; path=/; expires=Wed, 13 Apr 2022 22:27:30 GMT\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Set-Cookie: _shopify_y=a705248e-9ab3-4bc6-8428-5b7f69f3a2d6; path=/; expires=Wed, 13 Apr 2022 22:32:08 GMT\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Set-Cookie: _shopify_y=fbbdced4-9e7b-4936-8315-3a1b2ab0736f; path=/; expires=Wed, 13 Apr 2022 22:33:33 GMT\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Report-To: {"group":"net

b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'set-cookie: _shopify_y=1196d6be-95e1-4f55-9590-35c81700bc4f; path=/; expires=Thu, 14 Apr 2022 01:06:41 GMT\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Set-Cookie: _shopify_y=b0a755e4-cad8-49b2-afa5-6a23a847a8fa; path=/; expires=Thu, 14 Apr 2022 01:18:08 GMT\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325/shopify"}]}\r\n'
b'Report-To: {"group":"network-errors","max_age":2592000,"endpoints":[{"url":"https://monorail-edge.shopifycloud.com/v1/reports/nel/20190325

Alright we're in business! It looks like there are lots of examples of HTTP responses coming back that are setting a Shopify [cookie], specifically _shopify_y, which is a cookie Shopify use for [analytics](https://www.shopify.com/legal/cookies) tracking. Setting cookies is crucial for online purchases and click tracking. So let's walk through the responses looking for responses that do this. Let's start with just one shall we?

We used warcio to write the WARC data and we can [follow the directions](https://github.com/webrecorder/warcio#reading-warc-records) for reading it. The confusing thing is that WARC request and response records themselves have headers, and they contain the actual HTTP request and response records as payloads. The URL that is being archived is available in the `WARC-Target-URI` WARC record header, and we can find our `Set-Cookie` header in the HTTP headers.

As we find matches we can add them to a list for further processing without needing to go back to the WARC data.

In [21]:
from warcio.archiveiterator import ArchiveIterator

shopify_urls = []

with open('data/warc/domaintools-2020-04-13.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        
        # if it's a response and had http headers
        if record.rec_type == 'response' and record.http_headers:
            
            # this is the URL that we collected (based on the domaintools list)
            url = record.rec_headers.get_header('WARC-Target-URI')
            
            # now lets see if the response sets a cookie
            cookie = record.http_headers.get_header('Set-Cookie', '')
            
            # if it's one of our _shopify_y cookies print it out
            if '_shopify_y' in cookie:
                print(url)
                shopify_urls.append(url)
            


https://www.coronamasks.io/
https://www.covid-19masken.de/
https://firstcoronavirusresponseact.com/
https://www.firstcoronavirusresponseact.com/
https://www.covid19-sanitizers.com/
https://www.covid19-stop.de/
https://covid-19preparation.com/
https://www.covid-19preparation.com/
https://covid19-desinfectantes.com/
https://www.covid19-desinfectantes.com/
https://covidphilippines.com/
https://www.covidphilippines.com/
https://www.sars-cov-2-design.de/
https://corona-shop-hamburg.de/
https://www.covidsupplyusa.com/
https://www.covidone9.co.uk/
https://covid-19-t-shirt-shop.myshopify.com/
https://covid-19-t-shirt-shop.myshopify.com/
https://www.fk-covid19.com/
https://www.covidrebel.com/
https://www.customcovid.com/
https://www.co-vid19store.com/
https://freakyshoes.com/
https://approvedcovid19tests.com.au/
https://www.approvedcovid19tests.com.au/
https://freakyshoes.com/
https://covid-19familycare.com/
https://www.covidclub2020.com/
https://theanticoronavirusshop.com/
https://www.theantic

https://letsendcovid19.org/
https://www.letsendcovid19.org/
https://coronavirussocks.com/
https://www.coronavirussocks.com/
https://www.becoronavirusclean.com/
https://coronaviruscontrolcenter.com/
https://www.coronaviruscontrolcenter.com/
https://coronavirus-tees.com/
https://www.coronavirus-tees.com/
https://covid-19mask.org/
https://www.covid-19mask.org/
https://coronatestkit.shop/
https://anticovid19.fr/
https://www.anticovid19.fr/
https://www.covid19ware.com/
https://coronasalvation.com/
https://www.coronasalvation.com/
https://www.coronasafestore.net/
https://coronaoff.me/
https://www.coronaoff.me/
https://coronaantivirus.club/
https://www.coronaantivirus.club/
https://covidkit19.com/
https://www.covidkit19.com/
https://covidsupplier.com/
https://www.covidsupplier.com/
https://coronavirusprotect.net/
https://www.coronavirusprotect.net/
https://udown.com/
https://coronavirusadieu.com/
https://www.coronavirusadieu.com/
https://packanticontagiocoronavirus.com/
https://www.packantico

https://www.doihavecorona.org/
https://www.facemask-coronavirus.com/
https://covidmasks.net/
https://www.covidmasks.net/
https://trumpgotcorona.com/
https://www.trumpgotcorona.com/
https://facemaskcoronaviruses.com/
https://www.facemaskcoronaviruses.com/
https://coronasupplystore.com/
https://www.coronasupplystore.com/
https://coronasuperstore.com/
https://www.coronasuperstore.com/
https://covidclear.com/
https://www.covidclear.com/
https://www.coronaminus.com/
https://coronavirussmasks.com/
https://www.coronavirussmasks.com/
https://stop-corona.info/
https://www.stop-corona.info/
https://goodbyecoronavirus.com/
https://www.goodbyecoronavirus.com/
https://covidprotect.com/
https://www.covidprotect.com/
https://adioscoronavirus.com/
https://www.adioscoronavirus.com/
https://thecoronafacemask.com/
https://www.thecoronafacemask.com/
https://www.coronavirusmart.com/
https://covidheartcompany.org/
https://www.covidheartcompany.org/
https://vermijdcoronavirus.nl/
https://www.vermijdcoronavir

You can see a bit of duplication since some URLs redirect to their www equivalent. To join up this data with the DomainTools dataset lets get the naked domain again.

In [29]:
from urllib.parse import urlparse

def get_domain(url):
    u = urlparse(url)
    return u.netloc.replace('www.', '')

shopify_domains = set(map(get_domain, shopify_urls))
shopify_domains

{'19covidmask.com',
 'acsienterprises.com',
 'adioscorona.com',
 'adioscoronavirus.com',
 'americanmasks.myshopify.com',
 'anti-corona-mask.com',
 'anti-corona-virus.com',
 'anti-corona.ml',
 'anti-corona.nl',
 'anti-coronakits.com',
 'anti-coronavirus.org',
 'anti-coronavirus.tech',
 'anti-covid19.store',
 'anti-cronavirusmask.com',
 'antiagingbed.com',
 'anticorona-virus.com',
 'anticorona.co',
 'anticoronagermsupply.com',
 'anticoronaviru.com',
 'anticoronavirus-masks.com',
 'anticoronaviruscare.com',
 'anticoronavirusstore.com',
 'anticovid.store',
 'anticovid19.fr',
 'anticovidmask.shop',
 'antiviruscorona.shop',
 'apparelcovid19.com',
 'approvedcovid19tests.com.au',
 'beatthecorona.com',
 'becoronavirusclean.com',
 'bescherming-coronavirus.nl',
 'brovid-19.com',
 'buildcoronavirushospital.com',
 'buycoronavirusgear.com',
 'bye-covid19.com',
 'caronavirusmasks.com',
 'chinese-virus.com',
 'clovid19viruskits.com',
 'co-vid19store.com',
 'combatcoronavirus.org',
 'contrelecoronaviru

Now we can join back up with the DomainTools dataset:

In [44]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-13.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk'],
)
df.head()

Unnamed: 0,domain,created,risk
0,covidcolour.online,2020-04-12,99
1,covid-19-test-kit.info,2020-04-12,99
2,quarantinecrew.xyz,2020-04-12,99
3,covidmemes.co,2020-04-12,99
4,quarantalne.website,2020-04-12,99


Now lets add a column of booleans to indicate if the domain was flagged as being handled by Shopify.

In [45]:
df['shopify'] = df.apply(lambda r: r.domain in shopify_domains, axis=1)

Now lets get a dataframe of just the shopify domains and their created and risk.

In [46]:
shopify = df[df['shopify'] == True]
shopify

Unnamed: 0,domain,created,risk,shopify
6543,coronamasks.io,2020-04-07,99,True
9010,covid-19masken.de,2020-04-06,99,True
13431,firstcoronavirusresponseact.com,2020-04-04,99,True
15400,covid19-sanitizers.com,2020-04-03,99,True
17217,covid19-stop.de,2020-04-02,99,True
...,...,...,...,...
121506,quarante--ten.com,2020-03-05,97,True
122871,covidone9.co.uk,2020-03-30,96,True
123792,coronakit.com.co,2020-03-20,96,True
126998,trackingcorona.com,2020-02-22,95,True


It's not entirely clear if the *created* value is when the domain was created or when DomainTools started tracking it (they say it can be either). But it still could be interesting to see how they evolve over time.

In [50]:
import altair as alt

counts = shopify.created.value_counts().reset_index()
counts.columns = ['created', 'count']
counts

alt.Chart(counts, width=800, title="Shopify Domains Created/Tracked per Day").mark_bar().encode(
    alt.X("created:T", title="Days"),
    alt.Y('count:Q', title="Domains")
)

## Test Kits

The New York Times [reported](https://www.nytimes.com/2020/03/24/business/coronavirus-ecommerce-sites.html) on March 24 that Shopify have started blocking Coronavirus related sites that are selling test kits. So lets take a pass through the archived data again and look for any pages that mention "test kit" in their content.


In [61]:
import re
import sys

test_kit_urls = []
with open('data/warc/domaintools-2020-04-13.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        
        # if it's a response and had http headers
        if record.rec_type == 'response' and record.http_headers:
            
            # this is the URL that we collected (based on the domaintools list)
            url = record.rec_headers.get_header('WARC-Target-URI')
            
            # limit to html content
            if record.http_headers and 'html' in record.http_headers.get_header('Content-Type', ''):
            
                content = record.content_stream().read().decode('utf8', errors='ignore')
                if re.search('test kit', content, re.IGNORECASE):
                    sys.stdout.write('.')
                    test_kit_urls.append(url)

..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [65]:
test_kit_domains = set(map(get_domain, test_kit_urls))
test_kit_domains

{'about.fb.com',
 'acovid19test.com',
 'adsone.group',
 'alphacheckpoint.com',
 'anti-coronavirusproducts.co.za',
 'antibodiescheck.com',
 'anticoronaviruswholesale.com',
 'anticovid19.top',
 'athenscovid19.com',
 'bcov2019.live',
 'besafe-from-covid.com',
 'bestcovid19test.convertri.com',
 'blokkx-covid19.com',
 'breakingcoronavirus.com',
 'buycovidtest.com',
 'c19medicus.com',
 'canada.ca',
 'cannabisdomainfinder.com',
 'caronavirus.supplies',
 'casesofcoronavirus.com',
 'certifycovidclear.com',
 'chambercovidupdates.ky',
 'cornavirus.app',
 'corona-19.tech',
 'corona-advice.info',
 'corona-schnelltester.com',
 'corona-virus-test.shop',
 'corona19kit.com',
 'corona19test.co.uk',
 'corona20.in',
 'corona2019.org',
 'coronacide.com',
 'coronacure.news',
 'coronadetectionkit.com',
 'coronadiseasetracker.com',
 'coronaeq.com',
 'coronafaq.org',
 'coronafighter.cn',
 'coronaidmd.com',
 'coronalatestnewsupdates.com',
 'coronamedicalsupplies.co',
 'coronamedtest.com',
 'coronameters.com',
 

Now we can look for the intersection between test kit doains and the shopify domains we identified earlier.

In [67]:
shopify_domains.intersection(test_kit_domains)

{'rapidmedicalsystems.com'}

Not bad, we only found one! But there are a lot out there.

In [68]:
len(test_kit_domains)

388