# Employment by category vs online purchases

Looking at changes in local stores (establishments) vs online purchases, comparing business category (by NAICS code) and product category by states.


Hypothesis: 
There is a negative correlation between change in online purchases for category X and change in employment at establishments of NAICS/category X.

Focusing on changes from 2018 to 2019
- There is both purchases data and (SUSB) census data for these years. SUSB/establishments census data goes only to 2021
- Using other years would incorporate 2020 (COVID!) changes -- not studying COVID impacts here

#### Categories of interest to study:
Criteria for choosing:
- Clear dilineation of categories
- There can be a logical replacement of in person buys by online buys

Categories:

- Books stores
    - hypothesis validated (p<0.05)
- Shoes stores
    - hypothesis validated (p<0.05)

(Didn't pan out)
- Pet supplies
    - overall increase in employment
- Paint and wallpaper 
    - not enough online purchases (most paint purchases in our Amazon data for arts and crafts)
- Electronics
    - ideas for why this didn't work:
    - no clear delineation of category -- included tens of subcategories
    - overlaps with other NAICS codes -- office supplies, used electronics, toys and games (video game is a big category)
    - while book stores and shoes stores clearly have products that people previously needed to buy in stores and can now buy online, there are more and more NEW types of electronics to buy and therefore there is not as clearly a transfer of where specific items are being purchased.

In [5]:
from datetime import date, datetime
import os

from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr

census_data_dir = '../data/census/'

## Load in the states data
For matching to Census data.

In [6]:
states_df = pd.read_csv(census_data_dir + 'state-abbreviations.csv')
states_df.head(3)

Unnamed: 0,state,abbrev,code
0,Alabama,Ala.,AL
1,Alaska,Alaska,AK
2,Arizona,Ariz.,AZ


## Census data

Census data are from 
Statistics of U.S. Businesses (SUSB)
https://www.census.gov/programs-surveys/susb/technical-documentation/methodology.html

Data are downloaded from https://www.census.gov/programs-surveys/susb/data/tables.html

I compiled data for each year, pulling out NAICS codes of interest, in spreadsheets: [link](https://docs.google.com/spreadsheets/d/1JNXUakd53ekObgPEwbz6tB56MkIErJ937pdLbCj37IQ/edit?usp=sharing).

I normalized employment changes by population changes by first pulling population data for each year from NST-EST2020. [Population data are here](https://docs.google.com/spreadsheets/d/1_YEiBzyt8BtOl8oPYZ51cZN3SpmBk9QIcerPq6OM0PY/edit?usp=sharing).
I calculated employment/population for each year and then calculated percent change in employment.

i.e. metric of interest is percent change in employment from 2018 to 2019:

```
= [(employment2019/population2019) - (employment2018/population2018)]/(employment2018/population2018)
```


In [7]:
def read_susb_data(fname):
    df = pd.read_csv(census_data_dir + 'SUSB/' + fname)
    # Drop United States total
    df = df.drop(0)
    df['state code'] = df['State Name'].map(states_df.set_index('state')['code'])
    # Index and order by state code to match purchases data
    df = df.set_index('state code').sort_index()
    return df

#### Read in books census data

In [8]:
census_books_stores = read_susb_data('books-stores-aggregated.csv')
census_books_stores.head(3)

Unnamed: 0_level_0,State,State Name,NAICS,NAICS Description,Enterprise Size,Establishments 2018,Establishments 2019,Establishments 2020,Employment 2018,Employment 2019,...,Enterprise size <500 employees: 2019 employment/population,Enterprise size <500 employees: 2018-2019 employment/population percent change,Enterprise Size < 20,Enterprise size <20 employees: Establishments 2018,Enterprise size <20 employees: Establishments 2019,Enterprise size <20 employees: Employment 2018,Enterprise size <20 employees: Employment 2019,Enterprise size <20 employees: 2018 employment/population,Enterprise size <20 employees: 2019 employment/population,Enterprise size <20 employees: 2018-2019 employment/population percent change
state code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,2,Alaska,451211,Book Stores,01: Total,27,23,27,175,139,...,8e-05,-0.277525,05: <20 employees,14,9,58,37,7.9e-05,5e-05,-0.359442
AL,1,Alabama,451211,Book Stores,01: Total,107,106,94,983,842,...,3.1e-05,-0.135834,05: <20 employees,27,28,141,115,2.9e-05,2.3e-05,-0.187112
AR,5,Arkansas,451211,Book Stores,01: Total,89,86,76,693,689,...,3.7e-05,0.015208,05: <20 employees,24,24,71,78,2.4e-05,2.6e-05,0.095383


Pull out columns of interest and compare

BEWARE the metrics you use!

- metrics have low correlation if any
- year to year changes negatively correlated
- change in employment low positive correlation with change in establishments

In [10]:
census_books_stores[[
    '2018-2019 Establishments percent change',
    '2019-2020 Establishments percent change',
    '2018-2019 Employment percent change',
    '2019-2020 Employment percent change',
    '2018-2019 employment/population percent change',
    '2019-2020 employment/population percent change',
    # Also have data specific to enterprise size
    'Enterprise size <500 employees: 2018-2019 employment/population percent change',
    'Enterprise size <20 employees: 2018-2019 employment/population percent change',
    
]].corr()

Unnamed: 0,2018-2019 Establishments percent change,2019-2020 Establishments percent change,2018-2019 Employment percent change,2019-2020 Employment percent change,2018-2019 employment/population percent change,2019-2020 employment/population percent change,Enterprise size <500 employees: 2018-2019 employment/population percent change,Enterprise size <20 employees: 2018-2019 employment/population percent change
2018-2019 Establishments percent change,1.0,-0.319861,0.260161,-0.117258,0.216737,-0.079443,0.437304,0.348661
2019-2020 Establishments percent change,-0.319861,1.0,-0.123347,0.250796,-0.073798,0.225837,-0.10325,-0.300708
2018-2019 Employment percent change,0.260161,-0.123347,1.0,-0.092615,0.992434,-0.077472,0.6616,0.352816
2019-2020 Employment percent change,-0.117258,0.250796,-0.092615,1.0,-0.057756,0.973918,-0.201326,-0.177107
2018-2019 employment/population percent change,0.216737,-0.073798,0.992434,-0.057756,1.0,-0.041061,0.635306,0.334797
2019-2020 employment/population percent change,-0.079443,0.225837,-0.077472,0.973918,-0.041061,1.0,-0.179528,-0.145611
Enterprise size <500 employees: 2018-2019 employment/population percent change,0.437304,-0.10325,0.6616,-0.201326,0.635306,-0.179528,1.0,0.443595
Enterprise size <20 employees: 2018-2019 employment/population percent change,0.348661,-0.300708,0.352816,-0.177107,0.334797,-0.145611,0.443595,1.0


In [47]:
# Pull out columns of interest 
census_books_est_pct_change_20182019 = census_books_stores['2018-2019 Establishments percent change']
census_books_est_pct_change_20192020 = census_books_stores['2019-2020 Establishments percent change']
# census_books_emp_pct_change_20182019 = census_books_stores['2018-2019 Employment percent change']
# census_books_emp_pct_change_20192020 = census_books_stores['2019-2020 Employment percent change']
census_books_emp_pct_change_20182019 = census_books_stores['2018-2019 employment/population percent change']
census_books_emp_pct_change_20192020 = census_books_stores['2019-2020 employment/population percent change']

census_books_lt_500_emp_pct_change_20182019 = census_books_stores['Enterprise size <500 employees: 2018-2019 employment/population percent change']
census_books_lt_20_emp_pct_change_20182019 = census_books_stores['Enterprise size <20 employees: 2018-2019 employment/population percent change']

census_books_est_pct_change_20182019.head()

state code
AK   -0.148148
AL   -0.009346
AR   -0.033708
AZ   -0.008065
CA    0.000000
Name: 2018-2019 Establishments percent change, dtype: float64

#### Read in shoe stores census data

For shoe stores, employment and establishments data more correlated than for book stores.

Negative correlations across years.

In [14]:
census_shoe_stores = read_susb_data('shoe-stores-aggregated.csv')
census_shoe_stores.head(3)

Unnamed: 0_level_0,State,State Name,NAICS,NAICS Description,Enterprise Size,Establishments 2018,Establishments 2019,Establishments 2020,Employment 2018,Employment 2019,...,Enterprise size <500 employees: 2019 employment/population,Enterprise size <500 employees: 2018-2019 employment/population percent change,Enterprise Size < 20,Enterprise size <20 employees: Establishments 2018,Enterprise size <20 employees: Establishments 2019,Enterprise size <20 employees: Employment 2018,Enterprise size <20 employees: Employment 2019,Enterprise size <20 employees: 2018 employment/population,Enterprise size <20 employees: 2019 employment/population,Enterprise size <20 employees: 2018-2019 employment/population percent change
state code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,2,Alaska,448210,Shoe Stores,01: Total,35,36,35,246,259,...,0.000166,0.123875,05: <20 employees,8,9,57,64,7.7e-05,8.7e-05,0.127431
AL,1,Alabama,448210,Shoe Stores,01: Total,370,360,355,3268,3457,...,0.000122,-0.062692,05: <20 employees,66,61,349,319,7.1e-05,6.5e-05,-0.089002
AR,5,Arkansas,448210,Shoe Stores,01: Total,199,199,194,1548,1647,...,8.6e-05,-0.021734,05: <20 employees,35,34,144,130,4.8e-05,4.3e-05,-0.099859


In [15]:
census_shoe_stores[[
    '2018-2019 Establishments percent change',
    '2019-2020 Establishments percent change',
    '2018-2019 Employment percent change',
    '2019-2020 Employment percent change',
    '2018-2019 employment/population percent change',
    '2019-2020 employment/population percent change',
    # Also have data specific to enterprise size
    'Enterprise size <500 employees: 2018-2019 employment/population percent change',
    'Enterprise size <20 employees: 2018-2019 employment/population percent change',
]].corr()

Unnamed: 0,2018-2019 Establishments percent change,2019-2020 Establishments percent change,2018-2019 Employment percent change,2019-2020 Employment percent change,2018-2019 employment/population percent change,2019-2020 employment/population percent change,Enterprise size <500 employees: 2018-2019 employment/population percent change,Enterprise size <20 employees: 2018-2019 employment/population percent change
2018-2019 Establishments percent change,1.0,-0.466332,0.548621,-0.279487,0.546644,-0.269233,0.036012,0.025754
2019-2020 Establishments percent change,-0.466332,1.0,-0.041917,0.274604,-0.067856,0.286381,0.042038,-0.024941
2018-2019 Employment percent change,0.548621,-0.041917,1.0,-0.308751,0.987648,-0.270753,0.140001,0.154044
2019-2020 Employment percent change,-0.279487,0.274604,-0.308751,1.0,-0.337143,0.992687,0.006106,0.280784
2018-2019 employment/population percent change,0.546644,-0.067856,0.987648,-0.337143,1.0,-0.29343,0.183909,0.147873
2019-2020 employment/population percent change,-0.269233,0.286381,-0.270753,0.992687,-0.29343,1.0,0.027734,0.299174
Enterprise size <500 employees: 2018-2019 employment/population percent change,0.036012,0.042038,0.140001,0.006106,0.183909,0.027734,1.0,0.43134
Enterprise size <20 employees: 2018-2019 employment/population percent change,0.025754,-0.024941,0.154044,0.280784,0.147873,0.299174,0.43134,1.0


In [16]:
# Pull out columns of interest
census_shoes_est_pct_change_20182019 = census_shoe_stores['2018-2019 Establishments percent change']
census_shoes_est_pct_change_20192020 = census_shoe_stores['2019-2020 Establishments percent change']
# census_shoes_emp_pct_change_20182019 = census_shoe_stores['2018-2019 Employment percent change']
# census_shoes_emp_pct_change_20192020 = census_shoe_stores['2019-2020 Employment percent change']
census_shoes_emp_pct_change_20182019 = census_shoe_stores['2018-2019 employment/population percent change']
census_shoes_emp_pct_change_20192020 = census_shoe_stores['2019-2020 employment/population percent change']

census_shoes_lt_500_emp_pct_change_20182019 = census_shoe_stores['Enterprise size <500 employees: 2018-2019 employment/population percent change']
census_shoes_lt_20_emp_pct_change_20182019 = census_shoe_stores['Enterprise size <20 employees: 2018-2019 employment/population percent change']

census_shoes_emp_pct_change_20182019.head()

state code
AK    0.057181
AL    0.054312
AR    0.060846
AZ   -0.159494
CA   -0.092635
Name: 2018-2019 employment/population percent change, dtype: float64

#### Read in electronics stores census data

For electronics stores, employment and establishments data less correlated than for shoe stores.

Negative correlations across years.

In [17]:
census_electronics_stores = read_susb_data('electronics-stores-aggregated.csv')
census_electronics_stores.head(3)

Unnamed: 0_level_0,State,State Name,NAICS,NAICS Description,Enterprise Size,Establishments 2018,Establishments 2019,Establishments 2020,Employment 2018,Employment 2019,...,2018-2019 Employment percent change,2019-2020 Employment percent change,2018 population,2019 population,2020 population,2018 employment/population,2019 employment/population,2020 employment/population,2018-2019 employment/population percent change,2019-2020 employment/population percent change
state code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,2,Alaska,443142,Electronics Stores,01: Total,47,43,37,437,418,...,-0.045455,-0.097113,736624,733603,731158,0.000593,0.00057,0.000521,-0.039539,-0.085469
AL,1,Alabama,443142,Electronics Stores,01: Total,220,199,186,2380,2363,...,-0.007194,-0.29978,4891628,4907965,4921532,0.000487,0.000481,0.000369,-0.010448,-0.23276
AR,5,Arkansas,443142,Electronics Stores,01: Total,167,160,153,1674,1693,...,0.011223,-0.095084,3012161,3020985,3030522,0.000556,0.00056,0.00051,0.008396,-0.089702


In [18]:
census_electronics_stores[[
    '2018-2019 Establishments percent change',
    '2019-2020 Establishments percent change',
    '2018-2019 Employment percent change',
    '2019-2020 Employment percent change'
]].corr()

Unnamed: 0,2018-2019 Establishments percent change,2019-2020 Establishments percent change,2018-2019 Employment percent change,2019-2020 Employment percent change
2018-2019 Establishments percent change,1.0,-0.088687,0.357728,0.066169
2019-2020 Establishments percent change,-0.088687,1.0,-0.366091,0.400313
2018-2019 Employment percent change,0.357728,-0.366091,1.0,-0.498358
2019-2020 Employment percent change,0.066169,0.400313,-0.498358,1.0


In [19]:
# Pull out columns of interest
census_electronics_est_pct_change_20182019 = census_electronics_stores['2018-2019 Establishments percent change']
census_electronics_est_pct_change_20192020 = census_electronics_stores['2019-2020 Establishments percent change']
census_electronics_emp_pct_change_20182019 = census_electronics_stores['2018-2019 employment/population percent change']
census_electronics_emp_pct_change_20192020 = census_electronics_stores['2019-2020 employment/population percent change']
census_electronics_emp_pct_change_20182019.head()

state code
AK   -0.039539
AL   -0.010448
AR    0.008396
AZ   -0.081680
CA   -0.076248
Name: 2018-2019 employment/population percent change, dtype: float64

## Amazon purchases

Will restrict analysis to 2018 and 2019 and analyze corresponding changes. 
- Limiting to these years to avoid changes due to COVID-19

Restrict data to response ids that had purchases in 2018.

In [20]:
amzn_data_fpath = '../data/amazon-data/amazon-data-cleaned.csv'
amzn_data = pd.read_csv(amzn_data_fpath, index_col=[0])
# add year to data for convenience
amzn_data['year'] = pd.to_datetime(amzn_data['Order Date']).apply(lambda d: d.year)
# peek at it:
amzn_data.drop(['Survey ResponseID'], axis=1).head(3)

Unnamed: 0,Order Date,Purchase Price Per Unit,Quantity,Shipping Address State,Title,ASIN/ISBN (Product Code),Category,unit price,total price,yyyy-mm,state,year
0,2018-02-21,$7.93,1.0,RHODE ISLAND,Suburban World: The Norling Photos,0873516095,ABIS_BOOK,7.93,7.93,2018-02,RI,2018
1,2018-02-21,$3.53,1.0,RHODE ISLAND,,B004S7EZR0,,3.53,3.53,2018-02,RI,2018
2,2018-03-05,$5.99,1.0,RHODE ISLAND,1952 Back In The Day - 24-page Greeting Card /...,193938012X,ABIS_BOOK,5.99,5.99,2018-03,RI,2018


## Sampling set up

Set up for repeated random sampling.

Since analysis is done comparing states, limit data to states with a sufficient number of response ids to sample from.

Exclude data from respondents who purchase more than the 90th percentile.

In [21]:
# Restrict data to responseIds with purchases in 2018.
print('Before dropping any data:')
print('N=%s unique purchasers' % amzn_data['Survey ResponseID'].nunique())
print('%s total purchases' % len(amzn_data))
print('Dropping data for response ids do that do not have purchases in 2018')
responseids_2018 = amzn_data[amzn_data['year']==2018]['Survey ResponseID'].unique()
amzn_data_sample = amzn_data[amzn_data['Survey ResponseID'].isin(responseids_2018)]
print('N=%s unique purchasers' % amzn_data_sample['Survey ResponseID'].nunique())
print('%s total purchases' % len(amzn_data_sample))

Before dropping any data:
N=5027 unique purchasers
1850717 total purchases
Dropping data for response ids do that do not have purchases in 2018
N=4281 unique purchasers
1745772 total purchases


In [22]:
responseids_by_state = amzn_data.groupby('state')['Survey ResponseID'].nunique()
print('Bottom number of response ids by state')
print(responseids_by_state.sort_values().head(8))
print('\nDistribution of response ids by state')
print(responseids_by_state.describe())

Bottom number of response ids by state
state
PR    10
ND    16
WY    21
AK    28
MT    29
SD    29
VT    45
DE    51
Name: Survey ResponseID, dtype: int64

Distribution of response ids by state
count     52.000000
mean     215.634615
std      203.585799
min       10.000000
25%       67.750000
50%      144.000000
75%      281.000000
max      920.000000
Name: Survey ResponseID, dtype: float64


In [23]:
state_min_n = 50
print('using n=%s as minimum number of response ids for each state' % state_min_n)
sample_states = responseids_by_state[responseids_by_state >= state_min_n].index
print('%s states meet threshold' % len(sample_states))
print('dropping %s purchases for the states that do not meet the threshold' % (
    len(amzn_data_sample) - len(amzn_data_sample[amzn_data_sample['state'].isin(sample_states)])))
amzn_data_sample = amzn_data_sample[amzn_data_sample['state'].isin(sample_states)]

print('N=%s unique purchasers' % amzn_data_sample['Survey ResponseID'].nunique())
print('%s total purchases' % len(amzn_data_sample))

using n=50 as minimum number of response ids for each state
45 states meet threshold
dropping 92969 purchases for the states that do not meet the threshold
N=4211 unique purchasers
1652803 total purchases


In [24]:
purchases_by_responseid = amzn_data_sample.groupby(['Survey ResponseID'])['Quantity'].agg(['sum','count'])
print('Distribution of purchases per response id')
purchases_by_responseid.describe()

Distribution of purchases per response id


Unnamed: 0,sum,count
count,4211.0,4211.0
mean,427.691285,392.496557
std,485.553254,435.463559
min,1.0,1.0
25%,114.5,108.0
50%,276.0,256.0
75%,562.5,517.5
max,5839.0,5413.0


Random sampling pipeline

- before: make sampling frame -- first restrict data as needed by category
- randomly sample with replacement n responseids 
    - not stratified: n total
    - stratified: n for each state in threshold states

In [25]:
def get_random_sample(frame=amzn_data_sample, N=2500, sample_states=sample_states):
    # limit the sampling frame to the states (if not already)
    sample_responseids = np.random.choice(
        frame[frame['state'].isin(sample_states)]['Survey ResponseID'].unique(), 
        size=N, replace=True
    )
    return frame[frame['Survey ResponseID'].isin(sample_responseids)]


def get_random_stratified_sample(frame=amzn_data_sample, state_n=state_min_n, sample_states=sample_states):
    stratified_sample_df = pd.DataFrame(columns=frame.columns)
    for s in sample_states:
        sample_responseids = np.random.choice(
            frame[frame['state']==s]['Survey ResponseID'].unique(),
            size=state_n, replace=True,
        )
        sampled_df = frame[frame['Survey ResponseID'].isin(sample_responseids)]
        stratified_sample_df = pd.concat([stratified_sample_df, sampled_df])
    return stratified_sample_df

## Books

Look at them.

FYI they are the TOP most purchased category.

In [26]:
# Using sample vs all data excludes people who bought hundreds of gift cards
amzn_data_sample['Category'].value_counts().head(5)

Category
ABIS_BOOK                 80764
PET_FOOD                  35748
NUTRITIONAL_SUPPLEMENT    24995
SHIRT                     24876
ELECTRONIC_CABLE          16618
Name: count, dtype: int64

Check: Are there other book categories?

In [27]:
categories = amzn_data['Category'].unique()
bookish_categories = [c for c in categories if 'book' in str(c).lower()]
amzn_data[amzn_data['Category'].isin(bookish_categories)]['Category'].value_counts()

Category
ABIS_BOOK                        87619
BLANK_BOOK                        3422
NOTEBOOK_COMPUTER                 1040
AMAZON_BOOK_READER                 505
BOOKMARK                           322
BOOK_DOCUMENT_STAND                226
AMAZON_BOOK_READER_ACCESSORY       208
BOOKEND                            186
BOOK_COVER                         114
BOOK                                37
BOOKS_1973_AND_LATER                34
ELECTRONIC_BOOK_READER              14
ABIS_EBOOKS                         14
BOOKSHELF_OR_MICRO_STEREO_SYS        1
Name: count, dtype: int64

In [28]:
print('What are these?')
amzn_data[amzn_data['Category']=='BOOK'][['Category','Title','unit price','Quantity','state']].head()

What are these?


Unnamed: 0,Category,Title,unit price,Quantity,state
211,BOOK,Kafka on the Shore,12.25,1.0,CA
283,BOOK,Sarah Plain and Tall,8.95,1.0,OR
205,BOOK,Art of Seduction,23.04,1.0,TX
20,BOOK,Don't Shoot the Dog! : The New Art of Teaching...,9.23,1.0,IL
1562,BOOK,Sesame Street Ultimate Board Books Set for Kid...,16.95,1.0,OR


In [29]:
amzn_data[amzn_data['Category']=='BOOKS_1973_AND_LATER'][['Category','Title','unit price','Quantity','state']].head()

Unnamed: 0,Category,Title,unit price,Quantity,state
23,BOOKS_1973_AND_LATER,Gentle Babies: Essential Oils and Natural Reme...,17.5,1.0,TX
10,BOOKS_1973_AND_LATER,"Explore and Learn, 6 Volume Set: Earth and Spa...",15.68,1.0,IN
143,BOOKS_1973_AND_LATER,Games People Play: The Psychology of Human Rel...,15.5,1.0,IN
2830,BOOKS_1973_AND_LATER,"KIRSTEN, AN AMERICAN GIRL (6 books, Boxed set)",29.71,1.0,OH
132,BOOKS_1973_AND_LATER,National Park Journal: Yellowstone,7.99,1.0,AR


In [30]:
amzn_data[amzn_data['Category']=='AMAZON_BOOK_READER'][['Category','Title','unit price','Quantity','state']].head()

Unnamed: 0,Category,Title,unit price,Quantity,state
521,AMAZON_BOOK_READER,Kindle Paperwhite – (previous generation - 201...,129.99,1.0,IL
303,AMAZON_BOOK_READER,Kindle Paperwhite – (previous generation - 201...,84.99,1.0,OH
555,AMAZON_BOOK_READER,"Kindle Paperwhite (8 GB) – Now with a 6.8"" dis...",76.5,1.0,OH
44,AMAZON_BOOK_READER,Certified Refurbished Kindle Paperwhite E-read...,79.99,1.0,PA
56,AMAZON_BOOK_READER,Kindle Paperwhite – (previous generation - 201...,99.99,1.0,PA


In [31]:
# collect the real book categories
book_categories = ['ABIS_BOOK', 'BOOK', 'BOOKS_1973_AND_LATER']
book_purchases = amzn_data_sample[amzn_data_sample['Category'].isin(book_categories)]
print('%s total book purchases in sample' % len(book_purchases))

80832 total book purchases in sample


### Book purchases analysis

Will look at 2018 and 2019 data and related changes.

In [32]:
book_purchases = book_purchases[book_purchases['year'].isin([2018, 2019])]
print('%s total book purchases from N=%s purchasers in 2018-2019 sample' % (len(book_purchases), book_purchases['Survey ResponseID'].nunique()))
book_purchases.head(3)

30558 total book purchases from N=2991 purchasers in 2018-2019 sample


Unnamed: 0,Order Date,Purchase Price Per Unit,Quantity,Shipping Address State,Title,ASIN/ISBN (Product Code),Category,Survey ResponseID,unit price,total price,yyyy-mm,state,year
0,2018-02-21,$7.93,1.0,RHODE ISLAND,Suburban World: The Norling Photos,0873516095,ABIS_BOOK,R_3I9Pu8iauEcOx9A,7.93,7.93,2018-02,RI,2018
2,2018-03-05,$5.99,1.0,RHODE ISLAND,1952 Back In The Day - 24-page Greeting Card /...,193938012X,ABIS_BOOK,R_3I9Pu8iauEcOx9A,5.99,5.99,2018-03,RI,2018
4,2018-04-29,$4.50,1.0,RHODE ISLAND,Time of Wonder (Picture Puffins),0140502017,ABIS_BOOK,R_3I9Pu8iauEcOx9A,4.5,4.5,2018-04,RI,2018


Data checks

In [33]:
# Sum is sum over quanity. Count is unique purchases per person per year
print('There are some outlier purchasers making lots of purchases!')
print('Book purchases per person per year')
book_purchases_per_person = book_purchases.groupby(['year','Survey ResponseID'])['Quantity'].agg(['sum','count'])
book_purchases_per_person.describe()

There are some outlier purchasers making lots of purchases!
Book purchases per person per year


Unnamed: 0,sum,count
count,4837.0,4837.0
mean,6.541245,6.317552
std,9.715249,9.050545
min,1.0,1.0
25%,2.0,2.0
50%,3.0,3.0
75%,8.0,8.0
max,188.0,179.0


In [34]:
# What is the 90th percentile?
# Actually the 99th percentile is a reasonable amount of books to buy so let's use that as the cut off.
print('90th percentile : ', book_purchases_per_person['sum'].quantile(0.90))
print('95th percentile : ', book_purchases_per_person['sum'].quantile(0.95))
print('99th percentile : ', book_purchases_per_person['sum'].quantile(0.99))
max_purchases = book_purchases_per_person['sum'].quantile(0.99)

90th percentile :  15.0
95th percentile :  22.0
99th percentile :  46.0


In [35]:
# too_many_books_responseids = book_purchases_per_person[
#     (book_purchases_per_person['sum'] > max_purchases)
# ].reset_index()['Survey ResponseID'].unique()
# print('Dropping %s response IDs for people who bought more than %s books' % (len(too_many_books_responseids), max_purchases))

Given our focus on number of buyers rather than purchases: 
What if we didn't remove the top 99th percentile?

Test shows correlation higher without removal (larger N). 

In [36]:
# book_purchases = book_purchases[~book_purchases['Survey ResponseID'].isin(too_many_books_responseids)]
# print('%s total book purchases from N=%s purchasers in 2018-2019 dataset' % (len(book_purchases), book_purchases['Survey ResponseID'].nunique()))
# book_purchases.head(3)

In [37]:
book_purchases2018 = book_purchases[book_purchases['year']==2018]
book_purchases2019 = book_purchases[book_purchases['year']==2019]
print('%s book purchases from N=%s purchasers in 2018 dataset' % (len(book_purchases2018), book_purchases2018['Survey ResponseID'].nunique()))
print('%s book purchases from N=%s purchasers in 2019 dataset' % (len(book_purchases2019), book_purchases2019['Survey ResponseID'].nunique()))

15812 book purchases from N=2435 purchasers in 2018 dataset
14746 book purchases from N=2402 purchasers in 2019 dataset


In [38]:
# Metric: Portion of purchasers increasing number of purchases

def get_portion_increases_by_purchaser(purchases, yr1=2018, yr2=2019):
    # List with a number (float) for each state
    portion_purchaser_increases = []
    for s in sample_states:
        s_purchases = book_purchases[book_purchases['state']==s]
        # Make series mapping response ID to total book purchases (summed quantity)
        # Create dataframe with both seris 
        # fill na with 0 where ResponseIDs are missing in 1 yr
        # quantify portion of response IDs that increased purchases
        s_purchases_yr1 = s_purchases[s_purchases['year']==yr1].groupby(
            'Survey ResponseID'
        )['Quantity'].sum().rename(yr1).to_frame()
        s_purchases_yr2 = s_purchases[s_purchases['year']==yr2].groupby(
            'Survey ResponseID'
        )['Quantity'].sum().rename(yr2).to_frame()
        s_purchases_yr1_yr2 = pd.merge(
            s_purchases_yr1, s_purchases_yr2, how='outer', left_index=True, right_index=True
        ).fillna(0)
        s_purchases_yr1_yr2['increased'] = s_purchases_yr1_yr2[yr2] > s_purchases_yr1_yr2[yr1]
        portion_increased = s_purchases_yr1_yr2['increased'].sum()/len(s_purchases_yr1_yr2)
        portion_purchaser_increases += [portion_increased]
    return pd.Series(portion_purchaser_increases, index=sample_states)

In [39]:
book_portion_increases_by_purchaser = get_portion_increases_by_purchaser(book_purchases)
book_portion_increases_by_purchaser.head()

state
AL    0.519231
AR    0.571429
AZ    0.432836
CA    0.416894
CO    0.465116
dtype: float64

In [40]:
def get_pct_change_buyers(purchases, yr1=2018, yr2=2019, verbose=False):
    purchases_yr1 = purchases[purchases['year']==yr1]
    purchases_yr2 = purchases[purchases['year']==yr2]
    buyers_by_state_yr1 = purchases_yr1.groupby(['state'])['Survey ResponseID'].nunique()
    buyers_by_state_yr2 = purchases_yr2.groupby(['state'])['Survey ResponseID'].nunique()
    pct_change = (buyers_by_state_yr2 - buyers_by_state_yr1)/buyers_by_state_yr1
    if verbose:
        print('%s purchases from N=%s purchasers in 2018 dataset' % (len(purchases_yr1), purchases_yr1['Survey ResponseID'].nunique()))
        print('%s purchases from N=%s purchasers in 2019 dataset' % (len(purchases_yr2), purchases_yr2['Survey ResponseID'].nunique()))
    return pct_change
    
book_buyers_pct_change = get_pct_change_buyers(book_purchases, verbose=True)

15812 purchases from N=2435 purchasers in 2018 dataset
14746 purchases from N=2402 purchasers in 2019 dataset


In [41]:
def get_pct_change_purchases(purchases, yr1=2018, yr2=2019, verbose=False):
    purchases_yr1 = purchases[purchases['year']==yr1]
    purchases_yr2 = purchases[purchases['year']==yr2]
    purchases_by_state_yr1 = purchases_yr1.groupby(['state'])['Quantity'].sum()
    purchases_by_state_yr2 = purchases_yr2.groupby(['state'])['Quantity'].sum()
    pct_change = (purchases_by_state_yr2 - purchases_by_state_yr1)/purchases_by_state_yr1
    if verbose:
        print('%s purchases from N=%s purchasers in 2018 dataset' % (len(purchases_yr1), purchases_yr1['Survey ResponseID'].nunique()))
        print('%s purchases from N=%s purchasers in 2019 dataset' % (len(purchases_yr2), purchases_yr2['Survey ResponseID'].nunique()))
    return pct_change
    
book_purchases_pct_change = get_pct_change_purchases(book_purchases, verbose=True)

15812 purchases from N=2435 purchasers in 2018 dataset
14746 purchases from N=2402 purchasers in 2019 dataset


Quantifying change in distinct purchasers by state and total purchases by state
- these metrics are correlated but not highly correlated (p<0.05)

In [42]:
print('correlation between percent change in buyers vs total purchasers (books)')
r, pvalue = pearsonr(book_purchases_pct_change, book_buyers_pct_change)
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_change, book_buyers_pct_change)
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

correlation between percent change in buyers vs total purchasers (books)
Pearson r=0.3528 (p-value=0.0175)
Spearman r=0.3360 (p-value=0.0240)


#### Compare census data to purchases data

Using random sampling

for N subsamples:
- get random subsample from books purchases
- get pct changes
- get mean pct changes
- compare mean pct changes to census data (correlation)

In [43]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_book_purchases = get_random_sample(frame=book_purchases)
    state_pct_changes += [get_pct_change_buyers(sampled_book_purchases, verbose=v)]
book_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

1/1000
8878 purchases from N=1376 purchasers in 2018 dataset
8320 purchases from N=1364 purchasers in 2019 dataset
501/1000
9040 purchases from N=1400 purchasers in 2018 dataset
8534 purchases from N=1373 purchasers in 2019 dataset


In [44]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment
Pearson r=-0.3460 (p-value=0.0199)


We find the expected negative correlation (p<0.05) comparing:
- 2018-2019 change in number of book buyers per state
- 2018-2019 change in employment per state

(change found for random sample without stratifying, N=2500)

Not found for:
- total number of purchases per state
- employment in following year (2019-2020 has opposite direction)
- number establishments


Questions:

How does this change by sampling strategy?
- random sample vs stratified random sample 
    - stratifying has no improvement
- smaller N for simple random sample
    - N=2000 --> weaker correlation
    
Notes: 
- Correlation is about the same (slightly weaker) when using portion increases by purchaser
- normalizing employment by population works slightly better than not normalizing by population
- correlation weaker when limiting to enterprises <500 employees or <20 employees

In [352]:
# Look at the data
books_metrics = pd.DataFrame({
    'book buyers pct changes': book_buyers_pct_changes,
    'book purchases pct changes': book_purchases_pct_change,
    'book portion increases by purchaser': book_portion_increases_by_purchaser
})
print('Correlations between metrics:')
display(books_metrics.corr())
books_metrics

Correlations between metrics:


Unnamed: 0,book buyers pct changes,book purchases pct changes,book portion increases by purchaser
book buyers pct changes,1.0,0.366915,0.706474
book purchases pct changes,0.366915,1.0,0.611868
book portion increases by purchaser,0.706474,0.611868,1.0


Unnamed: 0_level_0,book buyers pct changes,book purchases pct changes,book portion increases by purchaser
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AL,0.006327,0.147436,0.519231
AR,-0.02725,0.351648,0.571429
AZ,0.009368,0.42471,0.432836
CA,-0.096519,-0.024564,0.416894
CO,0.193924,-0.284375,0.465116
CT,-0.083245,0.049451,0.439024
DC,-0.145172,-0.252874,0.391304
DE,0.215772,0.0,0.4
FL,-0.060372,-0.23301,0.39738
GA,0.113148,0.033333,0.466102


## Shoes

https://www.naics.com/naics-code-description/?code=448210

Look at them

In [49]:
shoesy_categories = [c for c in categories if 'shoe' in str(c).lower()]
amzn_data[amzn_data['Category'].isin(shoesy_categories)]['Category'].value_counts()

Category
SHOES                   12758
SHOE_INSERT              1804
SHOELACE                  892
SHOE_ACCESSORY            248
SHOE_TREE                 148
TECHNICAL_SPORT_SHOE      123
SHOE_BAG                   37
SNOWSHOE                   27
SHOE_POLISH                18
GUILD_SHOES                 1
Name: count, dtype: int64

FYI the 'Guild shoes' are beautifully crafted crochet sneakers made by an artist. Not to be included in shoes analysis. https://www.amazon.com/Sneakers-Slippers-Crochet-Comfortable-Basketball/dp/B09X5RGJKG

In [50]:
amzn_data[amzn_data['Category']=='TECHNICAL_SPORT_SHOE'][['Category','Title','unit price','Quantity','state']].head(3)

Unnamed: 0,Category,Title,unit price,Quantity,state
104,TECHNICAL_SPORT_SHOE,Dear Time Women Flat Shoes Comfortable Slip on...,14.99,1.0,NY
72,TECHNICAL_SPORT_SHOE,"adidas Men's Alphabounce Em m, White/Metallic ...",31.21,1.0,NJ
502,TECHNICAL_SPORT_SHOE,New Balance Men's 410 V5 Cushioning Trail Runn...,69.95,1.0,VA


Are there other shoesy categories?

In [51]:
slippers_categories = [c for c in categories if 'slipper' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(slippers_categories)]['Category'].value_counts())

sandals_categories = [c for c in categories if 'sandal' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(sandals_categories)]['Category'].value_counts())

boot_categories = [c for c in categories if 'boot' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(boot_categories)]['Category'].value_counts())

Category
SLIPPER    2711
Name: count, dtype: int64
Category
SANDAL    3641
Name: count, dtype: int64
Category
BOOT              3167
SNOWBOARD_BOOT       7
Name: count, dtype: int64


In [52]:
amzn_data[amzn_data['Category']=='SLIPPER'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
RockDove Men's Original Two-Tone Memory Foam Slipper,67.0,67
"ULTRAIDEAS Women's Fuzzy Wool-Like House Shoes with Memory Foam, Gift for Women, Ladies Slippers with Indoor &Outdoor Anti-Skid Rubber Sole",52.0,51
Jessica Simpson Women's Comfy Faux Fur House Slipper Scuff Memory Foam Slip on Anti-Skid Sole,23.0,23
Dearfoams Women's Rebecca Lightweight Cozy Memory Foam Closed Back Slipper with Wide Widths,22.0,22
landeer Women's and Men's Memory Foam Slippers Casual House Shoes,18.0,18


In [53]:
amzn_data[amzn_data['Category']=='BOOT'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Lone Cone Rain Boots with Easy-On Handles in Fun Patterns for Toddlers and Kids,34.0,34
Hudson Baby Unisex-Baby Cozy Fleece Booties,30.0,30
Asgard Women's Ankle Rain Boots Waterproof Chelsea Boots,24.0,24
Clarks Men's Bushacre 2 Chukka Boot,21.0,21
Columbia Men's Newton Ridge Plus Ii Waterproof Hiking Boot Shoe,18.0,18


In [54]:
shoe_categories = ['SHOES', 'TECHNICAL_SPORT_SHOE', 'BOOT', 'SANDAL', 'SLIPPER']
shoe_purchases = amzn_data_sample[amzn_data_sample['Category'].isin(shoe_categories)]
print('%s total shoe purchases in dataset' % len(shoe_purchases))

20482 total shoe purchases in dataset


### Shoe purchases analysis

Limit to 2018 - 2019 related changes

In [55]:
shoe_purchases = shoe_purchases[shoe_purchases['year'].isin([2018, 2019])]
print('%s total shoe purchases from N=%s purchasers in 2018-2019 sample' % (len(shoe_purchases), shoe_purchases['Survey ResponseID'].nunique()))
shoe_purchases.head(3)

6246 total shoe purchases from N=1875 purchasers in 2018-2019 sample


Unnamed: 0,Order Date,Purchase Price Per Unit,Quantity,Shipping Address State,Title,ASIN/ISBN (Product Code),Category,Survey ResponseID,unit price,total price,yyyy-mm,state,year
34,2019-10-08,$14.99,1.0,RHODE ISLAND,GAXmi Flip Flops Women Men Kids Summer Casual ...,B07CYHXHFR,SANDAL,R_3I9Pu8iauEcOx9A,14.99,14.99,2019-10,RI,2019
40,2019-10-09,$39.00,1.0,RHODE ISLAND,Amazon Essentials Men's Chelsea Boot,B07QKNGSNB,BOOT,R_3I9Pu8iauEcOx9A,39.0,39.0,2019-10,RI,2019
41,2019-10-09,$39.00,1.0,RHODE ISLAND,Amazon Essentials Men's Chelsea Boot,B07QH14QTN,BOOT,R_3I9Pu8iauEcOx9A,39.0,39.0,2019-10,RI,2019


Data checks

In [56]:
# Sum is sum over quanity. Count is unique purchases per person per year
print('There are some outlier purchasers making lots of purchases!')
print('Shoe purchases per person per year')
shoe_purchases_per_person = shoe_purchases.groupby(['year','Survey ResponseID'])['Quantity'].agg(['sum','count'])
shoe_purchases_per_person.describe()

There are some outlier purchasers making lots of purchases!
Shoe purchases per person per year


Unnamed: 0,sum,count
count,2566.0,2566.0
mean,2.453624,2.434139
std,3.22091,3.204065
min,1.0,1.0
25%,1.0,1.0
50%,2.0,2.0
75%,3.0,3.0
max,83.0,83.0


In [57]:
# What is the 90th percentile?
# Actually the 99th percentile is a reasonable amount of shoes to buy (imagine they have a family)
print('90th percentile : ', shoe_purchases_per_person['sum'].quantile(0.90))
print('95th percentile : ', shoe_purchases_per_person['sum'].quantile(0.95))
print('99th percentile : ', shoe_purchases_per_person['sum'].quantile(0.99))

90th percentile :  5.0
95th percentile :  7.0
99th percentile :  14.0


In [58]:
shoe_purchases2018 = shoe_purchases[shoe_purchases['year']==2018]
shoe_purchases2019 = shoe_purchases[shoe_purchases['year']==2019]
print('%s shoe purchases from N=%s purchasers in 2018 dataset' % (len(book_purchases2018), shoe_purchases2018['Survey ResponseID'].nunique()))
print('%s shoe purchases from N=%s purchasers in 2019 dataset' % (len(book_purchases2019), shoe_purchases2019['Survey ResponseID'].nunique()))

15812 shoe purchases from N=1178 purchasers in 2018 dataset
14746 shoe purchases from N=1388 purchasers in 2019 dataset


In [59]:
shoe_purchases_pct_change = get_pct_change_purchases(shoe_purchases, verbose=True)
shoe_purchases_pct_change.head()

2746 purchases from N=1178 purchasers in 2018 dataset
3500 purchases from N=1388 purchasers in 2019 dataset


state
AL    0.096774
AR   -0.212121
AZ    0.447368
CA    0.178808
CO   -0.136364
Name: Quantity, dtype: float64

In [60]:
shoe_buyers_pct_change = get_pct_change_buyers(shoe_purchases, verbose=True)
shoe_buyers_pct_change.head()

2746 purchases from N=1178 purchasers in 2018 dataset
3500 purchases from N=1388 purchasers in 2019 dataset


state
AL    0.062500
AR    0.300000
AZ    0.578947
CA    0.222222
CO    0.380952
Name: Survey ResponseID, dtype: float64

In [61]:
print('correlation between percent change in buyers vs total purchasers (shoes)')
r, pvalue = pearsonr(shoe_purchases_pct_change, shoe_buyers_pct_change)
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(shoe_purchases_pct_change, shoe_buyers_pct_change)
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

correlation between percent change in buyers vs total purchasers (shoes)
Pearson r=0.8501 (p-value=0.0000)
Spearman r=0.6214 (p-value=0.0000)


#### Compare census data to purchases data

Using random sampling

for N subsamples:
- get random subsample from books purchases
- get pct changes
- get mean pct changes
- compare mean pct changes to census data (correlation)

In [62]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_sample(frame=shoe_purchases)
    state_pct_changes += [get_pct_change_buyers(sampled_purchases, verbose=v)]
shoe_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

1/1000
2055 purchases from N=864 purchasers in 2018 dataset
2656 purchases from N=1017 purchasers in 2019 dataset
501/1000
2021 purchases from N=850 purchasers in 2018 dataset
2580 purchases from N=1025 purchasers in 2019 dataset


In [63]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment
Pearson r=-0.3125 (p-value=0.0366)


Notes from experiments

- using a smaller random sample (given fewer ppl buy shoes than books) reduces correlation
- No significant correlation with # Establishments
- Stratifying by state: No improvement
- Slightly stronger negative correlation when comparing total purchases to employment
- Doesn't work: comparing portion of purchasers increasing purchases
- Doesn't work/correlation too weak: Limiting to establishments of smaller size

- normalizing employment by population works slightly better than not normalizing by population


## Paint and wallpaper stores

Not doing this analysis:
It looks like there are not enough overall paint purchases and like the most popular items purchased in the paint category are for art.

In [51]:
painty_categories = [c for c in categories if 'paint' in str(c).lower()]
amzn_data[amzn_data['Category'].isin(painty_categories)]['Category'].value_counts()

Category
PAINT          3483
PAINT_BRUSH    1053
BODY_PAINT      282
Name: count, dtype: int64

In [52]:
wallpaper_categories = [c for c in categories if 'wallpaper' in str(c).lower()]
amzn_data[amzn_data['Category'].isin(wallpaper_categories)]['Category'].value_counts()

Category
WALLPAPER    525
Name: count, dtype: int64

In [53]:
print('What kinds of products are these?')
amzn_data[amzn_data['Category']=='PAINT'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

What kinds of products are these?


Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Apple Barrel PROMOABI Acrylic Paint Set, 2 Fl Oz (Pack of 18), Assorted Matte Colors, 18 Count",23.0,23
"Apple Barrel Acrylic Paint in Assorted Colors (8 Ounce), 20403 White",20.0,16
"Crayola Washable Kids Paint, 6 Count, Kids At Home Activities, Painting Supplies, Gift, Assorted",16.0,16
"Crafts 4 All Acrylic Paint Set for Adults and Kids - 24-Pack of 12mL Paints for Canvas, Wood & Ceramic w/ 3 Art Brushes - Non-Toxic Craft Paint Sets - Stocking Stuffers for Girls and Boys",13.0,13
"Krylon K01305 Gallery Series Artist and Clear Coatings Aerosol, 11-Ounce, UV-Resistant Clear Gloss",12.0,12


## Pet food, pet supplies, etc

Not exploring this since the census data shows overall increases in employment year to year.

In [54]:
pet_categories = [c for c in categories if str(c).startswith('PET_')]
amzn_data[amzn_data['Category'].isin(pet_categories)]['Category'].value_counts()

Category
PET_FOOD                  38256
PET_SUPPLIES              10902
PET_TOY                    7261
PET_PEST_CONTROL           1941
PET_FEEDER                 1870
PET_ACTIVITY_STRUCTURE     1712
PET_BED_MAT                1665
PET_APPAREL                1444
PET_PLACEMAT                436
PET_HEALTH_CARE             300
PET_PLAYPEN                 261
PET_DOOR                    209
PET_SEAT                     52
PET_FUR_DEODORIZER           11
Name: count, dtype: int64

## Electronics

From: https://www.naics.com/naics-code-description/?code=443142

This U.S. industry comprises: (1) establishments known as consumer electronics stores primarily engaged in retailing a general line of new consumer-type electronic products such as televisions, computers, and cameras; (2) establishments specializing in retailing a single line of consumer-type electronic products; (3) establishments primarily engaged in retailing these new electronic products in combination with repair and support services; (4) establishments primarily engaged in retailing new prepackaged computer software; and/or (5) establishments primarily engaged in retailing prerecorded audio and video media, such as CDs, DVDs, and tapes.

Illustrative Examples:

- Cellular telephone accessories stores
- Consumer-type electronic stores (e.g., televisions, computers, cameras)
- Stereo stores (except automotive)
- Radio and television stores
- Computer stores


#### Note what should not be included

From https://www.naics.com/naics-code-description/?code=443142:
- Retailing electronic goods via electronic home shopping, mail-order, or direct sale--are classified in Subsector 454, Nonstore Retailers;
- Retailing automotive electronic sound systems--are classified in Industry 441310, Automotive Parts and Accessories Stores;
- Retailing new computers, computer peripherals, and prepackaged software in combination with retailing new office equipment, office furniture, and office supplies--are classified in Industry 453210, Office Supplies and Stationery Stores;
- Retailing new cellular telephones and communication service plans--are classified in U.S. Industry 517312, Wireless Telecommunications Carriers (except Satellite);
- Providing television or other electronic equipment repair services without retailing new televisions or electronic products--are classified in Industry 81121, Electronic and Precision Equipment Repair and Maintenance;
- Developing film and/or making photographic slides, prints, and enlargements without retailing a range of new photographic equipment and supplies--are classified in Industry 81292, Photofinishing;
- Retailing new electronic toys, such as dedicated video game consoles and handheld electronic games--are classified in Industry 451120, Hobby, Toy, and Game Stores; and
- Retailing used electronics--are classified in Industry 453310, Used Merchandise Stores.


i.e. exclude from below:
- anything with AUTO or CAR or VEHICLE
- cellphones
- probably anything with OFFICE
- probably film
- used electronics
- video game consoles and handheld games, video game hardware

In [55]:
electronics_categories = [c for c in categories if 'electronic' in str(c).lower()]
amzn_data[amzn_data['Category'].isin(electronics_categories)]['Category'].value_counts()

Category
ELECTRONIC_CABLE                      18268
PORTABLE_ELECTRONIC_DEVICE_COVER       7629
ELECTRONIC_ADAPTER                     4890
ELECTRONIC_GIFT_CARD                   2975
PORTABLE_ELECTRONIC_DEVICE_MOUNT       2636
PORTABLE_ELECTRONIC_DEVICE_STAND       2305
CONSUMER_ELECTRONICS                   1925
ELECTRONIC_SWITCH                      1494
ELECTRONIC_COMPONENT_FAN               1447
ELECTRONIC_FINDER                       441
ELECTRONIC_DEVICE_SKIN                  368
ELECTRONIC_COMPONENT_TERMINAL           326
ELECTRONIC_SENSOR                       318
ELECTRONIC_WIRE                         315
PRELOADED_ELECTRONIC_GAME               299
ELECTRONIC_DEVICE_COOLING_PAD           236
ELECTRONIC_DEVICE_DOCKING_STATION       223
ELECTRONIC_LEARNING_TOY                 221
PORTABLE_ELECTRONIC_DEVICE_ARMBAND      210
ELECTRONIC_COMPONENT                    180
SECURITY_ELECTRONICS                    161
CAR_ELECTRONICS                         120
OFFICE_ELECTRONICS     

In [56]:
video_categories = [c for c in categories if 'video' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(video_categories)]['Category'].value_counts())

dvd_categories = [c for c in categories if 'dvd' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(dvd_categories)]['Category'].value_counts())

cd_categories = [c for c in categories if ('cd_' in str(c).lower()) or ('_cd' in str(c).lower())]
print(amzn_data[amzn_data['Category'].isin(cd_categories)]['Category'].value_counts())

stereo_categories = [c for c in categories if 'stereo' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(stereo_categories)]['Category'].value_counts())

speaker_categories = [c for c in categories if 'speaker' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(speaker_categories)]['Category'].value_counts())

tv_categories = [c for c in categories if 'television' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(tv_categories)]['Category'].value_counts())

camera_categories = [c for c in categories if 'camera' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(camera_categories)]['Category'].value_counts())

print('\n----Computers and such----\n')
pc_categories = [c for c in categories if ('pc_' in str(c).lower()) or ('_pc' in str(c).lower())]
print(amzn_data[amzn_data['Category'].isin(pc_categories)]['Category'].value_counts())

computer_categories = [c for c in categories if 'computer' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(computer_categories)]['Category'].value_counts())

radio_categories = [c for c in categories if 'radio' in str(c).lower()]
amzn_data[amzn_data['Category'].isin(radio_categories)]['Category'].value_counts()

Category
PHYSICAL_VIDEO_GAME_SOFTWARE     6829
DOWNLOADABLE_VIDEO_GAME          5669
VIDEO_GAME_CONTROLLER            1795
VIDEO_GAME_ACCESSORIES           1279
VIDEO_GAME_CONSOLE                838
CONSOLE_VIDEO_GAMES               671
VIDEO_CARD                        416
VIDEO_PROJECTOR                   229
VIDEO_DISC_PLAYER                 152
VIDEO_GAME_PERIPHERAL_SET         100
VIDEO_DVD                          96
AUDIO_OR_VIDEO                     84
VIDEO_GAME_HARDWARE                78
ABIS_VIDEO_GAMES                   49
Video Game                         48
PORTABLE_VIDEO_DISC_PLAYER         42
DIGITAL_VIDEO_RECORDER             19
STREAMING_VIDEO_SUBSCRIPTION       12
ABIS_VIDEO                          4
VIDEO_VHS                           3
VIDEO_DEVICE                        2
VIDEO_PROJECTOR_PART                1
VIDEO_GAME                          1
COMPUTER_VIDEO_GAME_CONTOLLER       1
Name: count, dtype: int64
Category
ABIS_DVD                  960
VIDEO_DVD     

Category
TWO_WAY_RADIO    341
RADIO            333
Name: count, dtype: int64

In [57]:
screen_categories = [c for c in categories if 'screen_' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(screen_categories)]['Category'].value_counts())
cord_categories = [c for c in categories if 'cord' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(cord_categories)]['Category'].value_counts())
cable_categories = [c for c in categories if 'cable' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(cable_categories)]['Category'].value_counts())
adapter_categories = [c for c in categories if 'adapter' in str(c).lower()]
print(amzn_data[amzn_data['Category'].isin(adapter_categories)]['Category'].value_counts())

Category
SCREEN_PROTECTOR             9640
FLAT_SCREEN_DISPLAY_MOUNT    1498
Name: count, dtype: int64
Category
POWER_CORD                       1548
CAMCORDER                        1132
THREAD_CORD                       820
CORD_MANAGEMENT_COVER             793
SOUND_AND_RECORDING_EQUIPMENT     481
CORD_ROPE                         318
BUNGEE_CORD                       178
VOICE_RECORDER                    119
SURVEILLANCE_RECORDER_SYSTEM       44
DIGITAL_VIDEO_RECORDER             19
DVD_PLAYER_OR_RECORDER              8
Name: count, dtype: int64
Category
ELECTRONIC_CABLE    18268
CABLE_TIE             785
CABLE_OR_ADAPTER      421
CABLE_ASSEMBLY        175
CABLE                   2
Name: count, dtype: int64
Category
CHARGING_ADAPTER                        8175
ELECTRONIC_ADAPTER                      4890
NETWORK_INTERFACE_CONTROLLER_ADAPTER    1491
CABLE_OR_ADAPTER                         421
WIRELESS_AUDIO_ADAPTER                   417
Name: count, dtype: int64


In [58]:
print('What kinds of products are these?')

What kinds of products are these?


In [59]:
amzn_data[amzn_data['Category']=='PHYSICAL_VIDEO_GAME_SOFTWARE'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Super Smash Bros. Ultimate - Nintendo Switch,142.0,141
Animal Crossing: New Horizons - Nintendo Switch,92.0,91
Ring Fit Adventure - Nintendo Switch,85.0,85
The Legend of Zelda: Breath of the Wild - Nintendo Switch,73.0,70
Mario Kart 8 Deluxe - Nintendo Switch,67.0,67


In [60]:
# Don't use this
amzn_data[amzn_data['Category']=='ELECTRONIC_GIFT_CARD'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Amazon.com eGift Card,1674.0,1390
"Google Play gift code - give the gift of games, apps and more (Email Delivery - US Only)",365.0,364
Amazon.com Print at Home Gift Card,333.0,285
Grubhub Gift Cards - Email Delivery,312.0,276
Safeway Gift Card - Email Delivery (Must print eGift to redeem),193.0,193


In [61]:
amzn_data[amzn_data['Category']=='COMPUTER'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Allstate 4-Year PC Peripheral Protection Plan ($0-49.99),89.0,84
Allstate 4-Year PC Peripheral Protection Plan ($75-99.99),35.0,34
"ARCTIC MX-4 (4 g) - Premium Performance Thermal Paste for all processors (CPU, GPU - PC, PS4, XBOX), very high thermal conductivity, long durability, safe application, non-conductive, non-capacitive",28.0,27
Allstate 3-Year PC Peripheral Protection Plan ($0-49.99),23.0,21
Allstate 4-Year PC Peripheral Protection Plan ($50-74.99),17.0,17


In [62]:
amzn_data[amzn_data['Category']=='WEARABLE_COMPUTER'].groupby(
    'Title'
)['Quantity'].agg(['sum','count']).sort_values('count',ascending=False).head()

Unnamed: 0_level_0,sum,count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Fitbit Inspire 2 Health & Fitness Tracker with a Free 1-Year Fitbit Premium Trial, 24/7 Heart Rate, Black/Black, One Size (S & L Bands Included)",43.0,43
"Fitbit Inspire HR Heart Rate and Fitness Tracker, One Size (S and L Bands Included), 1 Count",26.0,26
"Fitbit Versa Smart Watch, Black/Black Aluminium, One Size (S & L Bands Included)",24.0,23
"Amazfit Band 5 Activity Fitness Tracker with Alexa Built-in, 15-Day Battery Life, Blood Oxygen, Heart Rate, Sleep & Stress Monitoring, 5 ATM Water Resistant, Fitness Watch for Men Women Kids, Black",22.0,22
"Fitbit Inspire 2 Health & Fitness Tracker with a Free 1-Year Fitbit Premium Trial, 24/7 Heart Rate, Black/Rose, One Size (S & L Bands Included)",21.0,21


In [63]:
electronics_categories = [
    'ELECTRONIC_CABLE',
    'PORTABLE_ELECTRONIC_DEVICE_COVER',
    'ELECTRONIC_ADAPTER',
    'PORTABLE_ELECTRONIC_DEVICE_MOUNT',
    'PORTABLE_ELECTRONIC_DEVICE_STAND',
    'CONSUMER_ELECTRONICS',
    'ELECTRONIC_DEVICE_SKIN',
    #'ELECTRONIC_COMPONENT_TERMINAL',
    'ELECTRONIC_WIRE',
    'PRELOADED_ELECTRONIC_GAME',
    'ELECTRONIC_DEVICE_COOLING_PAD',
    'ELECTRONIC_DEVICE_DOCKING_STATION',
    'PORTABLE_ELECTRONIC_DEVICE_ARMBAND',
    'SECURITY_ELECTRONICS',
    #'ELECTRONIC_DEVICE_FACEPLATE',
    'ABIS_ELECTRONICS',
    'ELECTRONIC_BOOK_READER',
    'ELECTRONIC_CONTROLLER',
    'PORTABLE_ELECTRONICS',
    'Electronics',
    'PHYSICAL_VIDEO_GAME_SOFTWARE',
    'DOWNLOADABLE_VIDEO_GAME',
    'CONSOLE_VIDEO_GAMES',
    'VIDEO_CARD',
    'VIDEO_PROJECTOR',
    'VIDEO_DISC_PLAYER',
    'VIDEO_DVD',
    'AUDIO_OR_VIDEO',
    'ABIS_VIDEO_GAMES',
    'Video',
    'PORTABLE_VIDEO_DISC_PLAYER',
    'DIGITAL_VIDEO_RECORDER',
    'ABIS_VIDEO',
    'VIDEO_VHS',
    'VIDEO_DEVICE',
    'VIDEO_PROJECTOR_PART',
    'VIDEO_GAME',
    'ABIS_DVD',
    'VIDEO_DVD',
    'DVD_PLAYER_OR_RECORDER',
    'DVD',
    'AUDIO_CD_PLAYER',
    'LCD_GRAPHIC_DISPLAY',
    'INTEGRATED_STEREO_SYSTEM',
    'BOOKSHELF_OR_MICRO_STEREO_SYS',
    'SPEAKERS',
    'SPEAKER_AMPLIFIER_STAND',
    'COMPUTER_SPEAKER',
    'TELEVISION',
    'SECURITY_CAMERA',
    'CAMERA_TRIPOD',
    'CAMERA_OTHER_ACCESSORIES',
    'CAMERA_CONTINUOUS_LIGHT',
    'CAMERA_LENSES',
    'CAMERA_DIGITAL',
    'CAMERA_SUPPORT',
    'CAMERA_CLEANER',
    'CAMERA_LENS_FILTERS',
    'CAMERA_STAGE_LIGHTING_MODIFIER',
    'CAMERA',
    'CAMERA_ENCLOSURE',
    'CAMERA_FLASH',
    'CAMERA_LENS_ACCESSORY',
    'CAMERA_BAGS_AND_CASES',
    'CAMERA_PRIVACY_COVER',
    'CAMERA_STAGE_LIGHTING_FILTER_DIFFUSER',
    'CAMERA_POWER_SUPPLY',
    'ABIS_PC',
    'COMPUTER_DRIVE_OR_STORAGE',
    'WEARABLE_COMPUTER',
    'NOTEBOOK_COMPUTER',
    'TABLET_COMPUTER',
    'COMPUTER_COMPONENT',
    'COMPUTER_ADD_ON',
    'COMPUTER_CHASSIS',
    'COMPUTER',
    'COMPUTER_PROCESSOR',
    'PERSONAL_COMPUTER',
    'COMPUTER_INPUT_DEVICE',
    'SINGLE_BOARD_COMPUTER',
    'COMPUTER_COOLING_DEVICE',
    'COMPUTER_INPUT_DEVICE_ACCESSORY',
    'COMPUTER_SPEAKER',
    'TWO_WAY_RADIO',
    #'SCREEN_PROTECTOR',
    'FLAT_SCREEN_DISPLAY_MOUNT',
    'POWER_CORD',
    'CAMCORDER',
    'CORD_MANAGEMENT_COVER',
    'SOUND_AND_RECORDING_EQUIPMENT',
    'VOICE_RECORDER',
    'SURVEILLANCE_RECORDER_SYSTEM',
    'DIGITAL_VIDEO_RECORDER',
    'DVD_PLAYER_OR_RECORDER',
    #'CABLE_TIE',
    'CABLE_OR_ADAPTER',
    #'CABLE_ASSEMBLY',
    'CABLE',
    'CHARGING_ADAPTER',
    'ELECTRONIC_ADAPTER',
    #'NETWORK_INTERFACE_CONTROLLER_ADAPTER',
    'WIRELESS_AUDIO_ADAPTER',
]

In [64]:
electronics_purchases = amzn_data_sample[amzn_data_sample['Category'].isin(electronics_categories)]
print('%s total electronics purchases in dataset' % len(electronics_purchases))

75107 total electronics purchases in dataset


### Electronics stores analysis

Limit to 2018 - 2019 related changes

In [65]:
electronics_purchases = electronics_purchases[electronics_purchases['year'].isin([2018, 2019])]
print('%s total electronics related purchases from N=%s purchasers in 2018-2019 sample' % (len(electronics_purchases), electronics_purchases['Survey ResponseID'].nunique()))

23928 total electronics related purchases from N=3345 purchasers in 2018-2019 sample


Data checks

In [66]:
# Sum is sum over quanity. Count is unique purchases per person per year
print('electronics purchases per person per year')
electronics_purchases_per_person = electronics_purchases.groupby(['year','Survey ResponseID'])['Quantity'].agg(['sum','count'])
electronics_purchases_per_person.describe()

electronics purchases per person per year


Unnamed: 0,sum,count
count,5448.0,5448.0
mean,4.595448,4.39207
std,5.576255,5.047862
min,1.0,1.0
25%,1.0,1.0
50%,3.0,3.0
75%,6.0,5.0
max,86.0,73.0


In [67]:
print('90th percentile : ', electronics_purchases_per_person['sum'].quantile(0.90))
print('95th percentile : ', electronics_purchases_per_person['sum'].quantile(0.95))
print('99th percentile : ', electronics_purchases_per_person['sum'].quantile(0.99))

90th percentile :  10.0
95th percentile :  14.0
99th percentile :  26.0


In [68]:
electronics_purchases2018 = shoe_purchases[shoe_purchases['year']==2018]
electronics_purchases2019 = shoe_purchases[shoe_purchases['year']==2019]
print('%s electronics purchases from N=%s purchasers in 2018 dataset' % (len(electronics_purchases2018), electronics_purchases2018['Survey ResponseID'].nunique()))
print('%s electronics purchases from N=%s purchasers in 2019 dataset' % (len(electronics_purchases2019), electronics_purchases2019['Survey ResponseID'].nunique()))

2746 electronics purchases from N=1178 purchasers in 2018 dataset
3500 electronics purchases from N=1388 purchasers in 2019 dataset


In [69]:
electronics_purchases_pct_change = get_pct_change_purchases(electronics_purchases, verbose=True)
electronics_purchases_pct_change.head()

11337 purchases from N=2661 purchasers in 2018 dataset
12591 purchases from N=2787 purchasers in 2019 dataset


state
AL    0.065789
AR    0.042735
AZ    0.323699
CA    0.010189
CO    0.264840
Name: Quantity, dtype: float64

In [70]:
electronics_buyers_pct_change = get_pct_change_buyers(electronics_purchases, verbose=True)
electronics_buyers_pct_change.head()

11337 purchases from N=2661 purchasers in 2018 dataset
12591 purchases from N=2787 purchasers in 2019 dataset


state
AL    0.117647
AR    0.068966
AZ    0.183673
CA    0.050167
CO    0.176471
Name: Survey ResponseID, dtype: float64

In [71]:
print('correlation between percent change in buyers vs total purchasers (electronics)')
r, pvalue = pearsonr(electronics_purchases_pct_change, electronics_buyers_pct_change)
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_purchases_pct_change, electronics_buyers_pct_change)
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

correlation between percent change in buyers vs total purchasers (electronics)
Pearson r=0.5268 (p-value=0.0002)
Spearman r=0.4570 (p-value=0.0016)


Would there be higher correlation if we limited data to 99th percentile and below?

Answer: Yes. Pearson r goes from <0.6 to >0.7

In [72]:
max_purchases = electronics_purchases_per_person['sum'].quantile(0.99)
too_many_electronics_responseids = electronics_purchases_per_person[
    (electronics_purchases_per_person['sum'] > max_purchases)
].reset_index()['Survey ResponseID'].unique()
print('Dropping %s response IDs for people who bought more than %s items' % (len(too_many_electronics_responseids), max_purchases))
electronics_purchases_99 = electronics_purchases[~electronics_purchases['Survey ResponseID'].isin(too_many_electronics_responseids)]
print('%s total electronics purchases from N=%s purchasers in 2018-2019 dataset' % (len(electronics_purchases), electronics_purchases['Survey ResponseID'].nunique()))

Dropping 44 response IDs for people who bought more than 26.0 items
23928 total electronics purchases from N=3345 purchasers in 2018-2019 dataset


In [73]:
electronics_purchases_pct_change = get_pct_change_purchases(electronics_purchases_99, verbose=True)
electronics_buyers_pct_change = get_pct_change_buyers(electronics_purchases_99, verbose=False)

print('correlation between percent change in buyers vs total purchasers (electronics)')
r, pvalue = pearsonr(electronics_purchases_pct_change, electronics_buyers_pct_change)
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_purchases_pct_change, electronics_buyers_pct_change)
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

10277 purchases from N=2618 purchasers in 2018 dataset
11371 purchases from N=2743 purchasers in 2019 dataset
correlation between percent change in buyers vs total purchasers (electronics)
Pearson r=0.6779 (p-value=0.0000)
Spearman r=0.6130 (p-value=0.0000)


In [74]:
electronics_purchases = electronics_purchases_99

In [75]:
print('%s total electronics related purchases from N=%s purchasers in 2018-2019 sample' % (len(electronics_purchases), electronics_purchases['Survey ResponseID'].nunique()))

21648 total electronics related purchases from N=3301 purchasers in 2018-2019 sample


#### Compare census data to purchases data

In [76]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_sample(frame=electronics_purchases)
    state_pct_changes += [get_pct_change_buyers(sampled_purchases, verbose=v)]
electronics_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

1/1000
5631 purchases from N=1415 purchasers in 2018 dataset
6117 purchases from N=1473 purchasers in 2019 dataset
501/1000
5541 purchases from N=1379 purchasers in 2018 dataset
6036 purchases from N=1464 purchasers in 2019 dataset


In [77]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment
Pearson r=-0.0120 (p-value=0.9376)


Overall not finding the hypothesized negative correlation.

Notes from experiments

- stratifying on state does not help

- what about restricting the categories? --> could not find the "right categories"

## Other Experiments / misses

### Books

In [None]:
# Employment
print('\nComparing to 2019-2020 census data: Employment')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_emp_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_buyers_pct_changes, census_books_emp_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
print('Comparing to 2018-2019 census data: Establishments')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_est_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_buyers_pct_changes, census_books_est_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_est_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_buyers_pct_changes, census_books_est_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
# N=2000
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_book_purchases = get_random_sample(frame=book_purchases, N=2000)
    state_pct_changes += [get_pct_change_buyers(sampled_book_purchases, verbose=v)]
book_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

Compare to census 2018-2019 employment data specific to enterprise size (smaller businesses)

In [45]:
print('Comparing to 2018-2019 census data: Employment: Enterprise size <500 employees')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_lt_500_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment: Enterprise size <500 employees
Pearson r=-0.2375 (p-value=0.1162)


In [48]:
print('Comparing to 2018-2019 census data: Employment: Enterprise size <20 employees')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_lt_20_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment: Enterprise size <20 employees
Pearson r=-0.0937 (p-value=0.5404)


In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_buyers_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Using stratification

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_book_purchases = get_random_stratified_sample(frame=book_purchases)
    state_pct_changes += [get_pct_change_buyers(sampled_book_purchases, verbose=v)]
book_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(book_buyers_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_buyers_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Total purchases (vs distinct buyers)

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_book_purchases = get_random_stratified_sample(frame=book_purchases)
    state_pct_changes += [get_pct_change_purchases(sampled_book_purchases, verbose=v)]
book_purchases_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(book_purchases_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(book_purchases_pct_changes, census_books_emp_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_changes, census_books_emp_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%250==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_book_purchases = get_random_sample(frame=book_purchases)
    state_pct_changes += [get_pct_change_purchases(sampled_book_purchases, verbose=v)]
book_purchases_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(book_purchases_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_changes, census_books_emp_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(book_purchases_pct_changes, census_books_emp_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_changes, census_books_emp_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
print('Comparing to 2018-2019 census data: Establishments')
r, pvalue = pearsonr(book_purchases_pct_changes, census_books_est_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_changes, census_books_est_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(book_purchases_pct_changes, census_books_est_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(book_purchases_pct_changes, census_books_est_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
book_purchases_by_state2019 = book_purchases2019.groupby(['state'])['Quantity'].sum()
print('2019 Top states for total book purchases')
print(book_purchases_by_state2019.sort_values(ascending=False).head())
print('\n2019 Bottom states for total book purchases')
print(book_purchases_by_state2019.sort_values().head())
book_purchases_by_state2019.describe()

In [None]:
# 2018 book purchases by state
book_purchases_by_state2018 = book_purchases2018.groupby(['state'])['Quantity'].sum()
print('2018 Top states for total book purchases')
print(book_purchases_by_state2018.sort_values(ascending=False).head())
print('\n2018 Bottom states for total book purchases')
print(book_purchases_by_state2018.sort_values().head())
book_purchases_by_state2018.describe()

Using metric: Portion of purchasers increasing purchases

In [251]:
N_subsamples = 1000
state_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%200==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_book_purchases = get_random_sample(frame=book_purchases)
    state_changes += [get_portion_increases_by_purchaser(sampled_book_purchases)]
book_portion_increases_by_purchaser = pd.DataFrame(state_changes).mean()

1/1000
201/1000
401/1000
601/1000
801/1000


In [252]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(book_portion_increases_by_purchaser, census_books_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment
Pearson r=-0.3103 (p-value=0.0380)


### Shoes experiments

Comparing to number of establishments

In [None]:
print('Comparing to 2018-2019 census data: Establishments')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_est_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(shoe_buyers_pct_changes, census_books_est_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_est_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(shoe_buyers_pct_changes, census_shoes_est_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Compare to census 2018-2019 employment data specific to enterprise size (smaller businesses)

In [64]:
print('Comparing to 2018-2019 census data: Employment: Enterprise size <500 employees')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_lt_500_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment: Enterprise size <500 employees
Pearson r=-0.1774 (p-value=0.2437)


In [65]:
print('Comparing to 2018-2019 census data: Employment: Enterprise size <20 employees')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_lt_20_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment: Enterprise size <20 employees
Pearson r=0.3099 (p-value=0.0383)


Use a smaller sample because there are fewer total buyers.

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_sample(frame=shoe_purchases, N=1000)
    state_pct_changes += [get_pct_change_buyers(sampled_purchases, verbose=v)]
shoe_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Stratify by state

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_stratified_sample(frame=shoe_purchases)
    state_pct_changes += [get_pct_change_buyers(sampled_purchases, verbose=v)]
shoe_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(shoe_buyers_pct_changes, census_shoes_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Purchases

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_sample(frame=shoe_purchases)
    state_pct_changes += [get_pct_change_purchases(sampled_purchases, verbose=v)]
shoe_purchases_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(shoe_purchases_pct_changes, census_shoes_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(shoe_purchases_pct_changes, census_shoes_emp_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(shoe_purchases_pct_changes, census_shoes_emp_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(shoe_purchases_pct_changes, census_shoes_emp_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Using metric: Portion of purchasers increasing purchases

In [269]:
N_subsamples = 1000
state_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%200==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_sample(frame=shoe_purchases)
    state_changes += [get_portion_increases_by_purchaser(sampled_purchases)]
shoes_portion_increases_by_purchaser = pd.DataFrame(state_changes).mean()

1/1000
201/1000
401/1000
601/1000
801/1000


In [270]:
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(shoes_portion_increases_by_purchaser, census_shoes_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Comparing to 2018-2019 census data: Employment
Pearson r=0.1912 (p-value=0.2084)


### Electronics experiments

Total purchases (vs distinct buyers)

In [None]:
# Employment
print('Comparing to 2019-2020 census data: Employment')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_emp_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_buyers_pct_changes, census_electronics_emp_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
print('Comparing to 2018-2019 census data: Establishments')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data: Establishments')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

Using stratification

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_stratified_sample(frame=electronics_purchases)
    state_pct_changes += [get_pct_change_buyers(sampled_purchases, verbose=v)]
electronics_buyers_pct_changes = pd.DataFrame(state_pct_changes).mean()

In [None]:
# Employment
print('Comparing to 2018-2019 census data: Employment')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_emp_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
print('\nComparing to 2019-2020 census data: Employment')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_emp_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_buyers_pct_changes, census_electronics_emp_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))
# Establishments
print('\nComparing to 2018-2019 census data: Establishments')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20182019.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20182019.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

print('\nComparing to 2019-2020 census data')
r, pvalue = pearsonr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20192020.loc[sample_states])
print('Pearson r=%0.4f (p-value=%0.4f)' % (r, pvalue))
r, pvalue = spearmanr(electronics_buyers_pct_changes, census_electronics_est_pct_change_20192020.loc[sample_states])
print('Spearman r=%0.4f (p-value=%0.4f)' % (r, pvalue))

In [None]:
N_subsamples = 1000
state_pct_changes = [] # compute the mean over N_subsamples
for i in range(N_subsamples):
    v = (i%500==0)
    if v:
        print('%s/%s' % (i+1, N_subsamples))
    sampled_purchases = get_random_sample(frame=electronics_purchases)
    state_pct_changes += [get_pct_change_purchases(sampled_purchases, verbose=v)]
electronics_purchases_pct_changes = pd.DataFrame(state_pct_changes).mean()