# Peanut Butter purchase data analysis

This notebook involves two datasets from the Nielsen datasets: purchase data products data. The main task is to examine the sales pattern on two selected peanut butter UPCs. Due to data confidentiality, cell outputs are not shown.     
Questions we tried to answer :    
1) What is the distribution of purchase quantity according to different promotion types?    
2) What is average spending/trip purchase per household on each product?     
3) How JIF and Skippy's sales and price interact

In [None]:
import pandas as pd
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
from glob import glob 

In [None]:
%pylab inline

In [None]:
!pwd

In [None]:
cd ~/Documents/Codes/Retail_Analytics/Panel_Data/

In [None]:
!ls

In [None]:
!head ./2006/Annual_Files/trips_2006.tsv

## Load and Prepare master data frame

### Select peanut butter upc list

In [None]:
fields = ["upc","upc_ver_uc","upc_descr","product_module_code" ,"product_module_descr", "product_group_code", 
          "brand_descr", "multi", "size1_amount", "size1_units"]
products = pd.read_csv("./Master_Files/Latest/products.tsv", sep = '\t', usecols=fields)

In [None]:
products.dtypes

In [None]:
## select products belong to peanut butter category
peanut_butter = products.query('product_module_code==1421 & product_group_code == 506')

In [None]:
peanut_butter.shape

From 2004 to 2014, there are toatl 2,748 products in peanut butter category

In [None]:
unique(peanut_butter.product_module_descr) # check if query successful

In [None]:
peanut_butter['upc_descr']= peanut_butter['upc_descr'] + " / " + peanut_butter['size1_amount'].astype(str) + " " + peanut_butter['size1_units'] + " / " + peanut_butter['multi'].astype(str) 

In [None]:
peanut_butter.head(2)

In [None]:
peanut_butter = peanut_butter[["upc","upc_ver_uc","upc_descr","brand_descr","multi","size1_amount"]]

In [None]:
peanut_butter.head(2)

### Merge pruchase file and PB upc list

In [None]:
## make a list of fine names
pl_path = glob('./2[0-1][0-1][0-9]/Annual_Files/purchases_*.tsv')
pl_path

In [None]:
sub_path = pl_path[4:8]

In [None]:
sub_path

In [None]:
print("Beginning to build Purchase Dataset: ")
pl = pd.DataFrame()
for i, path in enumerate(sub_path):
    tmp = pd.read_csv(path, sep = '\t')
    pl = pl.append(tmp)
    print(pl.shape)

In [None]:
pl.head(2)

In [None]:
peanut_butter.head(2)

In [None]:
purchase = pd.merge(pl, peanut_butter, how='inner', sort=False, on=["upc","upc_ver_uc"])

In [None]:
purchase.shape

In [None]:
purchase.head(2)

In [None]:
purchase.to_csv("./pb_purchase_11_14.csv")

### Merge PB purchase file with trip file

In [None]:
tp_path = glob('./2[0-1][0-1][0-9]/Annual_Files/trips_*.tsv')
tp_path = tp_path[4:8]
tp_path

In [None]:
fields = ['trip_code_uc', 'household_code', 'purchase_date','retailer_code', 'store_code_uc']
print("Beginning to build Purchase Dataset: ")
tp = pd.DataFrame()
for i, path in enumerate(tp_path):
    tmp = pd.read_csv(path, sep = '\t',usecols=fields)
    tp = tp.append(tmp)
    print(tp.shape)

In [None]:
tp.head(2)

In [None]:
purchase.head(2)

In [None]:
purchase.shape

In [None]:
master_df = pd.merge(purchase, tp, on="trip_code_uc", how='inner',sort=False)

In [None]:
master_df.shape

### Subset only Grocery store data 

In [None]:
retailers = pd.read_csv("./Master_Files/Latest/retailers.tsv", sep = '\t')

In [None]:
retailers.head(2)

In [None]:
master_df = pd.merge(master_df, retailers, on="retailer_code", how="inner", sort=False)

In [None]:
master_df.shape

In [None]:
master_df.to_csv("./master_df.csv")

In [None]:
grocery = master_df.query('channel_type=="Grocery"')

In [None]:
grocery.shape

In [None]:
grocery.to_csv("./grocery_df.csv",index=False)

## Multiple purchases Analysis

### Transform variables

In [None]:
grocery = pd.read_csv("./grocery_df.csv")

In [None]:
grocery.head(2)

#### Compute unit_price_paid

In [None]:
grocery['unit_price_paid'] = grocery.apply(lambda x: (x['total_price_paid']-x['coupon_value'])/x['quantity'], axis=1)

In [None]:
grocery.head(2)

#### select Skippy CRM 16.3 and JIF 18

In [None]:
selected_df = grocery.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1" | upc_descr=="JIF CRM H PLS / 18.0 OZ / 1"')

In [None]:
unique(selected_df["upc_descr"])

In [None]:
selected_df.shape

(68197,17)

In [None]:
min(selected_df.unit_price_paid)

In [None]:
selected_df = selected_df.query("unit_price_paid !=0")

In [None]:
selected_df.shape

In [None]:
min(selected_df.unit_price_paid)

In [None]:
selected_df['unit_price_paid'].describe()

In [None]:
def price_bin(row):
    if row['unit_price_paid']>=0 and row['unit_price_paid']<1.6 :
        return '[0,1.6)'
    if row['unit_price_paid']>=1.6 and row['unit_price_paid']<2.25 :
        return '[1.6,2.25)'
    if row['unit_price_paid']>=2.25 and row['unit_price_paid']<2.5 :
        return '[2.25,2.5)'
    if row['unit_price_paid']>=2.5 and row['unit_price_paid']<2.6 :
        return '[2.5,2.6)'
    if row['unit_price_paid']>=2.6 and row['unit_price_paid']<3 :
        return '[2.6,3)'
    return '[3, Inf) '

In [None]:
selected_df['unit_P_range'] = selected_df.apply (lambda row: price_bin(row),axis=1)

In [None]:
selected_df.head(2)

In [None]:
selected_df.groupby(['brand_descr','unit_P_range','deal_flag_uc'])['trip_code_uc'].count().unstack(2)

#### Examine basic statistics

In [None]:
# how many households involved
households = len(unique(selected_df.household_code))
households

In [None]:
trips = len(unique(selected_df.trip_code_uc))
trips

In [None]:
trips/households

In [None]:

len(unique(selected_df.store_code_uc))

In [None]:
# how many trip purchases for each brand
selected_df.groupby(['brand_descr'])['trip_code_uc'].count()

In [None]:
sum(selected_df.groupby(['brand_descr'])['trip_code_uc'].count())

In [None]:
JIF = selected_df.query('brand_descr=="JIF"')
JIF_gb = JIF.groupby(['household_code'])[['trip_code_uc']].count()

In [None]:
len(unique(JIF.trip_code_uc))

In [None]:
len(unique(JIF.household_code))

In [None]:
len(unique(JIF.trip_code_uc))/len(unique(JIF.household_code))

In [None]:
bin1 = numpy.linspace(1, 30, 30)

In [None]:
plt.hist(JIF_gb.trip_code_uc, bin1)
plt.title("Number of trips involving JIF per households")
plt.xlabel("Number of trips")
plt.ylabel("household counts")

In [None]:
SKIPPY = selected_df.query('brand_descr=="SKIPPY"')
SKIPPY_gb = SKIPPY.groupby(['household_code'])[['trip_code_uc']].count()
SKIPPY_gb.sum()

In [None]:
len(unique(SKIPPY.trip_code_uc))

In [None]:
len(unique(SKIPPY.household_code))

In [None]:
len(unique(SKIPPY.trip_code_uc))/len(unique(SKIPPY.household_code))

In [None]:
plt.hist(SKIPPY_gb.trip_code_uc, bin1)
plt.title("Number of trips involving SKIPPY per households")
plt.xlabel("Number of trips")
plt.ylabel("household counts")

#### How many households purchase JIF and Skippy

In [None]:
brands_hh = selected_df.groupby('household_code').brand_descr.nunique()
brands_hh.value_counts()

In [None]:
ls = brands_hh[brands_hh==2]
ls.index

In [None]:
len(ls.index)/brands_hh.shape[0]

In [None]:
brands_tp = selected_df.groupby('trip_code_uc').brand_descr.nunique()
brands_tp.value_counts()

In [None]:
63/(63+65663)

#### plot histogram according to deal_flag_uc

In [None]:
from pylab import rcParams
rcParams['figure.figsize'] = 6, 6

In [None]:
bins = numpy.linspace(0, 25, 25)

In [None]:
sub1 = selected_df.query('upc_descr=="JIF CRM H PLS / 18.0 OZ / 1" & deal_flag_uc==1')
plt.hist(sub1.quantity,bins)
plt.title("Purchase quantity for JIF CRM 18OZ if there is a deal")
plt.xlabel("purchase quantity per trip")
plt.ylabel("counts")

In [None]:
sub2 = selected_df.query('upc_descr=="JIF CRM H PLS / 18.0 OZ / 1" & deal_flag_uc==0')
plt.hist(sub2.quantity,bins)
plt.title("Purchase quantity for JIF CRM 18OZ if there is NO deal")
plt.xlabel("purchase quantity per trip")
plt.ylabel("counts")

In [None]:
sub3 = selected_df.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1" & deal_flag_uc==1')
plt.hist(sub3.quantity,bins)
plt.title("Purchase quantity for SKIPPY CRM 16.3OZ if there is a deal")
plt.xlabel("purchase quantity per trip")
plt.ylabel("counts")

In [None]:
sub4 = selected_df.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1" & deal_flag_uc==0')
plt.hist(sub4.quantity,bins)
plt.title("Purchase quantity for SKIPPY CRM 16.3OZ if there is NO deal")
plt.xlabel("purchase quantity per trip")
plt.ylabel("counts")

In [None]:
rcParams['figure.figsize'] = 17, 6

In [None]:
sub6 = selected_df.query('upc_descr=="JIF CRM H PLS / 18.0 OZ / 1"')
sub6 = sub6[['quantity','unit_P_range','deal_flag_uc']]
sub6.boxplot(by=['unit_P_range','deal_flag_uc'])
plt.title("JIF - Distribution of purchase quantity per trip according to price and deal (0=no deal)")
plt.ylabel("Purchase quantity")

In [None]:
sub5 = selected_df.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1"')
sub5 = sub5[['quantity','unit_P_range','deal_flag_uc']]
sub5.boxplot(by=['unit_P_range','deal_flag_uc'])
plt.title("Skippy - Distribution of purchase quantity per trip according to price and deal (0=no deal)")
plt.ylabel("Purchase quantity")

In [None]:
selected_df.query(' retailer_code==89').groupby(['upc_descr','unit_P_range','deal_flag_uc'])['trip_code_uc'].count().unstack(2)

In [None]:
tmp=selected_df.query(' retailer_code==89')
len(unique(tmp.store_code_uc))

In [None]:
sub7 = selected_df.query('upc_descr=="JIF CRM H PLS / 18.0 OZ / 1" & retailer_code==89')
sub7 = sub7[['quantity','unit_P_range','deal_flag_uc']]
sub7.boxplot(by=['unit_P_range','deal_flag_uc'])
plt.title("JIF - Distribution of purchase quantity per trip according to price and deal (0=no deal), retailer:89")
plt.ylabel("Purchase quantity")
plt.ylim((0,8))

In [None]:
sub8 = selected_df.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1" & retailer_code==89')
sub8 = sub8[['quantity','unit_P_range','deal_flag_uc']]
sub8.boxplot(by=['unit_P_range','deal_flag_uc'])
plt.title("Skippy - Distribution of purchase quantity per trip according to price and deal (0=no deal), retailer:89")
plt.ylabel("Purchase quantity")
plt.ylim((0,8))

#### How JIF and Skippy's sales and price interact

In [None]:
tmp.head(2)

In [None]:
# tmp.store_code_uc.value_counts()
## below stores have most trips
#2751743    61 (#trip)
#7346835    60
#742472     53

In [None]:
tmp['purchase_date'] = pd.to_datetime(tmp['purchase_date'])

In [None]:
tmp.dtypes

In [None]:
rcParams['figure.figsize'] = 6, 6

In [None]:
sub7 = selected_df.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1" & deal_flag_uc==1')
plt.scatter(x=sub7.unit_price_paid, y=sub7.quantity)
plt.title("Scatter plot of unit_price and quantity - SKIPPY CRM 16.3OZ (with deal)")
plt.xlabel("Unit_price_paid")
plt.ylabel("Quantity")

In [None]:
sub8 = selected_df.query('upc_descr=="SKP CRM H PLS / 16.3 OZ / 1" & deal_flag_uc==0')
plt.scatter(x=sub8.unit_price_paid, y=sub8.quantity)
plt.title("Scatter plot of unit_price and quantity - SKIPPY CRM 16.3OZ (No deal)")
plt.xlabel("Unit_price_paid")
plt.ylabel("Quantity")