# Analysis: new columns

#### This notebook adds more value to the dataframe, by creating new columns and insights. To keep the actual delivery for the client simple, this is done in a separate notebook.¶

### Additional columns interesting for analysis:
- data dictionary for the departments dataframe
- price range of products (low, medium, high price range)
- busy days in the week: when do consumers shop most (least, regularly, most busy)
- busy hours: at what time do customers shop (Fewest, average, most orders)
- customer loyalty, based on max count of orders (new, regular, royal customer)
- Customer Frequency, based on days since last order Non-frequent, Regular, Frequent customer)
- spending flag, based on prices, to distinguish high spending from low spending customers

## Importing libraries and data

In [1]:
#importing the libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
#importing dataset:
path = r'C:\Users\chris\Documents\Instacart Basket Analysis'
orders_products_all = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))
deps = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data', 'departments_wrangled.pkl'))

In [3]:
#calling main dataframe to check if imported completely:
orders_products_all.head(3)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423


### Data Dictionary for convenient analysis

##### this isn't exactly a new column, but a new feature for the analysis and is thus tested here.

In [5]:
# defining the data dictionary:
data_dict = deps.to_dict('index')

In [6]:
# calling the df:
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [7]:
# calling: get me department 19, should be 'snacks'
print(data_dict.get('19'))

{'department': 'snacks'}


### new column: price range

In [8]:
# Loc()method, condition 1: price > 15 = 'high-range product'
orders_products_all.loc[orders_products_all['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [9]:
# Loc()method, condition 2:
orders_products_all.loc[(orders_products_all['prices'] <= 15) & (orders_products_all['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [10]:
# Loc()method, condition 3:
orders_products_all.loc[orders_products_all['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [11]:
# check new column:
orders_products_all['price_range_loc'].value_counts(dropna = False)

Mid-range product     21860860
Low-range product     10126321
High-range product      412551
Name: price_range_loc, dtype: int64

In [12]:
# multiple aggregations: mean, min, max
orders_products_all.groupby('price_range_loc').agg({'prices': ['mean', 'min', 'max']}).round(1)

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
price_range_loc,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
High-range product,20.6,15.1,25.0
Low-range product,3.0,1.0,5.0
Mid-range product,9.8,5.1,15.0


### new column: busy days in the week:

In [13]:
# First, find out what is the busiest day of the week:
orders_products_all['order_day_of_week'].value_counts(dropna=False)

0    6203329
1    5659298
6    4495887
2    4213105
5    4205076
3    3839865
4    3783172
Name: order_day_of_week, dtype: int64

In [14]:
# For-Loop to create new column with labels “Busiest day,” “Least busy,” and “Regularly busy”:
result = []

for value in orders_products_all["order_day_of_week"]:
  if value == 0:
    result.append("Most busy")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [15]:
#checking if it worked:
print(result[:10])

['Regularly busy', 'Regularly busy', 'Regularly busy', 'Least busy', 'Least busy', 'Regularly busy', 'Regularly busy', 'Regularly busy', 'Regularly busy', 'Least busy']


In [16]:
# inserting "result" as a new column:
orders_products_all['busiest_day'] = result

In [17]:
# count values in new column:
orders_products_all['busiest_day'].value_counts(dropna = False)

Regularly busy    22413231
Most busy          6203329
Least busy         3783172
Name: busiest_day, dtype: int64

In [18]:
#show df with new column
orders_products_all.head(5)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,prices,Gender,STATE,Age,date_joined,n_dependants,fam_status,income,price_range_loc,busiest_day
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Least busy
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Least busy


### new column: busy days in the week (2 days)

In [22]:
# first, check again on busiest days. For more convienience, sort desc:
orders_products_all['order_day_of_week'].value_counts(dropna=False).sort_values(ascending=False)

0    6203329
1    5659298
6    4495887
2    4213105
5    4205076
3    3839865
4    3783172
Name: order_day_of_week, dtype: int64

In [23]:
# days are identified, so create the loop, using a new result list (result2):
result2 = []

for value in orders_products_all["order_day_of_week"]:
    if value in (0, 1):
        result2.append("Most busy")
    elif value in (3, 4):
        result2.append("Least busy")
    else:
        result2.append("Regularly busy")

In [24]:
#checking if it worked:
print(result2[:10])

['Regularly busy', 'Least busy', 'Least busy', 'Least busy', 'Least busy', 'Regularly busy', 'Most busy', 'Most busy', 'Most busy', 'Least busy']


In [25]:
# inserting "result2" as a new column:
orders_products_all['busiest_days'] = result2

In [26]:
# checking the extended dataframe:
orders_products_all.head(5)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,Gender,STATE,Age,date_joined,n_dependants,fam_status,income,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,196,1,0,Soda,...,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Female,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy


In [27]:
# count values in new column:
orders_products_all['busiest_days'].value_counts(dropna = False)

Regularly busy    12914068
Most busy         11862627
Least busy         7623037
Name: busiest_days, dtype: int64

### new column: busies hours of the day

In [28]:
# first, check busiest hours:
orders_products_all['order_hour_of_day'].value_counts(dropna=False).sort_values(ascending=False)

10    2761333
11    2735694
14    2688728
15    2661718
13    2660570
12    2618104
16    2534744
9     2453842
17    2087273
8     1717863
18    1636226
19    1258076
20     976000
7      890923
21     795528
22     634159
23     402272
6      290450
0      218742
1      115683
5       87944
2       69360
4       53232
3       51268
Name: order_hour_of_day, dtype: int64

In [29]:
# busiest periods of hours would be, starting at midnight:
## 0 - 6: Fewest orders
## 7 - 9: Average orders
## 10 - 17: Most orders
## 18 - 22: Average orders
## 23: Fewest orders
### for-loop can be as follows:
result3 = []

for value in orders_products_all["order_hour_of_day"]:
    if value in (10,11,12,13,14,15,16,17):
        result3.append("Most orders")
    elif value in (7,8,9,18,19,20,21,22):
        result3.append("Average orders")
    elif value in (0,1,2,3,4,5,6,23):
        result3.append("Fewest orders")
    else:
        result3.append("check again!")

In [30]:
# inserting "result3" as a new column:
orders_products_all['busiest_period_of_day'] = result3

In [31]:
# checking the extended dataframe:
orders_products_all.head(5)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,STATE,Age,date_joined,n_dependants,fam_status,income,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,...,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Alabama,31,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy,Most orders


In [32]:
# count values in new column - if done correctly, the else-value 'check again!' must not appear:
orders_products_all['busiest_period_of_day'].value_counts(dropna = False)

Most orders       20748164
Average orders    10362617
Fewest orders      1288951
Name: busiest_period_of_day, dtype: int64

## customer loyalty

### 1) COUNT of orders > MAX(COUNT) in new column: max_order

In [33]:
#Definition of customer loyalty:

## If max orders the user has made is over 40, then the customer will be labeled a “Loyal customer.”
## If max orders the user has made is over 10 but less than or equal to 40, then the customer will be labeled a “Regular customer.”
### If max orders the user has made is less than or equal to 10, then the customer will be labeled a “New customer.”

In [34]:
# creating a new column 'max order'
orders_products_all['max_order'] = orders_products_all.groupby(['user_id'])['order_number'].transform(np.max)

In [35]:
# as that created a new column, we're heading() the frame to check:
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,Age,date_joined,n_dependants,fam_status,income,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2539329,1,1,2,8,,196,1,0,Soda,...,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Regularly busy,Average orders,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy,Average orders,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,31,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy,Most orders,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,31,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy,Average orders,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,31,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy,Most orders,10


### 2) flagging the loyalty based on number of max orders:

In [42]:
# Flagging the loyal customers: Deriving Columns with loc()
# loyal customer:
orders_products_all.loc[orders_products_all['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [43]:
# regular customer:
orders_products_all.loc[(orders_products_all['max_order'] <= 40) & (orders_products_all['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [44]:
# new customer:
orders_products_all.loc[orders_products_all['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [45]:
# checking on flags, no NaN should appear:
orders_products_all['loyalty_flag'].value_counts(dropna=False)

Regular customer    15874128
Loyal customer      10282763
New customer         6242841
Name: loyalty_flag, dtype: int64

In [46]:
# checking the new flag in the dataframe:
orders_products_all.head(5)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,date_joined,n_dependants,fam_status,income,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,2019-02-17,3,married,40423,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy,Average orders,10,New customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,2019-02-17,3,married,40423,Mid-range product,Least busy,Least busy,Most orders,10,New customer


### Customer Frequency

In [47]:
# Frequency labels and definition:
## If mean ('mean_days_since_last_order') > 20, then the customer will be labeled a “Non-frequent customer”
## If mean ('mean_days_since_last_order') > 10 AND <=20 , then the customer will be labeled a “Regular customer”
## If mean ('mean_days_since_last_order') <=10, then the customer will be labeled a “Frequent customer”

In [48]:
# first, we need a column with mean days_since_prior_order, just as we did with the 'max_order':
orders_products_all['mean_days_since_last_order'] = orders_products_all.groupby(['user_id'])['days_since_last_order'].transform(np.mean)

In [49]:
# as that created a new column 'mean_days_since_last_order', we're heading() the frame to check:
orders_products_all.head(5)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,n_dependants,fam_status,income,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_days_since_last_order
0,2539329,1,1,2,8,,196,1,0,Soda,...,3,married,40423,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,20.259259
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,3,married,40423,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,20.259259
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,3,married,40423,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,20.259259
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,3,married,40423,Mid-range product,Least busy,Least busy,Average orders,10,New customer,20.259259
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,3,married,40423,Mid-range product,Least busy,Least busy,Most orders,10,New customer,20.259259


In [50]:
# assigning the labels now:
## Non-frequent customer: > 20
orders_products_all.loc[orders_products_all['mean_days_since_last_order'] > 20, 'frequency_type'] = 'Non-frequent customer'

In [51]:
# assigning the labels now:
## Regular customer: > 10 AND <=20 
orders_products_all.loc[(orders_products_all['mean_days_since_last_order'] > 10) & (orders_products_all['mean_days_since_last_order'] <=20), 'frequency_type'] = 'Regular customer'

In [52]:
# assigning the labels now:
## low spender: <=10
orders_products_all.loc[orders_products_all['mean_days_since_last_order'] <=10, 'frequency_type'] = 'Frequent customer'

In [53]:
# checking on flags, no NaN should appear:
orders_products_all['frequency_type'].value_counts(dropna=False)

Frequent customer        16464954
Regular customer         12520157
Non-frequent customer     3414616
NaN                             5
Name: frequency_type, dtype: int64

In [54]:
# Subsetting the NaN's:
frequency_type_nan = orders_products_all.loc[orders_products_all['frequency_type'].isnull()]

In [55]:
# calling my subset
frequency_type_nan

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,fam_status,income,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_days_since_last_order,frequency_type
32372739,895835,159838,1,0,17,,10749,3,0,Organic Red Bell Pepper,...,married,140994,Mid-range product,Most busy,Most busy,Most orders,1,New customer,,
32372740,895835,159838,1,0,17,,33401,6,0,Goat Cheese Crumbles,...,married,140994,Mid-range product,Most busy,Most busy,Most orders,1,New customer,,
32372741,895835,159838,1,0,17,,23695,2,0,California Veggie Burger,...,married,140994,Low-range product,Most busy,Most busy,Most orders,1,New customer,,
32372742,895835,159838,1,0,17,,21334,5,0,Organic Peeled Garlic,...,married,140994,Mid-range product,Most busy,Most busy,Most orders,1,New customer,,
32372743,895835,159838,1,0,17,,22198,1,0,4X Ultra Concentrated Natural Laundry Detergen...,...,married,140994,Low-range product,Most busy,Most busy,Most orders,1,New customer,,


##### The NaN's are ol: It's customers with 'NaN' in the column 'days_since_last_order'. We can ignore that. The aggregation and labeling is correct.

### new column: spending flag

In [57]:
#These are the labels:
## If mean ('prices') < 10, then the customer will be labeled a “low spender”
## If mean ('prices') >= 10, then the customer will be labeled a “high spender”

In [58]:
# first, we need a column with mean prices, just as we did with the 'max_order':
orders_products_all['mean_prices'] = orders_products_all.groupby(['user_id'])['prices'].transform(np.mean)

In [59]:
# as that created a new column 'mean_prices', we're heading() the frame to check:
orders_products_all.head(5)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,income,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_days_since_last_order,frequency_type,mean_prices
0,2539329,1,1,2,8,,196,1,0,Soda,...,40423,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,20.259259,Non-frequent customer,6.367797
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,40423,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,20.259259,Non-frequent customer,6.367797
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,40423,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,20.259259,Non-frequent customer,6.367797
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,40423,Mid-range product,Least busy,Least busy,Average orders,10,New customer,20.259259,Non-frequent customer,6.367797
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,40423,Mid-range product,Least busy,Least busy,Most orders,10,New customer,20.259259,Non-frequent customer,6.367797


In [60]:
# assigning the labels now:
## low spender:
orders_products_all.loc[orders_products_all['mean_prices'] < 10, 'spender_type'] = 'low spender'

In [61]:
# assigning the labels now:
## high spender:
orders_products_all.loc[orders_products_all['mean_prices'] >=10, 'spender_type'] = 'high spender'

In [62]:
# checking on flags, no NaN should appear:
orders_products_all['spender_type'].value_counts(dropna=False)

low spender     32280013
high spender      119719
Name: spender_type, dtype: int64

## Exporting the enhanced dataframe

In [64]:
# exporting the extended dataframe as a pkl:
orders_products_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))