# Order Data Wrangling

### Contents

#### Frequency table of days since order
#### Changing variable names
#### Changing data types
#### Transposing datasets
#### Creating data dictionaries
#### Creating subsets

## 1.0 Importing Libraries

In [4]:
# Importing Libraries
import pandas as pd
import numpy as np
import os


## 2.0 Importing Data

In [5]:
# Importing orders and products dataframe
project_path = r'C:\Users\Owner\Documents\Career Foundry\Instacart Basket Analysis'
# Import all variables except eval_set for orders dataframe
vars_list = ['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']
df_orders = pd.read_csv(os.path.join(project_path, '02 Data', '02 01 Originals', 'orders.csv'), )
# Importing complete products dataframe
df_products = pd.read_csv(os.path.join(project_path, '02 Data', '02 01 Originals', 'products.csv'), )
# Importing departments dataframe
df_departments = pd.read_csv(os.path.join(project_path, '02 Data', '02 01 Originals', 'departments.csv'), )


## 3.0 Practice Work

In [3]:
#3.1 Frequency of days_since_prior_order
df_orders['days_since_prior_order'].value_counts(dropna = False)

30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: days_since_prior_order, dtype: int64

In [4]:
#3.2 Changing variable name from order_dow to order_day_of_week
df_orders.rename(columns={'order_dow':'order_day_of_week'}, inplace=True)
df_orders.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [9]:
#3.3 Changing datatypes of variables
df_orders['order_id'] = df_orders['order_id'].astype('str')

In [5]:
#3.4 Transposing df_departments
df_departments_t = df_departments.T


In [6]:
#3.5 Resetting Index
df_departments_t.reset_index()
#Creating new header
#Copying values of first row to new variable
new_header = df_departments_t.iloc[0]
#Copying new df without first row
df_departments_t_new = df_departments_t[1:]
#Set header row
df_departments_t_new.columns = new_header
df_departments_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [8]:
#3.6 Creating a data dictionary of departments
dept_data_dict = df_departments_t_new.to_dict('index')
print(dept_data_dict.get('19'))

{'department': 'snacks'}


In [7]:
#3.7 Creating subset of snacks
df_snacks = df_products[df_products['department_id']==19]
df_snacks.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


## 4.0 Task Data Wrangling

In [11]:
#4.2 Change datatype of user_id from int to string
df_orders['user_id'] = df_orders['user_id'].astype('str')
df_orders.describe()


Unnamed: 0,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3214874.0
mean,17.15486,2.776219,13.45202,11.11484
std,17.73316,2.046829,4.226088,9.206737
min,1.0,0.0,0.0,0.0
25%,5.0,1.0,10.0,4.0
50%,11.0,3.0,13.0,7.0
75%,23.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


In [7]:
#4.3 Change order_id variable name to reflect that it is a unique key in this dataframe
df_orders.rename(columns={'order_id':'pk_order_id'}, inplace=True)
df_orders.head()

Unnamed: 0,pk_order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [13]:
#4.4 Frequency of hour of day of purchase
df_orders['order_hour_of_day'].value_counts(dropna = False)

10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: order_hour_of_day, dtype: int64

Based on the frequency table of the 'order_hour_of_day' variable, the 10:00 AM is the most frequent hour in which customers order from InstaCart.  This is followed by 11:00 AM and 3:00 PM (15:00 Greenwich Mean Time).  This is consistent with the mean of that variable as well, which is 11.11.

In [14]:
#4.5 Meaning of '4' in department_id for df_products
print(dept_data_dict.get('4'))

{'department': 'produce'}


In [15]:
#4.6 Create a subset of breakfast items
df_breakfast = df_products[df_products['department_id']==14]
df_breakfast.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6


In [18]:
#4.7 Create a subset for dinner party items which includes alcohol, beverage, deli, and meat/seafood departments
df_dinner_pty = df_products.loc[df_products['department_id'].isin([5,7,12,20])]
df_dinner_pty.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1


In [20]:
#4.8 Number of rows of dinner party dataframe
df_dinner_pty.count()

product_id       7650
product_name     7647
aisle_id         7650
department_id    7650
prices           7650
dtype: int64

There are 7,650 items that are considered items that could be purchased for a dinner party based on the alcohol, beverage, deli, and meat/seafood department purchases.  Of note is that there are only 7,647 product names, which means that 3 rows do not have a product name attached to the product id.

In [27]:
#4.9 Information regarding user_id 1
df_user1 = df_orders[df_orders['user_id']=='1']
df_user1

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
6,550135,1,7,1,9,20.0
7,3108588,1,8,1,14,14.0
8,2295261,1,9,1,16,0.0
9,2550362,1,10,4,8,30.0


In [29]:
#4.10 Basic information regarding user1 behavior
df_user1.describe()

Unnamed: 0,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,11.0,11.0,11.0,10.0
mean,6.0,2.636364,10.090909,19.0
std,3.316625,1.286291,3.477198,9.030811
min,1.0,1.0,7.0,0.0
25%,3.5,1.5,7.5,14.25
50%,6.0,3.0,8.0,19.5
75%,8.5,4.0,13.0,26.25
max,11.0,4.0,16.0,30.0


Based on the information we have regarding user 1, we know they made 11 total purchases.  In addition, they will likely place an order on the second or third day of the week (which day is unclear as we don't know if day 1 is Sunday or Monday for this dataset).  We can say for certain that they will order in the first half of the week, since they have never ordered later than the fourth day of the week. We also know that they typically order in the mid-morning, around 10:00 AM, never before 7:00 AM and never after 4:00 PM.  We also know that their likelihood is to wait 19 days between orders, and have gone as long as 30 days between orders.

## 5.0 Exporting Dataframes

In [32]:
#5.0 Export dataframes to csv
df_orders.to_csv(os.path.join(project_path, '02 Data','02 02 Prepared Data', 'orders_wrangled.csv'))
df_departments_t_new.to_csv(os.path.join(project_path, '02 Data', '02 02 Prepared Data', 'departments_wrangled.csv'))