### This script contains the following points:
#### 1. Creating "price_label" and "busiest_day" columns in 'df_ords_prods_merged'
#### 2. Createing "busiest_days" column in 'df_ords_prods_merged' (to show indicate two busiest days and two slowest days)
#### 3. Checking "busiest_days" column (frequency and accuracy)
#### 4. Creating "busiest_period_of_day" in 'df_ords_prods_merged' (to show ranges of busiest times)
#### 5. Checking "busiest_period_of_day" column (frequency and accuracy)
#### 6. Exporing dataframe as 'orders_products_merged_2.pkl'

# Exercise 4.7 Prep-Work (Step 1)

## Step 1.1. Importing Libraries, Defining Project Path, and Importing Datasets

In [1]:
# Importing pandas, numpy, and os
import pandas as pd
import numpy as np
import os

In [2]:
# Defining project folder path
path = r'C:\Users\prena\03-2023 Instacart Basket Analysis'

In [3]:
# Importing orders_products_combined.pkl dataset
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

## Step 1.2. Create smaller subset to work with (1,000,000 rows)

In [4]:
# Create a subset of 1,000,000 rows
df = df_ords_prods_merged[:1000000]

In [5]:
df.shape

(1000000, 15)

## Step 1.3. User-Defined Functions 
### Define "price_label" function and apply it to 'df' subset as column "price_range"

In [6]:
# Define function for "price_lable"
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [7]:
# Apply function in new row "price_range"
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [8]:
# Check the values in column "price_range"
df['price_range'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [9]:
# Investigate new column
df.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,_merge_2,price_range
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product


In [10]:
df['prices'].max()

14.8

## Step1.4. loc() Function
### Use loc() function on 'df' subset to set values for "price_range_loc"

In [11]:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [12]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [13]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [14]:
df['price_range_loc'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range_loc, dtype: int64

This results in the same value counts as the user-defined function above (5.)

## Step 1.5 loc() Function
### Use loc() function on entire 'df_ords_prods_merged' dataframe to set values for "price_range_loc"

In [15]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [16]:
df_ords_prods_merged.loc[(df_ords_prods_merged['prices'] <= 15) & (df_ords_prods_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [17]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [18]:
df_ords_prods_merged['price_range_loc'].value_counts(dropna = False)

Mid-range product     21860860
Low-range product     10126321
High-range product      412551
Name: price_range_loc, dtype: int64

## Step 1.6 For-Loops
### Create column "busiest day" in 'df_ords_prods_merged' and populate valules using for-loop

In [19]:
# Determine which day most orders take place w/ frequency table
df_ords_prods_merged['order_day_of_week'].value_counts(dropna = False)

0    6203329
1    5659298
6    4495887
2    4213105
5    4205076
3    3839865
4    3783172
Name: order_day_of_week, dtype: int64

In [20]:
# Create new column ("busiest day") that will contain one of three values: "Busiest day", "Least busy", "Regularly busy"

result = []

for value in df_ords_prods_merged["order_day_of_week"]:
    if value == 0:
        result.append("Busiest day")
    elif value == 4:
        result.append("Least busy")
    else:
        result.append("Regularly busy")

In [21]:
# Create new column called 'busiest_day' within df_ords_prods_merged dataframe and set it equal to result
df_ords_prods_merged['busiest_day'] = result

In [22]:
df_ords_prods_merged['busiest_day'].value_counts(dropna = False)

Regularly busy    22413231
Busiest day        6203329
Least busy         3783172
Name: busiest_day, dtype: int64

# Exercise 4.7 Task (Step 2 - Step 7)

## Step 2. Create column to indicate busiest days (plural) and least busy days (plural)

In this step, I'm creating an updated "busiest_day" column (from Step 1.6) called "busiest_days". For this new column, the days of the week will fall into the following categories (based on the frequency table for "order_day_of_week" shown at the beginning of Step 1.6:
- "Busiest day" = the two busiest days of the week (0 - Saturday or 1 - Sunday)
- "Least busy" = the two least busy days of the week (4 - Wednesday or 3 - Tuesday)
- "Regular busy" = all other days

In [23]:
# Create new column ("busiest day") that will contain one of three values: "Busiest day", "Least busy", "Regularly busy"

result = []

for value in df_ords_prods_merged["order_day_of_week"]:
    if value == 0 or value == 1:
        result.append("Busiest day")
    elif value == 4 or value == 3:
        result.append("Least busy")
    else:
        result.append("Regularly busy")

In [24]:
# Create new column called 'busiest_day' within df_ords_prods_merged dataframe and set it equal to result
df_ords_prods_merged['busiest_days'] = result

## Step 3. Check the values of 'busiest_days' column for accuracy

In [25]:
# Check frequency of column 'busiest_days'
df_ords_prods_merged['busiest_days'].value_counts(dropna = False)

Regularly busy    12914068
Busiest day       11862627
Least busy         7623037
Name: busiest_days, dtype: int64

In [26]:
# Check df_ords_prods_merged for update
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,_merge_2,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy


In [28]:
# Define function to check new column 'busiest_days'
def add_num(x,y):
    sum = x + y
    return sum

In [29]:
# Run function for "Least busy" days by adding freqencies for days 4 and 3
add_num(3839865,3783172)

7623037

In [30]:
# Run function for "Busiest day" days by adding frequencies for days 0 and 1
add_num(6203329,5659298)

11862627

## Step 4. Create column to indicate busiest hours 

In [31]:
# Determine which busiest hours w/ frequency table
df_ords_prods_merged['order_hour_of_day'].value_counts(dropna = False)

10    2761333
11    2735694
14    2688728
15    2661718
13    2660570
12    2618104
16    2534744
9     2453842
17    2087273
8     1717863
18    1636226
19    1258076
20     976000
7      890923
21     795528
22     634159
23     402272
6      290450
0      218742
1      115683
5       87944
2       69360
4       53232
3       51268
Name: order_hour_of_day, dtype: int64

Splitting these times into the following three categories (8 times per category), we get the following:
- "Most Orders": 9-16 (9am, 10am, 11am, 12pm, 1pm, 2pm, 3pm, 4pm)
- "Average Orders": 7-8 (7am, 8am) and 17-22 (5pm, 6pm, 7pm, 8pm, 9pm 10pm)
- "Fewest Orders": 23-6 (11pm, 12am, 1am, 2am, 3am, 4am, 5am, 6am)

In [32]:
# Create new column ("busiest_period_of_day") that will contain one of three values: "Busiest day", "Least busy", "Regularly busy"

result = []

for value in df_ords_prods_merged["order_hour_of_day"]:
    if value >= 9 and value < 17:
        result.append("Most orders")
    elif value == 7 or value == 8 or (value >= 17 and value < 23):
        result.append("Average orders")
    else:
        result.append("Fewest orders")

In [33]:
# Create new column called 'busiest_period_of_day' within df_ords_prods_merged dataframe and set it equal to result
df_ords_prods_merged['busiest_period_of_day'] = result

## Step 5. Check the values of 'busiest_period_of_day' for accuracy 

In [34]:
# Check frequency of column 'busiest_period_of_day'
df_ords_prods_merged['busiest_period_of_day'].value_counts(dropna = False)

Most orders       21114733
Average orders     9996048
Fewest orders      1288951
Name: busiest_period_of_day, dtype: int64

In [35]:
# Check df_ords_prods_merged for update
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,_merge_2,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy,Average orders
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Least busy,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy,Average orders
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Least busy,Most orders


## Step 7. Exporting 'df_ords_prods_merged' to 'orders_products_merged_2.pkl'

In [36]:
# Export data to pkl
df_ords_prods_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_2.pkl'))