# Table of Contents 

This notebook contains the following: 

* Importing libraries
* Turning project path into a string
* Importing data set
* Applying the following:
    * If-Statements with User-Defined Functions
    * If-Statements with the loc() Function
    * If-Statements with For-Loops
* Deriving new variables (Creating new columns)
* Checking for accuracy 
* Creating a new variable identifying the busiest hours of the day labeled as periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.
* Printing frequency of dataframe
* Exporting dataframe 


# Step 1

## If you haven’t done so already, complete the instructions in the Exercise for creating the “price_label” and “busiest_day” columns.

### 01. Importing libraries

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

### 02. Turning project path into a string

In [2]:
#Turn project folder path into a string

'/Users/aysha/Documents/Instacart Basket Analysis/'

'/Users/aysha/Documents/Instacart Basket Analysis/'

In [3]:
path = r'/Users/aysha/Documents/Instacart Basket Analysis/'

In [4]:
path

'/Users/aysha/Documents/Instacart Basket Analysis/'

### 03. Import the ords_prod_merged df exported from previous task (Task 4.6)

In [5]:
# Import the “orders_products_merged.pkl" file from Task 4.6

df_order_products_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_revised.pkl'))

In [None]:
df_order_products_merged.head()

In [7]:
# Create a subset of first million rows

df = df_order_products_merged[:1000000]

In [None]:
df.head()

In [None]:
df

In [10]:
# Checking the shape of the df subset

df.shape

(1000000, 15)

## If-Statements with User-Defined Functions

In [11]:
# Define a function (price_label)

def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [12]:
# Apply the function on the subset df 

df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [None]:
# Check the values in your df 

df['price_range'].value_counts(dropna = False)

In [14]:
# Check most expensive product within subset 

df['prices'].max()

14.8

## If-Statements with the loc() Function

In [15]:
# Create conditions for loc Function 

In [18]:
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [17]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [19]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [None]:
# Check the values in your df 

df['price_range_loc'].value_counts(dropna = False)

In [22]:
# Check this on the entire dataframe (df_order_products_merged) instead of the subset

In [23]:
df_order_products_merged.loc[df_order_products_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [29]:
df_order_products_merged.loc[(df_order_products_merged['prices'] <= 15) & (df_order_products_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [27]:
df_order_products_merged.loc[df_order_products_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range products

In [None]:
# Check the values in your entire dataframe 

df_order_products_merged['price_range_loc'].value_counts(dropna = False)

## If-Statements with For-Loops

In [31]:
# Print frequency of orders_day_of_week column 

df_order_products_merged['orders_day_of_week'].value_counts(dropna = False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

### From the project, each day corresponds to a number, as follows
#### 0 = Saturday
#### 1 = Sunday 
#### 2 = Monday
#### 3 = Tuesday
#### 4 = Wednesday
#### 5 = Thursday
#### 6 = Friday

In [33]:
# Create a new column 'Busiest day' with the following three values
# * Busiest day
# * Least busy 
# * Regularly busy 


result = []

for value in df_order_products_merged["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [None]:
result

In [35]:
# Combine result with df (Create a new columnn 'Busiest day' and set it equal to 'result')

df_order_products_merged['busiest day'] = result

In [None]:
# Print frequency of busiest day column / Check the values in df for busiest day column

df_order_products_merged['busiest day'].value_counts(dropna = False)

In [None]:
df_order_products_merged.head()

# Step 2


## Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.

In [58]:
# Create a new column 'Busiest days' with the following three values
# * Busiest days - should correspond to the two busiest days of the week 
# * Slowest days - should correspond to the two slowest days
# * Regularly busy days 


result2 = []

for value in df_order_products_merged["orders_day_of_week"]:
  if value in [0,1]:result2.append("Busiest days")
  elif value in [3,4]:result2.append("Slowest days")
  else:result2.append("Regularly busy")

In [None]:
result2

In [60]:
# Checking if result2 length = DataFrame length

len(result2)

32404859

In [61]:
# Combine the new column with df (Create a new columnn 'Busiest days' and set it equal to 'result2')

df_order_products_merged['Busiest days'] = result2

In [62]:
# Print frequency of Busiest days column 

df_order_products_merged['Busiest days'].value_counts(dropna = False)

Regularly busy    12916111
Busiest days      11864412
Slowest days       7624336
Name: Busiest days, dtype: int64

In [None]:
df_order_products_merged.head()

# Step 3

## Check the values of this new column for accuracy. Note any observations in markdown format.

In [64]:
# Checking value for busiest days ( 0 = Saturday, 1 = Sunday)

6204182 + 5660230

11864412

In [65]:
# Checking value for slowest days ( 3 = Tuesday, 4 = Wednesday)

3840534 + 3783802

7624336

### The total number of busiest days is the sum of 0 & 1 (11864412) and for slowest days, the total number is the sum of 3 & 4 (7624336).

# Step 4 

## When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”

In [67]:
# Check / Print frequency of order_time_of_day column 

df_order_products_merged['order_time_of_day'].value_counts(dropna = False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_time_of_day, dtype: int64

In [72]:
# Create a new column 'Busiest_period_of_day' with the following three values
# * Most orders (9 - 17) 
# * Average orders (7-8 and 18 - 23)
# * Fewest orders  (0- 6)


result3 = []

for value in df_order_products_merged["order_time_of_day"]:
  if value in [9,10,11,12,13,14,15,16,17]:result3.append("Most orders")
  elif value in [7,8,18,19,20,21,22,23]:result3.append("Average orders")
  else:result3.append("Fewest orders")

In [None]:
result3

In [74]:
# Checking if result3 length = DataFrame length

len(result3)

32404859

In [75]:
# Combine the new column with df (Create a new columnn 'Busiest_period_of_day' and set it equal to 'result3')

df_order_products_merged['Busiest_period_of_day'] = result3

In [None]:
df_order_products_merged.head()

# Step 5 


## Print the frequency for this new column.

In [77]:
# Print frequency of Busiest_period_of_day column 

df_order_products_merged['Busiest_period_of_day'].value_counts(dropna = False)

Most orders       23205725
Average orders     8312313
Fewest orders       886821
Name: Busiest_period_of_day, dtype: int64

In [78]:
# Checking value for Most orders ( 9 - 17)

2761760 + 2736140 + 2689136 + 2662144 + 2660954 + 2618532 + 2535202 + 2454203 + 2087654

23205725

In [79]:
# Checking value for Average orders ( 7 - 8 & 18 - 23)

1718118 + 1636502 + 1258305 + 976156 + 891054 + 795637 + 634225 + 402316

8312313

In [80]:
# Checking value for Fewest orders ( 0 - 6) 

290493 + 218769 + 115700 + 87961 + 69375 + 53242 + 51281

886821

# Step 6 

## Ensure your notebook is clean and structured and that your code is well commented.

### Done

# Step 7 

## Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder.

In [81]:
# Export dataframe (with new columns) to Pickel format (pkl)


df_order_products_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_4_7.pkl'))