### This script contains the following points:
Note: Corresponding steps for exercise task shown as (Step x)

#### 1 Exercise 4.5 Reading and Examples
#### 1.1 Importing Libraries, Defining Project Path, and Importing Datasets
- Importing orders_wrangled.csv dataset
- Importing products.csv dataset

#### 1.2 Mixed-Type Data Example
#### 1.3 Missing Values
- Finding Missing Values in 'df_prods'
- Addressing Missing Values in 'df_prods'

#### 1.4 Duplicates
- Finding Duplicates in 'df_prods_clean'
- Addressing Duplicate Values in 'df_prods_clean'

#### 2 Exercise 4.5 Task
#### (Step 2) Observing df_ords and df_prod with descriptive statistics
#### (Step 2.a) Addressing suspected typos in product prices
#### (Step 3) Check for mixed-type data in df_ords dataframe
#### (Step 4) Fix mixed-type data
#### (Step 5) Check for missing values in df_ords
#### (Step 6) Address missing values in df_ords
#### (Step 7) Check for duplicate values in df_ords
#### (Step 8) Address duplicate values in df_ords
#### (Step 9.a) Exporting 'df_ords' (as 'orders_checked.csv')
#### (Step 9.b) Exporting 'df_prods_clean_no_dups' (as 'products_checked.csv')

# Exercise 4.5 Reading and Examples

## 1.1 Importing Libraries, Defining Project Path, and Importing Datasets

In [1]:
# Importing pandas, numpy, and os
import pandas as pd
import numpy as np
import os

In [2]:
# Defining project folder path
path = r'C:\Users\prena\03-2023 Instacart Basket Analysis'

In [3]:
path

'C:\\Users\\prena\\03-2023 Instacart Basket Analysis'

In [4]:
# Importing orders_wrangled.csv dataset
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [5]:
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [6]:
# Importing products.csv dataset
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [7]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


## 1.2 Mixed-Type Data Example

In [8]:
# Create a dataframe
df_test = pd.DataFrame()

In [9]:
# Create a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [10]:
df_test

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [11]:
# Check for mixed types
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [12]:
# Fix the mixed-type (convert column data type from numeric to string)
df_test['mix'] = df_test['mix'].astype('str')

In [13]:
df_test['mix'].dtype

dtype('O')

In [14]:
# Re-check for mixed types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

## 1.3 Missing Values

### Finding Missing Values in 'df_prods'

In [15]:
# Determine sum of missing values for each column
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [16]:
# View the missing values in column "product_name" - create subset called df_nan
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [17]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [18]:
# Export missing values to review with stakeholders
df_nan.to_csv(os.path.join(path, '02 Data','Prepared Data', 'missing_values_products.csv'))

### Addressing Missing Values in 'df_prods'

In [19]:
# Checking the size of our df_prods dataframe
df_prods.shape

(49693, 5)

In [20]:
# Create a new subset for df_prods that excludes null values in column product_name
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [21]:
# Check to see the decrease in rows
df_prods_clean.shape

(49677, 5)

## 1.4 Duplicates

### Finding Duplicates in 'df_prods_clean'

In [22]:
# Find full duplicates and put them into a new subset named df_dups
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [23]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


### Addressing Duplicate Values in 'df_prods_clean'

In [24]:
# Checking the size of our df_prods_clean dataframe
df_prods_clean.shape

(49677, 5)

In [25]:
# Create a new subset for df_prods_clean that excludes duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [26]:
# Check to see the decrease in rows
df_prods_clean_no_dups.shape

(49672, 5)

# --------------------------------------------------------------------------------------------------

# Exercise 4.5 Task

## Step 2: Observing df_ords and df_prod with descriptive statistics

In [27]:
# Obtaining df_ords descriptive stats
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


#### Observations from df_ords.describe()
- Column "Unnamed: 0" looks to be an index column.
- Columns "order_id" and "user_id" should be treated as strings since these are identifying values.
- Column "user_id" has a min value of 1 and a max value of 206,209. Given there are 3,421,083, we can expect there to be duplicates in the "user_id" row (which is okay because users can have multiple order records).
- Column "order_day_of_week" has a min value of 0 (which represents Saturday) and a max value of 6 (which represents Friday).
- Column "order_hour_of_day" has a min value of 0 (which represents 12am) and a max value of 23 (which represents 11pm)
- Column "days_since_last_order" has a min value of 0 and a max value of 30 days, and on average, there are 11 days between customers' orders.

In [28]:
# Obtaining df_prods_clean_no_dups desscriptive stats
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


#### Observations from df_prods_clean_no_dups.describe()
- Columns "order_id","user_id", and "department_id" should be treated as strings since these are identifying values.
- Column "prices" has a minimum of 1.00 and a max of 99,999.00. This maximum looks out of place since a dollar amount of $99,999 seems pretty large for a grocery store / Instacart item.

In [29]:
# Taking a look at products listed with a price of $99,999
df_prods_clean_no_dups[df_prods_clean_no_dups['prices']==99999]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [30]:
# Are there any other products listed at an enormous rate?
df_prods_clean_no_dups[df_prods_clean_no_dups['prices']>25]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [31]:
df_incorrect = df_prods_clean_no_dups[df_prods_clean_no_dups['prices']>25]

In [32]:
# Export missing values to review with stakeholders
df_incorrect.to_csv(os.path.join(path, '02 Data','Prepared Data', 'missing_values_products_2.csv'))

That's expensive milk and cottage cheese!!

## Step 2.a: Addressing suspected typos in product prices

I would reach out to stakeholders to check in on these prices to figure out the correct values. Since I can't do this, I'm going to try to create a new dataframe and exclude these items from the product list. First, I want to check how many products there are and how many are in department_id 16, and then play around with getting a new dataframe without these two rows!

In [33]:
# Check size of df_prods_clean_no_dups
df_prods_clean_no_dups.shape

(49672, 5)

In [34]:
# Check size of department_id = 16
df_prods[df_prods['department_id']==16]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
8,9,Light Strawberry Blueberry Yogurt,120,16,7.0
97,98,2% Yellow American Cheese,2,16,4.5
116,117,Petit Suisse Fruit,2,16,3.1
120,121,Sharp Cheddar,21,16,8.1
126,127,Marscapone,108,16,3.5
...,...,...,...,...,...
49621,49617,Creamy Mozzarella Sun-Dried Tomato & Basil Fla...,2,16,10.2
49626,49622,Original Powder Coffee Creamer,53,16,12.0
49632,49628,Yoghurt Blueberry,120,16,14.7
49648,49644,Feta Crumbles,2,16,6.5


From this, I can see that there are 49,672 rows in our df_prods_clean_no_dups dataframe, and there are 3,449 products that make up department_id 16 (where our expensive cottage cheese and milk reside). I'm hoping to remove these two products from df_prods_clean_no_dups, and I would suspect that the shape of my new dataframe will be 49,670:

In [35]:
# Create a list that is made up of product_id 21553 and 33664
df_prods_clean_no_dups[df_prods_clean_no_dups['product_id'].isin([21553,33664])]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [36]:
# Create a list that omits product_id - 21553 (i.e. our expensive cottage cheese)
df_prods_clean_no_dups[df_prods_clean_no_dups['product_id'] != 21553]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


I can see that the count of rows has gone down by one, which tells me the cottage cheese was omitted correctly!

In [37]:
# Create a new dataframe omitting prodcut_id 21553
df_prods_clean_no_dups_omit_1 = df_prods_clean_no_dups[df_prods_clean_no_dups['product_id'] != 21553]

In [38]:
df_prods_clean_no_dups_omit_1.shape

(49671, 5)

Our new dataframe is -1 row than our original dataframe (df_prods_clean_no_dups). This is because I omitted the cottage cheese product (product_id = 21553). Next, I'll repeat this to remove the high-priced milk:

In [39]:
# Create a list that omits product_id - 33664 (i.e. our expensive milk)
df_prods_clean_no_dups_omit_1[df_prods_clean_no_dups_omit_1['product_id'] != 33664]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [40]:
# Create a new dataframe omitting prodcut_id 33664
df_prods_clean_no_dups_omit_2 = df_prods_clean_no_dups_omit_1[df_prods_clean_no_dups_omit_1['product_id'] != 33664]

In [41]:
df_prods_clean_no_dups_omit_2.shape

(49670, 5)

Our next new dataframe is -1 from our "df_prods_clean_no_dups_omit_1" dataframe because we've removed our expensive milk product (product_id = 33664). Now, I'll run the descriptive stats for our most recent dataframe to see that the stats for "prices" looks more reasonable than before:

In [42]:
df_prods_clean_no_dups_omit_2.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49670.0,49670.0,49670.0,49670.0
mean,24850.238716,67.761305,11.72877,7.680437
std,14340.93185,38.31606,5.850834,4.199381
min,1.0,1.0,1.0,1.0
25%,12432.25,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.75,100.0,17.0,11.1
max,49688.0,134.0,21.0,25.0


Success! These stats on prices look alot better!

## Step 3: Check for mixed-type data in df_ords dataframe

In [43]:
# Check for mixed types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

Since we did not get a return value of "mix", we can say that there are no mixed-type data in our df_ords dataframe. Checking the data types table for df_ords before moving on:

In [44]:
# Observe df_ords datatypes since we didn't find any mixed-type
df_ords.dtypes

Unnamed: 0                 int64
order_id                   int64
user_id                    int64
order_number               int64
order_day_of_week          int64
order_hour_of_day          int64
days_since_last_order    float64
dtype: object

In [45]:
df_ords

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...,...
3421078,3421078,2266710,206209,10,5,18,29.0
3421079,3421079,1854736,206209,11,4,10,30.0
3421080,3421080,626363,206209,12,1,12,18.0
3421081,3421081,2977660,206209,13,1,12,7.0


Given that the column "days_since_last_order" is listed as a data type of float64, but our descriptive stats and the first five/last five records for "days_since_last_order" show these values as an interger ending in ".0", we could change this column to "int64." However, when doing this we get an error since there are missing values represented as NaN.

## Step 4:  Fix mixed-type data

Because the NaN value is an indicator of the order being a customer's first order, these NaNs should remain and we should not change the data type of this column. I'll be looking into this NaN value in the following step. 

## Step 5: Check for missing values in df_ords

In [46]:
# Determine sum of missing values for each column
df_ords.isnull().sum()

Unnamed: 0                    0
order_id                      0
user_id                       0
order_number                  0
order_day_of_week             0
order_hour_of_day             0
days_since_last_order    206209
dtype: int64

In [47]:
# View the missing values in column "days_since_last_order" - create subset called df_nan_ords
df_nan_ords = df_ords[df_ords['days_since_last_order'].isnull() == True]

In [48]:
df_nan_ords

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order
0,0,2539329,1,1,2,8,
11,11,2168274,2,1,2,11,
26,26,1374495,3,1,1,14,
39,39,3343014,4,1,6,11,
45,45,2717275,5,1,3,12,
...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,1,4,12,
3420934,3420934,3189322,206206,1,3,18,
3421002,3421002,2166133,206207,1,6,19,
3421019,3421019,2227043,206208,1,1,15,


In [60]:
# Check using crosstab to make sure all NaN values are associated with order_number = 1
crosstab = pd.crosstab(df_ords['days_since_last_order'], df_ords['order_number'], dropna = False)

In [61]:
# Copy results to Excel to review with stakeholders
crosstab.to_clipboard()

## Step 6: Address missing values in df_ords

The field "days_since_last_order" is the only column that contains missing values. Looking at our subset called "df_nan_ords", we can see that the column "order_number" is 1 for all records where the "days_since_last_order" is NaN. This makes sense that our "days_since_last_order" is NaN, then, because there is no previous order to calculate a value (since this is the first order). Since this is the case, I would either create a new variable for these missing values (maybe something to state that this value can't be calculated) or leave it as NaN, knowing the reasoning above as to why it's blank. I think leaving this as NaN for the time being.

## Step 7: Check for duplicate values in df_ords

In [51]:
# Find full duplicates and put them into a new subset named df_dups_ords
df_dups_ords = df_ords[df_ords.duplicated()]

In [52]:
df_dups_ords

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order


There were 0 full duplicates found in our df_ords dataframe. The only other check I would make is to find duplicates in the "order_id" column, as I would suspect that each record is represented by a single "order_id". All other columns in this dataframe I would expect to have duplicates.

In [53]:
df_ords[df_ords.duplicated('order_id')]

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_last_order


## Step 8: Address duplicate values in df_ords

Because there were no full duplicates found and no duplicates found for column "order_id", no action is needed.

## Step 9.a: Exporting 'df_ords' (as 'orders_checked.csv')

In [54]:
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))

## Step 9.b: Exporting 'df_prods_clean_no_dups' (as 'products_checked.csv')

In [55]:
df_prods_clean_no_dups_omit_2.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))