# Table of Contents

0.1 Importing Libraries

0.2 Importing Data

0.3 Exploring Original Df

0.4 Adjusting Datatypes

0.5 Cleaning / Wrangling / Checking Data

    0.6.1  Overall df checks
    
    0.6.2  order_id
    
    0.6.3  product_id
    
    0.6.4  add_to_cart_order
    
    0.6.5  reordered

0.6 Exporting the Clean Df


### 0.1 Importing Libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

### 0.2 Importing Data

In [2]:
# Identify the file pathway to data files
path = r'C:\Users\CJ\Documents\_CJ-Stuff\Career Foundry\Data Immersion\Ach 4 - Python\2023-03 Instacart Basket Analysis'

In [3]:
# Import data from df_ords_prior.csv
df = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'order_products_prior.csv'), index_col = False)

### 0.3 Exploring Original Df

In [4]:
df.shape

(32434489, 4)

In [5]:
df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [6]:
df.tail()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
32434484,3421083,39678,6,1
32434485,3421083,11352,7,0
32434486,3421083,4600,8,0
32434487,3421083,24852,9,1
32434488,3421083,5020,10,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int64
 1   product_id         int64
 2   add_to_cart_order  int64
 3   reordered          int64
dtypes: int64(4)
memory usage: 989.8 MB


In [8]:
df.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,32434490.0,32434490.0,32434490.0,32434490.0
mean,1710749.0,25576.34,8.351076,0.5896975
std,987300.7,14096.69,7.126671,0.4918886
min,2.0,1.0,1.0,0.0
25%,855943.0,13530.0,3.0,0.0
50%,1711048.0,25256.0,6.0,1.0
75%,2565514.0,37935.0,11.0,1.0
max,3421083.0,49688.0,145.0,1.0


### 0.4 Adjusting Datatypes

In [9]:
# order_id is a unique integer identifier 
# so int64 is more appropriate than float64
df['order_id'] = df['order_id'].astype('int64')

In [10]:
# product_id ranges between 1 and 50k
# so int32 is more than sufficient and allows room for growth
df['product_id'] = df['product_id'].astype('int32')


In [11]:
# add_to_cart_order ranges between 1 and 145
# so int16 (which goes up to 32k) is more than sufficient
df['add_to_cart_order'] = df['add_to_cart_order'].astype('int16')

In [12]:
# reordered ranges between 0 and 1
# so int8 is more than adequate
df['reordered'] = df['reordered'].astype('int8')

In [13]:
df.dtypes

order_id             int64
product_id           int32
add_to_cart_order    int16
reordered             int8
dtype: object

### 0.5 Cleaning data

#### 0.5.1 Overall df checks

In [14]:
# Check for mixed data types
for col in df.columns.tolist():
  weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df[weird]) > 0:
    print (col)


In [15]:
# Check for missing values
df.isnull().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

In [16]:
# Check for duplicates 
df.duplicated().sum()

0

No nulls, no duplicated rows, and no mixed-value columns.

#### 0.5.2 order_id cleaning/wrangling

In [17]:
# Exploring the data for this column
df['order_id'].describe()

count    3.243449e+07
mean     1.710749e+06
std      9.873007e+05
min      2.000000e+00
25%      8.559430e+05
50%      1.711048e+06
75%      2.565514e+06
max      3.421083e+06
Name: order_id, dtype: float64

In [18]:
# Checking the uniqueness of order_id numbers
df.order_id.nunique()

3214874

In [19]:
df['order_id'].value_counts(dropna=False).sort_index()

2           9
3           8
4          13
5          26
6           3
           ..
3421079     1
3421080     9
3421081     7
3421082     7
3421083    10
Name: order_id, Length: 3214874, dtype: int64

Values all seem reasonable. There are roughly 3.4 million orders.  The number of products ordered varied by order. For example, order 2 had 9 products while order 3,421,079 had only 1 product.

#### 0.5.3 product_id cleaning/wrangling

In [20]:
# Exploring the data for this column
df['product_id'].describe()

count    3.243449e+07
mean     2.557634e+04
std      1.409669e+04
min      1.000000e+00
25%      1.353000e+04
50%      2.525600e+04
75%      3.793500e+04
max      4.968800e+04
Name: product_id, dtype: float64

In [21]:
df['product_id'].value_counts(dropna=False).sort_index()

1        1852
2          90
3         277
4         329
5          15
         ... 
49684       9
49685      49
49686     120
49687      13
49688      89
Name: product_id, Length: 49677, dtype: int64

Values seem reasonable. There are just under 50k products of varying popularity. For example, product 1 was ordered 1,852 times whereas product 49684 was only ordered 9 times.

#### 0.5.4 add_to_cart_order cleaning/wrangling

In [22]:
# Exploring the data for this column
df['add_to_cart_order'].describe()

count    3.243449e+07
mean     8.351076e+00
std      7.126671e+00
min      1.000000e+00
25%      3.000000e+00
50%      6.000000e+00
75%      1.100000e+01
max      1.450000e+02
Name: add_to_cart_order, dtype: float64

In [23]:
df['add_to_cart_order'].value_counts(dropna=False).sort_index()

1      3214874
2      3058126
3      2871133
4      2664106
5      2442025
        ...   
141          1
142          1
143          1
144          1
145          1
Name: add_to_cart_order, Length: 145, dtype: int64

Values seem reasonable. There were 3214874 unique orders and every one of those would need a first item added to their cart, so 1 appropriately = 3214874

#### 0.5.5 reordered cleaning/wrangling

In [24]:
# Exploring the data for this column
df['reordered'].describe()

count    3.243449e+07
mean     5.896975e-01
std      4.918886e-01
min      0.000000e+00
25%      0.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
Name: reordered, dtype: float64

In [25]:
df['reordered'].value_counts(dropna=False).sort_index()

0    13307953
1    19126536
Name: reordered, dtype: int64

It appears this field is acting like a boolean.  It may be valuable to ask the client if we can switch it to a true boolean instead of an int8.

### 0.6 Exporting cleaned data

In [26]:
# Confirming shape
df.shape

(32434489, 4)

In [27]:
# Exporting final df after completing consistency checks
df.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_prior_clean.pkl'))