# Products Prior EDA

This file gives us information about which products (`product_id`) were ordered. It also contains information of the order (`add_to_cart_order`) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0). <br/>
For example, we see below that `order_id` 1 had 8 products, 4 of which are reorders. </br>
Still we don’t know what these products are. This information is in the products.csv and the task is to join the dataframes.

## 01 Setup

In [1]:
#Import libriaries
import numpy as np
import pandas as pd 
import os

In [2]:
# Path
path = r'/Users/peanutcookie/instacart-book/'

In [3]:
# Import .csv file
df = pd.read_csv(os.path.join(path, '_csv-raw', 'order_products_prior.csv'), index_col = False)

## Data exploration 

In [4]:
# Dataframe shape
df.shape

(32434489, 4)

In [5]:
# Return DataFrame columns names and types of the data their store
df.dtypes

order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtype: object

In [6]:
# Dataframe head
df.head(5)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [7]:
# Dataframe tail
df.tail(5)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
32434484,3421083,39678,6,1
32434485,3421083,11352,7,0
32434486,3421083,4600,8,0
32434487,3421083,24852,9,1
32434488,3421083,5020,10,1


In [8]:
# Explore 'reordered' column
df['reordered'].describe()

count    3.243449e+07
mean     5.896975e-01
std      4.918886e-01
min      0.000000e+00
25%      0.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.000000e+00
Name: reordered, dtype: float64

## 03 Dataframe Cleansing

### Mixed values

In [9]:
# Check for mixed values
print("Mixed data")
for col in df.columns.tolist():
  mixed_products = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis = 1) 
  if len (df[mixed_products]) > 0:
    print (col + ": True")
  else: 
            print (col + ": False")

Mixed data
order_id: False
product_id: False
add_to_cart_order: False
reordered: False


### Data types

In [10]:
## Correcting dataypes - assignig Str data type to order_id, product_id to receive correct staistics records
df = df.astype({"product_id":'str', "order_id":'str'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   order_id           object
 1   product_id         object
 2   add_to_cart_order  int64 
 3   reordered          int64 
dtypes: int64(2), object(2)
memory usage: 989.8+ MB


### Missing values

In [11]:
# Missing values
df.isna().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

### Duplicates

In [12]:
# Return duplicates
duplicated_rows = df[df.duplicated()]
duplicated_rows.shape

(0, 4)

## 04 Export check

In [13]:
df.shape

(32434489, 4)

In [14]:
df.dtypes

order_id             object
product_id           object
add_to_cart_order     int64
reordered             int64
dtype: object

## 05 Export

In [15]:
df.to_pickle(os.path.join(path, '_database', 'order_products_prior.pkl'))