# Fill That Cart!
## Introduction

Instacart is a grocery delivery platform where customers can place orders and have them delivered, similar to Uber Eats or DoorDash. The dataset provided here is a modified version of the original. Its size was reduced to speed up computations, and missing values and duplicates were intentionally introduced. Care was taken to preserve the original data distributions when making these changes.

# Data Dictionary

The dataset contains five tables. <br>
Below is a data dictionary listing each table’s columns and describing the data they contain.
instacart_orders.csv: Each row corresponds to an order placed through the Instacart app.

- `instacart_orders.csv`
    - `'order_id'`: unique ID number identifying each order.
    - `'user_id'`: unique ID number identifying each customer account.
    - `'order_number'`: the number of times this customer has placed an order.
    - `'order_dow'`: day of the week the order was placed (0 = Sunday).
    - `'order_hour_of_day'`: hour of the day the order was placed.
    - `'days_since_prior_order'`: number of days since this customer’s previous order.
- `products.csv`
    - `'product_id'`: unique ID number identifying each product.
    - `'product_name'`: name of the product.
    - `'aisle_id'`: unique ID number identifying each grocery aisle category.
    - `'department_id'`: unique ID number identifying each grocery department.
- `order_products.csv` 
    - `'order_id'`: unique ID number identifying each order.
    - `'product_id'`: unique ID number identifying each product.
    - `'add_to_cart_order'`: sequential order in which each item was added to the cart.
    - `'reordered'`: 0 if the customer has never ordered this product before, 1 if they have.
- `aisles.csv`
    - `'aisle_id'`: unique ID number identifying each grocery aisle category.
    - `'aisle'`: name of the aisle.
- `departments.csv`
    - `'department_id'`: unique ID number identifying each grocery department.
    - `'department'`: name of the department.

# 1. Inicialization

In [None]:
# Import functions
import sys
import os

sys.path.append(os.path.abspath('..'))

In [None]:
# import libraries: 
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from src.null_columns import show_null_columns

In [None]:
# Extract the info from the Datasets
"""
The .csv files are separated by (;)
"""

#df_instacart_orders = pd.read_csv('../data/raw/instacart_orders.csv', sep= ';')
#df_products = pd.read_csv('../data/raw/products.csv', sep= ';')
#df_aisles = pd.read_csv('../data/raw/aisles.csv', sep= ';')
#df_departments = pd.read_csv('../data/raw/departments.csv', sep= ';')
#df_order_products = pd.read_csv('../data/raw/order_products.csv', sep= ';')

df_instacart_orders = pd.read_csv('C:\\Users\\gudia\\Documents\\Doc TripleTen\\Proyectos Sprint\\Sprint_4\\Proyecto\\instacart_orders.csv', sep=';')
df_products = pd.read_csv('C:\\Users\\gudia\\Documents\\Doc TripleTen\\Proyectos Sprint\\Sprint_4\\Proyecto\\products.csv', sep= ';')
df_aisles = pd.read_csv('C:\\Users\\gudia\\Documents\\Doc TripleTen\\Proyectos Sprint\\Sprint_4\\Proyecto\\aisles.csv', sep= ';')
df_departments = pd.read_csv('C:\\Users\\gudia\\Documents\\Doc TripleTen\\Proyectos Sprint\\Sprint_4\\Proyecto\\departments.csv', sep= ';')
df_order_products = pd.read_csv('C:\\Users\\gudia\\Documents\\Doc TripleTen\\Proyectos Sprint\\Sprint_4\\Proyecto\\order_products.csv', sep= ';')


## 1.1 Data Preprocessing

In [None]:
# General info
df_instacart_orders.info()  # Dataset orders
print()
df_products.info()          # Dataset products
print()
df_aisles.info()            # Dataset aisles
print()
df_departments.info()       # Dataset departments
print()
df_order_products.info()    # Dataset order_products

In [None]:
# Show random rows to see the datasets
print(df_instacart_orders.sample(3))
print()
print(df_products.sample(3))
print()
print(df_aisles.sample(3))
print()
print(df_departments.sample(3))
print()
print(df_order_products.sample(3))

### Copy original Dataframes

In [None]:
# Clone datasets to keep the original with no changes
df_instacart_orders_clean = df_instacart_orders.copy()
df_products_clean = df_products.copy()
df_aisles_clean = df_aisles.copy()
df_departments_clean = df_departments.copy()
df_order_products_clean = df_order_products.copy()

### Null Values

In [None]:
# Review of null values 
show_null_columns(df_instacart_orders_clean, "Instacard Orders")
show_null_columns(df_products_clean, "Products")
show_null_columns(df_aisles_clean, "Aisles")
show_null_columns(df_departments_clean, "Departments")
show_null_columns(df_order_products_clean, "Order Products")

### Duplicated Values

In [None]:
# Dataframe Instacard Orders
duplicados_totales = df_instacart_orders_clean.duplicated().sum()    # Sum of Duplicated rows
print(f"the duplicate rows are: {duplicados_totales}")
print() 
print(df_instacart_orders_clean[df_instacart_orders_clean.duplicated()])   # printing duplicated rows

In [None]:
"""
Findings: 
order_dow: 3, where 0 (according to the description) represents Sunday. Therefore, the repeated value across all duplicate rows corresponds to Wednesday.
order_hour_of_day: 2, following the 24-hour format starting at 0 (midnight), which means 2 a.m.
We will proceed to eliminate these duplicated rows, to keep the dataset integrity and to avoid "noise" in further analysis.
"""
# Eliminate duplicate values
df_instacart_orders_clean = df_instacart_orders_clean.drop_duplicates()
df_instacart_orders_clean.reset_index(drop=True)

# Now, verify if there's only, duplicated order ID's
print('number of duplicate values in column "order_id": ', df_instacart_orders['order_id'].duplicated().sum())