# Data Cleaning for Database Construction

This notebook is designed to clean and preprocess the dataset (`../data/DataCoSupplyChainDataset.csv`) to prepare it for database construction. The main objectives are:
- Handling missing values
- Removing duplicates and inconsistencies
- Ensuring data integrity for smooth integration into the database

By the end of this notebook, we will have
- `orders_cleaned_db` a cleaned dataset for the database
- `customers`, `shipping`, `orders`, `product_categories`, `products` and `order_items` dataframes ready for database storage

For more information about the database structure read the `README.md` file in the `database` folder

In [1]:
# Standard libraries
import pandas as pd

# Enable auto-reload for modules during development
%load_ext autoreload
%autoreload 2

# Set display options for Pandas to show all columns
pd.set_option('display.max_columns', None)

# Load custom scripts

# Import the scripts
from scripts import data_check as ch
from scripts import data_cleaning as dc

In [2]:
# Load dataset
path = "../data/DataCoSupplyChainDataset.csv"
orders = pd.read_csv(path, encoding="ISO-8859-1")

In [3]:
# First look at the first few rows of the dataset
orders.head(5)

Unnamed: 0,Type,Days for shipping (real),Days for shipment (scheduled),Benefit per order,Sales per customer,Delivery Status,Late_delivery_risk,Category Id,Category Name,Customer City,Customer Country,Customer Email,Customer Fname,Customer Id,Customer Lname,Customer Password,Customer Segment,Customer State,Customer Street,Customer Zipcode,Department Id,Department Name,Latitude,Longitude,Market,Order City,Order Country,Order Customer Id,order date (DateOrders),Order Id,Order Item Cardprod Id,Order Item Discount,Order Item Discount Rate,Order Item Id,Order Item Product Price,Order Item Profit Ratio,Order Item Quantity,Sales,Order Item Total,Order Profit Per Order,Order Region,Order State,Order Status,Order Zipcode,Product Card Id,Product Category Id,Product Description,Product Image,Product Name,Product Price,Product Status,shipping date (DateOrders),Shipping Mode
0,DEBIT,3,4,91.25,314.640015,Advance shipping,0,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Cally,20755,Holloway,XXXXXXXXX,Consumer,PR,5365 Noble Nectar Island,725.0,2,Fitness,18.251453,-66.037056,Pacific Asia,Bekasi,Indonesia,20755,1/31/2018 22:56,77202,1360,13.11,0.04,180517,327.75,0.29,1,327.75,314.640015,91.25,Southeast Asia,Java Occidental,COMPLETE,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2/3/2018 22:56,Standard Class
1,TRANSFER,5,4,-249.089996,311.359985,Late delivery,1,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Irene,19492,Luna,XXXXXXXXX,Consumer,PR,2679 Rustic Loop,725.0,2,Fitness,18.279451,-66.037064,Pacific Asia,Bikaner,India,19492,1/13/2018 12:27,75939,1360,16.389999,0.05,179254,327.75,-0.8,1,327.75,311.359985,-249.089996,South Asia,Rajastán,PENDING,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/18/2018 12:27,Standard Class
2,CASH,4,4,-247.779999,309.720001,Shipping on time,0,73,Sporting Goods,San Jose,EE. UU.,XXXXXXXXX,Gillian,19491,Maldonado,XXXXXXXXX,Consumer,CA,8510 Round Bear Gate,95125.0,2,Fitness,37.292233,-121.881279,Pacific Asia,Bikaner,India,19491,1/13/2018 12:06,75938,1360,18.030001,0.06,179253,327.75,-0.8,1,327.75,309.720001,-247.779999,South Asia,Rajastán,CLOSED,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/17/2018 12:06,Standard Class
3,DEBIT,3,4,22.860001,304.809998,Advance shipping,0,73,Sporting Goods,Los Angeles,EE. UU.,XXXXXXXXX,Tana,19490,Tate,XXXXXXXXX,Home Office,CA,3200 Amber Bend,90027.0,2,Fitness,34.125946,-118.291016,Pacific Asia,Townsville,Australia,19490,1/13/2018 11:45,75937,1360,22.940001,0.07,179252,327.75,0.08,1,327.75,304.809998,22.860001,Oceania,Queensland,COMPLETE,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/16/2018 11:45,Standard Class
4,PAYMENT,2,4,134.210007,298.25,Advance shipping,0,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Orli,19489,Hendricks,XXXXXXXXX,Corporate,PR,8671 Iron Anchor Corners,725.0,2,Fitness,18.253769,-66.037048,Pacific Asia,Townsville,Australia,19489,1/13/2018 11:24,75936,1360,29.5,0.09,179251,327.75,0.45,1,327.75,298.25,134.210007,Oceania,Queensland,PENDING_PAYMENT,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,1/15/2018 11:24,Standard Class


In [4]:
# Check data types, number of missing values, duplicates, and unique values
ch.check(orders)

Number of columns: 53 and rows: 180519

Data types:
Type                              object
Days for shipping (real)           int64
Days for shipment (scheduled)      int64
Benefit per order                float64
Sales per customer               float64
Delivery Status                   object
Late_delivery_risk                 int64
Category Id                        int64
Category Name                     object
Customer City                     object
Customer Country                  object
Customer Email                    object
Customer Fname                    object
Customer Id                        int64
Customer Lname                    object
Customer Password                 object
Customer Segment                  object
Customer State                    object
Customer Street                   object
Customer Zipcode                 float64
Department Id                      int64
Department Name                   object
Latitude                         float64
Longi

In [5]:
rate_missing_data_order_zip_code = orders['Order Zipcode'].isnull().sum() / len(orders) * 100
print("The rate of missing data in the column order zip code is: ", rate_missing_data_order_zip_code)

The rate of missing data in the column order zip code is:  86.23967560201419


<span style="color:red;">Initial Findings in the Data:</span>

- There are 53 columns and 180,519 rows.
- There are no duplicated rows.
- `Customer Emails` and `Passwords` are encoded, so these columns can be dropped latter for the analysis.
- The `Order date` (`DateOrders`) and `Shipping date` columns need to be modified to `datetime` type.
- The `Product descriptions` column can be changed to object type for the database. Since all values are missing, this column can be dropped later for the analysis.
- The `Product image` column can be dropped later for the analysis.
- The `Product Status` column has all values set to 0 (indicating available products), so it can be dropped later for the analysis.
- There are 8 `Customer Lname` missing but this wont affect the analysis since this column wont be probably used
- There are 3 `Customer Zipcode` missing, since this column might be used to train the model and this is less than 1% of the data, the rows will be eliminated
- `Order Zipcode` is missing 86% of the data, so the column will be drop latter for the analysis.

In [6]:
# Let's clean the data based on the checks for the database
orders_cleaned_db = dc.clean_for_database(orders)

In [7]:
# Let's check these changes
orders_cleaned_db.head(5)

Unnamed: 0,Type,Days for shipping (real),Days for shipment (scheduled),Benefit per order,Sales per customer,Delivery Status,Late_delivery_risk,Category Id,Category Name,Customer City,Customer Country,Customer Email,Customer Fname,Customer Id,Customer Lname,Customer Password,Customer Segment,Customer State,Customer Street,Customer Zipcode,Department Id,Department Name,Latitude,Longitude,Market,Order City,Order Country,Order Customer Id,order date (DateOrders),Order Id,Order Item Cardprod Id,Order Item Discount,Order Item Discount Rate,Order Item Id,Order Item Product Price,Order Item Profit Ratio,Order Item Quantity,Sales,Order Item Total,Order Profit Per Order,Order Region,Order State,Order Status,Order Zipcode,Product Card Id,Product Category Id,Product Description,Product Image,Product Name,Product Price,Product Status,shipping date (DateOrders),Shipping Mode
0,DEBIT,3,4,91.25,314.640015,Advance shipping,0,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Cally,20755,Holloway,XXXXXXXXX,Consumer,PR,5365 Noble Nectar Island,725.0,2,Fitness,18.251453,-66.037056,Pacific Asia,Bekasi,Indonesia,20755,2018-01-31 22:56:00,77202,1360,13.11,0.04,180517,327.75,0.29,1,327.75,314.640015,91.25,Southeast Asia,Java Occidental,COMPLETE,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2018-02-03 22:56:00,Standard Class
1,TRANSFER,5,4,-249.089996,311.359985,Late delivery,1,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Irene,19492,Luna,XXXXXXXXX,Consumer,PR,2679 Rustic Loop,725.0,2,Fitness,18.279451,-66.037064,Pacific Asia,Bikaner,India,19492,2018-01-13 12:27:00,75939,1360,16.389999,0.05,179254,327.75,-0.8,1,327.75,311.359985,-249.089996,South Asia,Rajastán,PENDING,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2018-01-18 12:27:00,Standard Class
2,CASH,4,4,-247.779999,309.720001,Shipping on time,0,73,Sporting Goods,San Jose,EE. UU.,XXXXXXXXX,Gillian,19491,Maldonado,XXXXXXXXX,Consumer,CA,8510 Round Bear Gate,95125.0,2,Fitness,37.292233,-121.881279,Pacific Asia,Bikaner,India,19491,2018-01-13 12:06:00,75938,1360,18.030001,0.06,179253,327.75,-0.8,1,327.75,309.720001,-247.779999,South Asia,Rajastán,CLOSED,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2018-01-17 12:06:00,Standard Class
3,DEBIT,3,4,22.860001,304.809998,Advance shipping,0,73,Sporting Goods,Los Angeles,EE. UU.,XXXXXXXXX,Tana,19490,Tate,XXXXXXXXX,Home Office,CA,3200 Amber Bend,90027.0,2,Fitness,34.125946,-118.291016,Pacific Asia,Townsville,Australia,19490,2018-01-13 11:45:00,75937,1360,22.940001,0.07,179252,327.75,0.08,1,327.75,304.809998,22.860001,Oceania,Queensland,COMPLETE,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2018-01-16 11:45:00,Standard Class
4,PAYMENT,2,4,134.210007,298.25,Advance shipping,0,73,Sporting Goods,Caguas,Puerto Rico,XXXXXXXXX,Orli,19489,Hendricks,XXXXXXXXX,Corporate,PR,8671 Iron Anchor Corners,725.0,2,Fitness,18.253769,-66.037048,Pacific Asia,Townsville,Australia,19489,2018-01-13 11:24:00,75936,1360,29.5,0.09,179251,327.75,0.45,1,327.75,298.25,134.210007,Oceania,Queensland,PENDING_PAYMENT,,1360,73,,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,0,2018-01-15 11:24:00,Standard Class


In [8]:
# Re-checking the dataframes after cleaning for the database
ch.check(orders_cleaned_db)

Number of columns: 53 and rows: 180516

Data types:
Type                                     object
Days for shipping (real)                  int64
Days for shipment (scheduled)             int64
Benefit per order                       float64
Sales per customer                      float64
Delivery Status                          object
Late_delivery_risk                        int64
Category Id                               int64
Category Name                            object
Customer City                            object
Customer Country                         object
Customer Email                           object
Customer Fname                           object
Customer Id                               int64
Customer Lname                           object
Customer Password                        object
Customer Segment                         object
Customer State                           object
Customer Street                          object
Customer Zipcode                    

In [9]:
# Save the initially cleaned DataFrame to a new CSV file for further cleaning in the data_cleaning file
orders_cleaned_db.to_csv('../data/orders_cleaned_db.csv', index=False)

<span style="color:red;">From this point different dataframes will be created based on the database structure</span>

You can check more about the structure in the folder `../database`. Read the `../database/README.md` file for instructions on how to create the database. Here you can see the tables structure:

![Database Schema](../images/tables_structure.png)  

In [10]:
# Customer table
df_customers = orders_cleaned_db[['Customer Id', 'Customer Fname', 'Customer Lname', 'Customer Email', 
                   'Customer Password', 'Customer City', 'Customer State', 'Customer Zipcode', 
                   'Customer Country', 'Customer Segment', 'Customer Street']].drop_duplicates().reset_index(drop=True)

df_customers.columns = df_customers.columns.str.replace(' ', '_').str.lower()
df_customers.to_csv('../database/data/customers_db.csv', index=False)
df_customers.head(5)

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_city,customer_state,customer_zipcode,customer_country,customer_segment,customer_street
0,20755,Cally,Holloway,XXXXXXXXX,XXXXXXXXX,Caguas,PR,725.0,Puerto Rico,Consumer,5365 Noble Nectar Island
1,19492,Irene,Luna,XXXXXXXXX,XXXXXXXXX,Caguas,PR,725.0,Puerto Rico,Consumer,2679 Rustic Loop
2,19491,Gillian,Maldonado,XXXXXXXXX,XXXXXXXXX,San Jose,CA,95125.0,EE. UU.,Consumer,8510 Round Bear Gate
3,19490,Tana,Tate,XXXXXXXXX,XXXXXXXXX,Los Angeles,CA,90027.0,EE. UU.,Home Office,3200 Amber Bend
4,19489,Orli,Hendricks,XXXXXXXXX,XXXXXXXXX,Caguas,PR,725.0,Puerto Rico,Corporate,8671 Iron Anchor Corners


In [11]:
ch.check(df_customers)

Number of columns: 11 and rows: 20649

Data types:
customer_id            int64
customer_fname        object
customer_lname        object
customer_email        object
customer_password     object
customer_city         object
customer_state        object
customer_zipcode     float64
customer_country      object
customer_segment      object
customer_street       object
dtype: object

Unique values count:
customer_id          20649
customer_fname         782
customer_lname        1109
customer_email           1
customer_password        1
customer_city          562
customer_state          44
customer_zipcode       995
customer_country         2
customer_segment         3
customer_street       7456
dtype: int64

These columns appear to be categorical (less than 20 unique values):
Index(['customer_email', 'customer_password', 'customer_country',
       'customer_segment'],
      dtype='object')

Unique value count for categorical columns:

customer_email
XXXXXXXXX    20649
Name: count, dtype

In [12]:
# Orders table
df_orders = orders_cleaned_db[['Order Id', 'order date (DateOrders)', 'Order Status', 'Order Region', 
                'Order City', 'Order State', 'Order Country', 'Order Zipcode', 'Order Customer Id']].drop_duplicates().reset_index(drop=True)

df_orders.columns = df_orders.columns.str.replace(' ', '_').str.lower()
df_orders.columns = df_orders.columns.str.replace('order_date_(dateorders)', 'order_date')
df_orders.fillna("NULL", inplace=True)
df_orders.to_csv('../database/data/orders_db.csv', index=False)
df_orders.head(5)

Unnamed: 0,order_id,order_date,order_status,order_region,order_city,order_state,order_country,order_zipcode,order_customer_id
0,77202,2018-01-31 22:56:00,COMPLETE,Southeast Asia,Bekasi,Java Occidental,Indonesia,,20755
1,75939,2018-01-13 12:27:00,PENDING,South Asia,Bikaner,Rajastán,India,,19492
2,75938,2018-01-13 12:06:00,CLOSED,South Asia,Bikaner,Rajastán,India,,19491
3,75937,2018-01-13 11:45:00,COMPLETE,Oceania,Townsville,Queensland,Australia,,19490
4,75936,2018-01-13 11:24:00,PENDING_PAYMENT,Oceania,Townsville,Queensland,Australia,,19489


In [13]:
ch.check(df_orders)

Number of columns: 9 and rows: 65749

Data types:
order_id                      int64
order_date           datetime64[ns]
order_status                 object
order_region                 object
order_city                   object
order_state                  object
order_country                object
order_zipcode                object
order_customer_id             int64
dtype: object

Unique values count:
order_id             65749
order_date           65749
order_status             9
order_region            23
order_city            3597
order_state           1089
order_country          164
order_zipcode          610
order_customer_id    20649
dtype: int64

These columns appear to be categorical (less than 20 unique values):
Index(['order_status'], dtype='object')

Unique value count for categorical columns:

order_status
COMPLETE           21714
PENDING_PAYMENT    14381
PROCESSING          7901
PENDING             7321
CLOSED              7249
ON_HOLD             3624
SUSPECTED_FRAUD

In [14]:
# Product category table
df_product_category = orders_cleaned_db[['Category Id', 'Category Name']].drop_duplicates().reset_index(drop=True)

df_product_category.columns = df_product_category.columns.str.replace(' ', '_').str.lower()
df_product_category.to_csv('../database/data/product_category_db.csv', index=False)
df_product_category.head(5)

Unnamed: 0,category_id,category_name
0,73,Sporting Goods
1,17,Cleats
2,29,Shop By Sport
3,24,Women's Apparel
4,13,Electronics


In [15]:
ch.check(df_product_category)

Number of columns: 2 and rows: 51

Data types:
category_id       int64
category_name    object
dtype: object

Unique values count:
category_id      51
category_name    50
dtype: int64

These columns appear to be categorical (less than 20 unique values):
Index([], dtype='object')

Unique value count for categorical columns:

Count of null values:
category_id      0
category_name    0
dtype: int64

Count of missing() values:
category_id      0
category_name    0
dtype: int64

Count of duplicated values:
0


In [16]:
# Products table
df_products = orders_cleaned_db[['Product Card Id', 'Product Description', 'Product Status', 'Product Image', 
                  'Product Name', 'Product Price', 'Product Category Id']].drop_duplicates().reset_index(drop=True)

df_products.columns = df_products.columns.str.replace(' ', '_').str.lower()
df_products.to_csv('../database/data/products_db.csv', index=False)
df_products.head(5)

Unnamed: 0,product_card_id,product_description,product_status,product_image,product_name,product_price,product_category_id
0,1360,,0,http://images.acmesports.sports/Smart+watch,Smart watch,327.75,73
1,365,,0,http://images.acmesports.sports/Perfect+Fitnes...,Perfect Fitness Perfect Rip Deck,59.990002,17
2,627,,0,http://images.acmesports.sports/Under+Armour+G...,Under Armour Girls' Toddler Spine Surge Runni,39.990002,29
3,502,,0,http://images.acmesports.sports/Nike+Men%27s+D...,Nike Men's Dri-FIT Victory Golf Polo,50.0,24
4,278,,0,http://images.acmesports.sports/Under+Armour+M...,Under Armour Men's Compression EV SL Slide,44.990002,13


In [17]:
ch.check(df_products)

Number of columns: 7 and rows: 118

Data types:
product_card_id          int64
product_description     object
product_status           int64
product_image           object
product_name            object
product_price          float64
product_category_id      int64
dtype: object

Unique values count:
product_card_id        118
product_description      0
product_status           1
product_image          118
product_name           118
product_price           75
product_category_id     51
dtype: int64

These columns appear to be categorical (less than 20 unique values):
Index(['product_description', 'product_status'], dtype='object')

Unique value count for categorical columns:

Series([], Name: count, dtype: int64)

product_status
0    118
Name: count, dtype: int64

Count of null values:
product_card_id          0
product_description    118
product_status           0
product_image            0
product_name             0
product_price            0
product_category_id      0
dtype: int64

C

In [18]:
# Orders Items table
df_order_items = orders_cleaned_db[['Order Item Id', 'Order Item Quantity', 
                      'Order Item Product Price', 'Order Item Discount', 'Order Item Discount Rate', 'Order Item Total', 
                      'Order Item Profit Ratio', 'Sales', 'Order Profit Per Order', 'Benefit per order', 
                      'Sales per customer', 'Type', 'Order Id', 'Order Item Cardprod Id']].drop_duplicates().reset_index(drop=True)

df_order_items.columns = df_order_items.columns.str.replace(' ', '_').str.lower()
df_order_items.to_csv('../database/data/order_items_db.csv', index=False)
df_order_items.head(5)

Unnamed: 0,order_item_id,order_item_quantity,order_item_product_price,order_item_discount,order_item_discount_rate,order_item_total,order_item_profit_ratio,sales,order_profit_per_order,benefit_per_order,sales_per_customer,type,order_id,order_item_cardprod_id
0,180517,1,327.75,13.11,0.04,314.640015,0.29,327.75,91.25,91.25,314.640015,DEBIT,77202,1360
1,179254,1,327.75,16.389999,0.05,311.359985,-0.8,327.75,-249.089996,-249.089996,311.359985,TRANSFER,75939,1360
2,179253,1,327.75,18.030001,0.06,309.720001,-0.8,327.75,-247.779999,-247.779999,309.720001,CASH,75938,1360
3,179252,1,327.75,22.940001,0.07,304.809998,0.08,327.75,22.860001,22.860001,304.809998,DEBIT,75937,1360
4,179251,1,327.75,29.5,0.09,298.25,0.45,327.75,134.210007,134.210007,298.25,PAYMENT,75936,1360


In [19]:
ch.check(df_order_items)

Number of columns: 14 and rows: 180516

Data types:
order_item_id                 int64
order_item_quantity           int64
order_item_product_price    float64
order_item_discount         float64
order_item_discount_rate    float64
order_item_total            float64
order_item_profit_ratio     float64
sales                       float64
order_profit_per_order      float64
benefit_per_order           float64
sales_per_customer          float64
type                         object
order_id                      int64
order_item_cardprod_id        int64
dtype: object

Unique values count:
order_item_id               180516
order_item_quantity              5
order_item_product_price        75
order_item_discount           1017
order_item_discount_rate        18
order_item_total              2927
order_item_profit_ratio        162
sales                          193
order_profit_per_order       21998
benefit_per_order            21998
sales_per_customer            2927
type                   

In [20]:
# Shipping table
df_shipping = orders_cleaned_db[['Delivery Status', 'Market', 'Shipping Mode', 
                      'Days for shipping (real)', 'Days for shipment (scheduled)', 'shipping date (DateOrders)',
                      'Late_delivery_risk', 'Department Id', 'Department Name', 'Latitude', 'Longitude', 'Order Item Id']].drop_duplicates().reset_index(drop=False)

df_shipping.columns = df_shipping.columns.str.replace(' ', '_').str.lower().str.replace('(', '').str.replace(')', '').str.replace('shipping_date_(dateorders)', 'shipping_date').str.replace('index', 'delivery_id')
df_shipping.to_csv('../database/data/shipping_db.csv', index=False)
df_shipping.head(5)

Unnamed: 0,delivery_id,delivery_status,market,shipping_mode,days_for_shipping_real,days_for_shipment_scheduled,shipping_date_dateorders,late_delivery_risk,department_id,department_name,latitude,longitude,order_item_id
0,0,Advance shipping,Pacific Asia,Standard Class,3,4,2018-02-03 22:56:00,0,2,Fitness,18.251453,-66.037056,180517
1,1,Late delivery,Pacific Asia,Standard Class,5,4,2018-01-18 12:27:00,1,2,Fitness,18.279451,-66.037064,179254
2,2,Shipping on time,Pacific Asia,Standard Class,4,4,2018-01-17 12:06:00,0,2,Fitness,37.292233,-121.881279,179253
3,3,Advance shipping,Pacific Asia,Standard Class,3,4,2018-01-16 11:45:00,0,2,Fitness,34.125946,-118.291016,179252
4,4,Advance shipping,Pacific Asia,Standard Class,2,4,2018-01-15 11:24:00,0,2,Fitness,18.253769,-66.037048,179251


In [21]:
ch.check(df_shipping)

Number of columns: 13 and rows: 180516

Data types:
delivery_id                             int64
delivery_status                        object
market                                 object
shipping_mode                          object
days_for_shipping_real                  int64
days_for_shipment_scheduled             int64
shipping_date_dateorders       datetime64[ns]
late_delivery_risk                      int64
department_id                           int64
department_name                        object
latitude                              float64
longitude                             float64
order_item_id                           int64
dtype: object

Unique values count:
delivery_id                    180516
delivery_status                     4
market                              5
shipping_mode                       4
days_for_shipping_real              7
days_for_shipment_scheduled         4
shipping_date_dateorders        63699
late_delivery_risk                  2
department

<span style="color:red;">After finishing this notebook I recommend to check the `data_cleaning.ipynb` file</span>