# Data Cleaning

In this step, We are going to explore the dataset to try to get the data prepared for exploratory data analysis and feature engineering. We'll go through all of the steps listed bellow

- Column Type verification
- Casting
- Inconsistencies
- Missing Values
- Analysis of Constant and Quasi-constant columns
- Rare Categories
- Duplicate Rows
- Duplicate Columns
- Data Split

# 1) Setup

In [41]:
# Libs
import os
import warnings

import pandas as pd
import numpy as np

from dotenv import find_dotenv, load_dotenv
from utils.data.cleaning import check_dtypes, cast_columns, check_missing

In [40]:
import importlib
importlib.reload(utils.data.cleaning)

<module 'utils.data.cleaning' from '/Users/bruno.santos/Desktop/Estudos/case_cornershop/time2delivery/utils/data/cleaning.py'>

In [23]:
# Enviroment
load_dotenv(find_dotenv())
# Path variables
DATA_INPUT_PATH = os.getenv('DATA_RAW_PATH')
DATA_OUTPUT_PATH = os.getenv('DATA_PROCESSED_PATH')

# 2) Data cleaning

In [24]:
# Loading the data
df_orders = pd.read_csv(os.path.join(DATA_INPUT_PATH, 'all_orders.csv'))

In [25]:
df_orders.head()

Unnamed: 0,order_id,lat_os,lng_os,promised_time,on_demand,shopper_id,store_branch_id,total_minutes,seniority,found_rate,picking_speed,accepted_rate,rating,store_id,lat_strb,lng_strb,sum_kgs,sum_unities,n_distinct_items
0,e750294655c2c7c34d83cc3181c09de4,-33.501675,-70.579369,2019-10-18 20:48:00+00:00,True,e63bc83a1a952fa2b3cc9d558fb943cf,65ded5353c5ee48d0b7d48c591b8f430,67.684264,6c90661e6d2c7579f5ce337c3391dbb9,0.9024,1.3,0.92,4.76,c4ca4238a0b923820dcc509a6f75849b,-33.48528,-70.57925,2.756,16.0,19.0
1,6581174846221cb6c467348e87f57641,-33.440584,-70.556283,2019-10-19 01:00:00+00:00,False,195f9e9d84a4ba9033c4b6a756334d8b,45fbc6d3e05ebd93369ce542e8f2322d,57.060632,41dc7c9e385c4d2b6c1f7836973951bf,0.761,2.54,0.92,4.96,c4ca4238a0b923820dcc509a6f75849b,-33.441246,-70.53545,,11.0,5.0
2,3a226ea48debc0a7ae9950d5540f2f34,-32.987022,-71.544842,2019-10-19 14:54:00+00:00,True,a5b9ddc0d82e61582fca19ad43dbaacb,07563a3fe3bbe7e3ba84431ad9d055af,,50e13ee63f086c2fe84229348bc91b5b,0.8313,2.57,0.76,4.92,c4ca4238a0b923820dcc509a6f75849b,-33.008213,-71.545615,,18.0,5.0
3,7d2ed03fe4966083e74b12694b1669d8,-33.328075,-70.512659,2019-10-18 21:47:00+00:00,True,d0b3f6bf7e249e5ebb8d3129341773a2,f1748d6b0fd9d439f71450117eba2725,52.067742,41dc7c9e385c4d2b6c1f7836973951bf,0.8776,2.8,0.96,4.76,f718499c1c8cef6730f9fd03c8125cab,-33.355258,-70.537787,,1.0,1.0
4,b4b2682d77118155fe4716300ccf7f39,-33.403239,-70.56402,2019-10-19 20:00:00+00:00,False,5c5199ce02f7b77caa9c2590a39ad27d,1f0e3dad99908345f7439f8ffabdffc4,140.724822,50e13ee63f086c2fe84229348bc91b5b,0.7838,2.4,0.96,4.96,c4ca4238a0b923820dcc509a6f75849b,-33.386547,-70.568075,6.721,91.0,51.0


In [26]:
column_types = check_dtypes(df_orders)

In [27]:
column_types

{'object': ['order_id',
  'promised_time',
  'shopper_id',
  'store_branch_id',
  'seniority',
  'store_id'],
 'float64': ['lat_os',
  'lng_os',
  'total_minutes',
  'found_rate',
  'picking_speed',
  'accepted_rate',
  'rating',
  'lat_strb',
  'lng_strb',
  'sum_kgs',
  'sum_unities',
  'n_distinct_items'],
 'bool': ['on_demand']}

Basically there are three types of columns: `float64`, `bool` and `object`. The column `promised_time`, though, is representing time, maybe We should cast this one to datetime format. 

## 2.2) Casting

As We don't know at first which are the useful columns, let's just cast them to the right format, and once We assess its predictive power and decide to bring them in traning phase, We'll create a python function or transformer to cover this step in the pipeline of transformation and cleaning.

In [28]:
# promised_time to datetime and ond_demand to object
df_orders = cast_columns(df=df_orders, 
                         casting={'promised_time':'datetime64[ns]',
                                  'on_demand':'object'})

In [29]:
df_orders.dtypes

order_id                    object
lat_os                     float64
lng_os                     float64
promised_time       datetime64[ns]
on_demand                   object
shopper_id                  object
store_branch_id             object
total_minutes              float64
seniority                   object
found_rate                 float64
picking_speed              float64
accepted_rate              float64
rating                     float64
store_id                    object
lat_strb                   float64
lng_strb                   float64
sum_kgs                    float64
sum_unities                float64
n_distinct_items           float64
dtype: object

## 2.3) Inconsistencies

Checking for inconsistencies is a mandatory step in our analysis. Here, We'll check for the boundaries around the numerical features to see if some weird pattern appears (negative time, for example).

In [35]:
df_orders.describe()

Unnamed: 0,lat_os,lng_os,total_minutes,found_rate,picking_speed,accepted_rate,rating,lat_strb,lng_strb,sum_kgs,sum_unities,n_distinct_items
count,10000.0,10000.0,8000.0,9800.0,10000.0,9954.0,9837.0,10000.0,10000.0,6332.0,9900.0,9978.0
mean,-33.42709,-70.668017,81.10613,0.863309,1.6868,0.916928,4.849213,-33.431499,-70.661844,2.738629,34.82303,19.893766
std,0.558675,0.400249,34.720837,0.029801,0.626378,0.097246,0.128929,0.555641,0.400569,2.736629,33.15926,16.434651
min,-36.942135,-73.14428,11.969489,0.7373,0.65,0.24,3.88,-36.904347,-73.09666,0.055,1.0,1.0
25%,-33.426861,-70.605795,55.22548,0.8463,1.26,0.88,4.8,-33.440823,-70.599,0.948,11.0,8.0
50%,-33.39811,-70.574591,74.731672,0.866,1.51,0.96,4.88,-33.386547,-70.568075,1.926,26.0,16.0
75%,-33.353783,-70.540307,100.273498,0.8836,2.0,1.0,4.96,-33.370765,-70.521372,3.602,49.0,28.0
max,-29.833517,-70.453728,304.190303,0.971,7.04,1.0,5.0,-29.901425,-70.492256,32.492,335.0,145.0


Lat and Long information  are semingly in valid range. found_rate and accepted_rate are between 0 and 1, total_minutes is always greater than 0. Everything is ok, semingly. Just one thing to point out: If lat/long information appears at highly important for the model, We'll have to create some validation step to assure It makes sense (We don't want to pass ocean coordinates to the model, right?)

## 2.4) Missing Values

In [42]:
series_missing = check_missing(df_orders)

In [43]:
series_missing

sum_kgs             0.3668
total_minutes       0.2000
found_rate          0.0200
rating              0.0163
sum_unities         0.0100
accepted_rate       0.0046
n_distinct_items    0.0022
store_branch_id     0.0000
seniority           0.0000
lat_os              0.0000
picking_speed       0.0000
shopper_id          0.0000
on_demand           0.0000
store_id            0.0000
lat_strb            0.0000
lng_strb            0.0000
promised_time       0.0000
lng_os              0.0000
order_id            0.0000
dtype: float64

- sum_kgs: It represents the total quantity (in kg) of the order. Missing means 0 kg, so We'll fill missing values with 0 for this column
- sum_unities: It is the same situation as sum_kgs. Let's replace missing values with 0.
- n_distinct_items: It's weird that we have found missing values in this column, but It is a very low frequency event. So, We can use median safely.
- rating, found_rate and accepted_rate: All of these columns can be filled with median, because the rate of missing values is really low.

- total_minutes: This column is our target variable. As the README.md attached in the repo suggests, Let's consider all rows with missing values at this column as our submision set. So, let's set these rows aside from development and only use them in predict step. 

## 2.5) Constant and Quasi-constant Columns

## 2.6) Rare Categories

## 2.7) Duplicate Rows

## 2.8 Duplicate Columns