# Feature Engineering & Exploratory Data Analysis

In this step We're gonna to explore our features in order to create new and more significant features. Besides, We'll look for parterns that may help us to solve the challange and include these parterns in the model.

# 1) Setup

In [1]:
#Libs
import os
import warnings

import pandas as pd
import numpy as np

from dotenv import load_dotenv, find_dotenv
from haversine import haversine

In [2]:
# Env variables and data
load_dotenv(find_dotenv())
DATA_INPUT_PATH = os.getenv('DATA_PROCESSED_PATH')
DATA_TRAIN_NAME = 'train'
DATA_TEST_NAME = 'test'
# Data
df_orders_train = pd.read_parquet(os.path.join(DATA_INPUT_PATH, DATA_TRAIN_NAME))
df_orders_test = pd.read_parquet(os.path.join(DATA_INPUT_PATH, DATA_TEST_NAME))

# 2) Feature Engineering

## 2.1) Time Features

As We have the promised time, We can use information from It as a proxy to the time the order was made. Hence, We are  going to be able to extract hour, day, month, week, and all other characteristics about the time and provide them to the model to learn patterns about It and how those characteristics relates to total minutes.  

In [3]:
df_orders_train['hour'] = df_orders_train['promised_time'].apply(lambda x: x.hour)
df_orders_train['day'] = df_orders_train['promised_time'].apply(lambda x: x.dayofweek)
df_orders_train['month'] = df_orders_train['promised_time'].apply(lambda x: x.month)

## 2.2) Distance

We should calculate distance between store and consumer, so that We can have a clue about the time will take to complete the order.To acomplish this, We'll be using the Harvesine function of sklearn.

In [10]:
df_orders_train['distance_km'] = df_orders_train.apply(lambda x: haversine((x['lat_os'], x['lng_os']), 
                                                                          (x['lat_strb'], x['lng_strb'])), axis=1)

In [14]:
df_orders_train.columns

Index(['order_id', 'lat_os', 'lng_os', 'promised_time', 'on_demand',
       'shopper_id', 'store_branch_id', 'total_minutes', 'seniority',
       'found_rate', 'picking_speed', 'accepted_rate', 'rating', 'store_id',
       'lat_strb', 'lng_strb', 'sum_kgs', 'sum_unities', 'n_distinct_items',
       'hour', 'day', 'month', 'distance_km'],
      dtype='object')

Before We proceed, I'll drop unuseful features (those that represent IDs, lat/long and promised time)


In [15]:
df_orders_train.drop(['order_id', 'lat_os', 'lng_os', 'promised_time', 
                      'shopper_id', 'store_branch_id', 'lat_strb', 'lng_strb', 'store_id'], axis=1, inplace=True)

# 3) Exploratory Data Analysis

Let's state some hypothesis and try to check them using statistics and visualization.