## Table of Contents: <a class="anchor" id="table-of-contents"></a>

* [Section 1: Install Packages](#Install-Packages)
* [Section 2: Load Data](#load-data)
* [Section 3: Summary of Data](#summary-of-data)
    * [Section 3.1: Basic Details of Order Data](#section_3_1)
    * [Section 3.2: Basic Details of Labeled Data](#section_3_2)

## Install Packages <a class="anchor" id="Install-Packages"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [40]:
# functions definiton

def num_unique_values(df):
    """
    This function takes dataframe and returns number of unique values in each column
    :param df: dataframe
    :return: dataframe of names of columns and number of unique values in each columns
    """
    col_names = []
    values = []
    for col in df.columns:
        col_names.append(col)
        values.append(len(df[col].unique()))
    return(pd.DataFrame({'col_names':col_names,'num_unique_values':values}))

## Load Data <a class="anchor" id="load-data"></a>

In [10]:
# loading data
df_order = pd.read_csv('data/machine_learning_challenge_order_data.csv',index_col=False) # loading order data
df_labeled = pd.read_csv('data/machine_learning_challenge_labeled_data.csv',index_col=False) # loading labeled data

## Summary of Data <a class="anchor" id="summary-of-data"></a>

### Basic Details of Order Data <a class="anchor" id="section_3_1"></a>

In [14]:
# get glimpse of order and labeled data
print(df_order.head())
print(df_order.columns)

    customer_id  order_date  order_hour  customer_order_rank  is_failed  \
0  000097eabfd9  2015-06-20          19                  1.0          0   
1  0000e2c6d9be  2016-01-29          20                  1.0          0   
2  000133bb597f  2017-02-26          19                  1.0          0   
3  00018269939b  2017-02-05          17                  1.0          0   
4  0001a00468a6  2015-08-04          19                  1.0          0   

   voucher_amount  delivery_fee  amount_paid  restaurant_id  city_id  \
0             0.0         0.000     11.46960        5803498    20326   
1             0.0         0.000      9.55800      239303498    76547   
2             0.0         0.493      5.93658      206463498    33833   
3             0.0         0.493      9.82350       36613498    99315   
4             0.0         0.493      5.15070      225853498    16456   

   payment_id  platform_id  transmission_id  
0        1779        30231             4356  
1        1619        303

Unnamed: 0,customer_id,is_returning_customer
0,000097eabfd9,0
1,0000e2c6d9be,0
2,000133bb597f,1
3,00018269939b,0
4,0001a00468a6,0


In [24]:
# split columns into 3 categories numerical, id and date columns for future use
numerical_cols = ['order_hour', 'customer_order_rank','is_failed','voucher_amount', 'delivery_fee', 'amount_paid']
id_cols = ['customer_id','restaurant_id', 'city_id', 'payment_id', 'platform_id','transmission_id']
date_cols = ['order_date']
# summary of numerical columns of order data
df_order.loc[:,numerical_cols].describe()

Unnamed: 0,order_hour,customer_order_rank,is_failed,voucher_amount,delivery_fee,amount_paid
count,786600.0,761833.0,786600.0,786600.0,786600.0,786600.0
mean,17.588796,9.43681,0.031486,0.091489,0.18118,10.183271
std,3.357192,17.772322,0.174628,0.479558,0.36971,5.618121
min,0.0,1.0,0.0,0.0,0.0,0.0
25%,16.0,1.0,0.0,0.0,0.0,6.64812
50%,18.0,3.0,0.0,0.0,0.0,9.027
75%,20.0,10.0,0.0,0.0,0.0,12.213
max,23.0,369.0,1.0,93.3989,9.86,1131.03


In [43]:
# Number of unique ids in order data
num_unique_values(df_order[id_cols])

Unnamed: 0,col_names,num_unique_values
0,customer_id,245455
1,restaurant_id,13569
2,city_id,3749
3,payment_id,5
4,platform_id,14
5,transmission_id,10


### Basic Details of Labeled Data <a class="anchor" id="section_3_2"></a>

### Chapter 1 <a class="anchor" id="chapter1"></a>