<a href="https://colab.research.google.com/github/ashutosh3060/friday-burger-mojito/blob/master/data_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Table of Contents

0. Libraries
1. User-Defined Functions
2. Import Data and Basic Understanding
3. Data Quality Checks
4. Feature Engineering
5. Save Final Dataset with new features

## 0. Libraries 

In [1]:
# warnings
import warnings
warnings.filterwarnings("ignore")

# Numpy, Pandas
import numpy as np
import pandas as pd

# Display settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', 15)

## 1. User-defined Functions

In [None]:
def null_perc_check (df):
    '''
    Calculates missing value count and percentage for all the columns in a dataframe

    Inputs
    -------
    df : dataframe
        The dataframe for which missing value distribution needs to checked

    Output
    -------
    dataframe
        a dataframe showing missing value count and percentage for all the columns
    '''
    missing_value_df = pd.DataFrame(index = df.keys(), data =df.isnull().sum(), columns = ['Missing_Value_Count'])
    missing_value_df['Missing_Value_Percentage'] = np.round(((df.isnull().mean())*100),1)
    sorted_df = missing_value_df.sort_values('Missing_Value_Count',ascending= False)
    return sorted_df

## 2. Import Data and Basic Understanding

In [3]:
# Define the path, filename

order_path = "/content/"
order_file = "machine_learning_challenge_order_data.csv"
label_path = "/content/"
label_file = "machine_learning_challenge_labeled_data.csv"

In [4]:
# import the dataset as a dataframe

df_order = pd.read_csv(order_path+order_file)
df_label = pd.read_csv(label_path+label_file)

In [7]:
# df_order: shape and first few records

print(df_order.shape)
df_order.head(3)

(786600, 13)


Unnamed: 0,customer_id,order_date,order_hour,customer_order_rank,is_failed,voucher_amount,delivery_fee,amount_paid,restaurant_id,city_id,payment_id,platform_id,transmission_id
0,000097eabfd9,2015-06-20,19,1.0,0,0.0,0.0,11.4696,5803498,20326,1779,30231,4356
1,0000e2c6d9be,2016-01-29,20,1.0,0,0.0,0.0,9.558,239303498,76547,1619,30359,4356
2,000133bb597f,2017-02-26,19,1.0,0,0.0,0.493,5.93658,206463498,33833,1619,30359,4324


In [8]:
# df_label: shape and first few records

print(df_label.shape)
df_label.head(3)

(245455, 2)


Unnamed: 0,customer_id,is_returning_customer
0,000097eabfd9,0
1,0000e2c6d9be,0
2,000133bb597f,1


In [9]:
# Number of unique customers in both the datasets

print(f'df_order: {df_order.customer_id.nunique()}')
print(f'df_label: {df_label.customer_id.nunique()}')

df_order: 245455
df_label: 245455


* As mentioned in the problem document, it seems like multiple records are present for a single customer (Historical data)

* In both order and label datasets, #unique customers are same.

* Before merging both the datasets, need to check the intersection of the unique customer_id

In [10]:
# Intersection of unique customers in both the datasets

cust_intersect = pd.merge(df_order["customer_id"], df_label["customer_id"], how='inner', on=['customer_id'])
print(f'Number of common unique customers in both the datasets: {cust_intersect["customer_id"].nunique()}')

Number of common unique customers in both the datasets: 245455


* This confirms, all the unique customer_id in both the datasets are same.

* So, I will try to extract all the features by grouping, aggregating and other thought process. And then will merge the target variable from the label dataset

* For now, I can make the df_order as the primary dataset for independent feature extraction.

In [11]:
# Datatypes overview of the order datasets

df = df_order.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 786600 entries, 0 to 786599
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   customer_id          786600 non-null  object 
 1   order_date           786600 non-null  object 
 2   order_hour           786600 non-null  int64  
 3   customer_order_rank  761833 non-null  float64
 4   is_failed            786600 non-null  int64  
 5   voucher_amount       786600 non-null  float64
 6   delivery_fee         786600 non-null  float64
 7   amount_paid          786600 non-null  float64
 8   restaurant_id        786600 non-null  int64  
 9   city_id              786600 non-null  int64  
 10  payment_id           786600 non-null  int64  
 11  platform_id          786600 non-null  int64  
 12  transmission_id      786600 non-null  int64  
dtypes: float64(4), int64(7), object(2)
memory usage: 78.0+ MB


* Only **customer_order_rank**  has missing values.
* order_date is object, datatype casting is needed for operation.