Initial Data Analysis Exploration

In [1]:
import pandas as pd
import numpy

In [2]:
# Import all training data and vendor, order, and customer data
train_full = pd.read_csv('../data/train_full.csv', low_memory=False)
train_customers = pd.read_csv('../data/train_customers.csv')
train_locations = pd.read_csv('../data/train_locations.csv')
orders = pd.read_csv('../data/orders.csv')
vendors = pd.read_csv('../data/vendors.csv')
sample_submission = pd.read_csv('../data/SampleSubmission.csv')

# List of all datasets and names
data_collection = [train_full, train_customers, train_locations, orders, vendors, sample_submission]
data_names = ['train_full', 'train_customers', 'train_locations', 'orders', 'vendors', 'sample_submission']

# Display shape of each dataset
print("Data shapes:")
for name, data in zip(data_names, data_collection):
    print(name)
    print(data.shape)
    
# Display column names for each data set
print("Data columns:")
for name, data in zip(data_names, data_collection):
    print(name)
    print(data.columns)

  orders = pd.read_csv('../data/orders.csv')


Data shapes:
train_full
(5802400, 73)
train_customers
(34674, 8)
train_locations
(59503, 5)
orders
(135303, 26)
vendors
(100, 59)
sample_submission
(1672000, 2)
Data columns:
train_full
Index(['customer_id', 'gender', 'status_x', 'verified_x', 'created_at_x',
       'updated_at_x', 'location_number', 'location_type', 'latitude_x',
       'longitude_x', 'id', 'authentication_id', 'latitude_y', 'longitude_y',
       'vendor_category_en', 'vendor_category_id', 'delivery_charge',
       'serving_distance', 'is_open', 'OpeningTime', 'OpeningTime2',
       'prepration_time', 'commission', 'is_akeed_delivering',
       'discount_percentage', 'status_y', 'verified_y', 'rank', 'language',
       'vendor_rating', 'sunday_from_time1', 'sunday_to_time1',
       'sunday_from_time2', 'sunday_to_time2', 'monday_from_time1',
       'monday_to_time1', 'monday_from_time2', 'monday_to_time2',
       'tuesday_from_time1', 'tuesday_to_time1', 'tuesday_from_time2',
       'tuesday_to_time2', 'wednesday_from

Notes:
- train_full is a combination of SampleSubmissions, vendors, and orders with the ultimate result being 1. In our scenario where we want to recommend a restaurant to a person based off previous order history, we may be able to ignore many of the features presented. However, the one unique column is target. By assumption, it would seem that we need to classify yes/no (1/0) based on the rest of the information given here. 
- train_customers is mostly useless. The majority of the information is either missing or very similar. There could be some analysis given gender (22k) or dob (3k).
- train_locations can potentially be useful. It lists the number of locations each customer has, although the majority only has one location and they are unlabeled so we would need to make assumptions.
- orders will defitinely be helpful since this is the basis of our project.
- vendors will also be helpful in understanding the type of restaurant and being able to filter out recommendations based on location and time of day.
- After further analysis, it is clear that sample_submission is the given file containing the customer, order location (home, work, other), and a vendor. Given this file we need to decide if the customer would order here or not. (1/0)
- This brings up the question if we would be then only focusing on if someone would order from a vendor, if we are trying to recommend a number of vendors. 

In [3]:
# For each dataset, we check if the number of unique values is one for each column. If so, we print the column name and add to list.

drop_columns = []  

for name, data in zip(data_names, data_collection):
    print(f'Dataset: {name}')
    for column in data.columns:
        # Check if the column only has one unique value
        if data[column].nunique() == 1:
            drop_columns.append(column)
            print(column)
            data.drop(column, axis=1, inplace=True)
            
                
# Now we check if we drop all missing values, if the number of unique values is one for each column. If so, we print the column name.
for name, data in zip(data_names, data_collection):
    print(f'Dataset: {name}')
    temp_data = data.dropna()
    for column in temp_data.columns:
        # Check if the column only has one unique value
        if temp_data[column].nunique() == 1:
            drop_columns.append(column)
            print(column)
            data.drop(column, axis=1, inplace=True)
        

Dataset: train_full
commission
is_akeed_delivering
language
open_close_flags
one_click_vendor
country_id
city_id
display_orders
Dataset: train_customers
language
Dataset: train_locations
Dataset: orders
Dataset: vendors
commission
is_akeed_delivering
language
open_close_flags
one_click_vendor
country_id
city_id
display_orders
Dataset: sample_submission
target
Dataset: train_full
is_open
status_y
verified_y
device_type
Dataset: train_customers
Dataset: train_locations
Dataset: orders
Dataset: vendors
is_open
status
verified
device_type
Dataset: sample_submission


Full datasets: 
train_full: 
- commission
- is_akeed_delivering
- language
- open_close_flags
- one_click_vendor
- country_id
- city_id
- display_orders
train_customers:
- language
vendors: 
- commission
- is_akeed_delivering
- language
- open_close_flags
- one_click_vendor
- country_id
- city_id
- display_orders

Dropped missing values (unique from previous)
train_full:
- is_open
- status_y
- verified_y
- one_click_vendor
- device_type
Dataset: vendors
- is_open
- status
- verified
- one_click_vendor
- device_type 

Notes:
- There are a total of 8 columns that only contain one unique entry between train_full and vendors. Train_customers contains one of these as well. 
- When dropping missing entries, we have another 5 columns between train_full and vendors with only one unique entry. 

In [4]:
# Analysis of the columns that have only one unique value
drop_columns = list(set(drop_columns))
print(drop_columns)
print(len(drop_columns))

# To drop the columns, go above and uncomment the following line:
# data.drop(columns=drop_columns, inplace=True)

['is_akeed_delivering', 'display_orders', 'verified_y', 'device_type', 'status_y', 'verified', 'status', 'target', 'city_id', 'one_click_vendor', 'is_open', 'country_id', 'language', 'commission', 'open_close_flags']
15


In [5]:
# Filtering to further understand dataset
filtered_full = train_full[train_full['target'] == 1]

print("Filtered data shapes:")
print(filtered_full.shape)
print(filtered_full.head())

Filtered data shapes:
(78254, 61)
    customer_id gender  status_x  verified_x         created_at_x  \
56      TCHWPBT   Male         1           1  2018-02-07 19:16:23   
227     TCHWPBT   Male         1           1  2018-02-07 19:16:23   
362     ZGFSYCZ   Male         1           1  2018-02-09 12:04:42   
370     ZGFSYCZ   Male         1           1  2018-02-09 12:04:42   
404     ZGFSYCZ   Male         1           1  2018-02-09 12:04:42   

            updated_at_x  location_number location_type  latitude_x  \
56   2018-02-07 19:16:23                0          Work    -96.4400   
227  2018-02-07 19:16:23                2           NaN     -0.1287   
362  2018-02-09 12:04:41                0          Home     -0.1755   
370  2018-02-09 12:04:41                0          Home     -0.1755   
404  2018-02-09 12:04:41                1          Home      0.1912   

     longitude_x  ...  saturday_to_time2             primary_tags  \
56        -67.20  ...           23:45:00               