Initial Data Analysis Exploration

In [37]:
import pandas as pd
import numpy

In [38]:
# Import all training data and vendor, order, and customer data
train_full = pd.read_csv('../data/train_full.csv')
train_customers = pd.read_csv('../data/train_customers.csv')
train_locations = pd.read_csv('../data/train_locations.csv')
orders = pd.read_csv('../data/orders.csv')
vendors = pd.read_csv('../data/vendors.csv')
sample_submission = pd.read_csv('../data/SampleSubmission.csv')

# List of all datasets and names
data_collection = [train_full, train_customers, train_locations, orders, vendors, sample_submission]
data_names = ['train_full', 'train_customers', 'train_locations', 'orders', 'vendors', 'sample_submission']

# Display shape of each dataset
print("Data shapes:")
for name, data in zip(data_names, data_collection):
    print(name)
    print(data.shape)
    
# Display column names for each data set
print("Data columns:")
for name, data in zip(data_names, data_collection):
    print(name)
    print(data.columns)

  train_full = pd.read_csv('../data/train_full.csv')
  orders = pd.read_csv('../data/orders.csv')


Data shapes:
train_full
(5802400, 73)
train_customers
(34674, 8)
train_locations
(59503, 5)
orders
(135303, 26)
vendors
(100, 59)
sample_submission
(1672000, 2)
Data columns:
train_full
Index(['customer_id', 'gender', 'status_x', 'verified_x', 'created_at_x',
       'updated_at_x', 'location_number', 'location_type', 'latitude_x',
       'longitude_x', 'id', 'authentication_id', 'latitude_y', 'longitude_y',
       'vendor_category_en', 'vendor_category_id', 'delivery_charge',
       'serving_distance', 'is_open', 'OpeningTime', 'OpeningTime2',
       'prepration_time', 'commission', 'is_akeed_delivering',
       'discount_percentage', 'status_y', 'verified_y', 'rank', 'language',
       'vendor_rating', 'sunday_from_time1', 'sunday_to_time1',
       'sunday_from_time2', 'sunday_to_time2', 'monday_from_time1',
       'monday_to_time1', 'monday_from_time2', 'monday_to_time2',
       'tuesday_from_time1', 'tuesday_to_time1', 'tuesday_from_time2',
       'tuesday_to_time2', 'wednesday_from

Notes:
- train_full is a combination of vendors, and orders with the ultimate result being 1. In our scenario where we want to recommend a restaurant to a person based off previous order history, we may be able to ignore many of the features presented. However, the one unique column is target. By assumption, it would seem that we need to classify yes/no (1/0) based on the rest of the information given here.
- train_customers is mostly useless. The majority of the information is either missing or very similar. There could be some analysis given gender (22k) or dob (3k).
- train_locations can potentially be useful. It lists the number of locations each customer has, although the majority only has one location and they are unlabeled so we would need to make assumptions.
- orders will defitinely be helpful since this is the basis of our project.
- vendors will also be helpful in understanding the type of restaurant and being able to filter out recommendations based on location and time of day.
- After further analysis, it is clear that sample_submission is the given file containing the customer, order location (home, work, other), and a vendor. Given this file we need to decide if the customer would order here or not. (1/0)
- This brings up the question if we would be then only focusing on if someone would order from a vendor, if we are trying to recommend a number of vendors. 

In [39]:
# For each dataset, we check if the number of unique values is one for each column. If so, we print the column name and add to list.
drop_columns = []  

for name, data in zip(data_names, data_collection):
    print(f'Dataset: {name}')
    for column in data.columns:
        # Check if the column only has one unique value
        if data[column].nunique() == 1:
            drop_columns.append(column)
            print(column)
            data.drop(column, axis=1, inplace=True)
            
                
# Now we check if we drop all missing values, if the number of unique values is one for each column. If so, we print the column name.
for name, data in zip(data_names, data_collection):
    print(f'Dataset: {name}')
    temp_data = data.dropna()
    for column in temp_data.columns:
        # Check if the column only has one unique value
        if temp_data[column].nunique() == 1:
            drop_columns.append(column)
            print(column)
            data.drop(column, axis=1, inplace=True)
        

Dataset: train_full
commission
is_akeed_delivering
language
open_close_flags
one_click_vendor
country_id
city_id
display_orders
Dataset: train_customers
language
Dataset: train_locations
Dataset: orders
Dataset: vendors
commission
is_akeed_delivering
language
open_close_flags
one_click_vendor
country_id
city_id
display_orders
Dataset: sample_submission
target
Dataset: train_full
is_open
status_y
verified_y
device_type
Dataset: train_customers
Dataset: train_locations
Dataset: orders
Dataset: vendors
is_open
status
verified
device_type
Dataset: sample_submission


Full datasets: 
train_full: 
- commission
- is_akeed_delivering
- language
- open_close_flags
- one_click_vendor
- country_id
- city_id
- display_orders
train_customers:
- language
vendors: 
- commission
- is_akeed_delivering
- language
- open_close_flags
- one_click_vendor
- country_id
- city_id
- display_orders

Dropped missing values (unique from previous)
train_full:
- is_open
- status_y
- verified_y
- one_click_vendor
- device_type
Dataset: vendors
- is_open
- status
- verified
- one_click_vendor
- device_type 

Notes:
- There are a total of 8 columns that only contain one unique entry between train_full and vendors. Train_customers contains one of these as well. 
- When dropping missing entries, we have another 5 columns between train_full and vendors with only one unique entry. 

In [40]:
# Analysis of the columns that have only one unique value
drop_columns = list(set(drop_columns))
print(drop_columns)
print(len(drop_columns))

# Target Counts
target_counts = train_full['target'].value_counts()
print("Counts for target:")
print(target_counts)

['target', 'status_y', 'status', 'language', 'country_id', 'is_open', 'is_akeed_delivering', 'verified', 'verified_y', 'device_type', 'display_orders', 'one_click_vendor', 'city_id', 'commission', 'open_close_flags']
15
Counts for target:
target
0    5724146
1      78254
Name: count, dtype: int64


In [41]:
# Filtering to further understand dataset
filtered_full = train_full[train_full['target'] == 1]

print("Filtered data shapes:")
print(filtered_full.shape)
print(filtered_full.head())
print(filtered_full.columns)

Filtered data shapes:
(78254, 61)
    customer_id gender  status_x  verified_x         created_at_x  \
56      TCHWPBT   Male         1           1  2018-02-07 19:16:23   
227     TCHWPBT   Male         1           1  2018-02-07 19:16:23   
362     ZGFSYCZ   Male         1           1  2018-02-09 12:04:42   
370     ZGFSYCZ   Male         1           1  2018-02-09 12:04:42   
404     ZGFSYCZ   Male         1           1  2018-02-09 12:04:42   

            updated_at_x  location_number location_type  latitude_x  \
56   2018-02-07 19:16:23                0          Work    -96.4400   
227  2018-02-07 19:16:23                2           NaN     -0.1287   
362  2018-02-09 12:04:41                0          Home     -0.1755   
370  2018-02-09 12:04:41                0          Home     -0.1755   
404  2018-02-09 12:04:41                1          Home      0.1912   

     longitude_x  ...  saturday_to_time2             primary_tags  \
56        -67.20  ...           23:45:00               

Below is further analysis of features in train_full

In [42]:
# Going back to look at the train_full dataset
print(f'Before dropping missing values: {train_full.shape}')

# Look at how many targets are 1 and 0
target_counts = train_full['target'].value_counts()
print(f'Counts for target: \n {target_counts}')

# Check how many columns contain missing values in train_full
print("Columns with missing values:")
missing_values = train_full.isnull().sum()
if missing_values.any():
    print(missing_values.to_string())
    
print(train_full.head(5))

Before dropping missing values: (5802400, 61)
Counts for target: 
 target
0    5724146
1      78254
Name: count, dtype: int64
Columns with missing values:
customer_id                     0
gender                    1705100
status_x                        0
verified_x                      0
created_at_x                    0
updated_at_x                    0
location_number                 0
location_type             2654200
latitude_x                    600
longitude_x                   600
id                              0
authentication_id               0
latitude_y                      0
longitude_y                     0
vendor_category_en              0
vendor_category_id              0
delivery_charge                 0
serving_distance                0
OpeningTime                522216
OpeningTime2               522216
prepration_time                 0
discount_percentage             0
rank                            0
vendor_rating                   0
sunday_from_time1           5

We have a total of 5.7 million training samples, but of those only 78k have a target = 1. We will need to be careful with removing rows since we want to retain as many targets = 1, which is our "recommended" flag. 

We know we want to remove a lot of columns since there is a lot of information for our purposes that are unnessecary. 
- Gender: This column could be useful for gender analysis, but for ordering food we can remove this. 
- Location Type: This is the stirng label for the customer's location. Most have 1 location (most likely home), we can ignore this.
- Latitude_x and longitude_x: The customer's location, we may be able to drop the rows with missing values. This would be important to consider if a vendor will deliver to a customer's location. 
- Opening Time/Hours of Operation: These columns would be beneficial if we were considering TOD when ordering. For our purposes we can safely drop these columns. (Monday and Wednesday have no missing values, but we will still remove)
- Primary Tag: This column is inconsistent with the values found in vendor_tag, for this reason we will drop the column
- Vendor Tag/Vendor Tag Name: There are three vendors lacking vendor tags, but these vendors may come up in test time. Because of this we will need to work around the missing labels. 

_Note: we have already removed columns with only one unique entry_
    

In [43]:
# Create list of columns with missing values
columns_with_missing_values = missing_values[missing_values > 0].index.tolist()
print(columns_with_missing_values)

# Modified print out from above to include monday and wednesday time1 values and excluse latitude and longitude.
cols_to_remove =['gender', 'location_type', 'OpeningTime', 'OpeningTime2', 'sunday_from_time1', 
                 'sunday_to_time1', 'sunday_from_time2', 'sunday_to_time2', 'monday_from_time1',
                 'monday_to_time1','monday_from_time2', 'monday_to_time2', 'tuesday_from_time1', 
                 'tuesday_to_time1', 'tuesday_from_time2', 'tuesday_to_time2', 'wednesday_from_time1',
                 'wednesday_to_time1','wednesday_from_time2', 'wednesday_to_time2', 'thursday_from_time1', 
                 'thursday_to_time1', 'thursday_from_time2', 'thursday_to_time2', 'friday_from_time1', 
                 'friday_to_time1', 'friday_from_time2', 'friday_to_time2', 'saturday_from_time1', 
                 'saturday_to_time1', 'saturday_from_time2', 'saturday_to_time2', 'primary_tags']

# Lets look at the dataset after dropping all columns above
temp_train_full = train_full.drop(columns=cols_to_remove)
print(f'After dropping columns with missing values: {temp_train_full.shape}')
target_counts = temp_train_full['target'].value_counts()
print("Counts for target:")
print(target_counts)
print(temp_train_full.columns)
print(temp_train_full.head(5))

['gender', 'location_type', 'latitude_x', 'longitude_x', 'OpeningTime', 'OpeningTime2', 'sunday_from_time1', 'sunday_to_time1', 'sunday_from_time2', 'sunday_to_time2', 'monday_from_time2', 'monday_to_time2', 'tuesday_from_time1', 'tuesday_to_time1', 'tuesday_from_time2', 'tuesday_to_time2', 'wednesday_from_time2', 'wednesday_to_time2', 'thursday_from_time1', 'thursday_to_time1', 'thursday_from_time2', 'thursday_to_time2', 'friday_from_time1', 'friday_to_time1', 'friday_from_time2', 'friday_to_time2', 'saturday_from_time1', 'saturday_to_time1', 'saturday_from_time2', 'saturday_to_time2', 'primary_tags', 'vendor_tag', 'vendor_tag_name']
After dropping columns with missing values: (5802400, 28)
Counts for target:
target
0    5724146
1      78254
Name: count, dtype: int64
Index(['customer_id', 'status_x', 'verified_x', 'created_at_x', 'updated_at_x',
       'location_number', 'latitude_x', 'longitude_x', 'id',
       'authentication_id', 'latitude_y', 'longitude_y', 'vendor_category_en',
 

Notes from looking at CSV and intuition:
- Customer ID: Keep 
- Status/Verified Columns: Unsure
- Created At/Updated At Columns: Drop as many across the entry do not change. 
- Location Number: Identical to Location obj, representing customer location. Rename and keep one. 
- Latitude/Longitude X/Y: Keep
- ID: Identical to ID obj, representing vendor ID. Rename and keep one. 
- Authentication ID: Unsure
- Vendor Category EN/#: Drop EN, keep # and transform to binary.
- Delivery Charge: Keep
- Serving Distance: Keep
- Preperation Time: Keep
- Discount Percentage: Keep
- Rank: Drop, only two values 1 and 11. 
- Vendor Rating: Keep
- Vendor Tag/Names: Keep and modifiy 
- Location Number obj: Drop, reason above ^
- ID obj: Drop, reason above ^
- CID X LOC_NUM X VENDOR: Keep, but this may be removed for training. 
- target: Keep!

In [44]:
# Drop columns after further analysis
drop_cols = ['status_x', 'verified_x', 'created_at_x', 'updated_at_x',
       'authentication_id','vendor_category_en',
       'rank', 'created_at_y', 'updated_at_y']

temp_train_full.drop(columns=drop_cols, inplace=True)

# Drop duplicate columns and rename remaining columns
temp_train_full.drop(columns=['location_number_obj', 'id_obj'], inplace=True)

Customer: QFWLNUK
(500, 17)
Customer: NSQRO1H
(400, 17)
Customer: VZIK43C
(400, 17)
Customer: O0LALCF
(100, 17)
Customer: 7URX8JP
(300, 17)
Customer: 55MCNEF
(200, 17)

Customer: QFWLNUK
(400, 17)
Customer: NSQRO1H
(300, 17)
Customer: VZIK43C
(300, 17)
Customer: O0LALCF
Customer not found
Customer: 7URX8JP
(200, 17)
Customer: 55MCNEF
(100, 17)

In [45]:
# Now lets look back at the entries with missing longitude and latitude values
missing_lat_long = temp_train_full[temp_train_full['latitude_x'].isnull() | temp_train_full['longitude_x'].isnull()]
print(f'Missing latitude and longitude values: {missing_lat_long.shape}')

# Get the list of customers with missing latitude and longitude values
missing_lat_long_customers = missing_lat_long['customer_id'].unique()
print(f'Unique customers with missing latitude and longitude values: {len(missing_lat_long_customers)}')
print(missing_lat_long_customers)

Missing latitude and longitude values: (600, 17)
Unique customers with missing latitude and longitude values: 6
['QFWLNUK' 'NSQRO1H' 'VZIK43C' 'O0LALCF' '7URX8JP' '55MCNEF']


Of the 34k+ customers, 6 lack latitude and longitude. Lets see if other rows are present for them without missing values.

In [46]:
# Lets see if our list of missing valued customers is still present otherwise in train_full
missing_ll_customers = ['QFWLNUK', 'NSQRO1H', 'VZIK43C', 'O0LALCF', '7URX8JP', '55MCNEF']

print('Frequency in train_full')
for customer in missing_ll_customers:
    print(f'Customer: {customer}')
    if temp_train_full[temp_train_full['customer_id'] == customer].shape[0] == 0:
        print('Customer not found')
    else:
        print(temp_train_full[temp_train_full['customer_id'] == customer].shape)
    
# Lets see if the missing entries are in train_customers
print('Frequency in train_customers')
for customer in missing_ll_customers:
    print(f'Customer: {customer}')
    if train_customers[train_customers['akeed_customer_id'] == customer].shape[0] == 0:
        print('Customer not found')
    else:
        print(train_customers[train_customers['akeed_customer_id'] == customer].shape)
        
# Lets see if the missing entries are in train_locations
print('Frequency in train_locations')
for customer in missing_ll_customers:
    print(f'Customer: {customer}')
    if train_locations[train_locations['customer_id'] == customer].shape[0] == 0:
        print('Customer not found')
    else:
        print(train_locations[train_locations['customer_id'] == customer].shape)

Frequency in train_full
Customer: QFWLNUK
(500, 17)
Customer: NSQRO1H
(400, 17)
Customer: VZIK43C
(400, 17)
Customer: O0LALCF
(100, 17)
Customer: 7URX8JP
(300, 17)
Customer: 55MCNEF
(200, 17)
Frequency in train_customers
Customer: QFWLNUK
(1, 7)
Customer: NSQRO1H
(1, 7)
Customer: VZIK43C
(1, 7)
Customer: O0LALCF
(1, 7)
Customer: 7URX8JP
(1, 7)
Customer: 55MCNEF
(1, 7)
Frequency in train_locations
Customer: QFWLNUK
(5, 5)
Customer: NSQRO1H
(4, 5)
Customer: VZIK43C
(4, 5)
Customer: O0LALCF
(1, 5)
Customer: 7URX8JP
(3, 5)
Customer: 55MCNEF
(2, 5)


Since we know we have 600 rows missing latitude_x and longitude_x, and here we can see across the 6 customers there's a total of 1900 rows, we can safely remove the missing entries. 

In [47]:
# Origial target outputs
print("Counts for target:")
target_counts = temp_train_full['target'].value_counts()
print(target_counts)

# After removing rows with missing latitude and longitude values
temp_train_full.dropna(subset=['latitude_x', 'longitude_x'], inplace=True)
print(f'After removing rows with missing latitude and longitude values: {temp_train_full.shape}')
target_counts = temp_train_full['target'].value_counts()
print("Counts for target:")
print(target_counts)

# Again look at the customer's with missing latitude and longitude values
for customer in missing_ll_customers:
    print(f'Customer: {customer}')
    if temp_train_full[temp_train_full['customer_id'] == customer].shape[0] == 0:
        print('Customer not found')
    else:
        print(temp_train_full[temp_train_full['customer_id'] == customer].shape)

Counts for target:
target
0    5724146
1      78254
Name: count, dtype: int64
After removing rows with missing latitude and longitude values: (5801800, 17)
Counts for target:
target
0    5723551
1      78249
Name: count, dtype: int64
Customer: QFWLNUK
(400, 17)
Customer: NSQRO1H
(300, 17)
Customer: VZIK43C
(300, 17)
Customer: O0LALCF
Customer not found
Customer: 7URX8JP
(200, 17)
Customer: 55MCNEF
(100, 17)


Unfortunately we lose 5 cases where one of these customers chooses a vendor. Hopefully there is enough information otherwise for them to give them a recommendation. 

Customer O0LALCF was completely removed from train_full. On test_full, we will be applying all of the same preprocessing done here to keep the datasets consistent with information. 

Final step is to only include rows where location_number = 0 and then remove the 'location_number' column. 

In [None]:
# Drop location_number != 0 from train_full
temp_train_full = temp_train_full[temp_train_full['location_number'] == 0]

# Drop location_number column 
temp_train_full.drop(columns=['location_number'], inplace=True)

print(temp_train_full.shape)

# Save preprocessed data to csv
temp_train_full.to_csv('../data/preprocessed/train_full.csv', index=False)

Counts for target:
target
0    3411069
1      40931
Name: count, dtype: int64
(3452000, 16)


In [50]:
drop_columns = ["is_open", "status_y", "device_type", 'verified_y',"commission", "is_akeed_delivering", "language", "open_close_flags", "one_click_vendor", "country_id", "city_id", "display_orders"]

# Lists of dropped columns: drop_columns, cols_to_remove, drop_cols
columns_to_drop = drop_columns + cols_to_remove + drop_cols
columns_to_drop.append('location_number')
print(len(drop_columns))
print(len(cols_to_remove))
print(len(drop_cols))
print(len(columns_to_drop))
print(columns_to_drop)

12
33
9
55
['is_open', 'status_y', 'device_type', 'verified_y', 'commission', 'is_akeed_delivering', 'language', 'open_close_flags', 'one_click_vendor', 'country_id', 'city_id', 'display_orders', 'gender', 'location_type', 'OpeningTime', 'OpeningTime2', 'sunday_from_time1', 'sunday_to_time1', 'sunday_from_time2', 'sunday_to_time2', 'monday_from_time1', 'monday_to_time1', 'monday_from_time2', 'monday_to_time2', 'tuesday_from_time1', 'tuesday_to_time1', 'tuesday_from_time2', 'tuesday_to_time2', 'wednesday_from_time1', 'wednesday_to_time1', 'wednesday_from_time2', 'wednesday_to_time2', 'thursday_from_time1', 'thursday_to_time1', 'thursday_from_time2', 'thursday_to_time2', 'friday_from_time1', 'friday_to_time1', 'friday_from_time2', 'friday_to_time2', 'saturday_from_time1', 'saturday_to_time1', 'saturday_from_time2', 'saturday_to_time2', 'primary_tags', 'status_x', 'verified_x', 'created_at_x', 'updated_at_x', 'authentication_id', 'vendor_category_en', 'rank', 'created_at_y', 'updated_at_y