# DOTE6756BB-Machine Learning Individual Project

### Dataset Description

| Category | Features |
| --- | --- |
| **Target variable (classification)** | `action_type` - Decisions made by the delivery-men (PICKUP/DELIVERY)  |
| **Target variable (regression)** | `expected_use_time` -  Expected time of the deliery-man's next action |
| **Demographic** | `courier_id`, `wave_index`, `tracking_id`, `date`, `group`, `id` |
| **Geographic Information** | `courier_wave_start_lng`, `courier_wave_start_lat`, `target_lng`, `target_lat` |
| **Courier Information** | `level`, `speed`, `max_load` |
| **Courier's Previous Action Information** | `source_type`, `source_tracking_id`, `source_lng`, `source_lat` |
| **Others** | - `weather_grade` - Weather condition <br> - `aoi_id` - Area of Interest (i.e. delivery destination) <br> - `shop_id` - Shop ID <br> - `grid_distance` - Shortest Travel Distance (provided by GPS) <br> - `hour` - The hour in the day <br> - `urgency` - How urgent the order is |

### Part 1: Data Loading and Exploration

#### Import libraries

In [None]:
# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

#### Load dataset

In [2]:
df_train = pd.read_csv('../Data/dataframe_train.csv')
df_test = pd.read_csv('../Data/dataframe_test.csv')

In [3]:
df_train.head()

Unnamed: 0,courier_id,wave_index,tracking_id,courier_wave_start_lng,courier_wave_start_lat,action_type,date,group,level,speed,...,source_type,source_tracking_id,source_lng,source_lat,target_lng,target_lat,grid_distance,expected_use_time,urgency,hour
0,10007871,0,2.10007e+18,121.630997,39.142343,PICKUP,20200201,2.02002e+16,3,4.751832,...,ASSIGN,2.10007e+18,121.630997,39.142343,121.632547,39.141946,377.0,804,1246,11
1,10007871,0,2.10007e+18,121.630997,39.142343,DELIVERY,20200201,2.02002e+16,3,4.751832,...,PICKUP,2.10007e+18,121.632547,39.141946,121.626144,39.140281,780.0,298,1246,11
2,10007871,0,2.10007e+18,121.630997,39.142343,PICKUP,20200201,2.02002e+16,3,4.751832,...,DELIVERY,2.10007e+18,121.626144,39.140281,121.631219,39.141811,550.0,545,2462,11
3,10007871,0,2.10007e+18,121.630997,39.142343,DELIVERY,20200201,2.02002e+16,3,4.751832,...,PICKUP,2.10007e+18,121.631219,39.141811,121.632084,39.146201,707.0,341,1205,11
4,10007871,0,2.10007e+18,121.630997,39.142343,PICKUP,20200201,2.02002e+16,3,4.751832,...,DELIVERY,2.10007e+18,121.632084,39.146201,121.631574,39.142231,770.0,166,1882,11


In [None]:
# check data types and non-null counts
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509604 entries, 0 to 509603
Data columns (total 25 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   courier_id              509604 non-null  int64  
 1   wave_index              509604 non-null  int64  
 2   tracking_id             509604 non-null  float64
 3   courier_wave_start_lng  509604 non-null  float64
 4   courier_wave_start_lat  509604 non-null  float64
 5   action_type             509604 non-null  object 
 6   date                    509604 non-null  int64  
 7   group                   509604 non-null  float64
 8   level                   509604 non-null  int64  
 9   speed                   509604 non-null  float64
 10  max_load                509604 non-null  int64  
 11  weather_grade           509604 non-null  object 
 12  aoi_id                  509604 non-null  object 
 13  shop_id                 509604 non-null  object 
 14  id                  

In [5]:
df_train.describe()

Unnamed: 0,courier_id,wave_index,tracking_id,courier_wave_start_lng,courier_wave_start_lat,date,group,level,speed,max_load,id,source_tracking_id,source_lng,source_lat,target_lng,target_lat,grid_distance,expected_use_time,urgency,hour
count,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0,509604.0
mean,81512550.0,2.400154,2.100076e+18,121.534935,39.179724,20200220.0,1.496302e+17,2.607338,5.348056,8.980295,254801.5,2.100076e+18,121.534923,39.179897,121.534882,39.179971,1078.2749,441.655107,1572.033695,14.482592
std,49037810.0,2.168523,4797124000000.0,0.151081,0.114737,7.776252,1.333406e+17,0.698855,0.62607,2.02849,147110.147627,4797965000000.0,0.150718,0.113594,0.150752,0.113615,1124.569317,405.080785,4344.556228,3.310272
min,10007870.0,0.0,2.10007e+18,119.876654,36.064995,20200200.0,2.02002e+16,0.0,3.008735,1.0,0.0,2.10007e+18,119.876654,36.064995,121.059274,38.826421,0.0,1.0,-340771.0,6.0
25%,10697340.0,1.0,2.10007e+18,121.444628,39.116955,20200210.0,2.02002e+16,2.0,4.868302,8.0,127400.75,2.10007e+18,121.444174,39.11734,121.444254,39.117201,330.0,189.0,859.0,12.0
50%,111751100.0,2.0,2.10008e+18,121.523819,39.162378,20200220.0,2.02002e+17,3.0,5.458097,9.0,254801.5,2.10008e+18,121.52393,39.161311,121.523587,39.161241,869.0,354.0,1752.0,14.0
75%,118760800.0,4.0,2.10008e+18,121.591983,39.218092,20200220.0,2.02002e+17,3.0,5.779434,10.0,382202.25,2.10008e+18,121.591344,39.218011,121.591347,39.218921,1572.0,584.0,2590.0,17.0
max,125996900.0,16.0,2.10008e+18,122.256382,39.705013,20200230.0,2.02002e+18,3.0,6.943103,19.0,509603.0,2.10008e+18,122.260124,39.705013,122.260124,39.695211,429173.0,9246.0,11345.0,23.0


### Part 2： Exploratory Data Analysis (EDA)

In [None]:
# check target variable distribution
