## New York City Taxi Fare Prediction
### Can you predict a rider's taxi fare?


This is one of the Kaggle Competition for biginner, alredy completed.
Even it is very simple and straitforward for estimation, i thought it will be a good example to practice estimation problem from the scrath. 

You can refer to here for details, https://www.kaggle.com/c/new-york-city-taxi-fare-prediction 

In [None]:
import os
import sys
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils

In [None]:
files = os.listdir('./')
files

### Please download pre-created train_s file which samples 10% of the original file randomly
**train_s.csv file download** 
[train_s.csv](https://drive.google.com/uc?export=download&id=1AbO1jfrwJ0IKSJQactC6msPNub131LAh)
<br>**test.csv file download**
[test_s.csv](https://drive.google.com/uc?export=download&id=1iE_JilybsBIBpaTveD6vX8AfwaZzI8i8)

### Let's read csv data into pandas dataframe
It will take some time to load as train_s.csv is a bit huge, about 601MB

In [None]:
train_sdf = pd.read_csv('./train_s.csv', index_col=0)
test_sdf = pd.read_csv('./test_s.csv', index_col=0)
test_df = pd.read_csv('./test.csv', index_col=0)
print('The size of train data =', train_sdf.shape)
print('The size of test =', test_sdf.shape)
print('The size of test for submission =', test_df.shape)

**Let's see what the data looks like**

In [None]:
train_sdf.head(2)

### Removing Nan and Outliers

In [None]:
# Let's look over train data if there are Nan in any of features
print(train_sdf.isnull().sum() , test_sdf.isnull().sum())

In [None]:
# Remove missing values
train_sdf = train_sdf.dropna(how = 'any')

### From [the starter code](https://www.kaggle.com/dster/nyc-taxi-fare-starter-kernel-simple-linear-model) 

Add columns

```
# Given a dataframe, add two new features 'abs_diff_longitude' and
# 'abs_diff_latitude' reprensenting the "Manhattan vector" from
# the pickup location to the dropoff location.
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

add_travel_vector_features(train_df)
```

In [None]:
def add_travel_vector_features(df):
    df['abs_diff_longitude'] = (df.dropoff_longitude - df.pickup_longitude).abs()
    df['abs_diff_latitude'] = (df.dropoff_latitude - df.pickup_latitude).abs()

# add longitude, latitude difference to column features
add_travel_vector_features(train_sdf)
add_travel_vector_features(test_sdf)
add_travel_vector_features(test_df)

### Filtering out outliers
From the starter code<br>
"We expect most of these values to be very small (likely between 0 and 1) since it should all be differences between GPS coordinates within one city. For reference, one degree of latitude is about 69 miles. However, we can see the dataset has extreme values which do not make sense. Let's remove those values from our training set. Based on the scatterplot, it looks like we can safely exclude values above 5"

In [None]:
train_sdf = train_sdf[(train_sdf.abs_diff_longitude < 5.0) & (train_sdf.abs_diff_latitude < 5.0)]
print(train_sdf.shape)

**Removing locata data out of region of test set data**<br>
Find out longitude, latitude min, max of test data, 

In [None]:
def location_boundary(df):
    """
    Find min, max values of logitude, latitude in df
    
    Return
        (longitude_min, longitude_max), (latitude_min, latitude_max)
    """
    
    log_min = min(df.pickup_longitude.min(), df.dropoff_longitude.min())
    log_max = min(df.pickup_longitude.max(), df.dropoff_longitude.max())
    
    lat_min = min(df.pickup_latitude.min(), df.dropoff_latitude.min())
    lat_max = min(df.pickup_latitude.max(), df.dropoff_latitude.max())
    
    return (log_min, log_max), (lat_min, lat_max)

In [None]:
(lon_min, lon_max), (lat_min, lat_max) = location_boundary(test_df)
print(lon_min, lon_max, lat_min, lat_max)

Only taken data which is within test_s set location boundary

In [None]:
train_sdf = \
    train_sdf[(train_sdf.pickup_longitude >= lon_min) & (train_sdf.pickup_longitude <= lon_max) &
    (train_sdf.dropoff_longitude >= lon_min) & (train_sdf.dropoff_longitude <= lon_max) &
    (train_sdf.pickup_latitude >= lat_min) & (train_sdf.pickup_latitude <= lat_max) &
    (train_sdf.dropoff_latitude >= lat_min) & (train_sdf.dropoff_latitude <= lat_max)]

print(train_sdf.shape)

**Most of cases, passenger_count is lower than 6, let's regard as outliers if it is more than 6, remove 0 passenger count as well**

Already applied to test_s set

In [None]:
test_df.groupby('passenger_count').size()

In [None]:
train_sdf = train_sdf[(train_sdf.passenger_count <= 6) & (train_sdf.passenger_count > 0)]
print(train_sdf.shape)

**Let's remove fare_amount lower than 2.5**<br>
Because, the basic fare is 2.5, most of case is under $100.

In [None]:
train_sdf = train_sdf[ (train_sdf.fare_amount >= 2.5) & (train_sdf.fare_amount < 100)]
print(train_sdf.shape)

**Add pickup time infomation as input features**

In [None]:
train_sdf['pickup_datetime'] = pd.to_datetime(train_sdf['pickup_datetime'],format='%Y-%m-%d %H:%M:%S UTC')

In [None]:
def pickup_time_featues(data):
    """
    Insert year, month, day, day of week, hour, minute information 
    from pickup_datatime column
    """
    
    data['Year'] = data['pickup_datetime'].dt.year
    data['Month'] = data['pickup_datetime'].dt.month
    data['Date'] = data['pickup_datetime'].dt.day
    data['DayofWeek'] = data['pickup_datetime'].dt.dayofweek
    data['Hour'] = data['pickup_datetime'].dt.hour
    data['Minute'] = data['pickup_datetime'].dt.minute
    
    return data

In [None]:
train_sdf = pickup_time_featues(train_sdf)
train_sdf.head(1)

In [None]:
test_sdf['pickup_datetime'] = pd.to_datetime(test_sdf['pickup_datetime'],format='%Y-%m-%d %H:%M:%S UTC')
test_df['pickup_datetime'] = pd.to_datetime(test_df['pickup_datetime'],format='%Y-%m-%d %H:%M:%S UTC')
test_sdf = pickup_time_featues(test_sdf)
test_df = pickup_time_featues(test_df)

In [None]:
## Let's see the final size of data again now
print('The shape of train data =', train_sdf.shape)
print('The shape of test =', test_sdf.shape)
print('The size of test for submission =', test_df.shape)

### Set Columns to drop

In [None]:
def preprocess_data(df, cols_to_drop='', label='fare_amount'):
    """
    Returns df_x(features), df_y(labels)
    """
    drop_cols = []
    for col in cols_to_drop:
        if col in df:
            drop_cols.append(col)

    df_x = df.drop(columns=drop_cols)

    if label is not None:
        df_y = df_x.pop(label)
        return df_x, df_y
    else:
        return df_x

In [None]:
train_sdf.columns.values
train_sdf.head(2)

In [None]:
# columns to drop in input features
# simply not to use pickup_datetime as a feature for estimation here.
COLUMNS_TO_DROP = ['pickup_datetime']

In [None]:
train_x, train_y = preprocess_data(train_sdf, COLUMNS_TO_DROP, 'fare_amount')
test_x, test_y = preprocess_data(test_sdf, COLUMNS_TO_DROP, 'fare_amount')
stest_x = preprocess_data(test_df, COLUMNS_TO_DROP, None)

In [None]:
print(train_x.shape, train_y.shape, test_x.shape, test_y.shape, stest_x.shape)
train_x.head(2)

In [None]:
def _MinMax(dframes):
    """
    Find out min max values of each columns among data frame listed in dframes
    """
    s_min = dframes[0].min(axis=0)
    s_max = dframes[0].max(axis=0)
    
    for df in dframes[1:]:
        d_min =  df.min(axis=0)
        d_max =  df.max(axis=0)
        s_min = s_min.combine(d_min, min)
        s_max = s_max.combine(d_max, max)
    return s_min, s_max

# panda series type
x_min, x_max = _MinMax([train_x, test_x, stest_x])

### Let's implement your own pytorch code here

prepprocess - drop columns which will not be used as feature columns <br>
Define Torch Dataset
    ```Class NewYorkTaxiFareDataset(Dataset) ```

In [None]:
def preprocess(df, cols_to_drop=''):
    """
    Returns df with colums dropped
    """
    drop_cols = []
    for col in cols_to_drop:
        if col in df:
            drop_cols.append(col)
    df = df.drop(columns=drop_cols)
    return df
    
class NewYorkTaxiFareDataset(Dataset):
    """New York Taxi Fare dataset."""

    def __init__(self, df, drop_cols, label=None, transform=None):
        """
        Args:
            df : data frame which is read from csv file
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.df = df
        self.label = label
        self.drop_cols = drop_cols
        self.df = preprocess(self.df, self.drop_cols)
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
    
        if self.label is not None:
            taxi_fare_tg = np.array([self.df.iloc[idx, 0]])
            taxi_fare_input = np.array([self.df.iloc[idx, 1:]])
            sample = {'features': taxi_fare_input, 'target': taxi_fare_tg}
        else:
            taxi_fare_input = np.array([self.df.iloc[idx, :]])
            sample = {'features': taxi_fare_input}
        
        if self.transform:
            sample = self.transform(sample)

        return sample

### Transform

In [None]:
def _MinMaxScaler(X, s_min, s_max, feature_range=(0, 1)):
    
    min, max = feature_range
    
    episilon = 1e-10
    X_std = (X - s_min) / (s_max - s_min + episilon)
    X_scaled = X_std * (max - min) + min
    
    return X_scaled

class MinMaxScaler(object):
    """Apply min max scaler to sample"""
    def __init__(self, label, _min, _max):
        self.label = label
        self.min = _min.to_numpy()
        self.max = _max.to_numpy()

    def __call__(self, sample):
        if self.label is not None:
            features, target = sample['features'], sample['target']
            features = _MinMaxScaler(features, self.min, self.max)
            return {'features' : features, 'target' : target}
        else:
            features = sample['features']
            features = _MinMaxScaler(features, self.min, self.max)
            return {'features' : features}
    
class ToTensor(object):
    """Convert ndarrays in sample to Tensors."""
    def __init__(self, label):
        self.label = label

    def __call__(self, sample):
        if self.label is not None:
            taxi_features, taxi_fare = sample['features'], sample['target']
            return {'features': torch.from_numpy(taxi_features),
                    'target': torch.from_numpy(taxi_fare)}
        else:
            taxi_features = sample['features']
            return {'features': torch.from_numpy(taxi_features)}

### Applying transform and get train dataset using data loader

In [None]:
torch.set_default_dtype(torch.float64)
label = 'fare_amount'
train_dataset = NewYorkTaxiFareDataset(train_sdf, COLUMNS_TO_DROP, label=label, 
                                       transform = transforms.Compose([MinMaxScaler(label,x_min,x_max),
                                                                       ToTensor(label)]))

Now let's see transformed dataset, which is tensor, check the size

In [None]:
dataset = train_dataset[0]
print('dataset features size=', dataset['features'].shape, '\n', dataset['features'])
print('dataset taxi_fare size=', dataset['target'].shape, '\n', dataset['target'])

In [None]:
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=128,
                                          shuffle=True, num_workers=8)

In [None]:
dataiter = iter(trainloader)
print(dataiter.next()['features'].size(), dataiter.next()['target'].size())

### Build Your Model

In [None]:
class Net_NewYorkTaxiFare(nn.Module):
    def __init__(self, input_size, output_size, fc_units):
        """
        Model define
            input_size - input features size
            output_size - final output size
            fc_units - list of units of hidden layers
        """
        super(Net_NewYorkTaxiFare, self).__init__()
        
        self.nn_layers = nn.ModuleList()
        
        for i in range(len(fc_units)):
            if i == 0:
                self.nn_layers.append(nn.Linear(input_size, fc_units[i], bias=True))
            else:
                self.nn_layers.append(nn.Linear(fc_units[i-1], fc_units[i], bias=True))
        
        self.output = nn.Linear(fc_units[-1], output_size, bias=True)

    def forward(self, x):
        # Feed Forward
        for layer in self.nn_layers:
            x = F.relu(layer(x))
        x = self.output(x)
        x = x.view(-1,1)
        return x

In [None]:
# two hidden layers with units 13, 13
model = Net_NewYorkTaxiFare(13, 1, [13,13])

In [None]:
print(model)

Define Loss, metrics, optimizer and learning rate

In [None]:
import torch.optim as optim
import time

metrics = torch.nn.L1Loss()
criterion = torch.nn.MSELoss(reduction='mean')
learning_rate = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-4)

In [None]:
epochs = 5
num_batches = len(trainloader)
running_loss = 0.0
running_mae = 0.0

for epoch in range(1, epochs + 1):
    run_time = time.time()
    print('Epoch {:d}/{:d}'.format(epoch,epochs))
    
    for i, data in enumerate(trainloader):
        # time for batch start
        batch_st = time.time()
        
        inputs = data['features']
        labels = data['target']
        
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        # print statistics
        mae = metrics(outputs, labels)
        batch_et = (time.time() - batch_st)
        
        running_loss += loss.item()
        running_mae += mae.item()
        sys.stdout.write('[%d/%d] %.2fms/step - loss: %.4f - mae: %.4f\r' % \
                         (i, num_batches, batch_et*1000, loss.item(), mae.item()))
        sys.stdout.flush()
            
    
    # Measure one epoch training time
    t_epoch = time.time() - run_time
    print('[%d/%d] %.2fs %.2fms/step - loss: %.4f - mae: %.4f\r' % \
                         (i, num_batches, t_epoch, batch_et*1000, \
                          running_loss/num_batches, running_mae/num_batches))
    running_loss = 0
    running_mae = 0


### Get test_s data set tensor for model evaluation

In [None]:
# Test loader for test data set
test_dataset = NewYorkTaxiFareDataset(test_sdf, COLUMNS_TO_DROP, label=label, 
                                       transform = transforms.Compose([MinMaxScaler(label,x_min,x_max),
                                                                       ToTensor(label)]))

testloader = torch.utils.data.DataLoader(test_dataset, batch_size=128,
                                          shuffle=True, num_workers=4)

In [None]:
dataiter = iter(testloader)
print(dataiter.next()['features'].size(), dataiter.next()['target'].size())

In [None]:
test_predict = np.array([])
with torch.no_grad():
    total_loss, total_mae = 0.0, 0.0
    num_test_batches = len(testloader)
    for i, data in enumerate(testloader, 0):
        inputs = data['features']
        labels = data['target']
        
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        mae = metrics(outputs, labels)

        test_predict = np.append(outputs.view(-1,).numpy(), test_predict)
        
        total_loss += loss.item()
        total_mae += mae.item()
        sys.stdout.write('[%d/%d] - loss: %.4f - mae: %.4f\r' % \
                         (i, num_test_batches, loss.item(), mae.item()))
        sys.stdout.flush()
    
    print('\nAverage loss: %.4f  - mae: %.4f\r' % (total_loss/num_test_batches, total_mae/num_test_batches))

In [None]:
test_predict = test_predict.reshape(-1,1)
print(test_predict.shape)

In [None]:
test_taxi_fare = test_y.values.reshape(-1,1)
print(test_taxi_fare.shape)

In [None]:
taxi_fare_array = np.concatenate((test_taxi_fare, test_predict), axis=1)
print(taxi_fare_array.shape)

Dataframe with two columns, one is label, 'fare_amoun' and the other is its prediction values

In [None]:
taxi_fare = pd.DataFrame(taxi_fare_array , columns=['fare_amount', 'prediction'])
taxi_fare.describe()

In [None]:
y = abs(taxi_fare.fare_amount - taxi_fare.prediction)
y.describe()

### Cumulative distribution function

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 4))
n, bins, patches = ax.hist(y, bins=y.size, linewidth=2, density=True, \
            histtype='step', cumulative=True)
ax.grid(True)
ax.set_title('New York Taxi Fare Prediction Error CDF')
ax.set_xlabel('fare amount error')
ax.set_ylabel('Likelihood of occurrence')
plt.show()

### For submission

In [None]:
# Test loader for test data set
stest_dataset = NewYorkTaxiFareDataset(stest_x, COLUMNS_TO_DROP, label=None, 
                                       transform = transforms.Compose([MinMaxScaler(None,x_min,x_max),
                                                                       ToTensor(None)]))

stestloader = torch.utils.data.DataLoader(stest_dataset, batch_size=128,
                                          shuffle=False, num_workers=4)

In [None]:
taxi_fare_predict = np.array([])
with torch.no_grad():
    for data in stestloader:
        inputs = data['features']
        outputs = model(inputs)
        predict = outputs.view(-1,).numpy()
        taxi_fare_predict = np.append(taxi_fare_predict, predict)

In [None]:
key = stest_x.index.values
print(key.shape, taxi_fare_predict.shape)

In [None]:
# Write the predictions to a CSV file which we can submit to the competition.
submission = pd.DataFrame(
    {'key': stest_x.index, 'fare_amount': taxi_fare_predict},
    columns = ['key', 'fare_amount'])
submission.to_csv('submission.csv', index = False)