# Week 9
# Make Predictions on NYC Taxi Data

In Week 8 notebook, we have explored the content of [NYC Taxi Data](https://www.kaggle.com/c/nyc-taxi-trip-duration/data). This week, we will use numerical and graphical tools to discover the relationship between taxi trip duration and other relavent variables, so that we can make predictions on the test data set.

## Create functions to streamline data pre-processing
- Download and unzip the data file from Kaggle.
- Load the dataset and remove dubious records.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import zipfile
import folium
%matplotlib inline 

In [None]:
def unzip_training_data(download_path, unzip_path):
    
    if os.path.exists(unzip_path + "train.csv"):
        print("File train.csv already exists.")
        return
    
    file_name = download_path + "nyc-taxi-trip-duration.zip"
    with zipfile.ZipFile(file_name, "r") as f:
        f.extractall(unzip_path)
        
    file_name = unzip_path + "train.zip"
    with zipfile.ZipFile(file_name, "r") as f:
        f.extractall(unzip_path)
        print("File train.csv has been created.")

In [None]:
# Unzip the training data
unzip_training_data("C:/Users/lzhao/Downloads/", "Data/nyctaxi/")

In [None]:
def load_training_data(data_path):
    
    # Check if train.csv exists.
    assert os.path.exists(data_path + "train.csv"), "File train.csv does not exist."
    
    # Load train.csv as a data frame
    raw_data = pd.read_csv(data_path + "train.csv", sep=',')
    
    # Adjust data types
    raw_data['pickup_datetime'] = raw_data['pickup_datetime'].astype(np.datetime64)
    raw_data['dropoff_datetime'] = raw_data['dropoff_datetime'].astype(np.datetime64)

    # Remove trips that are too long or too short
    upper_trip_limit = 7200
    lower_trip_limit = 60
    long_trips = raw_data[raw_data['trip_duration'] > upper_trip_limit]
    data = raw_data.drop(long_trips.index)
    short_trips = data[data['trip_duration'] < lower_trip_limit]
    data = data.drop(short_trips.index)
    
    # Remove locations that are not in NYC.
    data_not_nyc = data[(data['pickup_longitude'] < -74.1) | (data['pickup_longitude'] > -73.7) |
                        (data['pickup_latitude'] < 40.5) | (data['pickup_latitude'] > 40.9) |
                        (data['dropoff_longitude'] < -74.1) | (data['dropoff_longitude'] > -73.7) |
                        (data['dropoff_latitude'] < 40.5) | (data['dropoff_latitude'] > 40.9)]
    data = data.drop(data_not_nyc.index)
    
    # Remove irrelavent columns
    data = data.drop(['id', 'vendor_id', 'passenger_count', 'store_and_fwd_flag', 'dropoff_datetime'], axis=1)
    
    print("Shape:", data.shape)
    
    return data

In [None]:
# Load the training data
data = load_training_data("Data/nyctaxi/")

In [None]:
data.head()

# Examine the bivariate relationships
Examine the relationship of a single feature and the trip duration


In [None]:
# How is pick-up date time related to trip duration?
plt.plot(data['pickup_datetime'], data['trip_duration'], 'r,', alpha=0.05)

In [None]:
# How is pick-up location related to trip duration?
fig = plt.figure(figsize=(12, 8))

ax1 = fig.add_subplot(2, 2, 1)
ax1.plot(data['pickup_longitude'], data['trip_duration'], 'r,', alpha=0.05)
ax1.set_xlabel("Pick-UpLongitude")
ax1.set_ylabel("Trip Duration (sec)")

ax2 = fig.add_subplot(2, 2, 2)
ax2.plot(data['pickup_latitude'], data['trip_duration'], 'r,', alpha=0.05)
ax2.set_xlabel("Pick-Up Latitude")
ax2.set_ylabel("Trip Duration (sec)")

ax3 = fig.add_subplot(2, 2, 3)
ax3.plot(data['dropoff_longitude'], data['trip_duration'], 'r,', alpha=0.05)
ax3.set_xlabel("Drop-OffLongitude")
ax3.set_ylabel("Trip Duration (sec)")

ax4 = fig.add_subplot(2, 2, 4)
ax4.plot(data['dropoff_latitude'], data['trip_duration'], 'r,', alpha=0.05)
ax4.set_xlabel("Drop-Off Latitude")
ax4.set_ylabel("Trip Duration (sec)")


In [None]:
# Calculate a numerical statistic called "correlation coefficient"
data.corr()

### Correlation Coefficient
The **correlation coefficient** is a numerical measurement of *linear* correlation between two variables.
- The value of correlation coefficient always lies in [-1, 1].
- If there is a strong positive correlation, then the coefficient is close to 1.
- If there is a strong negative correlation, then the coefficient is close to -1.
- If there is a very weak correlation, then the coefficient is close to 0.
- However, a near-zero coeffient may be caused by non-linear correlations.
![](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

**It looks like the relationship between trip durations and other features cannot be seen directly.**

## Feature Engineering
- Create additional useful features
    - Distance between pick-up and drop-off
    - Hour of the day
    - Day of the week
    - Holidays?
    - Weather?

In [None]:
# Create a column representing the aerial distance between pickup and dropoff
# Aerial distance = np.sqrt((x2 - x1) ** 2 + (y2 - y2) ** 2)

def aerial_distance(record):
    
    return np.sqrt((record['pickup_longitude'] - record['dropoff_longitude']) ** 2 + \
                   (record['pickup_latitude'] - record['dropoff_latitude']) ** 2)

In [None]:
aerial_distance(data.loc[0, :])

In [None]:
data['aerial_distance'] = data.apply(aerial_distance, axis=1) # add axis=1 to make sure it is applied along the vertical axis

In [None]:
data.head()

In [None]:
# Add hours of the day
data['hour'] = data['pickup_datetime'].dt.hour

data.head()

In [None]:
# Add day of the week
data['day'] = data['pickup_datetime'].dt.dayofweek

data.head()

In [None]:
data.corr()

In [None]:
plt.plot(data['hours'], data['trip_duration'], 'r,', alpha=0.05)

In [None]:
plt.plot(data['day'], data['trip_duration'], 'r,', alpha=0.05)

# Make Predictions

To make predictions, data scientists need to create a **mathematical model** that explictly describe how the prediction can be computed from the data. Often times the model comes with a set of tunable parameters whose values are determined by a **training algorithm**. The field of designing models and algorithms for computers to explore relationships in a data set is called **machine learning**. It is one of the most successful approach towards building **artificial intelligence**.

Today, we will employ a straight-forward strategy called **k-nearest-neighbors** to prediction trip durations for records in the test set.

- For each record that requires prediction, find **k** similar records from the training set with known trip durations.
- Since these records share similar attributes, their trips durations should be similar too.
- Use average trip duration of the k records as the prediction for the new record.

<img src="https://www.researchgate.net/profile/Debo_Cheng/publication/293487460/figure/fig1/AS:651874571149316@1532430417078/An-example-of-kNN-classification-task-with-k-5.png" width="600">

In [None]:
# Load test.csv
data_path = "Data/nyctaxi/"
with zipfile.ZipFile(data_path + "test.zip", "r") as f:
    f.extractall(data_path)
    
test_data = pd.read_csv(data_path + "test.csv", sep=",")

test_data.head()

In [None]:
# Change the data type of pick-up datetime
test_data['pickup_datetime'] = test_data['pickup_datetime'].astype(np.datetime64)

# Add hour of the day and day of the week
test_data['day'] = test_data['pickup_datetime'].dt.dayofweek

test_data['hour'] = test_data['pickup_datetime'].dt.hour

# drop irrelavant features
remove_cols = ['id', 'vendor_id', 'passenger_count', 'store_and_fwd_flag']
test_data = test_data.drop(remove_cols, axis=1)

test_data.head()

In [None]:
# Extract the first record
test_1 = test_data.loc[[0], :]
test_1

In [None]:
# Which record in the training data are similar to test_1?
# How to measure similarity?

# Let's use the following weighted sum of differences to measure similarity
# similarity = differences in coordinates * 500 + difference in days of the week + difference in hours

def similarity(training_record, test_record):
    
    diff_coordinates = np.abs(training_record['pickup_longitude'] - test_record['pickup_longitude']) + \
                        np.abs(training_record['pickup_latitude'] - test_record['pickup_latitude']) + \
                        np.abs(training_record['dropoff_longitude'] - test_record['dropoff_longitude']) + \
                        np.abs(training_record['dropoff_latitude'] - test_record['dropoff_latitude'])
    
    diff_day = np.abs(training_record['day'] - test_record['day'])
    
    diff_hour = np.abs(training_record['hour'] - test_record['hour'])
        
    return diff_coordinates * 500 + diff_day + diff_hour

In [None]:
similarity(data.loc[0, :], test_1)

In [None]:
data['similarity_test_1'] = data.apply(similarity, args=(test_1,), axis=1)

data.head()

In [None]:
# Let's use k = 5
# Find the 5 closes records

