# Mean Duration
<p>This tutorial illustrates a simple submission for the NYC Taxi Trip Duration competition on <a href = "https://www.kaggle.com/c/nyc-taxi-trip-duration"> Kaggle</a>. In this notebook, we read the dataset and use *only* the mean trip duration to make a submission. This gives us a (not very good) baseline score while allowing the opportunity to talk about the process to read in data and submit.</p>

<h2>Step 1: Download and Prepare data </h2>
<p>The first step is to download the raw data from the <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">Kaggle website</a>. For the purposes of this tutorial only two files are necessary: `test.csv` and `train.csv`. You should download them and save into the `data` folder. 

We begin by importing the necessary packages. We use the `pandas` data analysis library to read in the data in a usable format for python and `numpy` for some mathematical functions.


In [1]:
import pandas as pd
import numpy as np
import taxi_utils

Next, we use the function `read_data` which you can find in the `taxi_utils.py` file in this folder. In this case, `read_data` will create a *dataframe* which stores our tabular data. A dataframe has the `head()` method, which gives only the first five elements of the dataframe. We can use that to get a sense of what the dataframe looks like.

In [2]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"

data_train, data_test = taxi_utils.read_data(TRAIN_DIR, TEST_DIR)

data_train.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,False,455
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,False,663
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,False,2124
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,False,429
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,False,435


<h2>Step 2: Make a Submission </h2>

The form of our submission is a csv with the trip `id` and `trip_duration`. We take a guess that every trip will be about the average length of a trip. That turns out to be a fairly poor estimation.

In [3]:
data_test['trip_duration'] = data_train.trip_duration.mean()
data_test[['id', 'trip_duration']].head(5)

Unnamed: 0,id,trip_duration
0,id3004672,796.837574
1,id3505355,796.837574
2,id1217141,796.837574
3,id2150126,796.837574
4,id1598245,796.837574


In [4]:
data_test[['id', 'trip_duration']].to_csv('trip_duration_average.csv', index=False)

<dt>This solution:</dt>
<dd>&nbsp; &nbsp; Received a score of 0.82631 on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Placed 1158 out of 1257.</dd>
<dd>&nbsp; &nbsp; Beat 8% of competitors on the Kaggle.</dd>

December 27, 2017.

<p align="center">
<img width=50% src="https://alteryx-open-source-images.s3.amazonaws.com/OpenSource_Logo-01.jpg" alt="ayx_os" />
</p>

Featuretools was created by the developers at [Alteryx](https://www.alteryx.com). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.alteryx.com/contact-us/).