## Introduction

Hello and welcome to the introduction notebook of this repo! Hopefully you have run the `run.sh` script at the top level of this repository and have accessed this notebook successfully! This repo is the culmination of work from `cmrfrd`, `iopoi`, and `cfu288` for the kaggle nyc taxi competition.

The goal of this competition is to best predict a taxi fare given the pickup location, drop off location, and number of passengers. We are provided with 55 million taxi trips with the given price and about 10 thousand unknown trips. Whoever can best and most accurately predict the prices is the best!

## Basic setup

If you already have the dataset in this `/home/jovyan/work/data/` directory of your container you can skip this step!

Reference: https://github.com/Kaggle/kaggle-api

In order for you to replicate our model and out results we need you to have a copy of the dataset on your machine. In order to get that dataset, the easiest for both of us for you to obtain it is through kaggle. So in order to proceed we need you to have a kaggle account set up.

Set one up here: https://www.kaggle.com/

Assuming you now have a kaggle account set up, we need you to copy your special credentials kaggle has given you (this can be found [here](`https://www.kaggle.com/<username>/account`)). Follow the next few steps to continue and download the dataset!

1. Go to `https://www.kaggle.com/<username>/account`
2. Download your credentials called 'kaggle.json' to your host machine
3. Click the `Upload Files` button in the File browser toolbar to upload your 'kaggle.json' credentions
4. Move 'kaggle.json' into the `/home/jovyan/work/data/` directory

Now that your kaggle credentials are set up, continue running the cells

### Configure kaggle environment
By default in this container, the `kaggle` command is installed and it expects the `KAGGLE_CONFIG_DIR` environment variable to be set.

In [1]:
%env KAGGLE_CONFIG_DIR=/home/jovyan/work/data/

env: KAGGLE_CONFIG_DIR=/home/jovyan/work/data/


Now that your environment is confiured to use the `KAGGLE_CONFIG_DIR` environment variable the `kaggle` command can now have access to your downloaded 'kaggle.json' file so it can download the dataset. Run the next cell to download the zip file containing the dataset and extract it to the `KAGGLE_CONFIG_DIR` directory

In [2]:
%%bash
kaggle_files=(test.csv train.csv.zip)
kaggle_unzip_file=train.csv
for i in "${kaggle_files[@]}"
do
    if [ -f "$KAGGLE_CONFIG_DIR$i" ]; then
        echo
    else
        kaggle competitions download -c new-york-city-taxi-fare-prediction -p $KAGGLE_CONFIG_DIR
    fi
done

if [ ! -f "$KAGGLE_CONFIG_DIR$kaggle_unzip_file" ]; then
    unzip -D train.csv.zip
fi

echo "All good!"



All good!


## Visualization

Visualization is a key aspect to understanding the dataset you are working with. It is primarily used to answer fundamental questions that can be easily understood with a graphic. The dataset we will use to answer these questions will be `test.csv`.

Questions:

1. What does the data look like ontop of a map of manhatten?
2. What areas of NYC are more/less popular at different times of the day?
3. Where are the most people picked up in all of NYC?

## Importing dependencies

In the next cell we will import all the dependencies we need to execute the remaining cells of this notebook

In [84]:
import os
import pandas as pd
import holoviews as hv
import geoviews as gv
import dask.dataframe as dd
import cartopy.crs as ccrs

from holoviews.operation.datashader import aggregate, datashade, dynspread, shade, rasterize, spread

import datashader as ds
from datashader import transfer_functions as tf
from datashader.colors import Greys9

from beakerx import *

from bokeh.tile_providers import STAMEN_TONER
from bokeh.models import WMTSTileSource

hv.extension("bokeh")

## 1. What does the test dataset look like ontop of a map of manhatten?

Because this data is mostly location based data, it would be extremely useful to view it ontop of a map to gain more insight to what our data physically looks like. This can help us better perform future tasks such as data cleaning, clustering visualizations, etc.

In the next cell we will use the [`geoviews`](http://geo.holoviews.org/) library to super impose the pickup and dropoff points in the `test.csv` dataset. We won't use the full training set because of memory constraints for certain devices. We will use a single color to represent both. Feel free to zoom in and check out your favorite destinations!

In [82]:
%%opts WMTS [width=800 height=500]
%opts Points [width=800 height=500 show_legend=True] (color="grey" size=2)

## load dataset lazily
pickup_columns = ['pickup_longitude','pickup_latitude']
dropoff_columns = ['dropoff_longitude','dropoff_latitude']
vdim = "passenger_count"
df = dd.read_csv(os.environ["KAGGLE_CONFIG_DIR"] + 'test.csv')

## Create holoviews, then geoviews points object 
def create_points(df, columns, **kwargs):
    points = (hv.Points(df, kdims=columns, vdims=[vdim])
                .options({'Points': {'color': Greys9}}))
    points = gv.Points(points, crs=ccrs.PlateCarree())
    return points

## Create pickup and dropoff points
pickup_points = create_points(df, pickup_columns)
dropoff_points = create_points(df, dropoff_columns)

## Merge points and tiles
gv.tile_sources.Wikipedia * pickup_points * dropoff_points

## 2. What areas of NYC are more/less popular at different times of the day?

In [149]:
## load dataset lazily
pickup_columns = ['pickup_longitude','pickup_latitude']
dropoff_columns = ['dropoff_longitude','dropoff_latitude']
vdim = "pickup_datetime"
df = dd.read_csv(os.environ["KAGGLE_CONFIG_DIR"] + 'test.csv')

In [150]:
import datetime

d = datetime.datetime.now()

d.timestamp()

1540184879.837069

In [151]:
datetime.datetime.strptime('2015-01-27 13:08:24 UTC', "%Y-%m-%d %H:%M:%S UTC")

datetime.datetime(2015, 1, 27, 13, 8, 24)

In [162]:
df = dd.read_csv(os.environ["KAGGLE_CONFIG_DIR"] + 'test.csv')
df[vdim].to_frame().compute().iloc[0]

pickup_datetime    2015-01-27 13:08:24 UTC
Name: 0, dtype: object

In [156]:
df[df[vdim].str.contains("f")].head()



Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count


In [157]:
df[vdim].apply(lambda t:datetime.datetime.strptime(t, "%Y-%m-%d %H:%M:%S UTC"))

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


ValueError: Metadata inference failed in `apply`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
ValueError("time data 'foo' does not match format '%Y-%m-%d %H:%M:%S UTC'",)

Traceback:
---------
  File "/home/jovyan/.conda/envs/jovyan/lib/python3.6/site-packages/dask/dataframe/utils.py", line 137, in raise_on_meta_error
    yield
  File "/home/jovyan/.conda/envs/jovyan/lib/python3.6/site-packages/dask/dataframe/core.py", line 3585, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/home/jovyan/.conda/envs/jovyan/lib/python3.6/site-packages/dask/utils.py", line 695, in __call__
    return getattr(obj, self.method)(*args, **kwargs)
  File "/home/jovyan/.conda/envs/jovyan/lib/python3.6/site-packages/pandas/core/series.py", line 3194, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer
  File "<ipython-input-157-335b387170ae>", line 1, in <lambda>
    df[vdim].apply(lambda t:datetime.datetime.strptime(t, "%Y-%m-%d %H:%M:%S UTC"))
  File "/home/jovyan/.conda/envs/jovyan/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/home/jovyan/.conda/envs/jovyan/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))


In [89]:
df.columns

Index(['key', 'pickup_datetime', 'pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude', 'passenger_count'],
      dtype='object')

## 3. Where are the most people picked up in all of NYC?

In [None]:
del df["key"]

In [None]:
df["pickup_datetime"] = df["pickup_datetime"].apply(lambda d:int(d[10:13]))

In [None]:
import sklearn.linear_model
import numpy as np

In [None]:
model = sklearn.linear_model.LinearRegression()

In [None]:
X = df.as_matrix()
y = np.random.random(X.shape[0])

In [None]:
model.fit(X, y)