## Loading Training Set

Loading the entire dataset into Pandas is going to be slow, so we can use the following optimizations:

- Ignore the `key` column
- Parse pickup datetime while loading data
- Specify data types for other columns
   - `float32` for geo coordinates
   - `float32` for fare amount
   - `uint8` for passenger count
- Work with a 1% sample of the data (~500k rows)

We can apply these optimizations while using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [None]:
import pandas as pd
import random

In [None]:
sample_frac = 0.01

selected_cols = 'fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count'.split(',')

dtypes = {
    'fare_amount': 'float32',
    'pickup_longitude': 'float32',
    'pickup_latitude': 'float32',
    'dropoff_longitude': 'float32',
    'passenger_count': 'float32'
}

def skip_row(idx):
    if idx == 0:
        return False  # don't skip the header
    return random.random() > sample_frac

random.seed(42)

df = pd.read_csv('train.csv',usecols=selected_cols,dtype=dtypes, parse_dates=['pickup_datetime'],skiprows=skip_row)

In [None]:
df.head()

In [None]:
test_df = pd.read_csv('test.csv',dtype=dtypes, parse_dates=['pickup_datetime'])