# Exploring the NYC taxi data

In Project 2, you will work on the [NYC taxi trip data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Every month, the city of New York publishes open data which contains a record of every taxi ride taken that month in the city.

The function `get_taxi_data()` is provided for you in `utils.py` to easily download and read data for a particular month and type of taxi. You should use it in your project.

Open `utils.py` in VSCode, study it carefully, and try the example below. If you are not sure how it works, ask a tutor!

In [1]:
import pandas as pd

# Import the function get_taxi_data() from utils.py
from utils import get_taxi_data

In [6]:
# Example: get yellow taxi data for January 2022
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'passenger_count',
                'trip_distance',
                'fare_amount']

# Download the data and get the specified columns, save the file locally
df1 = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read, save=True)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 5 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   tpep_pickup_datetime   datetime64[ns]
 1   tpep_dropoff_datetime  datetime64[ns]
 2   passenger_count        float64       
 3   trip_distance          float64       
 4   fare_amount            float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 94.0 MB


In [7]:
df1.head(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,14.5
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,8.0
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,7.5
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,8.0
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,23.5
5,2022-01-01 00:40:15,2022-01-01 01:09:48,1.0,10.3,33.0
6,2022-01-01 00:20:50,2022-01-01 00:34:58,1.0,5.07,17.0
7,2022-01-01 00:13:04,2022-01-01 00:22:45,1.0,2.02,9.0
8,2022-01-01 00:30:02,2022-01-01 00:44:49,1.0,2.71,12.0
9,2022-01-01 00:48:52,2022-01-01 00:53:28,1.0,0.78,5.0


In [8]:
# Now, get the data only for those 3 columns.
# We have the file already saved from the previous command, so this should be faster!
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'trip_distance']

# We also don't need to save this as it's a subset of the file we already have.
df2 = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 3 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   tpep_pickup_datetime   datetime64[ns]
 1   tpep_dropoff_datetime  datetime64[ns]
 2   trip_distance          float64       
dtypes: datetime64[ns](2), float64(1)
memory usage: 56.4 MB


In [9]:
df2.head(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance
0,2022-01-01 00:35:40,2022-01-01 00:53:29,3.8
1,2022-01-01 00:33:43,2022-01-01 00:42:07,2.1
2,2022-01-01 00:53:21,2022-01-01 01:02:19,0.97
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.09
4,2022-01-01 00:36:48,2022-01-01 01:14:20,4.3
5,2022-01-01 00:40:15,2022-01-01 01:09:48,10.3
6,2022-01-01 00:20:50,2022-01-01 00:34:58,5.07
7,2022-01-01 00:13:04,2022-01-01 00:22:45,2.02
8,2022-01-01 00:30:02,2022-01-01 00:44:49,2.71
9,2022-01-01 00:48:52,2022-01-01 00:53:28,0.78


In [10]:
# Now, I want the same data, but I need a new column 'total_amount' which is not in my current file.
cols_to_read = ['fare_amount',
                'total_amount']

# The function tries to get the columns from the existing data file,
# but can't find them, so it automatically re-downloads the data.
df3 = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df3.info()

File is in current folder, but may not contain all required columns.
Re-downloading data...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 2 columns):
 #   Column        Dtype  
---  ------        -----  
 0   fare_amount   float64
 1   total_amount  float64
dtypes: float64(2)
memory usage: 37.6 MB


In [11]:
df3.head(10)

Unnamed: 0,fare_amount,total_amount
0,14.5,21.95
1,8.0,13.3
2,7.5,10.56
3,8.0,11.8
4,23.5,30.3
5,33.0,56.35
6,17.0,26.0
7,9.0,12.8
8,12.0,18.05
9,5.0,8.8


Now, choose another month, a type of vehicle, use `get_taxi_data()` to obtain the data, and start exploring the dataset!

---

## Important tips about memory usage

Some of the data files are very heavy (several gigabytes!). Depending on your computer's RAM (memory), you may not be able to read entire data files at once, in a single data frame.

### Specify `columns`

The `columns` input argument is provided for you to select which columns you want to include in your dataframe. You should always specify which columns you need when you read data, to avoid loading unnecessary data into memory.

### Save your processed data into CSV files

To create your report, you will be selecting specific parts of the data, and likely performing some cleaning and/or aggregation on this data. You may wish to save your data at intermediate steps of your processing into CSV files, so that you can load these directly the next time you start your notebook (instead of having to re-do all the processing every time you restart Jupyter).

---