# Exploring the NYC taxi data

In Project 2, you will work on the [NYC taxi trip data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Every month, the city of New York publishes open data which contains a record of every taxi ride taken that month in the city.

The function `get_taxi_data()` is provided for you in `utils.py` to easily download and read data for a particular month and type of taxi. You should use it in your project.

Open `utils.py` in VSCode, study it carefully, and try the example below. If you are not sure how it works, ask a tutor!

In [15]:
import pandas as pd

# Import the function get_taxi_data() from utils.py
from utils import get_taxi_data

In [48]:
# Example: get yellow taxi data for January 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
df1 = get_taxi_data('2022', '01', 'yellow', save=False)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 5 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   tpep_pickup_datetime   datetime64[ns]
 1   tpep_dropoff_datetime  datetime64[ns]
 2   passenger_count        float64       
 3   trip_distance          float64       
 4   fare_amount            float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 94.0 MB


In [17]:
df1.head(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount
0,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,14.5
1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,8.0
2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,7.5
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,8.0
4,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,23.5
5,2022-01-01 00:40:15,2022-01-01 01:09:48,1.0,10.3,33.0
6,2022-01-01 00:20:50,2022-01-01 00:34:58,1.0,5.07,17.0
7,2022-01-01 00:13:04,2022-01-01 00:22:45,1.0,2.02,9.0
8,2022-01-01 00:30:02,2022-01-01 00:44:49,1.0,2.71,12.0
9,2022-01-01 00:48:52,2022-01-01 00:53:28,1.0,0.78,5.0


In [50]:
print(df1['fare_amount'])

0          14.50
1           8.00
2           7.50
3           8.00
4          23.50
           ...  
2463926     8.00
2463927    16.80
2463928    11.22
2463929    12.40
2463930    25.48
Name: fare_amount, Length: 2463931, dtype: float64


In [18]:
df1.tail(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,fare_amount
2463921,2022-01-31 23:36:07,2022-01-31 23:48:05,,3.04,13.09
2463922,2022-01-31 23:09:46,2022-01-31 23:20:50,,2.42,10.97
2463923,2022-01-31 23:51:47,2022-02-01 00:10:07,,7.51,26.73
2463924,2022-01-31 23:49:00,2022-02-01 00:08:00,,8.5,25.41
2463925,2022-01-31 23:02:51,2022-01-31 23:13:54,,1.63,9.71
2463926,2022-01-31 23:36:53,2022-01-31 23:42:51,,1.32,8.0
2463927,2022-01-31 23:44:22,2022-01-31 23:55:01,,4.19,16.8
2463928,2022-01-31 23:39:00,2022-01-31 23:50:00,,2.1,11.22
2463929,2022-01-31 23:36:42,2022-01-31 23:48:45,,2.92,12.4
2463930,2022-01-31 23:46:00,2022-02-01 00:13:00,,8.94,25.48


In [45]:
# Example: get yellow taxi data for February 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
yellow = get_taxi_data('2022', '02', 'yellow', save=True)
yellow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2979431 entries, 0 to 2979430
Data columns (total 5 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   tpep_pickup_datetime   datetime64[ns]
 1   tpep_dropoff_datetime  datetime64[ns]
 2   passenger_count        float64       
 3   trip_distance          float64       
 4   fare_amount            float64       
dtypes: datetime64[ns](2), float64(3)
memory usage: 113.7 MB


In [49]:
# Example: get yellow taxi data for February 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
green_feb = get_taxi_data('2022', '02', 'green', save=False)
green_feb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69399 entries, 0 to 69398
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               69399 non-null  int64         
 1   lpep_pickup_datetime   69399 non-null  datetime64[ns]
 2   lpep_dropoff_datetime  69399 non-null  datetime64[ns]
 3   store_and_fwd_flag     61978 non-null  object        
 4   RatecodeID             61978 non-null  float64       
 5   PULocationID           69399 non-null  int64         
 6   DOLocationID           69399 non-null  int64         
 7   passenger_count        61978 non-null  float64       
 8   trip_distance          69399 non-null  float64       
 9   fare_amount            69399 non-null  float64       
 10  extra                  69399 non-null  float64       
 11  mta_tax                69399 non-null  float64       
 12  tip_amount             69399 non-null  float64       
 13  t

In [19]:
# Now, get the data only for those 3 columns.
# We have the file already saved from the previous command, so this should be faster!
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'trip_distance']

# We also don't need to save this as it's a subset of the file we already have.
df2 = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 3 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   tpep_pickup_datetime   datetime64[ns]
 1   tpep_dropoff_datetime  datetime64[ns]
 2   trip_distance          float64       
dtypes: datetime64[ns](2), float64(1)
memory usage: 56.4 MB


In [20]:
df2.head(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance
0,2022-01-01 00:35:40,2022-01-01 00:53:29,3.8
1,2022-01-01 00:33:43,2022-01-01 00:42:07,2.1
2,2022-01-01 00:53:21,2022-01-01 01:02:19,0.97
3,2022-01-01 00:25:21,2022-01-01 00:35:23,1.09
4,2022-01-01 00:36:48,2022-01-01 01:14:20,4.3
5,2022-01-01 00:40:15,2022-01-01 01:09:48,10.3
6,2022-01-01 00:20:50,2022-01-01 00:34:58,5.07
7,2022-01-01 00:13:04,2022-01-01 00:22:45,2.02
8,2022-01-01 00:30:02,2022-01-01 00:44:49,2.71
9,2022-01-01 00:48:52,2022-01-01 00:53:28,0.78


In [21]:
# Now, I want the same data, but I need a new column 'total_amount' which is not in my current file.
cols_to_read = ['fare_amount',
                'total_amount']

# The function tries to get the columns from the existing data file,
# but can't find them, so it automatically re-downloads the data.
df3 = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df3.info()

File is in current folder, but may not contain all required columns.
Re-downloading data...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 2 columns):
 #   Column        Dtype  
---  ------        -----  
 0   fare_amount   float64
 1   total_amount  float64
dtypes: float64(2)
memory usage: 37.6 MB


In [22]:
df3.head(10)

Unnamed: 0,fare_amount,total_amount
0,14.5,21.95
1,8.0,13.3
2,7.5,10.56
3,8.0,11.8
4,23.5,30.3
5,33.0,56.35
6,17.0,26.0
7,9.0,12.8
8,12.0,18.05
9,5.0,8.8


Now, choose another month, a type of vehicle, use `get_taxi_data()` to obtain the data, and start exploring the dataset!

---

## Important tips about memory usage

Some of the data files are very heavy (several gigabytes!). Depending on your computer's RAM (memory), you may not be able to read entire data files at once, in a single data frame.

### Specify `columns`

The `columns` input argument is provided for you to select which columns you want to include in your dataframe. You should always specify which columns you need when you read data, to avoid loading unnecessary data into memory.

### Save your processed data into CSV files

To create your report, you will be selecting specific parts of the data, and likely performing some cleaning and/or aggregation on this data. You may wish to save your data at intermediate steps of your processing into CSV files, so that you can load these directly the next time you start your notebook (instead of having to re-do all the processing every time you restart Jupyter).

---

In [26]:
# Example: get yellow taxi data for January 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
green_vehicle = get_taxi_data('2022', '01', 'green', save=True)
green_vehicle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62495 entries, 0 to 62494
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               62495 non-null  int64         
 1   lpep_pickup_datetime   62495 non-null  datetime64[ns]
 2   lpep_dropoff_datetime  62495 non-null  datetime64[ns]
 3   store_and_fwd_flag     56200 non-null  object        
 4   RatecodeID             56200 non-null  float64       
 5   PULocationID           62495 non-null  int64         
 6   DOLocationID           62495 non-null  int64         
 7   passenger_count        56200 non-null  float64       
 8   trip_distance          62495 non-null  float64       
 9   fare_amount            62495 non-null  float64       
 10  extra                  62495 non-null  float64       
 11  mta_tax                62495 non-null  float64       
 12  tip_amount             62495 non-null  float64       
 13  t

In [27]:
green_vehicle.head(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2022-01-01 00:14:21,2022-01-01 00:15:33,N,1.0,42,42,1.0,0.44,3.5,0.5,0.5,0.0,0.0,,0.3,4.8,2.0,1.0,0.0
1,1,2022-01-01 00:20:55,2022-01-01 00:29:38,N,1.0,116,41,1.0,2.1,9.5,0.5,0.5,0.0,0.0,,0.3,10.8,2.0,1.0,0.0
2,1,2022-01-01 00:57:02,2022-01-01 01:13:14,N,1.0,41,140,1.0,3.7,14.5,3.25,0.5,4.6,0.0,,0.3,23.15,1.0,1.0,2.75
3,2,2022-01-01 00:07:42,2022-01-01 00:15:57,N,1.0,181,181,1.0,1.69,8.0,0.5,0.5,0.0,0.0,,0.3,9.3,2.0,1.0,0.0
4,2,2022-01-01 00:07:50,2022-01-01 00:28:52,N,1.0,33,170,1.0,6.26,22.0,0.5,0.5,5.21,0.0,,0.3,31.26,1.0,1.0,2.75
5,1,2022-01-01 00:47:57,2022-01-01 00:54:09,N,1.0,150,210,1.0,1.3,7.0,0.5,0.5,0.0,0.0,,0.3,8.3,2.0,1.0,0.0
6,2,2022-01-01 00:13:38,2022-01-01 00:33:50,N,1.0,66,67,1.0,6.47,22.5,0.5,0.5,0.0,0.0,,0.3,23.8,2.0,1.0,0.0
7,2,2022-01-01 00:43:00,2022-01-01 00:49:20,N,1.0,40,195,1.0,1.15,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2.0,1.0,0.0
8,2,2022-01-01 00:41:04,2022-01-01 00:47:04,N,1.0,112,80,1.0,1.3,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2.0,1.0,0.0
9,2,2022-01-01 00:51:07,2022-01-01 01:09:31,N,1.0,256,186,1.0,4.75,17.0,0.5,0.5,4.21,0.0,,0.3,25.26,1.0,1.0,2.75


In [28]:
green_vehicle.tail(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
62485,2,2022-01-31 22:01:00,2022-01-31 22:13:00,,,244,151,,4.69,19.01,0.0,0.0,4.12,0.0,,0.3,26.18,,,
62486,2,2022-01-31 22:54:00,2022-01-31 23:10:00,,,25,188,,3.12,13.9,0.0,0.0,3.16,0.0,,0.3,17.36,,,
62487,2,2022-01-31 23:23:00,2022-01-31 23:39:00,,,179,112,,3.8,16.48,0.0,0.0,3.73,0.0,,0.3,20.51,,,
62488,2,2022-01-31 23:50:00,2022-02-01 00:11:00,,,112,239,,6.04,23.45,0.0,0.0,5.83,0.0,,0.3,32.33,,,
62489,2,2022-01-31 23:19:00,2022-01-31 23:37:00,,,152,233,,6.71,25.4,0.0,0.0,6.27,0.0,,0.3,34.72,,,
62490,2,2022-01-31 23:25:00,2022-01-31 23:33:00,,,40,65,,1.4,8.38,0.0,0.0,1.93,0.0,,0.3,10.61,,,
62491,2,2022-01-31 23:52:00,2022-02-01 00:10:00,,,36,61,,2.97,14.92,0.0,0.0,0.0,0.0,,0.3,15.22,,,
62492,2,2022-01-31 23:17:00,2022-01-31 23:36:00,,,75,167,,3.7,16.26,0.0,0.0,0.0,0.0,,0.3,16.56,,,
62493,2,2022-01-31 23:45:00,2022-01-31 23:55:00,,,116,166,,1.88,9.48,0.0,0.0,2.17,0.0,,0.3,11.95,,,
62494,2,2022-01-31 23:52:00,2022-02-01 00:26:00,,,225,179,,9.6,32.18,0.0,0.0,7.23,10.0,,0.3,49.71,,,


In [29]:
# Example: get yellow taxi data for January 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
fhvhv = get_taxi_data('2022', '01', 'fhvhv', save=True)
fhvhv.info()

File not in current folder; trying to download data...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14751591 entries, 0 to 14751590
Data columns (total 24 columns):
 #   Column                Dtype         
---  ------                -----         
 0   hvfhs_license_num     object        
 1   dispatching_base_num  object        
 2   originating_base_num  object        
 3   request_datetime      datetime64[ns]
 4   on_scene_datetime     datetime64[ns]
 5   pickup_datetime       datetime64[ns]
 6   dropoff_datetime      datetime64[ns]
 7   PULocationID          int64         
 8   DOLocationID          int64         
 9   trip_miles            float64       
 10  trip_time             int64         
 11  base_passenger_fare   float64       
 12  tolls                 float64       
 13  bcf                   float64       
 14  sales_tax             float64       
 15  congestion_surcharge  float64       
 16  airport_fee           float64       
 17  tips                  float

In [31]:
fhvhv.head(10)

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,...,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
0,HV0003,B03404,B03404,2022-01-01 00:05:31,2022-01-01 00:05:40,2022-01-01 00:07:24,2022-01-01 00:18:28,170,161,1.18,...,2.21,2.75,0.0,0.0,23.03,N,N,,N,N
1,HV0003,B03404,B03404,2022-01-01 00:19:27,2022-01-01 00:22:08,2022-01-01 00:22:32,2022-01-01 00:30:12,237,161,0.82,...,1.06,2.75,0.0,0.0,12.32,N,N,,N,N
2,HV0003,B03404,B03404,2022-01-01 00:43:53,2022-01-01 00:57:37,2022-01-01 00:57:37,2022-01-01 01:07:32,237,161,1.18,...,2.65,2.75,0.0,0.0,23.3,N,N,,N,N
3,HV0003,B03404,B03404,2022-01-01 00:15:36,2022-01-01 00:17:08,2022-01-01 00:18:02,2022-01-01 00:23:05,262,229,1.65,...,0.7,2.75,0.0,0.0,6.3,N,N,,N,N
4,HV0003,B03404,B03404,2022-01-01 00:25:45,2022-01-01 00:26:01,2022-01-01 00:28:01,2022-01-01 00:35:42,229,141,1.65,...,0.84,2.75,0.0,0.0,7.44,N,N,,N,N
5,HV0003,B03404,B03404,2022-01-01 00:34:44,2022-01-01 00:36:52,2022-01-01 00:38:50,2022-01-01 00:51:32,263,79,4.51,...,1.57,2.75,0.0,0.0,12.25,N,N,,N,N
6,HV0003,B03404,B03404,2022-01-01 00:47:51,2022-01-01 00:52:00,2022-01-01 00:53:25,2022-01-01 01:08:56,113,140,3.68,...,1.48,2.75,0.0,0.0,12.75,N,N,,N,N
7,HV0003,B03404,B03404,2022-01-01 00:06:21,2022-01-01 00:06:58,2022-01-01 00:08:58,2022-01-01 00:23:01,151,75,2.77,...,1.28,0.0,0.0,4.0,11.47,N,N,,N,N
8,HV0003,B03404,B03404,2022-01-01 00:27:54,2022-01-01 00:30:26,2022-01-01 00:32:25,2022-01-01 00:44:15,263,229,2.04,...,0.94,2.75,0.0,0.0,9.55,N,N,,N,N
9,HV0003,B03404,B03404,2022-01-01 00:44:59,2022-01-01 00:48:23,2022-01-01 00:50:23,2022-01-01 01:15:30,237,169,8.79,...,2.45,2.75,0.0,0.0,23.67,N,N,,N,N


In [32]:
fhvhv.tail(10)

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,...,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
14751581,HV0003,B03404,B03404,2022-01-31 23:15:36,2022-01-31 23:19:05,2022-01-31 23:19:05,2022-01-31 23:33:23,163,244,7.57,...,1.64,2.75,0.0,0.0,15.87,N,N,,N,N
14751582,HV0003,B03404,B03404,2022-01-31 23:33:34,2022-01-31 23:34:20,2022-01-31 23:36:02,2022-01-31 23:50:15,244,47,3.05,...,1.44,0.0,0.0,0.0,10.85,N,N,,N,N
14751583,HV0003,B03404,B03404,2022-01-31 22:57:18,2022-01-31 23:07:52,2022-01-31 23:09:52,2022-01-31 23:19:46,86,86,2.05,...,0.85,0.0,0.0,0.0,8.51,N,N,,N,N
14751584,HV0003,B03404,B03404,2022-01-31 23:23:00,2022-01-31 23:24:44,2022-01-31 23:26:37,2022-01-31 23:34:37,86,117,1.3,...,0.7,0.0,0.0,0.0,6.73,N,N,,N,N
14751585,HV0003,B03404,B03404,2022-01-31 23:33:19,2022-01-31 23:40:56,2022-01-31 23:41:58,2022-01-31 23:47:44,86,86,1.53,...,0.64,0.0,0.0,0.0,6.68,N,N,,N,N
14751586,HV0003,B03404,B03404,2022-01-31 23:22:16,2022-01-31 23:26:04,2022-01-31 23:27:20,2022-01-31 23:40:46,77,71,2.59,...,1.27,0.0,0.0,0.0,9.9,N,N,,N,N
14751587,HV0003,B03404,B03404,2022-01-31 23:42:30,2022-01-31 23:45:08,2022-01-31 23:45:46,2022-01-31 23:59:44,72,72,1.56,...,0.92,0.0,0.0,0.0,9.03,N,N,,N,N
14751588,HV0003,B03404,B03404,2022-01-31 22:56:50,2022-01-31 23:03:17,2022-01-31 23:03:25,2022-01-31 23:17:17,136,20,1.23,...,0.7,0.0,0.0,0.0,8.73,N,N,,N,N
14751589,HV0003,B03404,B03404,2022-01-31 23:15:07,2022-01-31 23:19:25,2022-01-31 23:20:26,2022-01-31 23:30:26,20,136,1.69,...,0.83,0.0,0.0,0.0,7.3,N,N,,N,N
14751590,HV0003,B03404,B03404,2022-01-31 23:33:24,2022-01-31 23:36:13,2022-01-31 23:38:13,2022-02-01 00:07:24,136,82,14.7,...,3.01,0.0,0.0,0.0,31.28,N,N,,N,N


In [53]:
# Example: get yellow taxi data for February 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
fhvhv = get_taxi_data('2022', '03', 'fhvhv', save=True)
fhvhv.info()

File not in current folder; trying to download data...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18453548 entries, 0 to 18453547
Data columns (total 24 columns):
 #   Column                Dtype         
---  ------                -----         
 0   hvfhs_license_num     object        
 1   dispatching_base_num  object        
 2   originating_base_num  object        
 3   request_datetime      datetime64[ns]
 4   on_scene_datetime     datetime64[ns]
 5   pickup_datetime       datetime64[ns]
 6   dropoff_datetime      datetime64[ns]
 7   PULocationID          int64         
 8   DOLocationID          int64         
 9   trip_miles            float64       
 10  trip_time             int64         
 11  base_passenger_fare   float64       
 12  tolls                 float64       
 13  bcf                   float64       
 14  sales_tax             float64       
 15  congestion_surcharge  float64       
 16  airport_fee           float64       
 17  tips                  float

In [36]:
fhvhv.tail(10)

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,...,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag
16019273,HV0003,B03404,B03404,2022-02-28 23:15:00,2022-02-28 23:03:45,2022-02-28 23:04:18,2022-02-28 23:14:26,113,4,1.57,...,0.96,2.75,0.0,0.0,6.82,N,N,,N,N
16019274,HV0003,B03404,B03404,2022-02-28 23:16:00,2022-02-28 23:18:13,2022-02-28 23:20:13,2022-02-28 23:42:06,4,238,6.02,...,2.29,2.75,0.0,3.15,18.56,N,N,,N,N
16019275,HV0003,B03404,B03404,2022-02-28 23:45:23,2022-02-28 23:51:33,2022-02-28 23:53:33,2022-03-01 00:10:26,151,247,4.4,...,1.82,0.0,0.0,0.0,15.35,N,N,,N,N
16019276,HV0003,B03404,B03404,2022-02-28 23:29:21,2022-02-28 23:34:00,2022-02-28 23:34:14,2022-02-28 23:39:39,142,163,0.68,...,1.26,2.75,0.0,0.0,11.33,N,N,,N,N
16019277,HV0003,B03404,B03404,2022-02-28 23:20:02,2022-02-28 23:20:20,2022-02-28 23:22:01,2022-02-28 23:28:23,211,209,1.14,...,1.51,2.75,0.0,0.0,11.57,N,N,,N,N
16019278,HV0005,B03406,,2022-02-28 23:53:03,NaT,2022-02-28 23:55:36,2022-03-01 00:22:34,90,265,4.655,...,0.0,0.0,0.0,0.0,31.77,N,N,N,N,N
16019279,HV0003,B03404,B03404,2022-02-28 22:58:07,2022-02-28 22:59:48,2022-02-28 23:00:19,2022-02-28 23:04:59,234,249,0.98,...,1.26,2.75,0.0,3.52,10.62,N,N,,N,N
16019280,HV0003,B03404,B03404,2022-02-28 23:15:58,2022-02-28 23:18:36,2022-02-28 23:19:10,2022-02-28 23:33:12,234,162,2.53,...,1.93,2.75,0.0,0.0,19.26,N,N,,N,N
16019281,HV0003,B03404,B03404,2022-02-28 23:45:41,2022-02-28 23:47:40,2022-02-28 23:50:22,2022-02-28 23:59:57,161,249,2.05,...,1.54,2.75,0.0,0.0,15.48,N,N,,N,N
16019282,HV0003,B03404,B03404,2022-02-28 23:10:56,2022-02-28 23:11:59,2022-02-28 23:13:59,2022-02-28 23:44:49,163,265,16.6,...,5.43,2.75,0.0,0.0,42.5,N,N,,N,N


In [30]:
# Example: get yellow taxi data for January 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
fhv = get_taxi_data('2022', '01', 'fhv', save=True)
fhv.info()

File not in current folder; trying to download data...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143691 entries, 0 to 1143690
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1143691 non-null  object        
 1   pickup_datetime         1143691 non-null  datetime64[ns]
 2   dropOff_datetime        1143691 non-null  datetime64[ns]
 3   PUlocationID            267997 non-null   float64       
 4   DOlocationID            1012291 non-null  float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1143691 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 61.1+ MB


In [43]:
# Example: get yellow taxi data for January 2022
# cols_to_read = ['tpep_pickup_datetime',
#                 'tpep_dropoff_datetime',
#                 'passenger_count',
#                 'trip_distance',
#                 'fare_amount']

# Download the data and get the specified columns, save the file locally
fhv = get_taxi_data('2022', '02', 'fhv', save=True)
fhv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1251504 entries, 0 to 1251503
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1251504 non-null  object        
 1   pickup_datetime         1251504 non-null  datetime64[ns]
 2   dropOff_datetime        1251504 non-null  datetime64[ns]
 3   PUlocationID            281584 non-null   float64       
 4   DOlocationID            1064403 non-null  float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1251504 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 66.8+ MB


In [33]:
fhv.head(10)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2022-01-01 00:31:00,2022-01-01 01:05:00,,,,B00009
1,B00009,2022-01-01 00:37:00,2022-01-01 01:05:00,,,,B00009
2,B00037,2022-01-01 00:56:37,2022-01-01 01:06:11,,85.0,,B00037
3,B00037,2022-01-01 00:19:54,2022-01-01 00:30:47,,85.0,,B00037
4,B00037,2022-01-01 00:41:49,2022-01-01 00:52:16,,188.0,,B00037
5,B00037,2022-01-01 00:21:32,2022-01-01 00:35:06,,61.0,,B00037
6,B00037,2022-01-01 00:51:19,2022-01-01 01:08:06,,76.0,,B00037
7,B00111,2022-01-01 00:30:00,2022-01-01 01:41:00,,,,B03406
8,B00112,2022-01-01 00:31:30,2022-01-01 01:10:06,,67.0,,B00112
9,B00112,2022-01-01 00:12:26,2022-01-01 00:37:22,,155.0,,B00112


In [34]:
fhv.tail(10)

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
1143681,B03380,2022-01-31 23:39:32,2022-01-31 23:47:43,246.0,158.0,,B03380
1143682,B03380,2022-01-31 23:52:52,2022-02-01 00:03:14,158.0,107.0,,B03380
1143683,B03380,2022-01-31 23:24:44,2022-01-31 23:35:46,231.0,4.0,,B03380
1143684,B03380,2022-01-31 23:21:35,2022-01-31 23:32:16,229.0,48.0,,B03380
1143685,B03380,2022-01-31 23:02:50,2022-01-31 23:20:07,142.0,113.0,,B03380
1143686,B03380,2022-01-31 23:22:41,2022-01-31 23:26:39,234.0,107.0,,B03380
1143687,B03380,2022-01-31 23:42:42,2022-01-31 23:52:58,114.0,148.0,,B03380
1143688,B03380,2022-01-31 23:07:13,2022-01-31 23:13:40,90.0,113.0,,B03380
1143689,B03380,2022-01-31 23:16:14,2022-01-31 23:31:03,113.0,140.0,,B03380
1143690,B03381,2022-01-31 23:47:42,2022-02-01 00:15:03,,122.0,,B03404
