# VIEW DATA
This notebook views and inspects raw NYC TLC Yellow taxi data from January 2021


In [1]:
import pandas as pd

## VIEW January's raw data


In [2]:
df1 = pd.read_parquet("../raw/yellow_tripdata_2021-01.parquet")
print("Raw data from January has shape:", df1.shape)


Raw data from January has shape: (1369769, 19)


In [3]:
print("First ten rows of the data: ")
df1.head(10)

First ten rows of the data: 


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.2,1.0,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.7,1.0,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.6,1.0,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,
5,1,2021-01-01 00:16:29,2021-01-01 00:24:30,1.0,1.6,1.0,N,224,68,1,8.0,3.0,0.5,2.35,0.0,0.3,14.15,2.5,
6,1,2021-01-01 00:00:28,2021-01-01 00:17:28,1.0,4.1,1.0,N,95,157,2,16.0,0.5,0.5,0.0,0.0,0.3,17.3,0.0,
7,1,2021-01-01 00:12:29,2021-01-01 00:30:34,1.0,5.7,1.0,N,90,40,2,18.0,3.0,0.5,0.0,0.0,0.3,21.8,2.5,
8,1,2021-01-01 00:39:16,2021-01-01 01:00:13,1.0,9.1,1.0,N,97,129,4,27.5,0.5,0.5,0.0,0.0,0.3,28.8,0.0,
9,1,2021-01-01 00:26:12,2021-01-01 00:39:46,2.0,2.7,1.0,N,263,142,1,12.0,3.0,0.5,3.15,0.0,0.3,18.95,2.5,


#### Findings from `.info()`

* **Dtypes:** Data types are mostly correct. `tpep_pickup_datetime` and `tpep_dropoff_datetime` are already proper `datetime64[ns]` objects.
* **Null Values:**
    * `airport_fee`: Almost 100% null (only 5 non-null values)
    * `passenger_count`, `RatecodeID`, `store_and_fwd_flag`, `congestion_surcharge`: These columns share the same number of missing values (~98k, or 7% of the data). 

-> Action: This needs to be handled in the QA step.

In [4]:
print("General information about columns, non-null count, and type of data of a month data:")
df1.info()

General information about columns, non-null count, and type of data of a month data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1369769 entries, 0 to 1369768
Data columns (total 19 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   VendorID               1369769 non-null  int64         
 1   tpep_pickup_datetime   1369769 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  1369769 non-null  datetime64[ns]
 3   passenger_count        1271417 non-null  float64       
 4   trip_distance          1369769 non-null  float64       
 5   RatecodeID             1271417 non-null  float64       
 6   store_and_fwd_flag     1271417 non-null  object        
 7   PULocationID           1369769 non-null  int64         
 8   DOLocationID           1369769 non-null  int64         
 9   payment_type           1369769 non-null  int64         
 10  fare_amount            1369769 non-null  float64       
 11  extr

#### Findings from `.describe()`

This reveals several anomalies and outliers that will define our QA Rules:
1.  **`passenger_count`**: `min = 0.0`. (An invalid value -> Rule: Must be > 0).
2.  **`trip_distance`**: `min = 0.0` and `max` is extremely large (263k). (Invalid values -> Rule: Must be > 0 and below a reasonable threshold).
3.  **`fare_amount`**: `min = -490.0`. (Invalid value -> Rule: Must be > 0).
4.  **`total_amount`**: `min = -492.8`. (Invalid value -> Rule: Must be > 0).

In [5]:
print("Info about numerical columns in data:")
df1.describe()

Info about numerical columns in data:


Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
count,1369769.0,1271417.0,1369769.0,1271417.0,1369769.0,1369769.0,1369769.0,1369769.0,1369769.0,1369769.0,1369769.0,1369769.0,1369769.0,1369769.0,1271417.0,5.0
mean,1.721725,1.411508,4.631983,1.035081,165.2474,161.4957,1.188578,12.09663,0.9705133,0.4930412,1.918098,0.2477473,0.2969412,17.4744,2.239047,0.0
std,0.5925347,1.059831,393.9037,0.599483,67.83854,72.10795,0.5776546,12.91337,1.231258,0.07632059,2.597151,1.672761,0.04222168,14.69342,0.7989435,0.0
min,1.0,0.0,0.0,1.0,1.0,1.0,0.0,-490.0,-5.5,-0.5,-100.0,-31.12,-0.3,-492.8,-2.5,0.0
25%,1.0,1.0,1.0,1.0,124.0,107.0,1.0,6.0,0.0,0.5,0.0,0.0,0.3,10.8,2.5,0.0
50%,2.0,1.0,1.7,1.0,162.0,162.0,1.0,8.5,0.0,0.5,1.86,0.0,0.3,13.8,2.5,0.0
75%,2.0,1.0,3.02,1.0,236.0,236.0,1.0,13.5,2.5,0.5,2.75,0.0,0.3,19.12,2.5,0.0
max,6.0,8.0,263163.3,99.0,265.0,265.0,4.0,6960.5,8.25,0.5,1140.44,811.75,0.3,7661.28,3.0,0.0


In [6]:
print("Columns of a month data are:")
df1.columns

Columns of a month data are:


Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

## VIEW DATA TABLE ZONE

In [7]:
tzl = pd.read_csv("../raw/taxi_zone_lookup.csv")
print(tzl.shape)
tzl.head(5)

(265, 4)


Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


In [8]:
tzl.describe()

Unnamed: 0,LocationID
count,265.0
mean,133.0
std,76.643112
min,1.0
25%,67.0
50%,133.0
75%,199.0
max,265.0


In [9]:
tzl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   LocationID    265 non-null    int64 
 1   Borough       264 non-null    object
 2   Zone          264 non-null    object
 3   service_zone  263 non-null    object
dtypes: int64(1), object(3)
memory usage: 8.4+ KB


## VIEW DATA OF SOME SPECIFIC COLUMNS


In [10]:
df1.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'airport_fee'],
      dtype='object')

In [11]:
print("Values in column VendorID are: ")
df1['VendorID'].value_counts()

Values in column VendorID are: 


2    937141
1    422337
6     10291
Name: VendorID, dtype: int64

In [12]:
print("Values in column Payment type are:")
df1['payment_type'].value_counts()

Values in column Payment type are:


1    934475
2    322891
0     98352
3      8384
4      5667
Name: payment_type, dtype: int64

In [13]:
print("Statistic info of Passenger count column is: ")
df1['passenger_count'].describe()

Statistic info of Passenger count column is: 


count    1.271417e+06
mean     1.411508e+00
std      1.059831e+00
min      0.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      8.000000e+00
Name: passenger_count, dtype: float64

In [14]:
print("Statistic info of Extra column is: ")
df1['extra'].describe()

Statistic info of Extra column is: 


count    1.369769e+06
mean     9.705133e-01
std      1.231258e+00
min     -5.500000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      2.500000e+00
max      8.250000e+00
Name: extra, dtype: float64

In [15]:
print("Unique values in Airport fee column are:")
list(df1['airport_fee'].unique())

Unique values in Airport fee column are:


[nan, 0.0]

In [16]:
print("The number of non-null values and null values(NA) in column Store and fwd flag is:")
df1['store_and_fwd_flag'].notna().value_counts()

The number of non-null values and null values(NA) in column Store and fwd flag is:


True     1271417
False      98352
Name: store_and_fwd_flag, dtype: int64