# VIEW DATA
This notebook view and inspect raw data from January


In [None]:
import pandas as pd

## VIEW January's raw data


In [None]:
df1 = pd.read_parquet("../raw/yellow_tripdata_2021-01.parquet")
print("Raw data from January has shape:", df1.shape)


In [None]:
print("First ten rows of the data: ")
df1.head(10)

In [None]:
print("General information about columns, non-null count, and type of data of a month data:")
df1.info()

#### Findings from `.info()`

* **Dtypes:** Data types are mostly correct. `tpep_pickup_datetime` and `tpep_dropoff_datetime` are already proper `datetime64[ns]` objects.
* **Null Values:**
    * `airport_fee`: Almost 100% null (only 5 non-null values)
    * `passenger_count`, `RatecodeID`, `store_and_fwd_flag`, `congestion_surcharge`: These columns share the same number of missing values (~98k, or 7% of the data). 

-> Action: This needs to be handled in the QA step.

In [None]:
print("Info about numerical columns in data:")
df1.describe()

#### Findings from `.describe()`

This reveals several anomalies and outliers that will define our QA Rules:
1.  **`passenger_count`**: `min = 0.0`. (An invalid value -> Rule: Must be > 0).
2.  **`trip_distance`**: `min = 0.0` and `max` is extremely large (263k). (Invalid values -> Rule: Must be > 0 and below a reasonable threshold).
3.  **`fare_amount`**: `min = -490.0`. (Invalid value -> Rule: Must be > 0).
4.  **`total_amount`**: `min = -492.8`. (Invalid value -> Rule: Must be > 0).

In [None]:
print("Columns of a month data are:")
df1.columns

## VIEW DATA TABLE ZONE

In [None]:
tzl = pd.read_csv("../raw/taxi_zone_lookup.csv")
print(tzl.shape)
tzl.head(5)

In [None]:
tzl.describe()

In [None]:
tzl.info()

## VIEW DATA OF SOME SPECIFIC COLUMNS


In [None]:
df1.columns

In [None]:
print("Values in column VendorID are: ")
df1['VendorID'].value_counts()

In [None]:
print("Values in column Payment type are:")
df1['payment_type'].value_counts()

In [None]:
print("Statistic info of Passenger count column is: ")
df1['passenger_count'].describe()

In [None]:
print("Statistic info of Extra column is: ")
df1['extra'].describe()

In [None]:
print("Unique values in Airport fee column are:")
list(df1['airport_fee'].unique())

In [None]:
print("The number of non-null values and null values(NA) in column Store and fwd flag is:")
df1['store_and_fwd_flag'].notna().value_counts()