# ðŸ“• 05 - Plot Training Data

## Introduction

After going through rigorous processes of data collection, validation, and transformation in our previous notebooks, we now take a last visual approach. The primary purposes of this exploration are:

1. **Insight Discovery:** Charts and graphs can reveal patterns, outliers, or anomalies in the data that might not be immediately evident from raw tables or basic statistics. They can guide us in understanding the nuances of our dataset, which can be immensely valuable when building and refining our models later.

2. **Validation & Quality Assurance:** This is our last step before we start model building. It's crucial to ensure that our transformations and processing steps have not introduced errors. Mistakes can happen, and it's much easier to catch these visually than by looking at rows of numbers. This step is a "last line of defense" against potential issues in our dataset.

In [1]:
# import libraries
import pandas as pd
from src.paths import TRANSFORMED_DATA_DIR
from src.plot import plot_one_sample

# load features and target data
features_and_target = pd.read_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')
features_and_target.head()

Unnamed: 0,rides_previous_672_hour,rides_previous_671_hour,rides_previous_670_hour,rides_previous_669_hour,rides_previous_668_hour,rides_previous_667_hour,rides_previous_666_hour,rides_previous_665_hour,rides_previous_664_hour,rides_previous_663_hour,...,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id,target_rides_next_hour
0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,...,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2022-01-29 00:00:00,1,0.0
1,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 01:00:00,1,0.0
2,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 02:00:00,1,0.0
3,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 03:00:00,1,0.0
4,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 04:00:00,1,0.0


In [2]:
# split into features and target
features = features_and_target.drop(columns=['target_rides_next_hour'])
targets = features_and_target['target_rides_next_hour']

In [3]:
# plot one sample
plot_one_sample(
    example_id=0,
    features=features,
    targets=targets,
)