# Data Analysis and Pipeline Verification

This notebook analyzes the flight price prediction dataset at different stages of the MLOps pipeline: raw, bronze, silver, and gold. The goal is to perform basic data analysis at each stage and verify if the corresponding data processing pipelines have worked as expected.

In [20]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

## 1. Raw Data Analysis

We start by loading the raw dataset and performing a preliminary analysis.

In [21]:
from shared.config import core_paths, config_bronze, config_gold, config_silver

In [22]:
raw_df = pd.read_csv(core_paths.RAW_DATA_DIR / "train_validation_test" / "train.csv")
raw_df.head()

Unnamed: 0,travelCode,userCode,from,to,flightType,price,time,distance,agency,date
0,0,0,Recife (PE),Florianopolis (SC),firstClass,1434.38,1.76,676.53,FlyingDrops,2019-09-26
1,121138,1202,Florianopolis (SC),Natal (RN),firstClass,1315.27,1.84,709.37,CloudFy,2019-09-26
2,132076,1301,Florianopolis (SC),Salvador (BH),premium,1311.38,2.44,937.77,CloudFy,2019-09-26
3,28904,276,Recife (PE),Rio de Janeiro (RJ),economic,908.93,2.3,885.57,Rainbow,2019-09-26
4,88695,877,Aracaju (SE),Natal (RN),firstClass,598.61,0.46,176.33,CloudFy,2019-09-26


In [23]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190321 entries, 0 to 190320
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   travelCode  190321 non-null  int64  
 1   userCode    190321 non-null  int64  
 2   from        190321 non-null  object 
 3   to          190321 non-null  object 
 4   flightType  190321 non-null  object 
 5   price       190321 non-null  float64
 6   time        190321 non-null  float64
 7   distance    190321 non-null  float64
 8   agency      190321 non-null  object 
 9   date        190321 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 14.5+ MB


In [24]:
raw_df.describe()

Unnamed: 0,travelCode,userCode,price,time,distance
count,190321.0,190321.0,190321.0,190321.0,190321.0
mean,67919.412272,667.159567,955.601391,1.418343,545.875221
std,39051.352769,387.624777,362.432241,0.543563,209.246707
min,0.0,0.0,301.51,0.44,168.22
25%,34245.0,328.0,672.66,1.02,392.76
50%,68332.0,662.0,899.6,1.46,562.14
75%,101679.0,1009.0,1222.24,1.76,676.53
max,135943.0,1339.0,1754.17,2.44,937.77


In [25]:
raw_df.describe(include="object")

Unnamed: 0,from,to,flightType,agency,date
count,190321,190321,190321,190321,190321
unique,9,9,3,3,461
top,Florianopolis (SC),Florianopolis (SC),firstClass,Rainbow,2019-09-26
freq,39868,39737,81242,81741,1335


In [26]:
raw_df.nunique()

travelCode    95430
userCode       1335
from              9
to                9
flightType        3
price           490
time             33
distance         35
agency            3
date            461
dtype: int64

In [27]:
cat_cols = list(raw_df.select_dtypes(include="object").columns)
cat_cols.remove("date")
for col in cat_cols:
    unique_cats = list(raw_df[col].unique())
    print(f"{col}:{unique_cats}")

from:['Recife (PE)', 'Florianopolis (SC)', 'Aracaju (SE)', 'Campo Grande (MS)', 'Brasilia (DF)', 'Natal (RN)', 'Sao Paulo (SP)', 'Rio de Janeiro (RJ)', 'Salvador (BH)']
to:['Florianopolis (SC)', 'Natal (RN)', 'Salvador (BH)', 'Rio de Janeiro (RJ)', 'Aracaju (SE)', 'Sao Paulo (SP)', 'Recife (PE)', 'Campo Grande (MS)', 'Brasilia (DF)']
flightType:['firstClass', 'premium', 'economic']
agency:['FlyingDrops', 'CloudFy', 'Rainbow']


### Initial Thoughts on Raw Data:
- The dataset contains a mix of numerical and categorical features.
- Column names are not standardized.


## 2. Bronze Data Analysis

In [28]:
bronze_train_df = pd.read_csv(config_bronze.BRONZE_PROCESSED_DIR / "train.csv")
bronze_train_df.head()

Unnamed: 0,travelCode,userCode,from,to,flightType,price,time,distance,agency,date
0,0,0,Recife (PE),Florianopolis (SC),firstClass,1434.38,1.76,676.53,FlyingDrops,2019-09-26
1,121138,1202,Florianopolis (SC),Natal (RN),firstClass,1315.27,1.84,709.37,CloudFy,2019-09-26
2,132076,1301,Florianopolis (SC),Salvador (BH),premium,1311.38,2.44,937.77,CloudFy,2019-09-26
3,28904,276,Recife (PE),Rio de Janeiro (RJ),economic,908.93,2.3,885.57,Rainbow,2019-09-26
4,88695,877,Aracaju (SE),Natal (RN),firstClass,598.61,0.46,176.33,CloudFy,2019-09-26


In [29]:
bronze_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190321 entries, 0 to 190320
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   travelCode  190321 non-null  int64  
 1   userCode    190321 non-null  int64  
 2   from        190321 non-null  object 
 3   to          190321 non-null  object 
 4   flightType  190321 non-null  object 
 5   price       190321 non-null  float64
 6   time        190321 non-null  float64
 7   distance    190321 non-null  float64
 8   agency      190321 non-null  object 
 9   date        190321 non-null  object 
dtypes: float64(3), int64(2), object(5)
memory usage: 14.5+ MB


In [30]:
bronze_train_df.describe()

Unnamed: 0,travelCode,userCode,price,time,distance
count,190321.0,190321.0,190321.0,190321.0,190321.0
mean,67919.412272,667.159567,955.601391,1.418343,545.875221
std,39051.352769,387.624777,362.432241,0.543563,209.246707
min,0.0,0.0,301.51,0.44,168.22
25%,34245.0,328.0,672.66,1.02,392.76
50%,68332.0,662.0,899.6,1.46,562.14
75%,101679.0,1009.0,1222.24,1.76,676.53
max,135943.0,1339.0,1754.17,2.44,937.77


In [31]:
bronze_train_df.describe(include="object")

Unnamed: 0,from,to,flightType,agency,date
count,190321,190321,190321,190321,190321
unique,9,9,3,3,461
top,Florianopolis (SC),Florianopolis (SC),firstClass,Rainbow,2019-09-26
freq,39868,39737,81242,81741,1335


In [32]:
bronze_train_df.nunique()

travelCode    95430
userCode       1335
from              9
to                9
flightType        3
price           490
time             33
distance         35
agency            3
date            461
dtype: int64

### Bronze Pipeline Verification
The `bronze_pipeline.py` is mainly responsible for data validation. It checks for:
- Column presence and order.
- Data types.
Since the bronze data is just a validated version of the raw data, we don't expect to see significant changes in the data itself, but we can be more confident in its quality.

## 3. Silver Data Analysis

In [33]:
silver_train_df = pd.read_parquet(config_silver.SILVER_PROCESSED_DIR / "train.parquet")
silver_train_df.head()

Unnamed: 0,travel_code,user_code,from_location,to_location,flight_type,price,time,distance,agency,date,year,month,day,day_of_week,day_of_year,week_of_year
0,0,0,Recife (PE),Florianopolis (SC),firstClass,1434.380005,1.76,676.530029,FlyingDrops,2019-09-26,2019,9,26,3,269,39
1,121138,1202,Florianopolis (SC),Natal (RN),firstClass,1315.27002,1.84,709.369995,CloudFy,2019-09-26,2019,9,26,3,269,39
2,132076,1301,Florianopolis (SC),Salvador (BH),premium,1311.380005,2.44,937.77002,CloudFy,2019-09-26,2019,9,26,3,269,39
3,28904,276,Recife (PE),Rio de Janeiro (RJ),economic,908.929993,2.3,885.570007,Rainbow,2019-09-26,2019,9,26,3,269,39
4,88695,877,Aracaju (SE),Natal (RN),firstClass,598.609985,0.46,176.330002,CloudFy,2019-09-26,2019,9,26,3,269,39


In [34]:
silver_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190321 entries, 0 to 190320
Data columns (total 16 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   travel_code    190321 non-null  int32         
 1   user_code      190321 non-null  int16         
 2   from_location  190321 non-null  category      
 3   to_location    190321 non-null  category      
 4   flight_type    190321 non-null  category      
 5   price          190321 non-null  float32       
 6   time           190321 non-null  float32       
 7   distance       190321 non-null  float32       
 8   agency         190321 non-null  category      
 9   date           190321 non-null  datetime64[ns]
 10  year           190321 non-null  int16         
 11  month          190321 non-null  int8          
 12  day            190321 non-null  int8          
 13  day_of_week    190321 non-null  int8          
 14  day_of_year    190321 non-null  int16         
 15  

In [35]:
silver_train_df.describe()

Unnamed: 0,travel_code,user_code,price,time,distance,date,year,month,day,day_of_week,day_of_year,week_of_year
count,190321.0,190321.0,190321.0,190321.0,190321.0,190321,190321.0,190321.0,190321.0,190321.0,190321.0,190321.0
mean,67919.412272,667.159567,955.601379,1.418343,545.875244,2020-07-11 06:14:53.299215872,2020.032041,6.44995,15.85308,3.373122,181.345238,26.484939
min,0.0,0.0,301.51001,0.44,168.220001,2019-09-26 00:00:00,2019.0,1.0,1.0,0.0,1.0,1.0
25%,34245.0,328.0,672.659973,1.02,392.76001,2020-02-02 00:00:00,2020.0,3.0,8.0,3.0,80.0,12.0
50%,68332.0,662.0,899.599976,1.46,562.140015,2020-06-26 00:00:00,2020.0,6.0,16.0,3.0,168.0,24.0
75%,101679.0,1009.0,1222.23999,1.76,676.530029,2020-12-10 00:00:00,2020.0,10.0,24.0,4.0,290.0,42.0
max,135943.0,1339.0,1754.170044,2.44,937.77002,2021-07-01 00:00:00,2021.0,12.0,31.0,6.0,366.0,53.0
std,39051.352769,387.624777,362.426453,0.543455,209.208664,,0.638354,3.673053,8.843703,1.653325,112.40369,16.158775


In [36]:
silver_train_df.describe(include=["object", "category"])

Unnamed: 0,from_location,to_location,flight_type,agency
count,190321,190321,190321,190321
unique,9,9,3,3
top,Florianopolis (SC),Florianopolis (SC),firstClass,Rainbow
freq,39868,39737,81242,81741


In [37]:
silver_train_df.nunique()

travel_code      95430
user_code         1335
from_location        9
to_location          9
flight_type          3
price              490
time                33
distance            35
agency               3
date               461
year                 3
month               12
day                 31
day_of_week          5
day_of_year        327
week_of_year        53
dtype: int64

### Silver Pipeline Verification
The `silver_pipeline.py` performs several transformations:
- **Column Renaming and Standardization**: Column names are now in snake_case.
- **Data Type Optimization**: Data types have been optimized (e.g., `date` is now a datetime object).
- **Feature Engineering**: New date-related features have been created (`year`, `month`, `day`, etc.).
- **Duplicate Handling**: Erroneous duplicates have been removed.
By comparing the silver data to the bronze data, we can verify that these transformations have been applied correctly.

## 4. Gold Data Analysis

In [29]:
gold_train_df = pd.read_parquet(config_gold.GOLD_PROCESSED_DIR / "train.parquet")
gold_train_df.head()

Unnamed: 0,from_location_aracaju_(se),from_location_brasilia_(df),from_location_campo_grande_(ms),from_location_florianopolis_(sc),from_location_natal_(rn),from_location_recife_(pe),from_location_rio_de_janeiro_(rj),from_location_salvador_(bh),from_location_sao_paulo_(sp),to_location_aracaju_(se),...,price,time,distance,year,month_sin,month_cos,day_of_week_sin,day_of_week_cos,day_sin,day_cos
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.2948,0.74725,0.748272,2019.0,-1.0,-1.83697e-16,0.433884,-0.900969,-0.848644,0.528964
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.033898,0.891218,0.895637,2019.0,-1.0,-1.83697e-16,0.433884,-0.900969,-0.848644,0.528964
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.02514,1.963484,1.896318,2019.0,-1.0,-1.83697e-16,0.433884,-0.900969,-0.848644,0.528964
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.016546,1.714398,1.670994,2019.0,-1.0,-1.83697e-16,0.433884,-0.900969,-0.848644,0.528964
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.968255,-1.635872,-1.681132,2019.0,-1.0,-1.83697e-16,0.433884,-0.900969,-0.848644,0.528964


In [30]:
gold_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94742 entries, 0 to 94741
Columns: 105 entries, from_location_aracaju_(se) to day_cos
dtypes: float64(105)
memory usage: 75.9 MB


In [31]:
gold_train_df.describe()

Unnamed: 0,from_location_aracaju_(se),from_location_brasilia_(df),from_location_campo_grande_(ms),from_location_florianopolis_(sc),from_location_natal_(rn),from_location_recife_(pe),from_location_rio_de_janeiro_(rj),from_location_salvador_(bh),from_location_sao_paulo_(sp),to_location_aracaju_(se),...,price,time,distance,year,month_sin,month_cos,day_of_week_sin,day_of_week_cos,day_sin,day_cos
count,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,...,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0,94742.0
mean,0.121572,0.121045,0.122617,0.116907,0.114099,0.121931,0.083595,0.084535,0.113698,0.151095,...,-3.802381e-16,7.784756e-17,6.209805000000001e-17,2020.077484,0.08310325,0.08454659,-0.228624,-0.208192,-0.003698261,-0.009739
std,0.326793,0.326181,0.327999,0.321311,0.317934,0.327208,0.276782,0.27819,0.317446,0.358143,...,1.0,1.0,1.0,0.642241,0.6883345,0.7156478,0.552117,0.774315,0.712855,0.701242
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.271354,-1.673352,-1.725432,2019.0,-1.0,-1.0,-0.974928,-0.900969,-0.9987165,-0.994869
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.7285338,-0.9108241,-0.8962967,2020.0,-0.5,-0.5,-0.781831,-0.900969,-0.7247928,-0.758758
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.03970741,0.1687241,0.1970218,2020.0,1.224647e-16,6.123234000000001e-17,-0.433884,-0.222521,-2.449294e-16,-0.050649
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.7888394,0.7472501,0.7482719,2020.0,0.8660254,0.8660254,0.433884,0.62349,0.7247928,0.688967
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.936008,1.963484,1.896318,2021.0,1.0,1.0,0.433884,1.0,0.9987165,1.0


In [35]:
gold_train_df.nunique()

from_location_aracaju_(se)           2
from_location_brasilia_(df)          2
from_location_campo_grande_(ms)      2
from_location_florianopolis_(sc)     2
from_location_natal_(rn)             2
                                    ..
month_cos                           11
day_of_week_sin                      5
day_of_week_cos                      5
day_sin                             31
day_cos                             26
Length: 105, dtype: int64

### Gold Pipeline Verification
The `gold_pipeline.py` applies the final feature engineering and preprocessing steps:
- **Duplcates**: after dropping unique identification columns(travel_code and user_code) there were no erroneous duplicates but duplicate data due to different people boarding the same flight for the same route, these duplicate rows were droped successfully.
- **Imputation**: Missing values have been imputed.
- **Feature Engineering**: Cyclical features and interaction features have been created.
- **Categorical Encoding**: Categorical features have been encoded.
- **Outlier Handling**: Outliers have been handled.
- **Power Transformation and Scaling**: Numerical features have been transformed and scaled.
The gold data is now ready for model training. We can see that all columns are numerical and there are no missing values.