# 01 – TradyFlow Dataset EDA

This notebook performs initial exploration of the TradyFlow options dataset and prepares a cleaned version for use in the Sentinel Premarket Forecasting System.

**Objectives**

- Load the raw options flow dataset from `data/raw/`
- Understand the schema, datatypes, and basic distributions
- Check data quality (missing values, ranges, uniqueness)
- Save a cleaned Parquet file to `data/processed/` for downstream feature engineering and modeling

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")
sns.set_theme(style="whitegrid")

## 1. Load TradyFlow Raw Data

The raw dataset is a single CSV exported from TradyFlow (best options trades).  
It contains one row per options trade, with columns describing:

- The **time** of the trade  
- The **underlying symbol**  
- Whether the contract is a **Call or Put (C/P)**  
- The **expiration date (Exp)** and **strike price (Strike)**  
- The **spot price of the underlying (Spot)**  
- The **bid–ask spread (BidAsk)** as a liquidity proxy  
- The number of **orders** and **volume (Vol)**  
- The total **premium traded (Prems)**  
- **Open interest (OI)** at the contract level  
- A **Diff(%)** column (percent difference / move)  
- An **ITM flag** indicating whether the option is in-the-money

We treat this as the base “options flow tape” for Sentinel.

In [2]:
RAW_PATH = "../data/raw/tradyflow_options.csv"  # change if needed

df = pd.read_csv(RAW_PATH)
df.head()

Unnamed: 0,Time,Sym,C/P,Exp,Strike,Spot,BidAsk,Orders,Vol,Prems,OI,Diff(%),ITM
0,6/17/2022 15:07,ISEE,Call,10/21/2022,10.0,9.54,5.05,7,360,183.60K,4.07K,4.71,0
1,6/17/2022 15:05,CVNA,Call,1/19/2024,60.0,23.52,4.6,7,634,310.66K,130,155.05,0
2,6/17/2022 14:51,PTLO,Put,2/17/2023,15.0,15.19,3.5,7,800,281.00K,0,1.39,0
3,6/17/2022 14:39,TWLO,Call,6/24/2022,86.0,84.51,2.95,5,722,198.80K,436,2.48,0
4,6/17/2022 13:56,ATUS,Put,9/16/2022,7.0,8.62,0.68,5,6.27K,501.84K,8.63K,23.13,0


## 2. Schema and Data Types

First, we inspect the overall schema and datatypes to understand how the data is represented in pandas.

Key questions:

- Are timestamp-like fields (Time, Exp) strings or already datetimes?
- Which columns are numeric vs categorical?
- Do we have any unexpected mixed types?

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7827 entries, 0 to 7826
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Time     7827 non-null   object 
 1   Sym      7827 non-null   object 
 2   C/P      7827 non-null   object 
 3   Exp      7827 non-null   object 
 4   Strike   7827 non-null   float64
 5   Spot     7827 non-null   float64
 6   BidAsk   7827 non-null   float64
 7   Orders   7827 non-null   int64  
 8   Vol      7827 non-null   object 
 9   Prems    7827 non-null   object 
 10  OI       7827 non-null   object 
 11  Diff(%)  7827 non-null   float64
 12  ITM      7827 non-null   int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 795.1+ KB


## 3. Descriptive Statistics

Next, we compute descriptive statistics to get a feel for:

- The range of **Strike**, **Spot**, **BidAsk**, **Orders**, **Vol**, **Prems**, **OI**, and **Diff(%)**
- Typical size of trades and spreads
- How many unique tickers and expirations we have

This helps calibrate feature engineering later (e.g., log transforms, percentiles).

In [4]:
df.describe(include="all")

Unnamed: 0,Time,Sym,C/P,Exp,Strike,Spot,BidAsk,Orders,Vol,Prems,OI,Diff(%),ITM
count,7827,7827,7827,7827,7827.0,7827.0,7827.0,7827.0,7827.0,7827,7827.0,7827.0,7827.0
unique,7266,1107,2,83,,,,,1740.0,6500,2978.0,,
top,7/28/2021 9:55,F,Call,1/21/2022,,,,,500.0,1.01M,0.0,,
freq,16,77,5077,980,,,,,89.0,22,106.0,,
mean,,,,,151.178342,148.59549,4.220649,7.109493,,,,11.477625,0.679826
std,,,,,358.668235,353.566766,5.207856,5.312003,,,,19.367772,0.466573
min,,,,,1.5,1.23,0.11,5.0,,,,0.02,0.0
25%,,,,,30.0,28.27,1.65,5.0,,,,2.11,0.0
50%,,,,,60.0,58.95,3.1,5.0,,,,5.6,1.0
75%,,,,,145.0,145.16,5.12,7.0,,,,12.715,1.0


## 4. Missing Values Check

We verify that each column is fully populated before moving into modeling.

If any columns had significant missing values, we would either:
- Impute (for continuous variables), or  
- Drop/flag them (for less critical fields).

In this dataset, all columns are fully populated, which simplifies preprocessing.

In [5]:
df.isna().sum().sort_values(ascending=False)

Time       0
Sym        0
C/P        0
Exp        0
Strike     0
Spot       0
BidAsk     0
Orders     0
Vol        0
Prems      0
OI         0
Diff(%)    0
ITM        0
dtype: int64

## 5. Final Column Dictionary

For future notebooks (feature engineering, modeling, dashboarding), it is useful to lock in a simple column dictionary:

- `Time`: trade timestamp (string → will be converted to datetime)
- `Sym`: underlying ticker symbol
- `C/P`: Call or Put contract type
- `Exp`: contract expiration date (string → will be converted to datetime)
- `Strike`: option strike price
- `Spot`: underlying spot price at trade time
- `BidAsk`: bid–ask spread (liquidity / microstructure signal)
- `Orders`: number of orders
- `Vol`: trade volume (contracts)
- `Prems`: total premium traded
- `OI`: open interest
- `Diff(%)`: percent difference / move metric
- `ITM`: in-the-money flag (0/1)

In [6]:
df.columns.tolist()

['Time',
 'Sym',
 'C/P',
 'Exp',
 'Strike',
 'Spot',
 'BidAsk',
 'Orders',
 'Vol',
 'Prems',
 'OI',
 'Diff(%)',
 'ITM']

## 6. Save Cleaned Dataset to `data/processed/`

For downstream use in the Sentinel pipeline (feature engineering, model training, and Streamlit dashboard), we save a cleaned version of the dataset as a Parquet file:

- Keeps the schema consistent
- Loads faster than CSV
- Avoids repeatedly parsing the raw file

This file will be the canonical input for the next notebook (`02_feature_engineering.ipynb`) and the Python data loader in `src/data/load_tradyflow.py`.

In [7]:
df.to_parquet("../data/processed/tradyflow_clean.parquet")

pd.read_parquet("../data/processed/tradyflow_clean.parquet").head()

Unnamed: 0,Time,Sym,C/P,Exp,Strike,Spot,BidAsk,Orders,Vol,Prems,OI,Diff(%),ITM
0,6/17/2022 15:07,ISEE,Call,10/21/2022,10.0,9.54,5.05,7,360,183.60K,4.07K,4.71,0
1,6/17/2022 15:05,CVNA,Call,1/19/2024,60.0,23.52,4.6,7,634,310.66K,130,155.05,0
2,6/17/2022 14:51,PTLO,Put,2/17/2023,15.0,15.19,3.5,7,800,281.00K,0,1.39,0
3,6/17/2022 14:39,TWLO,Call,6/24/2022,86.0,84.51,2.95,5,722,198.80K,436,2.48,0
4,6/17/2022 13:56,ATUS,Put,9/16/2022,7.0,8.62,0.68,5,6.27K,501.84K,8.63K,23.13,0
