# Features and States

Up to this point, we built our main enriched dataset:

- Clean SPY 1-minute candles
- Initial Balance (IB) levels (ib_high, ib_low, ib_mid, ib_width, etc.)
- Three AVWAP lines (avwap_open, avwap_up, avwap_down)
- Slopes for each AVWAP line

This means we have completed the **technical market structure layer** of the project.


## Why do we create “Features” and “States”?

Right now, our dataset contains many useful raw columns and technical levels.  
But for the next steps of the project, we need the data to be in a form that is:

- Easy to test with **hypotheses**
- Ready for **label creation** (future outcome definitions)
- Ready for **prediction and ML methods** later

To do that, we convert our technical information into:

- **Features** → numerical or categorical inputs we can analyze or feed into models  
  (example: distance to ib_high, slope of avwap_open, etc.)
- **States** → simple market “conditions” at each minute  
  (example: price above both AVWAPs, price between two AVWAPs, price inside IB range, etc.)


## What we do in this notebook

In this notebook, we create two structured outputs:

1. **A states table**
   - Gives a clear label for the market condition at each minute  
   - Helps us group and compare similar situations

2. **A features table**
   - Converts our raw technical columns into clean, usable variables  
   - These variables will be used for:
     - Hypothesis testing
     - Labeling logic
     - Later prediction / ML steps


## Final goal

After this step, each 1-minute row will not only have raw prices and lines,  
but also a clear set of **state definitions** and **feature values**.

This prepares the dataset for the next phases of the project:

- **Labeling**
- **Hypothesis tests**
- **Prediction and ML**


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

In [3]:
# Now I can use our cleaned data with AVWAP levels and IB levels which we made in "02_avwap_build.ipynb"

PROJECT_ROOT = Path("..").resolve()

DATA_CACHE = PROJECT_ROOT / "data" / "cache"

CACHE_FILE = DATA_CACHE / "spy_1min_et_clean_with_IBlevels_and_AWVAPs.csv"

df_aw = pd.read_csv(CACHE_FILE, parse_dates=['datetime'])

df_aw.head()

Unnamed: 0,datetime,high,low,close,Volume,ib_high,ib_low,ib_mid,ib_width,ib_width_type,...,prev_day,prev_close,gap,gap_dir,avwap_open,avwap_down,avwap_up,slope_open,slope_up,slope_down
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588,649.06,647.75,648.405,1.31,narrow,...,,,,0.0,648.453333,,,,,
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118,649.06,647.75,648.405,1.31,narrow,...,,,,0.0,648.415886,,,,,
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143,649.06,647.75,648.405,1.31,narrow,...,,,,0.0,648.391911,,,,,
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231,649.06,647.75,648.405,1.31,narrow,...,,,,0.0,648.387859,,,,,
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659,649.06,647.75,648.405,1.31,narrow,...,,,,0.0,648.40165,,,,,


## 1) Setting Timeframes (IB zone, Analysis zone, Label window)

Before we build features or create labels, we must clearly define **where each 1-minute candle belongs during the trading day**.

Not every minute should be used for the same purpose.  
Some minutes are only for **collecting information**, while others are for **testing hypotheses** and some are for **creating labels** (future outcomes).

So we create three boolean flags for every 1-minute candle:

- **`is_ib`**
- **`is_analysis`**
- **`is_labelwin`**

These flags simply answer:  
> “Is this minute inside this specific time window?”


### **IB zone (09:30–10:30 ET) → `is_ib`**

**What it means:**
- This is the **Initial Balance window**.
- It is the first hour of the regular session.

**How we use it:**
- We use this window to **measure and extract information**, such as:
  - **ib_high / ib_low / ib_mid / ib_width**
  - the strongest 5-minute up/down bursts (anchors)

**Important rule:**
- We **do not run hypothesis tests, labeling, or predictions inside the IB minutes**.

So:

- **`is_ib = True`** → this minute is inside **09:30–10:30**
- These minutes are mainly a **measurement / observation zone**, not a testing zone.



### **Analysis zone (10:45–15:30 ET) → `is_analysis`**

**What it means:**
- This is the main part of the day where we actively study market behavior.

**Why it starts at 10:45 (not 10:30):**
- We intentionally leave a **buffer after the IB ends**.
- This avoids mixing our “IB measurement period” with immediate post-IB noise.

**How we use it:**
Inside this window we will:

- Compute and analyze **features**
- Run our **hypothesis tests**
- Study how price behaves relative to:
  - IB levels
  - AVWAP lines
  - AVWAP slopes
  - distance-to-lines, etc.

So:

- **`is_analysis = True`** → this minute is inside **10:45–15:30**
- These minutes are the “main research zone” of the day.



### **Label window (10:45–15:10 ET) → `is_labelwin`**

**What it means:**
- This window is used for creating **labels** (future outcomes).
- Labels require looking **forward in time** (example: “what happens in the next 15 minutes?”).

**Why it ends at 15:10 (not 15:30):**
- We need extra time after the label point to avoid “running out of future data”.
- Example:
  - If we label using the next 20 minutes,
  - We cannot label a candle at 15:25 because we don’t have 20 minutes ahead before the close.
- So we stop label creation earlier to leave a **future-data buffer**.

**How we use it:**
- We will define labels only for candles where **`is_labelwin = True`**.
- Later, ML models will learn from:
  - the features at time *t*
  - and the label that describes what happened after time *t*

So:

- **`is_labelwin = True`** → this minute is inside **10:45–15:10**
- These minutes are valid starting points for **future outcome labels**.


In [4]:
dt = df_aw["datetime"]

# 09:30–10:30 (IB)
df_aw["is_ib"] = (
    ((dt.dt.hour == 9) & (dt.dt.minute >= 30)) |
    ((dt.dt.hour == 10) & (dt.dt.minute <= 30))
)

# 10:45–15:30 (analysis)
df_aw["is_analysis"] = (
    ((dt.dt.hour == 10) & (dt.dt.minute >= 45)) |
    ((dt.dt.hour > 10) & (dt.dt.hour < 15))     |
    ((dt.dt.hour == 15) & (dt.dt.minute <= 30))
)

# 10:45–15:10 (label window)
df_aw["is_labelwin"] = (
    ((dt.dt.hour == 10) & (dt.dt.minute >= 45)) |
    ((dt.dt.hour > 10) & (dt.dt.hour < 15))     |
    ((dt.dt.hour == 15) & (dt.dt.minute <= 10))
)

df_aw.head()

Unnamed: 0,datetime,high,low,close,Volume,ib_high,ib_low,ib_mid,ib_width,ib_width_type,...,gap_dir,avwap_open,avwap_down,avwap_up,slope_open,slope_up,slope_down,is_ib,is_analysis,is_labelwin
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588,649.06,647.75,648.405,1.31,narrow,...,0.0,648.453333,,,,,,True,False,False
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118,649.06,647.75,648.405,1.31,narrow,...,0.0,648.415886,,,,,,True,False,False
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143,649.06,647.75,648.405,1.31,narrow,...,0.0,648.391911,,,,,,True,False,False
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231,649.06,647.75,648.405,1.31,narrow,...,0.0,648.387859,,,,,,True,False,False
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659,649.06,647.75,648.405,1.31,narrow,...,0.0,648.40165,,,,,,True,False,False


## 2) States and Delta (price position relative to AVWAPs and selected AVWAP position relative to AVWAPs)

Now that we have our AVWAP lines, we want to describe **where the current price is** compared to those lines **at every single 1-minute candle**.

To do this in a clear and reusable way, we create two kinds of information:

- **State** → a simple categorical description (above / between / below)
- **Delta** → a numeric distance showing *how far* price is from an AVWAP

Both are computed for each 1-minute row in our dataset.


### **State: “Where is price relative to AVWAP?”**

A **state** is a short label that tells us the price’s position compared to AVWAP levels.

We use states because they are:

- Easy to interpret
- Easy to group and test in hypotheses
- Useful for rule-based logic (later: labeling, signals, ML features)

For each minute, we compare the **current price (usually `close`)** to our daily AVWAP lines
(**`avwap_open`**, **`avwap_up`**, **`avwap_down`**) and assign a state such as:

- **`above`**  
  → price is **higher** than the selected AVWAP (price is trading above that benchmark)

- **`below`**  
  → price is **lower** than the selected AVWAP (price is trading below that benchmark)

- **`between`**  
  → price is **in the middle of two AVWAP lines**  
  (for example: above one AVWAP but below another)

So the **state** answers:

> “Is the market trading above the benchmark, below it, or trapped between two benchmarks?”


### **Delta: “How far is price from AVWAP?”**

A **delta** is a number that measures the distance between price and AVWAP.

For a selected AVWAP line, the simplest delta is:

- **delta = price − AVWAP**

This tells us:

- If **delta > 0** → price is **above** that AVWAP
- If **delta < 0** → price is **below** that AVWAP
- If **delta ≈ 0** → price is **very close** to the AVWAP

Why delta is useful:

- It gives more detail than just “above/below”.
- It helps measure the **strength** of the position:
  - Small delta → barely above/below (weak separation)
  - Large delta → clearly above/below (strong separation)

So delta answers:

> “Not only where is price, but also how strongly it is separated from the AVWAP.”


### **How State and Delta work together**

- **State** gives a simple category: **above / between / below**
- **Delta** gives the exact magnitude of distance

Together, they allow us to:

- Define clean market regimes (states)
- Test hypotheses about reversion vs continuation
- Build stronger features for later prediction work

In [None]:
# Firstly we need to define relationships between our three AVWAPs as pairs of two
# These combinations create three different AVWAP states

pairs = {
    "ud": ("avwap_up", "avwap_down"), #the significant corridor between buyers level (AWVAP_up and AWVAP_down)
    # It is a corridor volume weighted price interval between people who are lifting prices up and who are lifting prices down , our hypotheses mostly use this corridor
    "ou": ("avwap_open", "avwap_up"), # a deviation state or corridor which presents 
    # how strong buyers deviated volume weighted price level from original market open volume weighted price level 
    "od": ("avwap_open", "avwap_down"), # a deviation state or corridor which presents 
    # how strong sellers deviate volume weighted price level from original market open volume weighted price level
}

for tag, (a, b) in pairs.items(): # tag represents our state (ud, ou, od) and a,b are our selected AVWAP pairs in that state

    # our AVWAP pairs are changeable, so we need to determine which one has higher and which one is lower
    lo = df_aw[[a, b]].min(axis=1) # with this logic, our code decides which pair has lower level
    hi = df_aw[[a, b]].max(axis=1) # with this logic, our code decides which pair has higher level
    # this logic is important because we need to know our corridor's max and min levels

    # now we need to compare our original "close" price level with created three distinct corridors
    df_aw[f"state_{tag}_above"]   = (df_aw["close"] > hi).astype(int) #close price is higher than max level of corridor, its "above" from that corridor
    df_aw[f"state_{tag}_below"]   = (df_aw["close"] < lo).astype(int) #close price is lower than min level of the corridor, its "below" from that corridor
    df_aw[f"state_{tag}_between"] = ((df_aw["close"] >= lo) & (df_aw["close"] <= hi)).astype(int) #close price is between our corridor's high and low, its "between" in that corridor 

    df_aw[f"delta_{tag}"] = (df_aw[a] - df_aw[b]).abs() #the distance between our two AWVAP levels --> corridor lenght
    df_aw[f"delta_{tag}_pct"] = df_aw[f"delta_{tag}"] / df_aw["close"] #we need to understand how big the corridor lenght compared to original market price
    #for instance the corridor lenght 0.5 can be big for 300 dolar market price but not big for 500 dolar market price


df_aw.head()


Unnamed: 0,datetime,high,low,close,Volume,ib_high,ib_low,ib_mid,ib_width,ib_width_type,...,state_ou_above,state_ou_below,state_ou_between,delta_ou,delta_ou_pct,state_od_above,state_od_below,state_od_between,delta_od,delta_od_pct
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588,649.06,647.75,648.405,1.31,narrow,...,0,1,0,,,0,1,0,,
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118,649.06,647.75,648.405,1.31,narrow,...,0,1,0,,,0,1,0,,
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143,649.06,647.75,648.405,1.31,narrow,...,0,1,0,,,0,1,0,,
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231,649.06,647.75,648.405,1.31,narrow,...,1,0,0,,,1,0,0,,
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659,649.06,647.75,648.405,1.31,narrow,...,1,0,0,,,1,0,0,,


In [None]:
#checking whether our ud logic works or not

df_aw[["datetime", "close", "avwap_up", "avwap_down", "avwap_open","state_ud_above", "state_ud_below", "state_ud_between"]].loc[400:450]

Unnamed: 0,datetime,close,avwap_up,avwap_down,avwap_open,state_ud_above,state_ud_below,state_ud_between
400,2025-09-09 09:40:00,649.08,,,649.140265,0,0,0
401,2025-09-09 09:41:00,649.08,,,649.133135,0,0,0
402,2025-09-09 09:42:00,649.07,,,649.132224,0,0,0
403,2025-09-09 09:43:00,649.11,,,649.128963,0,0,0
404,2025-09-09 09:44:00,649.28,,,649.132274,0,0,0
405,2025-09-09 09:45:00,649.01,,,649.129564,0,0,0
406,2025-09-09 09:46:00,648.95,,,649.123505,0,0,0
407,2025-09-09 09:47:00,649.25,,,649.124427,0,0,0
408,2025-09-09 09:48:00,648.97,,,649.122892,0,0,0
409,2025-09-09 09:49:00,649.08,,,649.121063,0,0,0


In [None]:
# After we made our corridors, we need to define our AVWAPs movements, because our corridor is not constantly same.
# Our AVWAP lines changes time to time with going different price levels. So just defining corridors wouldn't be enough for that

K = 5  # son 5 dk
gday = df_aw["datetime"].dt.normalize()  # its a groupby key which is useful for altering just days

for name in ["open", "up", "down"]:
    av = f"avwap_{name}" #our operations will be applied directly to our three AVWAPs in each iteration

    # slope direction: +1 / 0 / -1
    df_aw[f"slope_{name}_sign"] = (np.sign(df_aw[f"slope_{name}"])).fillna(0).astype(int) # looking at selecting AVWAP's slope in selected timeframe, 
    #if its positive it states +1, slope negative it states -1, slope NaN it states 0(zero)
    # REMAINDER: Slope values can't be 0(zero) in real market conditions, because SPY is the most liquid asset and there is always some bids and asks in 5 minutes
    # So, we don't need to consider an event that states a slope of AVWAP is zero!!!

    # Stating whether our market price crossed our AVWAP line or not
    diff = df_aw["close"] - df_aw[av] # we determine the position of our market price related our selected AVWAP line "-diff -> price below AVWAP", "+diff -> price above AVWAP"
    prev = diff.groupby(gday).shift(1) # we are also looking this difference in previous candle to analyze cross event

    cross_now = (np.sign(diff) != np.sign(prev)).astype(int) # we are looking the sign of our diff variable in now and before
    # if these signs are not same, market price crossed our AVWAP line
    df_aw[f"cross_px_{name}"] = cross_now

    # To clearly state the price action, we are looking at the last K minutes to see if there is really cross event happened in last K minutes
    # If there is no cross event in that candle but happened last K minutes, we may say that the reason of that market price change is cross event.
    df_aw[f"cross_px_{name}_last{K}"] = (
        cross_now.groupby(gday)
                 .rolling(K, min_periods=1)
                 .max()
                 .reset_index(level=0, drop=True)
                 .astype(int)
    )
    # this is the clear data preparation for our future hypothesis tests

# our hypotheses also include avwap cross to avwap line which tries to explain market price change with this event
for tag, (a, b) in pairs.items(): # our ud, ou, ud combinations comes into play which explain the relation between each three AWVAP pairs

    diff = df_aw[a] - df_aw[b] # same logic about determining the position of one AVWAP compared to other AWVAP line
    prev = diff.groupby(gday).shift(1) # we are also looking this difference in previous candle to analyze cross event

    cross_now = (np.sign(diff) != np.sign(prev)).astype(int) #looking signs of diff variable again
    df_aw[f"cross_av_{tag}"] = cross_now #define cross event between AVWAPs with sign changes before vs. now

    # To clearly state the price action, we are looking at the last K minutes to see if there is really cross event happened in last K minutes
    # If there is no cross event in that candle but happened last K minutes, we may say that the reason of that market price change is cross event.
    df_aw[f"cross_av_{tag}_last{K}"] = (
        cross_now.groupby(gday)
                 .rolling(K, min_periods=1)
                 .max()
                 .reset_index(level=0, drop=True)
                 .astype(int)
    )
    # this is the clear data preparation for our future hypothesis tests


df_aw.head()

Unnamed: 0,datetime,high,low,close,Volume,ib_high,ib_low,ib_mid,ib_width,ib_width_type,...,cross_px_up_last5,slope_down_sign,cross_px_down,cross_px_down_last5,cross_av_ud,cross_av_ud_last5,cross_av_ou,cross_av_ou_last5,cross_av_od,cross_av_od_last5
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588,649.06,647.75,648.405,1.31,narrow,...,1,0,1,1,1,1,1,1,1,1
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118,649.06,647.75,648.405,1.31,narrow,...,1,0,1,1,1,1,1,1,1,1
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143,649.06,647.75,648.405,1.31,narrow,...,1,0,1,1,1,1,1,1,1,1
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231,649.06,647.75,648.405,1.31,narrow,...,1,0,1,1,1,1,1,1,1,1
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659,649.06,647.75,648.405,1.31,narrow,...,1,0,1,1,1,1,1,1,1,1


## 3) Short-Term Volatility and Trend Pressure

So far, we have built the main technical structure of our dataset:

- **IB levels**
- **AVWAP lines**
- **AVWAP slopes**
- **Price position relative to AVWAPs** (states and deltas)
- **AVWAPs position relative to AVWAPs** (avwap states and deltas)
- **Timeframe flags** (IB zone, analysis zone, label window)

This is a strong foundation.  
But to test our hypotheses in a more realistic way, we still need a few **control variables** that describe the market’s *current condition*.

Why?

Because the same AVWAP setup can behave very differently depending on:

- How volatile the market is *right now*
- Whether price is already trending strongly or trapped in a range

So we add two extra tools that capture the short-term “environment” around each minute.


### Why do we need these extra controls?

Our AVWAP-based features explain **structure** and **relative positioning**.

But hypotheses usually depend not only on structure, but also on **context**:

- A “reversion” setup may fail if volatility is extremely high.
- A “breakout” setup may work better if trend pressure is already strong.
- Two identical “states” (e.g., price above AVWAP) can produce different outcomes if the market is calm vs. explosive.

So we add:

1. **Short-term volatility**
2. **Short-term trend pressure / position**

These variables help us avoid misleading conclusions by answering:

> “Is this setup happening in a calm market or a wild market?”  
> “Is price already leaning strongly in one direction or stuck in the middle?”


#### a) Short-term volatility (last 5–15 minutes)

**Question it answers:**

> **How wide are the candles recently?**

Meaning:

- We look at the last **5 to 15 minutes**
- We measure how large the recent price movement has been

Interpretation:

- **High short-term volatility**
  - candles are wide
  - price is moving quickly
  - outcomes may be more extreme (fast breakouts or sharp reversals)

- **Low short-term volatility**
  - candles are small
  - price is quieter
  - moves may be slower, with more mean-reversion behavior

So this feature tells us how “aggressive” the market is in the short run.


#### b) Short-term trend pressure / position (last 30-minute channel)

**Question it answers:**

> **Where is the current close inside the recent 30-minute range?**

Meaning:

- We take the **last 30 minutes**
- We identify a simple “channel”:
  - recent **high**
  - recent **low**
- Then we locate the current close inside that channel

Interpretation:

- If the close is near the **top** of the last 30-minute range  
  → there is **upward pressure** (buyers are pushing and holding higher levels)

- If the close is near the **bottom** of the last 30-minute range  
  → there is **downward pressure** (sellers are pushing and holding lower levels)

- If the close is near the **middle**  
  → the market is more balanced (less directional pressure)

So this feature captures whether the market is currently “leaning” bullish or bearish in a local sense.


#### Why these two features matter for our hypotheses

These two controls connect our AVWAP-based logic to **real price behavior**:

- **Volatility** tells us how much price is moving right now.
- **Trend pressure** tells us whether price is already being pushed toward one side.

When we test hypotheses (continuation vs. reversion), these controls help us explain:

- *When* the hypothesis is more likely to work
- *When* it is likely to fail


In [25]:
day_key = df_aw["datetime"].dt.normalize()

hl = (df_aw["high"] - df_aw["low"]) / df_aw["close"] # directly our 1 min candle range
df_aw["hl_pct"] = hl # clearly our 1 min candle range relative to "close" price 

# volatility --> deviation from starting price level
# average volatility in 5 minute candles 
df_aw["hl5"] = (hl.groupby(day_key)
               .rolling(5, min_periods=5)
               .mean()
               .reset_index(level=0, drop=True))

# average volatility in 15 minute candles 
df_aw["hl15"] = (hl.groupby(day_key)
                .rolling(15, min_periods=15)
                .mean()
                .reset_index(level=0, drop=True))


# last 30 minutes price corridor, with between lowest low price and highest high price in 30 minutes timeframe
m = 30

# highest high price
roll_hi = (df_aw["high"].groupby(day_key)
             .rolling(m, min_periods=m)
             .max()
             .reset_index(level=0, drop=True))

# lowest low price
roll_lo = (df_aw["low"].groupby(day_key)
             .rolling(m, min_periods=m)
             .min()
             .reset_index(level=0, drop=True))

# our "close" market price position in this rolling 30 minutes corridor
df_aw["trend_score_m30"] = (df_aw["close"] - roll_lo) / (roll_hi - roll_lo)
# score closer to 0 (zero) --> Our market price is close to our lowest low price which means we are bottom of the corridor --> selling pressure
# score closer to 0.5 --> Our market price is simply between of the corridor --> no buying or selling pressure which makes trend following position
# score closer to 1 --> Our market price is close to our highest high price which means we are on top of the corridor --> buying pressure

# REMAINDER: Score closer 0.5 is directly represents the trend following model, in other words price is respecting the trend. But if we see our score is coming closer to
# 0 (zero) or 1, we can say that the price is not following the trend anymore and it shows the tendency of breakout or creating a new trend.

df_aw.head(30)


Unnamed: 0,datetime,high,low,close,Volume,ib_high,ib_low,ib_mid,ib_width,ib_width_type,...,cross_av_ud,cross_av_ud_last5,cross_av_ou,cross_av_ou_last5,cross_av_od,cross_av_od_last5,hl_pct,hl5,hl15,trend_score_m30
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000956,,,
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000463,,,
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000555,,,
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.00037,,,
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000555,0.00058,,
5,2025-09-08 09:35:00,648.88,648.62,648.78,38252,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000401,0.000469,,
6,2025-09-08 09:36:00,648.92,648.78,648.79,36436,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000216,0.000419,,
7,2025-09-08 09:37:00,649.06,648.8,648.87,35151,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000401,0.000388,,
8,2025-09-08 09:38:00,648.91,648.23,648.23,52975,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.001049,0.000524,,
9,2025-09-08 09:39:00,648.35,648.06,648.11,58512,649.06,647.75,648.405,1.31,narrow,...,1,1,1,1,1,1,0.000447,0.000503,,


In [26]:
# I am substituting the last column creations about short term trend and volatility between our "Volume" and "ib_high" columns, because these variables are unrelated
# to our calculations with AVWAPs and IBs

cols = df_aw.columns.to_list()
move = ["hl_pct","hl5","hl15","trend_score_m30"]  # We will move these columns after "Volume" column
left = "Volume"   # Near to "Volume" column

idx = cols.index(left) + 1 # Determining "Volume" column's index among all columns which is 4, and we are putting our columns in index 5
new_cols = [c for c in cols if c not in move] # removing our columns from our df_aw dataframe 
new_cols[idx:idx] = move  # adding our columns into 5:5 -- > :)
df2 = df_aw[new_cols]

df2.head()

Unnamed: 0,datetime,high,low,close,Volume,hl_pct,hl5,hl15,trend_score_m30,ib_high,...,cross_px_up_last5,slope_down_sign,cross_px_down,cross_px_down_last5,cross_av_ud,cross_av_ud_last5,cross_av_ou,cross_av_ou_last5,cross_av_od,cross_av_od_last5
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588,0.000956,,,,649.06,...,1,0,1,1,1,1,1,1,1,1
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118,0.000463,,,,649.06,...,1,0,1,1,1,1,1,1,1,1
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143,0.000555,,,,649.06,...,1,0,1,1,1,1,1,1,1,1
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231,0.00037,,,,649.06,...,1,0,1,1,1,1,1,1,1,1
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659,0.000555,0.00058,,,649.06,...,1,0,1,1,1,1,1,1,1,1


## 4) Saving the final dataframe as a `.csv` file

At this point, we have finished building the full **Features and States** dataset.  
Our dataframe now includes everything we need to move into the next project stages.


### What we have completed so far

We successfully added:

- **Timeframe flags**
  - **Initial Balance window** → `is_ib` (09:30–10:30)
  - **Analysis zone** → `is_analysis` (10:45–15:30)
  - **Label window** → `is_labelwin` (10:45–15:10)

- **States and deltas**
  - **Price → AVWAP** relationships  
    (where price is relative to `avwap_open`, `avwap_up`, `avwap_down`)
  - **AVWAP → AVWAP** relationships  
    (how the AVWAP lines relate to each other)
  - **Delta values** that measure *how far* things are from each other  
    (not just “above/below”, but the strength of separation)

- **Short-term market context**
  - **Short-term volatility** (recent 5–15 minute behavior)
  - **Short-term trend pressure / position**  
    (where the close sits inside the recent 30-minute channel)

So we now have a dataset that contains:

- Raw 1-minute price information
- Technical structure (IB + AVWAPs + slopes)
- Clear state definitions
- Extra context variables that make hypotheses more reliable


### Why we save the dataframe now

This is the perfect time to create a saved “checkpoint” file because:

- The dataset is now **feature-rich** and **ready for reuse**
- We do not want to recompute all these columns in every notebook
- Future notebooks (labeling, hypothesis evaluation, model fitting) can simply load this file and start immediately


### Where we save it

We export this dataframe as a **`.csv` file** into:

- **`data/cache/`**

This folder is meant for **processed intermediate datasets** that will be used later in the pipeline.

In [27]:
from pathlib import Path

# 1) Define project root which is the main branch in our repository
PROJECT_ROOT = Path("..").resolve()

# 2) We need to go to data/cache folder so define that pathway
DATA_CACHE = PROJECT_ROOT / "data" / "cache"
DATA_CACHE.mkdir(parents=True, exist_ok=True)  # yoksa oluştur

clean_csv_path = DATA_CACHE / "spy_1min_et_clean_with_completed_all_features_states.csv"

df2.to_csv(clean_csv_path, index=False)

print("Saved CSV to:", clean_csv_path)

Saved CSV to: /Users/canka/Dev/python/DSA210-Project-Can-Karadogan/data/cache/spy_1min_et_clean_with_completed_all_features_states.csv
