# Data Acquisition & Engineering  
## Quantitative Trading System (NIFTY 50)

**Objective:**  
To acquire, clean, and structure 1 year of daily market data required for building a complete quantitative trading system involving:
- Data engineering
- Feature engineering
- Regime detection
- Algorithmic strategy development
- Machine learning and statistical analysis

**Market:** Indian Equity & Derivatives Market (NSE)  
**Instruments:** NIFTY 50 Spot, NIFTY Futures, NIFTY Options  
**Data Period:** 14 January 2025 – 14 January 2026


## Why Daily Data Instead of 5-Minute Interval Data?

Although intraday (5-minute) data is commonly used in high-frequency trading systems, this project intentionally uses **daily data** for the following reasons:

### 1. Strategy Horizon Alignment
This project focuses on:
- Market regime detection
- Medium-term trend identification
- Derivatives positioning analysis
- Statistical and ML-based signal generation  

Daily data is more appropriate for capturing **structural market behavior** rather than short-term noise.

### 2. Noise Reduction
Intraday data contains significant microstructure noise caused by:
- Order book fluctuations
- Short-term liquidity imbalances
- High transaction costs impact  

Daily aggregation improves signal stability and statistical robustness.

### 3. Data Reliability & Reproducibility
- NSE does not officially provide clean intraday historical derivatives data
- Daily NSE historical data and F&O bhavcopies are **official, reliable, and reproducible**
- This ensures transparency and auditability of the research

### 4. Computational Efficiency
Daily data allows:
- Faster feature engineering
- Easier regime clustering
- Efficient ML model training without overfitting

### 5. Realistic Retail Quant Constraints
Most retail and academic quant research in Indian markets operates on daily data due to:
- API limitations
- Cost constraints
- Data availability

> Hence, daily data is sufficient and appropriate for building a robust end-to-end quantitative trading system.


In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime

# NIFTY 50 Spot Data Acquisition (Daily)

This notebook combines daily NSE index snapshot files into a single
clean NIFTY 50 daily OHLCV dataset.

**Input:**
- 249 daily raw NSE index CSV files (one file per trading day)

**Output:**
- nifty_spot_daily.csv

Only the **NIFTY 50** index is retained.


In [3]:
import os

RAW_SPOT_PATH = "/content/raw spot"

files = [f for f in os.listdir(RAW_SPOT_PATH) if f.endswith(".csv")]
print("Total raw spot files found:", len(files))

files[:5]

Total raw spot files found: 249


['ind_close_all_21032025.csv',
 'ind_close_all_29072025.csv',
 'ind_close_all_21012025.csv',
 'ind_close_all_14112025.csv',
 'ind_close_all_04122025.csv']

In [5]:
import pandas as pd

dfs = []

for file in sorted(files):
    file_path = os.path.join(RAW_SPOT_PATH, file)
    df = pd.read_csv(file_path)
    dfs.append(df)

combined_spot_df = pd.concat(dfs, ignore_index=True)

combined_spot_df.head()

Unnamed: 0,Index Name,Index Date,Open Index Value,High Index Value,Low Index Value,Closing Index Value,Points Change,Change(%),Volume,Turnover (Rs. Cr.),P/E,P/B,Div Yield
0,Nifty 50,01-01-2026,26173.3,26197.55,26113.4,26146.55,16.95,0.06,425631910,23454.66,22.76,3.56,1.28
1,Nifty Next 50,01-01-2026,69523.15,69722.15,69276.8,69675.4,310.9,0.45,216114798,7311.28,20.27,3.62,1.41
2,Nifty 100,01-01-2026,26736.55,26764.25,26674.1,26723.25,33.95,0.13,641746708,30765.95,22.32,3.57,1.3
3,Nifty 200,01-01-2026,14560.7,14571.25,14518.4,14560.25,26.75,0.18,3987031128,47016.89,23.81,3.7,1.21
4,Nifty 500,01-01-2026,23914.65,23925.55,23835.55,23909.55,38.0,0.16,4549066214,61090.24,24.49,3.72,1.15


In [6]:
combined_spot_df.columns

Index(['Index Name', 'Index Date', 'Open Index Value', 'High Index Value',
       'Low Index Value', 'Closing Index Value', 'Points Change', 'Change(%)',
       'Volume', 'Turnover (Rs. Cr.)', 'P/E', 'P/B', 'Div Yield'],
      dtype='object')

In [7]:
nifty50_df = combined_spot_df[
    combined_spot_df['Index Name'].str.strip() == 'Nifty 50'
].copy()

print("Rows after filtering Nifty 50:", nifty50_df.shape)
nifty50_df.head()

Rows after filtering Nifty 50: (249, 13)


Unnamed: 0,Index Name,Index Date,Open Index Value,High Index Value,Low Index Value,Closing Index Value,Points Change,Change(%),Volume,Turnover (Rs. Cr.),P/E,P/B,Div Yield
0,Nifty 50,01-01-2026,26173.3,26197.55,26113.4,26146.55,16.95,0.06,425631910,23454.66,22.76,3.56,1.28
143,Nifty 50,01-04-2025,23341.1,23565.15,23136.4,23165.7,-353.65,-1.5,375118343,30680.7,21.05,3.43,1.34
280,Nifty 50,01-07-2025,25551.35,25593.4,25501.8,25541.8,24.75,0.1,260669106,23921.02,22.99,3.7,1.24
419,Nifty 50,01-08-2025,24734.9,24784.15,24535.05,24565.35,-203.0,-0.82,263357194,24972.45,21.7,3.35,1.41
554,Nifty 50,01-09-2025,24432.7,24635.6,24432.7,24625.05,198.2,0.81,229327424,20704.88,21.63,3.29,1.36


In [9]:
nifty50_df['Index Date'] = pd.to_datetime(
    nifty50_df['Index Date'],
    dayfirst=True,
    errors='coerce'
)

In [11]:
nifty50_df = nifty50_df.sort_values('Index Date')

In [12]:
print(nifty50_df['Index Date'].min())
print(nifty50_df['Index Date'].max())

2025-01-14 00:00:00
2026-01-14 00:00:00


In [16]:
for col in nifty50_df.columns:
    print(f"'{col}'")

'Index Name'
'Index Date'
'Open Index Value'
'High Index Value'
'Low Index Value'
'Closing Index Value'
'Points Change'
'Change(%)'
'Volume'
'Turnover (Rs. Cr.)'
'P/E'
'P/B'
'Div Yield'


In [17]:
nifty50_df.columns = nifty50_df.columns.str.strip()

In [20]:
nifty50_df = nifty50_df[[
    'Index Date',
    'Open Index Value',
    'High Index Value',
    'Low Index Value',
    'Closing Index Value',
    'Volume'
]]

In [22]:
nifty50_df = nifty50_df.rename(columns={
    'Index Date': 'date',
    'Open Index Value': 'open',
    'High Index Value': 'high',
    'Low Index Value': 'low',
    'Closing Index Value': 'close',
    'Volume': 'volume'
})

In [23]:
nifty50_df = nifty50_df.reset_index(drop=True)

In [25]:
nifty50_df.to_csv("nifty_spot_daily.csv", index=False)

## NIFTY 50 Spot Data – Final Dataset

- Frequency: Daily
- Period: 14 Jan 2025 – 14 Jan 2026
- Source: NSE Historical Index Data
- Rows: One row per trading day
- Fields: Open, High, Low, Close, Volume

This dataset serves as the base market reference for:
- Return calculations
- Volatility estimation
- Market regime detection
- Benchmarking derivative strategies




---



# NIFTY Futures Data Acquisition (Daily)

This notebook constructs a **continuous daily NIFTY Index Futures dataset**
using NSE F&O bhavcopy files.

### Objective
- Extract **NIFTY Index Futures (FUTIDX)**
- Select **near-month contract**
- Handle **monthly expiry rollover**
- Create a clean daily futures time series

### Data Details
- Source: NSE F&O Bhavcopy
- Frequency: Daily
- Period: 14 Jan 2025 – 14 Jan 2026

### Output
- `nifty_futures_daily.csv`


In [26]:
import pandas as pd
import os

In [28]:
RAW_FUTURE_PATH = "/content/raw future"

In [29]:
files = sorted([
    os.path.join(RAW_FUTURE_PATH, f)
    for f in os.listdir(RAW_FUTURE_PATH)
    if f.endswith(".csv")
])

print("Total bhavcopy files found:", len(files))

Total bhavcopy files found: 250


In [30]:
dfs = []

for file in files:
    df = pd.read_csv(file)
    df.columns = df.columns.str.strip()  # Clean column names
    dfs.append(df)

fno_df = pd.concat(dfs, ignore_index=True)
fno_df.head()

Unnamed: 0,INSTRUMENT,SYMBOL,EXPIRY_DT,EXPIRY_DT_FINAL,STRIKE_PR,OPTION_TYP,OPEN,HIGH,LOW,CLOSE,SETTLE_PR,CONTRACTS,VAL_INLAKH,OPEN_INT,CHG_IN_OI,TIMESTAMP
0,OPTSTK,NCC,27-Feb-2025,2025-02-27,265.0,CE,0.0,0.0,0.0,37.45,9.2,0,0.0,0,0,14-JAN-2025
1,OPTSTK,NCC,30-Jan-2025,2025-01-30,265.0,PE,26.35,26.35,26.35,26.35,21.05,1,5.171462,252050,1775,14-JAN-2025
2,OPTSTK,NCC,30-Jan-2025,2025-01-30,265.0,CE,1.95,2.3,1.45,2.1,2.1,125,592.359212,228975,39050,14-JAN-2025
3,OPTSTK,NCC,27-Mar-2025,2025-03-27,270.0,CE,0.0,0.0,0.0,40.1,12.25,0,0.0,0,0,14-JAN-2025
4,OPTSTK,NCC,30-Jan-2025,2025-01-30,220.0,PE,1.55,2.45,1.2,1.2,1.2,137,539.105662,156200,12425,14-JAN-2025


In [31]:
fno_df.columns

Index(['INSTRUMENT', 'SYMBOL', 'EXPIRY_DT', 'EXPIRY_DT_FINAL', 'STRIKE_PR',
       'OPTION_TYP', 'OPEN', 'HIGH', 'LOW', 'CLOSE', 'SETTLE_PR', 'CONTRACTS',
       'VAL_INLAKH', 'OPEN_INT', 'CHG_IN_OI', 'TIMESTAMP'],
      dtype='object')

In [32]:
fut_df = fno_df[
    (fno_df['INSTRUMENT'] == 'FUTIDX') &
    (fno_df['SYMBOL'] == 'NIFTY')
].copy()

print("Rows after filtering FUTIDX NIFTY:", fut_df.shape)
fut_df.head()

Rows after filtering FUTIDX NIFTY: (750, 16)


Unnamed: 0,INSTRUMENT,SYMBOL,EXPIRY_DT,EXPIRY_DT_FINAL,STRIKE_PR,OPTION_TYP,OPEN,HIGH,LOW,CLOSE,SETTLE_PR,CONTRACTS,VAL_INLAKH,OPEN_INT,CHG_IN_OI,TIMESTAMP
39384,FUTIDX,NIFTY,27-Feb-2025,2025-02-27,,,23348.0,23471.2,23335.75,23399.6,23399.6,10908,191381.4,2802150,466425,14-JAN-2025
40705,FUTIDX,NIFTY,27-Mar-2025,2025-03-27,,,23518.0,23620.0,23493.0,23562.4,23562.4,1945,34355.84,411450,2775,14-JAN-2025
40706,FUTIDX,NIFTY,30-Jan-2025,2025-01-30,,,23248.0,23339.0,23198.45,23271.75,23271.75,194699,1132413.0,13753850,-113075,14-JAN-2025
80646,FUTIDX,NIFTY,27-Feb-2025,2025-02-27,,,23420.05,23479.7,23341.05,23400.65,23400.65,4942,86747.42,2906625,104475,15-JAN-2025
81597,FUTIDX,NIFTY,27-Mar-2025,2025-03-27,,,23574.25,23620.0,23495.0,23546.55,23546.55,1639,28952.57,422400,10950,15-JAN-2025


In [33]:
fut_df['TIMESTAMP'] = pd.to_datetime(fut_df['TIMESTAMP'], dayfirst=True)
fut_df['EXPIRY_DT'] = pd.to_datetime(fut_df['EXPIRY_DT'], dayfirst=True)

  fut_df['TIMESTAMP'] = pd.to_datetime(fut_df['TIMESTAMP'], dayfirst=True)


In [34]:
fut_df = fut_df.sort_values(['TIMESTAMP', 'EXPIRY_DT'])

In [36]:
# Create continuous near-month futures series
fut_near = fut_df.groupby('TIMESTAMP').first().reset_index()

In [37]:
print(type(fut_near))
print(fut_near.shape)

<class 'pandas.core.frame.DataFrame'>
(250, 16)


In [38]:
fut_near = fut_near[[
    'TIMESTAMP',
    'OPEN',
    'HIGH',
    'LOW',
    'CLOSE',
    'CONTRACTS',
    'OPEN_INT'
]]

In [39]:
fut_near = fut_near.rename(columns={
    'TIMESTAMP': 'date',
    'OPEN': 'open',
    'HIGH': 'high',
    'LOW': 'low',
    'CLOSE': 'close',
    'CONTRACTS': 'volume',
    'OPEN_INT': 'open_interest'
})

In [40]:
fut_near = fut_near.reset_index(drop=True)

In [41]:
fut_near.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           250 non-null    datetime64[ns]
 1   open           250 non-null    float64       
 2   high           250 non-null    float64       
 3   low            250 non-null    float64       
 4   close          250 non-null    float64       
 5   volume         250 non-null    int64         
 6   open_interest  250 non-null    int64         
dtypes: datetime64[ns](1), float64(4), int64(2)
memory usage: 13.8 KB


In [42]:
fut_near.head()

Unnamed: 0,date,open,high,low,close,volume,open_interest
0,2025-01-14,23248.0,23339.0,23198.45,23271.75,194699,13753850
1,2025-01-15,23302.05,23345.0,23201.0,23265.9,143790,13536300
2,2025-01-16,23406.0,23423.9,23346.15,23377.55,171198,13500850
3,2025-01-17,23344.9,23353.8,23150.1,23267.2,218388,14032150
4,2025-01-20,23339.95,23449.0,23220.0,23400.2,163712,14133750


In [43]:
fut_near.to_csv("nifty_futures_daily.csv", index=False)

## NIFTY Futures – Final Dataset

- Instrument: NIFTY Index Futures (FUTIDX)
- Contract Type: Near-Month Continuous Futures
- Frequency: Daily
- Period: 14 Jan 2025 – 14 Jan 2026

### Fields
- Open, High, Low, Close
- Volume (Contracts Traded)
- Open Interest

### Usage in Quant System
- Futures-spot trend confirmation
- Open interest based position analysis
- Regime and momentum validation
- Strategy signal filtering




---



# NIFTY Options Data Acquisition (Daily)

This notebook constructs a **daily NIFTY index options dataset**
from NSE F&O bhavcopy files.

### Objective
- Extract NIFTY index options (OPTIDX)
- Select near-month expiry
- Retain ATM, ATM ±1, ATM ±2 strikes
- Include both Call and Put options
- Build a clean daily options dataset

### Data Source
- NSE F&O Bhavcopy
- Frequency: Daily
- Period: 14 Jan 2025 – 14 Jan 2026

### Output
- nifty_options_daily.csv


In [45]:
import pandas as pd
import os

In [47]:
RAW_FUTURE_PATH = "/content/raw future"

files = sorted([
    os.path.join(RAW_FUTURE_PATH, f)
    for f in os.listdir(RAW_FUTURE_PATH)
    if f.endswith(".csv")
])

dfs = []
for file in files:
    df = pd.read_csv(file)
    df.columns = df.columns.str.strip()
    dfs.append(df)

fno_df = pd.concat(dfs, ignore_index=True)

In [48]:
opt_df = fno_df[
    (fno_df['INSTRUMENT'] == 'OPTIDX') &
    (fno_df['SYMBOL'] == 'NIFTY')
].copy()

print("Total option rows:", opt_df.shape)

Total option rows: (396825, 16)


In [49]:
opt_df['TIMESTAMP'] = pd.to_datetime(opt_df['TIMESTAMP'], dayfirst=True)
opt_df['EXPIRY_DT'] = pd.to_datetime(opt_df['EXPIRY_DT'], dayfirst=True)

  opt_df['TIMESTAMP'] = pd.to_datetime(opt_df['TIMESTAMP'], dayfirst=True)


In [50]:
opt_df = opt_df.sort_values(['TIMESTAMP', 'EXPIRY_DT'])
opt_df = opt_df.groupby(['TIMESTAMP', 'STRIKE_PR', 'OPTION_TYP']).first().reset_index()

In [51]:
spot_df = pd.read_csv("nifty_spot_daily.csv")
spot_df['date'] = pd.to_datetime(spot_df['date'])

In [52]:
opt_df = opt_df.merge(
    spot_df[['date', 'close']],
    left_on='TIMESTAMP',
    right_on='date',
    how='left'
)

In [54]:
import numpy as np

def get_atm(price, interval=50):
    if pd.isna(price):
        return np.nan
    return round(price / interval) * interval

In [55]:
opt_df['ATM'] = opt_df['close'].apply(get_atm)

In [56]:
opt_df['strike_diff'] = opt_df['STRIKE_PR'] - opt_df['ATM']

In [58]:
opt_df['ATM'].isna().sum()

np.int64(0)

In [59]:
opt_df = opt_df.dropna(subset=['ATM'])

In [60]:
opt_df = opt_df[
    opt_df['strike_diff'].isin([-100, -50, 0, 50, 100])
]

In [61]:
opt_df = opt_df[[
    'TIMESTAMP',
    'EXPIRY_DT',
    'OPTION_TYP',
    'STRIKE_PR',
    'CLOSE',
    'CONTRACTS',
    'OPEN_INT'
]]

In [62]:
opt_df = opt_df.rename(columns={
    'TIMESTAMP': 'date',
    'EXPIRY_DT': 'expiry',
    'OPTION_TYP': 'option_type',
    'STRIKE_PR': 'strike',
    'CLOSE': 'close_price',
    'CONTRACTS': 'volume',
    'OPEN_INT': 'open_interest'
})

In [63]:
opt_df = opt_df.reset_index(drop=True)
opt_df.to_csv("nifty_options_daily.csv", index=False)

## NIFTY Options – Final Dataset

- Instrument: NIFTY Index Options (OPTIDX)
- Expiry: Near-month
- Strikes: ATM, ATM ±1, ATM ±2
- Option Types: Call (CE), Put (PE)
- Frequency: Daily

### Usage in Quant System
- Market sentiment analysis (PCR)
- Options positioning analysis
- Volatility regime confirmation
- Signal filtering for futures strategies




---



---

