# Collecting and clean SPY 1-minute data

This notebook implements the "Data Sources" section:

- Load raw SPY 1-minute OHLCV+volume data from `data/raw/`.
- Convert Unix timestamps to U.S. Eastern Time.
- Drop the per-minute open and keep only high, low, close, and volume.
- Save a cleaned dataset into `data/clean/` for all later notebooks.


In [52]:
from pathlib import Path
import pandas as pd

In [53]:
#looking at the data to see if its in correct folders

PROJECT_ROOT = Path("..").resolve()

DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_CLEAN = PROJECT_ROOT / "data" / "clean"

RAW_FILE = DATA_RAW / "spy_1min_bats_2025.csv"

print("RAW_FILE path:", RAW_FILE)
print("File exists?:", RAW_FILE.exists())


RAW_FILE path: /Users/canka/Dev/python/DSA210-Project-Can-Karadogan/data/raw/spy_1min_bats_2025.csv
File exists?: True


## 1) Reading raw SPY 1-minute OHLCV+volume data

Firstly, we need to see how our raw data is formed. In that way we can apply our EDA in raw data and convert this data to cleaned data

- Our data must return original 6 column names: time, open, high, low, close, Volume



In [54]:
#checking raw data head

df_raw = pd.read_csv(RAW_FILE)
df_raw.head()

Unnamed: 0,time,open,high,low,close,Volume
0,1757338200,648.63,648.86,648.24,648.26,141588
1,1757338260,648.26,648.45,648.15,648.27,42118
2,1757338320,648.3,648.46,648.1,648.26,37143
3,1757338380,648.28,648.47,648.23,648.4,42231
4,1757338440,648.4,648.68,648.32,648.665,23659


In [55]:
#checking column dtypes and their statistics so that can easily seen missing or incorrect valuation

df_raw.info()
df_raw.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21450 entries, 0 to 21449
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    21450 non-null  int64  
 1   open    21450 non-null  float64
 2   high    21450 non-null  float64
 3   low     21450 non-null  float64
 4   close   21450 non-null  float64
 5   Volume  21450 non-null  int64  
dtypes: float64(4), int64(2)
memory usage: 1005.6 KB


Unnamed: 0,time,open,high,low,close,Volume
count,21450.0,21450.0,21450.0,21450.0,21450.0,21450.0
mean,1760548000.0,668.012768,668.153902,667.867178,668.010579,31363.66
std,1917741.0,9.477913,9.467176,9.486371,9.477293,35760.77
min,1757338000.0,647.33,647.51,647.22,647.31,1314.0
25%,1758825000.0,661.13,661.33,660.91225,661.1225,14615.5
50%,1760547000.0,666.83,666.97,666.66,666.82,22888.0
75%,1762272000.0,673.0175,673.12,672.87,673.0,36983.0
max,1763759000.0,689.6,689.7,689.52,689.59,1362579.0


## 2) Convert Unix Timestamp to Real New York Dates and Hours

The biggest problem about altering time-series data like stock prices, the original format is Unix Timestamp which is seconds passed from the date 1.1.1970.

- We need to adjust this data as U.S. Eastern (New York Time) for clearly state which date, hour and minute are we in


In [56]:
import numpy as np

# Our data includes both DST (Daylight Saving Time) and ST (Standart Time) in U.S. Eastern Time.
# Because Americans setted their clocks back 1 hour in November 2nd at 02:00 (in DST) and 01:00 (in ST)
# After November 2nd, U.S. Eastern Time is in ST timeframe now which is 1 hour setted back from DST timeframe

# Unix timestamp 1762063200 --> November 2nd at 02:00 (in DST) and 01:00 (in ST)
threshold = 1762063200

# created new column 'datetime' which considers times before November 2nd at 02:00 as DST and after November 2nd at 01:00 as ST and converts to New York Timeframe
df_raw["datetime"] = np.where(
    df_raw["time"] < threshold,
    pd.to_datetime(df_raw["time"], unit="s") - pd.Timedelta(hours=4),
    pd.to_datetime(df_raw["time"], unit="s") - pd.Timedelta(hours=5)
)

df_raw.drop(columns=["time"], inplace=True)

df_raw.head()

Unnamed: 0,open,high,low,close,Volume,datetime
0,648.63,648.86,648.24,648.26,141588,2025-09-08 09:30:00
1,648.26,648.45,648.15,648.27,42118,2025-09-08 09:31:00
2,648.3,648.46,648.1,648.26,37143,2025-09-08 09:32:00
3,648.28,648.47,648.23,648.4,42231,2025-09-08 09:33:00
4,648.4,648.68,648.32,648.665,23659,2025-09-08 09:34:00


In [None]:
# I adjusting our 'datetime' column as our first column

cols = df_raw.columns.tolist()
cols.remove('datetime')     
new_order = ['datetime'] + cols
df_raw = df_raw[new_order]

df_raw.head()

Unnamed: 0,datetime,open,high,low,close,Volume
0,2025-09-08 09:30:00,648.63,648.86,648.24,648.26,141588
1,2025-09-08 09:31:00,648.26,648.45,648.15,648.27,42118
2,2025-09-08 09:32:00,648.3,648.46,648.1,648.26,37143
3,2025-09-08 09:33:00,648.28,648.47,648.23,648.4,42231
4,2025-09-08 09:34:00,648.4,648.68,648.32,648.665,23659


In [63]:
# We need to see our data's first 1 min-candle and our data's last minute candle to see if data catches correct U.S. datetime interval
# Which is in our data with getting 55 days from 8th September to 21st November with correct stock market hours 09:00 - 16:00

# Moreover, we can see it from whole data summary with .describe() function

df_raw['datetime'].info()
df_raw['datetime'].describe()


<class 'pandas.core.series.Series'>
RangeIndex: 21450 entries, 0 to 21449
Series name: datetime
Non-Null Count  Dtype         
--------------  -----         
21450 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 167.7 KB


count                  21450
mean     2025-10-15 12:44:30
min      2025-09-08 09:30:00
25%      2025-09-25 14:22:15
50%      2025-10-15 12:44:30
75%      2025-11-04 11:06:45
max      2025-11-21 15:59:00
Name: datetime, dtype: object

## 3) Drop the per-minute open and keep only high, low, close, and volume.

As mentioned in .README(), we do not use the open for each minute because the close is a standard reference for one-minute bars and makes the label definition simple. So, we need to delete this column!!

In [64]:
df_raw.drop(columns=["open"], inplace=True)
df_raw.head()

Unnamed: 0,datetime,high,low,close,Volume
0,2025-09-08 09:30:00,648.86,648.24,648.26,141588
1,2025-09-08 09:31:00,648.45,648.15,648.27,42118
2,2025-09-08 09:32:00,648.46,648.1,648.26,37143
3,2025-09-08 09:33:00,648.47,648.23,648.4,42231
4,2025-09-08 09:34:00,648.68,648.32,648.665,23659


## 4) Save a cleaned dataset into `data/clean/` for all later notebooks.

Now, we need to save our last form into our `data/clean/` folder, to see a clear transition from original raw data and to use in our all later processes

In [65]:
from pathlib import Path

# 1) Define project root which is the main branch in our repository
PROJECT_ROOT = Path("..").resolve()

# 2) We need to go to data/clean folder so define that pathway
DATA_CLEAN = PROJECT_ROOT / "data" / "clean"
DATA_CLEAN.mkdir(parents=True, exist_ok=True)  # yoksa oluştur

# 3) Kaydedilecek CSV dosyasının adı
clean_csv_path = DATA_CLEAN / "spy_1min_et_clean.csv"

# 4) df_raw şu anda en son işlenmiş (temiz) halinse, onu kaydediyoruz
df_raw.to_csv(clean_csv_path, index=False)

print("Saved CSV to:", clean_csv_path)


Saved CSV to: /Users/canka/Dev/python/DSA210-Project-Can-Karadogan/data/clean/spy_1min_et_clean.csv
