# Using DataRequest.py and Preprocessing.py to Fetch and Aggregate Financial Data

In this tutorial, we will demonstrate how to use **DataRequest.py** and **Preprocessing.py** to fetch tick data in the financial domain and aggregate it into minute-level candlestick data.


## Prerequisites

- Ensure that `DataRequest.py` and `Preprocessing.py` are placed in the current working directory.
- Install the required dependencies:
  ```bash
  pip install pandas finnhub-python # you should have this already
  ```
- Obtain and replace your API key, such as Finnhub's `YOUR_API_KEY`.


## 1. Use DataRequest.py to Fetch Raw Financial Data

Below we demonstrate how to fetch **candlestick data** and **tick data**, respectively.


In [2]:
from Utility.DataRequest import fetch_candle_data

# Example: Fetch 1-minute candlestick data for AAPL on 2023-01-03
candle_df = fetch_candle_data(
    symbol='AAPL',
    start_date='2023-01-03',
    end_date='2023-01-03',
    interval='1',        # '1' means 1-minute candlestick interval
    token='YOUR_API_KEY', # API key
    max_workers=1         # Number of threads to use
)

candle_df.head()


AAPL 1-min K 线块: 100%|██████████| 1/1 [00:00<00:00,  6.43it/s]


Unnamed: 0,timestamp,open,high,low,close,volume
0,2023-01-03 04:00:00-05:00,130.28,131.0,130.28,131.0,8174
1,2023-01-03 04:01:00-05:00,130.87,131.17,130.87,131.1,8820
2,2023-01-03 04:02:00-05:00,131.18,131.24,131.17,131.17,2112
3,2023-01-03 04:03:00-05:00,131.19,131.29,131.19,131.28,3888
4,2023-01-03 04:04:00-05:00,131.28,131.46,131.28,131.46,5984


### `fetch_tick_data` Parameter Description

- **`page_workers`** (_int_)  
  Controls the number of threads fetching paginated data within a single trading day.  
  - Tick data is often returned in “pages,” and `page_workers` determines how many page requests are issued concurrently.  
  - Increases data retrieval speed for a single day, but too many threads may cause API rate limiting. Adjust based on network and API constraints.

- **`day_workers`** (_int_)  
  Controls the number of threads fetching data across multiple trading days concurrently.  
  - When fetching tick data for multiple days, `day_workers` determines how many days’ requests are processed simultaneously.  
  - Allows parallel retrieval across days to improve overall efficiency; be mindful of API concurrency limits.

In [4]:
from Utility.DataRequest import fetch_tick_data

# Example: Fetch tick-by-tick trade details for AAPL on 2023-01-03
tick_df = fetch_tick_data(
    symbol='AAPL',
    start_date='2023-01-03',
    end_date='2023-01-03',
    api_key='YOUR_API_KEY',     # API key
    page_workers=10,            # Number of page workers for concurrent fetches within a single day
    day_workers=1               # Number of day workers for parallel fetch across days (here only 1)
)

tick_df.head()


Days:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching ticks for 2023-01-03 — 当日总 tick 数: 1021095


Days: 100%|██████████| 1/1 [00:08<00:00,  8.48s/it]


Unnamed: 0,timestamp,symbol,price,volume,condition
0,2023-01-03 04:00:00.004000-05:00,AAPL,130.28,100,"[1, 24]"
1,2023-01-03 04:00:00.004000-05:00,AAPL,130.28,12,"[1, 24, 12]"
2,2023-01-03 04:00:00.005000-05:00,AAPL,130.28,10,"[1, 24, 12]"
3,2023-01-03 04:00:00.007000-05:00,AAPL,130.28,4,"[1, 24, 12]"
4,2023-01-03 04:00:00.009000-05:00,AAPL,130.28,5,"[1, 24, 12]"


## 2. Use Preprocessing.py to Aggregate Tick Data into Minute-Level Candlestick Data

Use the `aggregate_tick_to_minute` function to convert tick data into OHLCV data:


In [7]:
from Utility.Preprocessing import aggregate_tick_to_minute

# Aggregate tick data into minute-level OHLCV data
minute_df = aggregate_tick_to_minute(tick_df)


[Info] Dropped 0 rows due to invalid timestamps.


## 3. Display Aggregated Results and Explain the Structure

View the first 5 rows of the aggregated minute-level data:

In [8]:
minute_df.head(5)

Unnamed: 0,timestamp,open,high,low,close,volume,dollar_volume,vwap,tick_count,trade_size_mean,trade_size_std,zero_return_count,price_direction_ratio,large_trade_count,large_trade_ratio,large_trade_volume_ratio
0,2023-01-03 04:00:00-05:00,130.28,131.0,130.06,130.89,8174.0,1069601.58,130.85412,208.0,39.298077,89.09474,101.0,0.246377,5.0,0.024038,0.285784
1,2023-01-03 04:01:00-05:00,130.89,131.18,130.85,131.1,8820.0,1155025.24,130.955243,157.0,56.178344,188.856313,71.0,0.25,1.0,0.006369,0.255556
2,2023-01-03 04:02:00-05:00,131.17,131.29,131.1,131.19,2112.0,277112.92,131.208769,53.0,39.849057,70.22132,23.0,0.25,1.0,0.018868,0.210227
3,2023-01-03 04:03:00-05:00,131.17,131.29,131.15,131.28,3888.0,510186.43,131.22079,90.0,43.2,63.307223,54.0,0.202247,2.0,0.022222,0.190329
4,2023-01-03 04:04:00-05:00,131.29,131.46,131.24,131.4,5984.0,785866.15,131.327899,88.0,68.0,125.710888,37.0,0.264368,2.0,0.022727,0.229445


In the table above, each row represents a one-minute interval, with the following columns explained:
- **timestamp**: The timestamp indicating the start of the one-minute interval (to the nearest minute). It is usually converted to Eastern Time (America/New_York) to align with exchange hours. For example, `2023-01-03 04:00:00-05:00` corresponds to the 09:30 interval in Eastern Time.
- **open**: The price of the first trade within that minute. For example, at the 04:00 interval, the first trade price was 130.28.
- **high**: The highest trade price within that minute. For example, at 04:00 the highest price was 131.00.
- **low**: The lowest trade price within that minute. For example, at 04:00 the lowest price was 130.06.
- **close**: The price of the last trade within that minute, i.e., the price at the end of the interval. For example, the close at 04:00 was 130.89.
- **volume**: The total number of shares traded during that minute. At 04:00, a total of 8,174 shares were traded.
- **dollar_volume**: The total dollar amount traded during that minute, calculated as the sum of each trade price multiplied by its volume. For example, the dollar volume at 04:00 was $1,069,601.58.
- **vwap**: Volume-Weighted Average Price, representing the average price weighted by trade volume. In the example, the VWAP at 04:00 is approximately 130.85, indicating that most trading occurred around this price.
- **tick_count**: The number of trades executed during that minute. For example, there were 208 trades at 04:00.
- **trade_size_mean**: The average trade size per trade during that minute. For example, the average trade size at 04:00 was 39.29 shares.
- **trade_size_std**: The standard deviation of trade sizes during that minute, measuring variability in trade sizes. For the 04:00 interval, the standard deviation was approximately 89.09.
- **zero_return_count**: The number of times the trade price did not change between consecutive trades within that minute. A value of 0 indicates that each trade had a different price (no consecutive duplicate prices).
- **price_direction_ratio**: The ratio of upward price movements to total price changes during the minute. A value close to 1 indicates mostly increasing prices; close to 0 indicates mostly decreasing; 0.5 indicates equal numbers of increases and decreases.
- **large_trade_count**: The number of large trades (trades with unusually high volume) during the minute. Large trades are typically identified as those exceeding the mean by two standard deviations. A value of 0 indicates no large trades.
- **large_trade_ratio**: The proportion of large trades to total trades in the minute. A value of 0 indicates no large trades occurred.
- **large_trade_volume_ratio**: The proportion of volume contributed by large trades to total volume during the minute. A value of 0 indicates no volume came from large trades.

In summary, by using `Preprocessing.py` to aggregate data, we convert raw high-frequency tick data into an easy-to-analyze minute-level candlestick DataFrame `minute_df`.

This table has a clear structure: the first columns are the classic OHLCV data for price charting, and the subsequent columns provide additional market microstructure features for deeper analysis of trading behavior.

For beginners, focus on basic fields such as OHLC and volume; for advanced analysis, leverage VWAP and microstructure features to explore market details.

**At this point, we have completed the entire workflow**: from using `DataRequest.py` to fetch raw financial market data to using `Preprocessing.py` to aggregate tick data into minute-level candlestick data. You can apply these steps to other stocks or time ranges and adjust parameters (e.g., symbol, date range, interval) to obtain the desired data. Happy learning!
