# Feature Engineering
Now we try to extract some features from the data that we have. Then we can use the new features to train our model.

Now just use one file to demonstrate the feature engineering process.

For the LOB, I have already applied the same process to the whole dataset and saved the new features to a new csv file. The code is in ParsingLOB.py and Parsing_LOB3.0.py

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
tape_df= pd.read_csv('datasets/tape1.csv')
lob_df= pd.read_csv('datasets/lob_expanded_clean.csv')

## 1. Tape Data
We have convert the `Timestamp` to datetime and extract the hour and minute from it, which are two columns in the dataframe.

In [2]:
tape_df.head()

Unnamed: 0,Timestamp,Price,Volume,time_bin,Hour,Minute
0,1970-01-01 00:00:11.067000000,269,1,0,0,0
1,1970-01-01 00:00:11.222000000,267,2,0,0,0
2,1970-01-01 00:00:12.338000000,270,2,0,0,0
3,1970-01-01 00:00:13.733000000,267,3,0,0,0
4,1970-01-01 00:00:18.321000000,265,2,0,0,0


### Rolling Statistics
1. **Rolling Mean**: Also known as moving average, it calculates the average value of data points within a time window and updates as the time window moves. In financial data analysis, the rolling mean helps to smooth out price data, reducing short-term fluctuations, thereby revealing the long-term trend of prices.

2. **Rolling Standard Deviation**: Rolling standard deviation measures the variability or volatility of data points within a time window. In the financial markets, a higher rolling standard deviation indicates greater volatility in asset prices, which may signify higher risk.

3. **Rolling Max and Min**: These metrics represent the highest and lowest prices within the time window, respectively. They help identify extreme price changes within a given time window, useful for identifying potential support and resistance levels.


In [3]:
# Define the rolling window size
window_size = 5

# Calculate rolling statistics for price
tape_df['Price_Rolling_Mean'] = tape_df['Price'].rolling(window=window_size).mean()
tape_df['Price_Rolling_Std'] = tape_df['Price'].rolling(window=window_size).std()
tape_df['Price_Rolling_Max'] = tape_df['Price'].rolling(window=window_size).max()
tape_df['Price_Rolling_Min'] = tape_df['Price'].rolling(window=window_size).min()

# Calculate rolling statistics for volume
tape_df['Volume_Rolling_Mean'] = tape_df['Volume'].rolling(window=window_size).mean()
tape_df['Volume_Rolling_Std'] = tape_df['Volume'].rolling(window=window_size).std()
tape_df['Volume_Rolling_Max'] = tape_df['Volume'].rolling(window=window_size).max()
tape_df['Volume_Rolling_Min'] = tape_df['Volume'].rolling(window=window_size).min()

# Display the dataframe to verify the rolling statistics
tape_df[['Price', 'Price_Rolling_Mean', 'Price_Rolling_Std', 'Price_Rolling_Max', 'Price_Rolling_Min', 'Volume', 'Volume_Rolling_Mean', 'Volume_Rolling_Std', 'Volume_Rolling_Max', 'Volume_Rolling_Min']].head(10)  # Displaying the first 10 rows to see some of the NaN values and then the filled values


Unnamed: 0,Price,Price_Rolling_Mean,Price_Rolling_Std,Price_Rolling_Max,Price_Rolling_Min,Volume,Volume_Rolling_Mean,Volume_Rolling_Std,Volume_Rolling_Max,Volume_Rolling_Min
0,269,,,,,1,,,,
1,267,,,,,2,,,,
2,270,,,,,2,,,,
3,267,,,,,3,,,,
4,265,267.6,1.949359,270.0,265.0,2,2.0,0.707107,3.0,1.0
5,266,267.0,1.870829,270.0,265.0,1,2.0,0.707107,3.0,1.0
6,264,266.4,2.302173,270.0,264.0,1,1.8,0.83666,3.0,1.0
7,261,264.6,2.302173,267.0,261.0,1,1.6,0.894427,3.0,1.0
8,264,264.0,1.870829,266.0,261.0,1,1.2,0.447214,2.0,1.0
9,264,263.8,1.788854,266.0,261.0,4,1.6,1.341641,4.0,1.0


### Historical Differences
1. **Price Change**: The historical difference in prices (i.e., the difference between consecutive prices) reflects the change in price from one point in time to another. This is very useful for capturing short-term momentum in prices and can help analyze upward or downward price trends.

2. **Volume Change**: The historical difference in trading volume represents the change in volume between consecutive time points. An increase in trading volume typically indicates an increase in the strength of the trend, while a decrease in volume may indicate a weakening of the trend.


In [4]:
# Calculate historical price change
tape_df['Price_Change'] = tape_df['Price'].diff()

# Calculate historical volume change
tape_df['Volume_Change'] = tape_df['Volume'].diff()

# Display the dataframe to verify the historical changes
tape_df[['Price', 'Price_Change', 'Volume', 'Volume_Change']].head(10)  # Displaying the first 10 rows to see some of the NaN values and then the filled values

Unnamed: 0,Price,Price_Change,Volume,Volume_Change
0,269,,1,
1,267,-2.0,2,1.0
2,270,3.0,2,0.0
3,267,-3.0,3,1.0
4,265,-2.0,2,-1.0
5,266,1.0,1,-1.0
6,264,-2.0,1,0.0
7,261,-3.0,1,0.0
8,264,3.0,1,0.0
9,264,0.0,4,3.0




In summary, these moving statistics and historical difference indicators provide rich information for feature extraction in time series data. In financial market analysis, these features help to better understand market behavior, providing a foundation for modeling and forecasting. By combining these features, more refined and predictive models can be constructed to forecast future price movements or identify trading signals.

# 2. Limit Order Book Data
## 1. Add New Features
We try to extract more information from `Orders` column in the dataframe. We will calculate the following features: (for each row)
- Total Bid Quantity: The total quantity of all bid orders
- Total Ask Quantity: The total quantity of all ask orders
- Max Bid Price: The highest bid price
- Min Ask Price: The lowest ask price
- Spread: The difference between the lowest ask price and the highest bid price
- Weighted Avg Bid Price: The weighted average of all bid prices
- Weighted Avg Ask Price: The weighted average of all ask prices
- Bid-Ask Quantity Ratio: The ratio of total bid quantity to total ask quantity

In [11]:
import ast  # parse string representation of list to list

# read the cleaned LOB data
lob_cleaned_path = './datasets/lob_expanded_clean.csv'
lob_cleaned = pd.read_csv(lob_cleaned_path)

# Function to parse the orders and calculate the features
def parse_orders(row):
    orders = ast.literal_eval(row['Orders'])  
    bid_orders = orders[0][1]  
    ask_orders = orders[1][1]  

    # Initialize variables
    total_bid_quantity, total_ask_quantity = 0, 0 # total bid and ask quantities
    total_bid_value, total_ask_value = 0, 0 # total bid and ask values,values= price * quantity
    max_bid_price, min_ask_price = float('-inf'), float('inf') # max bid and min ask prices

    for price, quantity in bid_orders:
        total_bid_quantity += quantity
        total_bid_value += price * quantity
        max_bid_price = max(max_bid_price, price)

    for price, quantity in ask_orders:
        total_ask_quantity += quantity
        total_ask_value += price * quantity
        min_ask_price = min(min_ask_price, price)

    # Calculate additional features
    spread = min_ask_price - max_bid_price  # spread: difference between the lowest ask price and the highest bid price
    weighted_avg_bid_price = total_bid_value / total_bid_quantity if total_bid_quantity > 0 else 0  
    weighted_avg_ask_price = total_ask_value / total_ask_quantity if total_ask_quantity > 0 else 0 
    bid_ask_quantity_ratio = total_bid_quantity / total_ask_quantity if total_ask_quantity > 0 else 0 

    return pd.Series([total_bid_quantity, total_ask_quantity, max_bid_price, min_ask_price, spread, weighted_avg_bid_price, weighted_avg_ask_price, bid_ask_quantity_ratio])

# add the new features to the dataframe
lob_cleaned[['Total Bid Quantity', 'Total Ask Quantity', 'Max Bid Price', 'Min Ask Price', 'Spread', 'Weighted Avg Bid Price', 'Weighted Avg Ask Price', 'Bid-Ask Quantity Ratio']] = lob_cleaned.apply(parse_orders, axis=1)

print(lob_cleaned.head())

   Timestamp Exchange                                             Orders  \
0      1.333    Exch0           [['bid', [[1, 6]]], ['ask', [[800, 1]]]]   
1      1.581    Exch0           [['bid', [[1, 6]]], ['ask', [[799, 1]]]]   
2      1.643    Exch0           [['bid', [[1, 6]]], ['ask', [[798, 1]]]]   
3      1.736    Exch0  [['bid', [[261, 1], [1, 6]]], ['ask', [[798, 1...   
4      1.984    Exch0  [['bid', [[261, 1], [1, 6]]], ['ask', [[797, 1...   

   Bid Price  Bid Quantity  Ask Price  Ask Quantity  Total Bid Quantity  \
0        1.0           6.0      800.0           1.0                 6.0   
1        1.0           6.0      799.0           1.0                 6.0   
2        1.0           6.0      798.0           1.0                 6.0   
3      261.0           1.0      798.0           1.0                 7.0   
4      261.0           1.0      797.0           1.0                 7.0   

   Total Ask Quantity  Max Bid Price  Min Ask Price  Spread  \
0                 1.0        

In [15]:
# save the new features to a new csv file
selected_columns = ['Timestamp', 'Exchange', 'Total Bid Quantity', 'Total Ask Quantity', 
                    'Max Bid Price', 'Min Ask Price', 'Spread', 
                    'Weighted Avg Bid Price', 'Weighted Avg Ask Price', 
                    'Bid-Ask Quantity Ratio']

new_df = lob_cleaned[selected_columns].copy()
print(new_df.head())

new_df.to_csv('lob_only_features.csv', index=False)

   Timestamp Exchange  Total Bid Quantity  Total Ask Quantity  Max Bid Price  \
0      1.333    Exch0                 6.0                 1.0            1.0   
1      1.581    Exch0                 6.0                 1.0            1.0   
2      1.643    Exch0                 6.0                 1.0            1.0   
3      1.736    Exch0                 7.0                 1.0          261.0   
4      1.984    Exch0                 7.0                 1.0          261.0   

   Min Ask Price  Spread  Weighted Avg Bid Price  Weighted Avg Ask Price  \
0          800.0   799.0                1.000000                   800.0   
1          799.0   798.0                1.000000                   799.0   
2          798.0   797.0                1.000000                   798.0   
3          798.0   537.0               38.142857                   798.0   
4          797.0   536.0               38.142857                   797.0   

   Bid-Ask Quantity Ratio  
0                     6.0  
1     