<a href="https://colab.research.google.com/github/chengshengli/hflf/blob/main/Copy_of_Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
針對 m1fa-5rows.csv文件， 將 [high,low] 作為一個價格整體，[hf,lf] 作為一個指標整體，深入分析其關聯，以及綜合價格拐點 與 指標拐點 的關聯及特徵發現 及 特徵組合。
注： hf 和 lf 都是Delta RLE-Increments 编码 ： 【声明：编码处理按照 【Delta RLE（差分游程编码）+ end + length + sum 是对“主干连续+偶有跳点”的稀疏整数数组非常优雅且高效的表示方法】
RLE-Increments 编码/还原流程 * 编码* 若数组为空：全部为0或空。 计算首元素start。 按顺序计算差分序列，遇到连续+1合并为+1xN；跳跃直接写出（如+10）。 end为最后一个数。 length为元素个数。 sum为总和。 还原 取start，解开deltas（顺序累加），还原所有元素。 检查end是否一致，length是否一致。 检查sum一致性。 校验有误直接丢弃。
边界处理与异常情况 空数组：空串。 单元素：deltas为空。】
'hf' and 'lf' columns 仅使用 'end' 解码出来的字段来继续进行数据分析和特征工程。

通常情况下，在上涨段，[high,low] 以 high 为基准增涨，直到波峰拐点 high出现，随后就是下跌段；在下跌段，[high,low] 以 low 为基准减少，直到波谷拐点 low 出现，随后是上涨段； 这就是价格的上涨 - 下跌 - 上涨 这种交替的出现的描述。
另外当前数据是1分钟最小粒度的数据，是离散数据，有部分非交易时间或无交易的时间的数据空缺；
通常情况下，[hf,lf] 在上涨段 呈现为 hf 的 end 以某种（或某几种）线形或类线形的形式递增，而 lf 一般都是 0 ；在下降段 lf 的 end 会以某种（或某几种）线形或类线形的形式递增，而 hf 一般都是0；


Here is all the data you need:
"m1fa-5rows.csv"

## Data loading

### Subtask:
Load the data from "m1fa-5rows.csv" into a dataframe.


**Reasoning**:
Load the data into a pandas DataFrame and display the first few rows.



In [None]:
import pandas as pd

df = pd.read_csv('m1fa-5rows.csv')
display(df.head())

Unnamed: 0,strDt,high,low,hf,lf
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0


## Data wrangling

### Subtask:
Decode the 'hf' and 'lf' columns using the 'end' value from the RLE-Increments encoding description provided. This will involve extracting the 'end' value from the string representation in each cell for 'hf' and 'lf' columns and creating new numerical columns for these decoded values.


**Reasoning**:
Define a function to extract the 'end' value from the RLE-Increments encoded string.



In [None]:
def extract_end_value(encoded_string):
  """
  Extracts the 'end' value from an RLE-Increments encoded string.

  Args:
    encoded_string: The RLE-Increments encoded string.

  Returns:
    The integer value of 'end' if found, otherwise None.
  """
  if not isinstance(encoded_string, str):
      return None
  parts = encoded_string.split('|')
  if len(parts) >= 4:
    try:
      return int(parts[3])
    except (ValueError, IndexError):
      return None
  return None

df['hf_end'] = df['hf'].apply(extract_end_value)
df['lf_end'] = df['lf'].apply(extract_end_value)

display(df.head())

Unnamed: 0,strDt,high,low,hf,lf,hf_end,lf_end
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0,0,0
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0,0,0
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0,0,0
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0,1,0
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0,2,0


## Data preparation

### Subtask:
Based on the decoded 'hf_end' and 'lf_end' columns, identify potential price turning points. Since 'hf_end' increasing suggests an uptrend and 'lf_end' increasing suggests a downtrend, look for transitions between these states. Also, identify price turning points based on 'high' and 'low' values (peaks and troughs).


**Reasoning**:
Calculate the differences for 'hf_end' and 'lf_end', identify potential indicator turning points, and identify potential price peaks and troughs based on neighboring values. Create boolean columns to flag these points.



In [None]:
df['hf_end_diff'] = df['hf_end'].diff()
df['lf_end_diff'] = df['lf_end'].diff()

# Identify potential indicator turning points
# hf_end turning point: when hf_end_diff changes from non-positive to positive
# lf_end turning point: when lf_end_diff changes from non-positive to positive
df['hf_indicator_turning_point'] = ((df['hf_end_diff'].shift(1) <= 0) & (df['hf_end_diff'] > 0))
df['lf_indicator_turning_point'] = ((df['lf_end_diff'].shift(1) <= 0) & (df['lf_end_diff'] > 0))


# Identify potential price peaks (high is greater than immediate neighbors)
df['price_peak'] = (df['high'] > df['high'].shift(1)) & (df['high'] > df['high'].shift(-1))

# Identify potential price troughs (low is less than immediate neighbors)
df['price_trough'] = (df['low'] < df['low'].shift(1)) & (df['low'] < df['low'].shift(-1))

display(df[['hf_end_diff', 'lf_end_diff', 'hf_indicator_turning_point', 'lf_indicator_turning_point', 'price_peak', 'price_trough']].head())

Unnamed: 0,hf_end_diff,lf_end_diff,hf_indicator_turning_point,lf_indicator_turning_point,price_peak,price_trough
0,,,False,False,False,False
1,0.0,0.0,False,False,False,True
2,0.0,0.0,False,False,False,False
3,1.0,0.0,True,False,False,False
4,1.0,0.0,False,False,False,False


## Data analysis

### Subtask:
Analyze the relationship between the decoded 'hf_end', 'lf_end' and the price movements ([high, low]). Specifically, investigate the correlation between increases in 'hf_end' and rising prices, and increases in 'lf_end' and falling prices. Analyze the relationship between the identified price turning points and the turning points indicated by 'hf_end' and 'lf_end'.


**Reasoning**:
Calculate the correlation coefficients between 'hf_end' and price ('high', 'low'), and 'lf_end' and price ('high', 'low') as per instructions 1 and 2.



In [None]:
correlation_hf_high = df['hf_end'].corr(df['high'])
correlation_hf_low = df['hf_end'].corr(df['low'])
correlation_lf_high = df['lf_end'].corr(df['high'])
correlation_lf_low = df['lf_end'].corr(df['low'])

print(f"Correlation between hf_end and high: {correlation_hf_high}")
print(f"Correlation between hf_end and low: {correlation_hf_low}")
print(f"Correlation between lf_end and high: {correlation_lf_high}")
print(f"Correlation between lf_end and low: {correlation_lf_low}")

Correlation between hf_end and high: 0.020713712562954567
Correlation between hf_end and low: 0.02051047978156996
Correlation between lf_end and high: -0.04798378477228087
Correlation between lf_end and low: -0.04860713610444682


**Reasoning**:
Analyze the co-occurrence of indicator turning points and price turning points as per instructions 3 and 4 by counting the instances where 'hf_indicator_turning_point' coincides with 'price_peak' and 'lf_indicator_turning_point' coincides with 'price_trough'.



In [None]:
# Count occurrences where hf indicator turning point and price peak happen at the same time
hf_peak_coincidence_count = df[df['hf_indicator_turning_point'] & df['price_peak']].shape[0]

# Count occurrences where lf indicator turning point and price trough happen at the same time
lf_trough_coincidence_count = df[df['lf_indicator_turning_point'] & df['price_trough']].shape[0]

print(f"Number of times hf indicator turning point coincides with a price peak: {hf_peak_coincidence_count}")
print(f"Number of times lf indicator turning point coincides with a price trough: {lf_trough_coincidence_count}")

# Optional: Display rows where these coincidences occur
# display(df[df['hf_indicator_turning_point'] & df['price_peak']].head())
# display(df[df['lf_indicator_turning_point'] & df['price_trough']].head())

Number of times hf indicator turning point coincides with a price peak: 4716
Number of times lf indicator turning point coincides with a price trough: 4847


## Feature engineering

### Subtask:
Create new features that capture the relationship between price and the decoded 'hf_end' and 'lf_end'. This could include features like the difference between consecutive 'hf_end' values, the difference between consecutive 'lf_end' values, indicators of when 'hf_end' or 'lf_end' are increasing, and combinations of these with price changes. Also, create features representing the identified price and indicator turning points.


**Reasoning**:
Calculate the differences between consecutive 'hf_end' and 'lf_end' values, create boolean flags for increasing values, calculate price change, and then create interaction features and ensure turning point features are present. This combines steps 1 through 7 of the instructions.



In [None]:
df['hf_end_diff'] = df['hf_end'].diff()
df['lf_end_diff'] = df['lf_end'].diff()
df['hf_end_increasing'] = df['hf_end_diff'] > 0
df['lf_end_increasing'] = df['lf_end_diff'] > 0
df['price_change'] = df['high'].diff()
df['hf_increase_price_change'] = df['hf_end_increasing'] * df['price_change']
df['lf_increase_price_change'] = df['lf_end_increasing'] * df['price_change']

# Ensure turning point features are present (they were created in the previous step)
# The columns 'hf_indicator_turning_point', 'lf_indicator_turning_point',
# 'price_peak', 'price_trough' should already exist in the dataframe.

display(df[['hf_end_diff', 'lf_end_diff', 'hf_end_increasing', 'lf_end_increasing', 'price_change', 'hf_increase_price_change', 'lf_increase_price_change', 'hf_indicator_turning_point', 'lf_indicator_turning_point', 'price_peak', 'price_trough']].head())

Unnamed: 0,hf_end_diff,lf_end_diff,hf_end_increasing,lf_end_increasing,price_change,hf_increase_price_change,lf_increase_price_change,hf_indicator_turning_point,lf_indicator_turning_point,price_peak,price_trough
0,,,False,False,,,,False,False,False,False
1,0.0,0.0,False,False,-11.25,-0.0,-0.0,False,False,False,True
2,0.0,0.0,False,False,3.75,0.0,0.0,False,False,False,False
3,1.0,0.0,True,False,2.75,2.75,0.0,True,False,False,False
4,1.0,0.0,True,False,4.0,4.0,0.0,False,False,False,False


## Summary:

### Data Analysis Key Findings

*   The correlation between the decoded 'hf\_end' and price ('high' and 'low') is very low, close to zero.
*   Similarly, the correlation between the decoded 'lf\_end' and price ('high' and 'low') is also very low, with a slight negative correlation.
*   A significant number of instances (4716) show a coincidence between the 'hf\_indicator\_turning\_point' and a 'price\_peak'.
*   A significant number of instances (4847) show a coincidence between the 'lf\_indicator\_turning\_point' and a 'price\_trough'.
*   New features such as the differences in consecutive 'hf\_end' and 'lf\_end' values, boolean flags for when 'hf\_end' or 'lf\_end' are increasing, and interaction terms combining these increases with price changes have been successfully created.

### Insights or Next Steps

*   While linear correlation is weak, the high number of coincidences between indicator turning points and price turning points suggests that 'hf\_end' and 'lf\_end' may serve as useful signals for price reversals, possibly in a non-linear or temporal manner. Further investigation into the timing and sequence of these turning points is warranted.
*   The newly engineered features, including differences, increasing flags, and interaction terms, can be used to build predictive models to forecast price movements or identify trading opportunities based on the combined behavior of price and these indicators.


# Task
針對 m1fa-5rows.csv文件， 將 [high,low] 作為一個價格整體，[hf,lf] 作為一個指標整體，深入分析其關聯，以及綜合價格拐點 與 指標拐點 的關聯及特徵發現 及 特徵組合， 使用深度学习来自动提取特征， 初步的思维片段和定义请参考下面的文字描述。


注： hf 和 lf 都是Delta RLE-Increments 编码 ： 【声明：编码处理按照 【Delta RLE（差分游程编码）+ end + length + sum 是对“主干连续+偶有跳点”的稀疏整数数组非常优雅且高效的表示方法】 RLE-Increments 编码/还原流程 * 编码* 若数组为空：全部为0或空。 计算首元素start。 按顺序计算差分序列，遇到连续+1合并为+1xN；跳跃直接写出（如+10）。 end为最后一个数。 length为元素个数。 sum为总和。 还原 取start，解开deltas（顺序累加），还原所有元素。 检查end是否一致，length是否一致。 检查sum一致性。 校验有误直接丢弃。 边界处理与异常情况 空数组：空串。 单元素：deltas为空。】 'hf' and 'lf' columns 仅使用 'end' 解码出来的字段来继续进行数据分析和特征工程。

通常情况下，在上涨段，[high,low] 以 high 为基准增涨，直到波峰拐点 high出现，随后就是下跌段；在下跌段，[high,low] 以 low 为基准减少，直到波谷拐点 low 出现，随后是上涨段； 这就是价格的上涨 - 下跌 - 上涨 这种交替的出现的描述。 另外当前数据是1分钟最小粒度的数据，是离散数据，有部分非交易时间或无交易的时间的数据空缺； 通常情况下，[hf,lf] 在上涨段 呈现为 hf 的 end 以某种（或某几种）线形或类线形的形式递增，而 lf 一般都是 0 ；在下降段 lf 的 end 会以某种（或某几种）线形或类线形的形式递增，而 hf 一般都是0；

Here is all the data you need:
"m1fa-5rows.csv"

## Data loading

### Subtask:
Load the data from "m1fa-5rows.csv" into a dataframe.


**Reasoning**:
Import the pandas library and load the data into a pandas DataFrame, then display the head and info of the dataframe.



In [None]:
import pandas as pd

df = pd.read_csv('m1fa-5rows.csv')
display(df.head())
display(df.info())

Unnamed: 0,strDt,high,low,hf,lf
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102706 entries, 0 to 102705
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   strDt   102706 non-null  object 
 1   high    102706 non-null  float64
 2   low     102706 non-null  float64
 3   hf      102706 non-null  object 
 4   lf      102706 non-null  object 
dtypes: float64(2), object(3)
memory usage: 3.9+ MB


None

## Data wrangling

### Subtask:
Decode the 'hf' and 'lf' columns using the 'end' value as described in the task.


**Reasoning**:
Define a function to extract the 'end' value from the encoded strings and apply it to the 'hf' and 'lf' columns, then convert the results to numeric and display the head and info of the updated dataframe.



In [None]:
def decode_rle_increment_end(encoded_string):
    """Decodes the end value from a Delta RLE-Increments encoded string."""
    if isinstance(encoded_string, str):
        parts = encoded_string.split('|')
        if len(parts) >= 4: # Ensure there are enough parts for end, length, and sum
            # The 'end' part is the second-to-last part when split by '|'
            return parts[-3] # Accessing the third last element which is 'end'
    return None

# Apply the function to 'hf' and create 'hf_end', then convert to numeric
df['hf_end'] = df['hf'].apply(decode_rle_increment_end)
df['hf_end'] = pd.to_numeric(df['hf_end'], errors='coerce')

# Apply the function to 'lf' and create 'lf_end', then convert to numeric
df['lf_end'] = df['lf'].apply(decode_rle_increment_end)
df['lf_end'] = pd.to_numeric(df['lf_end'], errors='coerce')

# Display the first few rows with the new columns
display(df.head())

# Check the data types of the new columns
display(df[['hf_end', 'lf_end']].info())

Unnamed: 0,strDt,high,low,hf,lf,hf_end,lf_end
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0,0,0
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0,0,0
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0,0,0
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0,2,0
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0,3,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102706 entries, 0 to 102705
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   hf_end  102706 non-null  int64
 1   lf_end  102706 non-null  int64
dtypes: int64(2)
memory usage: 1.6 MB


None

## Data preparation

### Subtask:
Combine 'high' and 'low' into a single representation of price movement, and combine the decoded 'hf_end' and 'lf_end' into a single representation of the indicator movement.


**Reasoning**:
Create the 'price_movement' and 'indicator_movement' columns as described in the instructions by performing the specified calculations and then display the head of the dataframe.



In [None]:
# 1. Create 'price_movement' column
df['price_movement'] = df['high'] - df['low']

# 2. Create 'indicator_movement' column
df['indicator_movement'] = df['hf_end'] - df['lf_end']

# 3. Display the head of the dataframe
display(df.head())

Unnamed: 0,strDt,high,low,hf,lf,hf_end,lf_end,price_movement,indicator_movement
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0,0,0,0.0,0
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0,0,0,0.0,0
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0,0,0,0.0,0
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0,2,0,0.25,2
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0,3,0,0.0,3


## Feature engineering

### Subtask:
Based on the descriptions, engineer features that capture price turning points (based on 'high' and 'low') and indicator turning points (based on decoded 'hf_end' and 'lf_end'). Explore interactions between these features.


**Reasoning**:
Identify potential price and indicator turning points using a rolling window and create binary features.



In [None]:
import numpy as np

# 1. Identify potential price turning points using rolling windows
window_size = 5 # Example window size
df['price_high_rolling_max'] = df['high'].rolling(window=window_size, center=True).max()
df['price_low_rolling_min'] = df['low'].rolling(window=window_size, center=True).min()

# Create binary features for price turning points
# A peak is where the current high is the rolling maximum
df['is_price_peak'] = (df['high'] == df['price_high_rolling_max']).astype(int)
# A trough is where the current low is the rolling minimum
df['is_price_trough'] = (df['low'] == df['price_low_rolling_min']).astype(int)

# 2. Identify potential indicator turning points
# Based on the description: hf_end increasing after being low/zero, lf_end increasing after being low/zero
# Let's define "low/zero" as <= 0 for simplicity in this example
df['is_indicator_hf_increasing'] = ((df['hf_end'].shift(-1) > df['hf_end']) & (df['hf_end'] <= 0)).astype(int)
df['is_indicator_lf_increasing'] = ((df['lf_end'].shift(-1) > df['lf_end']) & (df['lf_end'] <= 0)).astype(int)

# Fill NaN values created by rolling window and shifting for simplicity, e.g., with 0
df.fillna(0, inplace=True)

# 3. Explore interaction features
# Example interaction: Price peak occurring when hf_end is not increasing (could be decreasing or staying low)
df['interaction_peak_no_hf_increase'] = df['is_price_peak'] * (1 - df['is_indicator_hf_increasing'])

# Example interaction: Price trough occurring when lf_end is not increasing
df['interaction_trough_no_lf_increase'] = df['is_price_trough'] * (1 - df['is_indicator_lf_increasing'])

# Example interaction: Price peak and lf_end increasing
df['interaction_peak_lf_increase'] = df['is_price_peak'] * df['is_indicator_lf_increasing']

# Example interaction: Price trough and hf_end increasing
df['interaction_trough_hf_increase'] = df['is_price_trough'] * df['is_indicator_hf_increasing']


# Display the head with new features
display(df.head())

Unnamed: 0,strDt,high,low,hf,lf,hf_end,lf_end,price_movement,indicator_movement,price_high_rolling_max,price_low_rolling_min,is_price_peak,is_price_trough,is_indicator_hf_increasing,is_indicator_lf_increasing,interaction_peak_no_hf_increase,interaction_trough_no_lf_increase,interaction_peak_lf_increase,interaction_trough_hf_increase
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0,0,0,0.0,0,0.0,0.0,0,0,0,0,0,0,0,0
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0,0,0,0.0,0,0.0,0.0,0,0,0,0,0,0,0,0
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0,0,0,0.0,0,5975.5,5964.25,0,0,1,0,0,0,0,0
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0,2,0,0.25,2,5975.0,5964.25,0,0,0,0,0,0,0,0
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0,3,0,0.0,3,5986.0,5968.0,0,0,0,0,0,0,0,0


## Data exploration

### Subtask:
Analyze the relationship between the combined price representation ('price_movement') and the combined indicator representation ('indicator_movement'). Look for patterns and correlations, especially around the engineered turning points features ('is_price_peak', 'is_price_trough', 'is_indicator_hf_increasing', 'is_indicator_lf_increasing') and their interactions.


**Reasoning**:
Calculate the correlation matrix for the relevant features and then perform grouped analysis based on the turning point and interaction features to understand the relationship with price and indicator movements.



In [None]:
# Select the columns for correlation analysis
correlation_cols = [
    'price_movement',
    'indicator_movement',
    'is_price_peak',
    'is_price_trough',
    'is_indicator_hf_increasing',
    'is_indicator_lf_increasing',
    'interaction_peak_no_hf_increase',
    'interaction_trough_no_lf_increase',
    'interaction_peak_lf_increase',
    'interaction_trough_hf_increase'
]

# Calculate the correlation matrix
correlation_matrix = df[correlation_cols].corr()

# Print the correlation matrix
print("Correlation Matrix:")
display(correlation_matrix)

# Group by turning point features and calculate mean of movements
print("\nMean Movements Grouped by Turning Point Features:")
for turning_point_feature in ['is_price_peak', 'is_price_trough', 'is_indicator_hf_increasing', 'is_indicator_lf_increasing']:
    print(f"\nGrouping by: {turning_point_feature}")
    grouped_analysis = df.groupby(turning_point_feature)[['price_movement', 'indicator_movement']].mean()
    display(grouped_analysis)

# Group by interaction features and calculate mean of movements
print("\nMean Movements Grouped by Interaction Features:")
for interaction_feature in ['interaction_peak_no_hf_increase', 'interaction_trough_no_lf_increase', 'interaction_peak_lf_increase', 'interaction_trough_hf_increase']:
    print(f"\nGrouping by: {interaction_feature}")
    grouped_analysis_interaction = df.groupby(interaction_feature)[['price_movement', 'indicator_movement']].mean()
    display(grouped_analysis_interaction)

Correlation Matrix:


Unnamed: 0,price_movement,indicator_movement,is_price_peak,is_price_trough,is_indicator_hf_increasing,is_indicator_lf_increasing,interaction_peak_no_hf_increase,interaction_trough_no_lf_increase,interaction_peak_lf_increase,interaction_trough_hf_increase
price_movement,1.0,-0.104222,0.009202,0.025874,-0.001474,-0.007743,0.012736,0.027808,0.013576,0.023644
indicator_movement,-0.104222,1.0,0.030068,-0.007569,-0.004257,0.061624,0.029783,-0.010629,0.03102,-0.004707
is_price_peak,0.009202,0.030068,1.0,-0.141798,-0.125633,0.036184,0.974632,-0.141854,0.363882,-0.061536
is_price_trough,0.025874,-0.007569,-0.141798,1.0,0.031577,-0.124044,-0.142141,0.976812,-0.062686,0.367103
is_indicator_hf_increasing,-0.001474,-0.004257,-0.125633,0.031577,1.0,-0.120432,-0.177338,0.036748,-0.059745,0.423836
is_indicator_lf_increasing,-0.007743,0.061624,0.036184,-0.124044,-0.120432,1.0,0.041485,-0.171809,0.434607,-0.059225
interaction_peak_no_hf_increase,0.012736,0.029783,0.974632,-0.142141,-0.177338,0.041485,1.0,-0.142441,0.367454,-0.075162
interaction_trough_no_lf_increase,0.027808,-0.010629,-0.141854,0.976812,0.036748,-0.171809,-0.142441,1.0,-0.07467,0.370981
interaction_peak_lf_increase,0.013576,0.03102,0.363882,-0.062686,-0.059745,0.434607,0.367454,-0.07467,1.0,-0.027695
interaction_trough_hf_increase,0.023644,-0.004707,-0.061536,0.367103,0.423836,-0.059225,-0.075162,0.370981,-0.027695,1.0



Mean Movements Grouped by Turning Point Features:

Grouping by: is_price_peak


Unnamed: 0_level_0,price_movement,indicator_movement
is_price_peak,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.503318,-1185.657157
1,2.583976,-615.262442



Grouping by: is_price_trough


Unnamed: 0_level_0,price_movement,indicator_movement
is_price_trough,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.478188,-1060.693054
1,2.706928,-1205.505074



Grouping by: is_indicator_hf_increasing


Unnamed: 0_level_0,price_movement,indicator_movement
is_indicator_hf_increasing,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.519411,-1073.446962
1,2.505009,-1163.453927



Grouping by: is_indicator_lf_increasing


Unnamed: 0_level_0,price_movement,indicator_movement
is_indicator_lf_increasing,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.527444,-1257.31737
1,2.450634,65.628319



Mean Movements Grouped by Interaction Features:

Grouping by: interaction_peak_no_hf_increase


Unnamed: 0_level_0,price_movement,indicator_movement
interaction_peak_no_hf_increase,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.498379,-1182.193003
1,2.611915,-607.596283



Grouping by: interaction_trough_no_lf_increase


Unnamed: 0_level_0,price_movement,indicator_movement
interaction_trough_no_lf_increase,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.476231,-1051.450398
1,2.72592,-1257.982787



Grouping by: interaction_peak_lf_increase


Unnamed: 0_level_0,price_movement,indicator_movement
interaction_peak_lf_increase,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.509873,-1123.138351
1,2.787123,247.791474



Grouping by: interaction_trough_hf_increase


Unnamed: 0_level_0,price_movement,indicator_movement
interaction_trough_hf_increase,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2.504294,-1079.885461
1,2.989076,-1288.755014


## Feature engineering

### Subtask:
Use deep learning techniques to automatically extract additional features from the combined price and indicator data.


**Reasoning**:
Select, normalize, and reshape the relevant columns for deep learning input.



In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1. Select the relevant columns for deep learning input
# Using the combined 'price_movement' and 'indicator_movement' as requested in the prompt
selected_features = ['price_movement', 'indicator_movement']
data_for_dl = df[selected_features].values

# 2. Normalize these selected features
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data_for_dl)

# 3. Reshape the normalized data for sequential input
# Define sequence length
sequence_length = 10 # Example sequence length

# Create sequences
sequences = []
for i in range(len(normalized_data) - sequence_length + 1):
    sequence = normalized_data[i : i + sequence_length]
    sequences.append(sequence)

sequences = np.array(sequences)

print(f"Original data shape: {data_for_dl.shape}")
print(f"Normalized data shape: {normalized_data.shape}")
print(f"Reshaped data shape (sequences): {sequences.shape}")

Original data shape: (102706, 2)
Normalized data shape: (102706, 2)
Reshaped data shape (sequences): (102697, 10, 2)


**Reasoning**:
Define and train a simple deep learning model to extract features from the sequential data.



In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Flatten # Using SimpleRNN for simplicity as an example
from tensorflow.keras.optimizers import Adam

# 4. Define a simple deep learning model architecture
model = Sequential()
# Add an RNN layer to process sequences
model.add(SimpleRNN(units=32, activation='relu', input_shape=(sequence_length, len(selected_features))))
# Add a Dense layer to output the extracted features
# Let's say we want to extract 8 features
model.add(Dense(units=8, activation='relu', name='extracted_features'))
# Add an output layer (e.g., for reconstruction if doing autoencoding, or just a small layer)
# For simple feature extraction without a specific target, we can add a final dense layer
# or use the output of the 'extracted_features' layer directly.
# Let's add a small output layer for a simple training objective (e.g., predicting next step or just learning representations)
# A simple approach is to train it to output something related to the input, or just train for representation learning.
# For simplicity, let's add a dense layer that could potentially predict something or just serve as a training signal.
# The actual "extracted features" will come from the 'extracted_features' layer.
model.add(Dense(units=2)) # Example output layer, could be anything for representation learning

model.compile(optimizer=Adam(learning_rate=0.001), loss='mse') # Using MSE as a simple loss function

model.summary()

# 5. Train this model on the prepared data
# We need a target for training. Since we are just extracting features,
# we can train the model to predict the next step in the sequence, or just train on the input itself (autoencoder concept simplified).
# A simple approach is to use the input sequences as both X and a simplified Y (e.g., the end of the sequence).
# Or, train to predict the next value. For simplicity, let's train it to output something related to the sequence end.
# Let's create a dummy target, for feature extraction, the target isn't critical,
# the goal is to learn meaningful intermediate representations.
# A common approach is to train an autoencoder, but for simplicity here, we'll just train a simple network.
# We can train it to predict the last element of the sequence as a simple task.
target_data = sequences[:, -1, :] # Predict the last step's values

# Train the model
# Using a small number of epochs for demonstration
history = model.fit(sequences, target_data, epochs=5, batch_size=32, verbose=0)

print("Model training finished.")

  super().__init__(**kwargs)


Model training finished.


**Reasoning**:
Use the trained model to extract features from the data and add them as new columns to the original DataFrame.



In [None]:
from tensorflow.keras.models import Model

# 6. Use the trained model to predict or extract the new features
# Create a new model that outputs the result of the 'extracted_features' layer
feature_extractor_model = Model(inputs=model.input, outputs=model.get_layer('extracted_features').output)

# Predict the features for the sequences
extracted_features = feature_extractor_model.predict(sequences)

# 7. Add these extracted features as new columns to the original DataFrame
# Since the sequences were created with a lag, the extracted features correspond to the end of each sequence.
# We need to align these features with the original DataFrame.
# The first 'sequence_length - 1' rows of the original DataFrame do not have corresponding sequences.
# The extracted features array has shape (num_sequences, num_extracted_features).
# num_sequences = len(df) - sequence_length + 1
# We need to add 'sequence_length - 1' placeholder rows at the beginning of the extracted features
# or align them correctly.
# Let's create an array of NaNs with the same number of columns as extracted features
nan_padding = np.full((sequence_length - 1, extracted_features.shape[1]), np.nan)

# Prepend the NaN padding to the extracted features
aligned_extracted_features = np.vstack((nan_padding, extracted_features))

# Add the extracted features as new columns to the original DataFrame
feature_column_names = [f'dl_feature_{i+1}' for i in range(extracted_features.shape[1])]
extracted_features_df = pd.DataFrame(aligned_extracted_features, columns=feature_column_names, index=df.index)

df = pd.concat([df, extracted_features_df], axis=1)

# Display the head of the dataframe with the new features
display(df.head(sequence_length + 5)) # Display enough rows to see the features appear after the padding

# Display info to check new columns and dtypes
display(df.info())

AttributeError: The layer sequential has never been called and thus has no defined input.

**Reasoning**:
The error `AttributeError: The layer sequential has never been called and thus has no defined input.` indicates that the model's input shape was not correctly set when defining the `feature_extractor_model`. Although `input_shape` was provided in the first layer of the `sequential` model, when creating a new `Model` from existing layers, the input layer needs to be explicitly defined and connected. I will redefine the feature extractor model by explicitly defining the input layer and connecting it to the rest of the network up to the desired feature extraction layer.



In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input

# 6. Use the trained model to predict or extract the new features
# Create an input layer that matches the expected input shape of the trained model
input_layer = Input(shape=(sequence_length, len(selected_features)))

# Get the layers from the trained model
rnn_layer = model.get_layer('simple_rnn')
extracted_features_layer = model.get_layer('extracted_features')

# Connect the layers to build the feature extractor model
x = rnn_layer(input_layer)
extracted_features_output = extracted_features_layer(x)

# Create the new model that outputs the result of the 'extracted_features' layer
feature_extractor_model = Model(inputs=input_layer, outputs=extracted_features_output)

# Predict the features for the sequences
extracted_features = feature_extractor_model.predict(sequences)

# 7. Add these extracted features as new columns to the original DataFrame
# Since the sequences were created with a lag, the extracted features correspond to the end of each sequence.
# We need to align these features with the original DataFrame.
# The first 'sequence_length - 1' rows of the original DataFrame do not have corresponding sequences.
# The extracted features array has shape (num_sequences, num_extracted_features).
# num_sequences = len(df) - sequence_length + 1
# We need to add 'sequence_length - 1' placeholder rows at the beginning of the extracted features
# or align them correctly.
# Let's create an array of NaNs with the same number of columns as extracted features
nan_padding = np.full((sequence_length - 1, extracted_features.shape[1]), np.nan)

# Prepend the NaN padding to the extracted features
aligned_extracted_features = np.vstack((nan_padding, extracted_features))

# Add the extracted features as new columns to the original DataFrame
feature_column_names = [f'dl_feature_{i+1}' for i in range(extracted_features.shape[1])]
extracted_features_df = pd.DataFrame(aligned_extracted_features, columns=feature_column_names, index=df.index)

df = pd.concat([df, extracted_features_df], axis=1)

# Display the head of the dataframe with the new features
display(df.head(sequence_length + 5)) # Display enough rows to see the features appear after the padding

# Display info to check new columns and dtypes
display(df.info())

[1m3210/3210[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 2ms/step


Unnamed: 0,strDt,high,low,hf,lf,hf_end,lf_end,price_movement,indicator_movement,price_high_rolling_max,...,interaction_peak_lf_increase,interaction_trough_hf_increase,dl_feature_1,dl_feature_2,dl_feature_3,dl_feature_4,dl_feature_5,dl_feature_6,dl_feature_7,dl_feature_8
0,2024-10-29T02:10:00,5975.5,5975.5,0||0|0|0,0||0|0|0,0,0,0.0,0,0.0,...,0,0,,,,,,,,
1,2024-10-29T08:35:00,5964.25,5964.25,0||0|0|0,0||0|0|0,0,0,0.0,0,0.0,...,0,0,,,,,,,,
2,2024-10-29T09:29:00,5968.0,5968.0,0||0|0|0,0||0|0|0,0,0,0.0,0,5975.5,...,0,0,,,,,,,,
3,2024-10-29T10:12:00,5970.75,5970.5,2||2|1|2,0||0|0|0,2,0,0.25,2,5975.0,...,0,0,,,,,,,,
4,2024-10-29T10:22:00,5974.75,5974.75,2|+1x1|3|2|5,0||0|0|0,3,0,0.0,3,5986.0,...,0,0,,,,,,,,
5,2024-10-29T10:30:00,5975.0,5975.0,2|+1x2|4|3|9,0||0|0|0,4,0,0.0,4,5986.75,...,0,0,,,,,,,,
6,2024-10-29T12:00:00,5986.0,5986.0,2|+1x3|5|4|14,0||0|0|0,5,0,0.0,5,5986.75,...,0,0,,,,,,,,
7,2024-10-29T12:18:00,5986.75,5986.75,2|+1x9|11|10|65,0||0|0|0,11,0,0.0,11,5986.75,...,0,0,,,,,,,,
8,2024-10-29T12:21:00,5982.5,5982.5,0||0|0|0,0||0|0|0,0,0,0.0,0,5986.75,...,0,0,,,,,,,,
9,2024-10-29T12:25:00,5985.5,5985.5,0||0|0|0,0||0|0|0,0,0,0.0,0,5988.5,...,0,0,0.821598,0.140565,0.025867,0.946876,0.524412,0.0,0.035807,0.292607


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102706 entries, 0 to 102705
Data columns (total 27 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   strDt                              102706 non-null  object 
 1   high                               102706 non-null  float64
 2   low                                102706 non-null  float64
 3   hf                                 102706 non-null  object 
 4   lf                                 102706 non-null  object 
 5   hf_end                             102706 non-null  int64  
 6   lf_end                             102706 non-null  int64  
 7   price_movement                     102706 non-null  float64
 8   indicator_movement                 102706 non-null  int64  
 9   price_high_rolling_max             102706 non-null  float64
 10  price_low_rolling_min              102706 non-null  float64
 11  is_price_peak                      1027

None

## Model training

### Subtask:
Prepare the data and train a deep learning model to predict or classify the relationship between price and indicator movements, potentially focusing on identifying turning points.


**Reasoning**:
Drop rows with NaN values in the deep learning features, define the target variable, define the features for the model, split the data into training and testing sets, define and compile a deep learning model, and train the model.



In [None]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import numpy as np

# 1. Drop rows with NaN values in the deep learning features
# The deep learning features start from 'dl_feature_1'
dl_feature_cols = [f'dl_feature_{i+1}' for i in range(8)] # Assuming 8 features were extracted
df_cleaned = df.dropna(subset=dl_feature_cols).copy()

# 2. Define your target variable(s)
# Let's choose 'is_price_peak' as the target for a classification task
target_variable = 'is_price_peak'

# 3. Define your feature set (X) for the model
# Include original relevant features, engineered features, and deep learning features
feature_cols = [
    'price_movement',
    'indicator_movement',
    'is_price_trough', # Exclude the target itself
    'is_indicator_hf_increasing',
    'is_indicator_lf_increasing',
    'interaction_peak_no_hf_increase',
    'interaction_trough_no_lf_increase',
    'interaction_peak_lf_increase',
    'interaction_trough_hf_increase'
] + dl_feature_cols

X = df_cleaned[feature_cols]

# 4. Define your target set (y)
y = df_cleaned[target_variable]

# 5. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Stratify for classification target

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# 6. Define a deep learning model architecture
# Simple feedforward network for binary classification
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid') # Sigmoid for binary classification
])

# 7. Compile the deep learning model
model.compile(optimizer='adam',
              loss='binary_crossentropy', # Binary crossentropy for binary classification
              metrics=['accuracy'])

model.summary()

# 8. Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


Shape of X_train: (82157, 17)
Shape of X_test: (20540, 17)
Shape of y_train: (82157,)
Shape of y_test: (20540,)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m2054/2054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 3ms/step - accuracy: 0.8614 - loss: 16.0719 - val_accuracy: 0.9523 - val_loss: 0.2091
Epoch 2/10
[1m2054/2054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9531 - loss: 0.7280 - val_accuracy: 0.9645 - val_loss: 0.1055
Epoch 3/10
[1m2054/2054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 2ms/step - accuracy: 0.9638 - loss: 0.1788 - val_accuracy: 0.9733 - val_loss: 0.0839
Epoch 4/10
[1m2054/2054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9701 - loss: 0.1298 - val_accuracy: 0.9735 - val_loss: 0.0803
Epoch 5/10
[1m2054/2054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.9733 - loss: 0.1120 - val_accuracy: 0.9782 - val_loss: 0.0697
Epoch 6/10
[1m2054/2054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.9770 - loss: 0.0858 - val_accuracy: 0.9797 - val_loss: 0.0641
Epoch 7/10
[1