In [130]:
# import libraries
import pandas as pd
import datetime as dt
import numpy as np

### Question 1: Dummies for Month and Week-of-Month

**What is the ABSOLUTE CORRELATION VALUE of the most correlated dummy variable <month>_w<week_of_month> with the binary outcome `is_positive_growth_30d_future`?**

From the correlation analysis and modeling, you may have observed that October and November are potentially important seasonal months. In this task, you'll go further by generating dummy variables for both the **Month** and **Week-of-Month** (starting from 1). For example, the first week of October should be coded as: `'October_w1'`.

Once you've generated these new variables, identify the one with the **highest absolute correlation** with `is_positive_growth_30d_future`, and round the result to **three decimal places**.


#### Suggested Steps

1. Use [this StackOverflow reference](https://stackoverflow.com/questions/25249033/week-of-a-month-pandas) to compute the week of the month using the following formula:
  ```python
  (d.day - 1) // 7 + 1
  ```
2. Create a new string variable that combines the month name and week of the month.
Example: 'October_w1', 'November_w2', etc.

3. Add the new variable (e.g., `month_wom`) to your set of **categorical features**.

   Your updated categorical feature list should include:
   - `'Month'`
   - `'Weekday'`
   - `'Ticker'`
   - `'ticker_type'`
   - `'month_wom'`

4. Use [`pandas.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to generate dummy variables for all categorical features.

   This should result in approximately **115 dummy variables**, including around **60** for the `month_wom` feature (`12 months × up to 5 weeks`).

5. Use [`DataFrame.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) to compute the correlation between each feature and the target variable `is_positive_growth_30d_future`.

6. Filter the correlation results to include only the dummy variables generated from `month_wom`.

7. Create a new column named `abs_corr` in the correlation results that stores the **absolute value** of the correlations.

8. Sort the correlation results by `abs_corr` in **descending** order.

9. Identify and report the **highest absolute correlation value** among the `month_wom` dummy variables, rounded to **three decimal places**.


**NOTE**: new dummies will be used as features in the next tasks, please leave them in the dataset.

In [131]:
df_full = pd.read_parquet('/content/drive/MyDrive/Colab Notebooks/1. DataTalk/PythonInvest/Homework - Deadline Jul 14/data/stocks_df_combined_2025_06_13.parquet.brotli')

In [132]:
# truncated df_full with 25 years of data
df = df_full[df_full.Date>='2000-01-01']
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 191795 entries, 3490 to 5700
Columns: 203 entries, Open to growth_btc_usd_365d
dtypes: datetime64[ns](3), float64(129), int32(64), int64(5), object(2)
memory usage: 251.7+ MB


In [133]:
[col for col in df.keys() if '30d' in col]

['growth_30d',
 'growth_future_30d',
 'is_positive_growth_30d_future',
 'growth_dax_30d',
 'growth_snp500_30d',
 'growth_dji_30d',
 'growth_epi_30d',
 'growth_gold_30d',
 'growth_wti_oil_30d',
 'growth_brent_oil_30d',
 'growth_btc_usd_30d']

In [134]:
df['month_wom'] = df['Date'].apply(lambda d: d.strftime('%B')+'_w'+str((d.day - 1) // 7 + 1))
df['Month'] = df['Date'].dt.month_name()
df['Weekday'] = df['Date'].dt.day_name()

In [135]:
# Categorical feature list
CATEGORICAL = ['Month', 'Weekday', 'Ticker', 'ticker_type', 'month_wom']
# Generate dummy varibles of the categorical features
dummy_variables = pd.get_dummies(df[CATEGORICAL],dtype='int32')
# append dummy variables into df
df_with_dummies = pd.concat([df, dummy_variables], axis=1)

In [136]:
DUMMIES = dummy_variables.keys().to_list()

In [137]:
PREDICT = ['is_positive_growth_30d_future']

In [138]:
df_with_dummies['Date'].min()
df_with_dummies.shape[0]

191795

In [139]:
# correlation between dummy variables and predict
corr = df_with_dummies[dummy_variables.keys().to_list()+PREDICT].corr()['is_positive_growth_30d_future']
abs_corr = abs(corr[corr.index.str.contains('month_wom')])
abs_corr.sort_values(ascending=False).head(5)

Unnamed: 0,is_positive_growth_30d_future
month_wom_October_w4,0.024968
month_wom_November_w3,0.022097
month_wom_November_w2,0.018822
month_wom_January_w2,0.018327
month_wom_October_w3,0.017734


---
### Question 2:  Define New "Hand" Rules on Macro and Technical Indicator Variables

**What is the precision score for the best of the NEW predictions (`pred3` or `pred4`), rounded to 3 digits after the comma?**

In this task, you'll apply insights from the **visualized decision tree (`clf10`)** (see *Code Snippet 5: 1.4.4 Visualisation*) to manually define and evaluate new predictive rules.


1. **Define two new 'hand' rules** based on branches that lead to 'positive' predictions in the tree:
   - `pred3_manual_dgs10_5`:  
     ```python
     (DGS10 <= 4) & (DGS5 <= 1)
     ```
   - `pred4_manual_dgs10_fedfunds`:  
     ```python
     (DGS10 > 4) & (FEDFUNDS <= 4.795)
     ```

2. **Extend Code Snippet 3** (Manual "hand rule" predictions):  
   - Implement and apply the above two rules (`pred3`, `pred4`) to your dataset.
   - Add the resulting predictions as new columns in your dataframe (e.g., `new_df`).

3. **Compute precision**:
   - For the rule that **does** make positive predictions on the TEST set, compute its **precision score**.
   - Use standard precision metrics (`TP / (TP + FP)`).
   - Round the precision score to **three decimal places**.  
     Example: If your result is `0.57897`, your final answer should be: `0.579`.

In [140]:
# Temporal split function based on the Date field
def temporal_split(df, min_date, max_date, train_prop=0.7, val_prop=0.15, test_prop=0.15):
    # Define the date intervals
    train_end = min_date + pd.Timedelta(days=(max_date - min_date).days * train_prop)
    val_end = train_end + pd.Timedelta(days=(max_date - min_date).days * val_prop)

    # Assign split labels based on date ranges
    split_labels = []
    for date in df['Date']:
        if date <= train_end:
            split_labels.append('train')
        elif date <= val_end:
            split_labels.append('validation')
        else:
            split_labels.append('test')

    # Add 'split' column to the DataFrame
    df['split'] = split_labels

    return df

In [141]:
min_date = df_with_dummies['Date'].min()
max_date = df_with_dummies['Date'].max()
df_with_dummies = temporal_split(df_with_dummies, min_date = min_date, max_date = max_date)
df_with_dummies['split'].value_counts()/len(df_with_dummies)

Unnamed: 0_level_0,count
split,Unnamed: 1_level_1
train,0.676399
test,0.163758
validation,0.159843


In [142]:
# All Supported Ta-lib indicators: https://github.com/TA-Lib/ta-lib-python/blob/master/docs/funcs.md

TECHNICAL_INDICATORS = ['adx', 'adxr', 'apo', 'aroon_1','aroon_2', 'aroonosc',
 'bop', 'cci', 'cmo','dx', 'macd', 'macdsignal', 'macdhist', 'macd_ext',
 'macdsignal_ext', 'macdhist_ext', 'macd_fix', 'macdsignal_fix',
 'macdhist_fix', 'mfi', 'minus_di', 'mom', 'plus_di', 'dm', 'ppo',
 'roc', 'rocp', 'rocr', 'rocr100', 'rsi', 'slowk', 'slowd', 'fastk',
 'fastd', 'fastk_rsi', 'fastd_rsi', 'trix', 'ultosc', 'willr',
 'ad', 'adosc', 'obv', 'atr', 'natr', 'ht_dcperiod', 'ht_dcphase',
 'ht_phasor_inphase', 'ht_phasor_quadrature', 'ht_sine_sine', 'ht_sine_leadsine',
 'ht_trendmod', 'avgprice', 'medprice', 'typprice', 'wclprice']

TECHNICAL_PATTERNS = [g for g in df_full.keys() if g.find('cdl')>=0]

MACRO = ['gdppot_us_yoy', 'gdppot_us_qoq', 'cpi_core_yoy', 'cpi_core_mom', 'FEDFUNDS',
 'DGS1', 'DGS5', 'DGS10']

GROWTH = [g for g in df_full.keys() if (g.find('growth_')==0)&(g.find('future')<0)]

# CUSTOM_NUMERICAL = ['SMA10', 'SMA20', 'growing_moving_average', 'high_minus_low_relative','volatility', 'ln_volume']

In [143]:
INDICATOR = ['DGS10','DGS5','FEDFUNDS']
SPLIT = ['split']
PREDICT = ['is_positive_growth_30d_future','growth_30d','growth_snp500_30d']

In [153]:
# NUMERICAL = GROWTH + TECHNICAL_INDICATORS + TECHNICAL_PATTERNS + MACRO #+ CUSTOM_NUMERICAL

In [158]:
new_df = df_with_dummies[['Date']+INDICATOR+PREDICT+SPLIT]

In [159]:
new_df['pred3_manual_dgs10_5'] = np.where((new_df['DGS10'] <= 4) & (new_df['DGS5'] <= 1),1,0)
new_df['pred4_manual_dgs10_fedfunds'] = np.where((new_df['DGS10'] > 4) & (new_df['FEDFUNDS'] <= 4.795),1,0)

In [160]:
new_df['pred3_validation'] = (new_df['pred3_manual_dgs10_5']==new_df['is_positive_growth_30d_future'])
new_df['pred4_validation'] = (new_df['pred4_manual_dgs10_fedfunds']==df_with_dummies['is_positive_growth_30d_future'])

In [161]:
# precision score
new_df_test=new_df[new_df.split=='test']
pred3_tp = (new_df_test['pred3_validation'] & new_df_test['is_positive_growth_30d_future']==1).sum()
pred4_tp = (new_df_test['pred4_validation'] & new_df_test['is_positive_growth_30d_future']==1).sum()
print(f"rule dgs10 <=4 and dgs5 <=1: {(pred3_tp/(new_df_test['pred3_manual_dgs10_5']==1).sum()):.3f}")
print(f"rule dgs10 >4 and fedfunds <=4.795: {(pred4_tp/(new_df_test['pred4_manual_dgs10_fedfunds']==1).sum()):.3f}")

rule dgs10 <=4 and dgs5 <=1: 0.580
rule dgs10 >4 and fedfunds <=4.795: 0.466
