In this homework, we're going to work with categorical variables, first ML models (Decision Trees), and hyperparameter tuning.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score

In [2]:
# !gdown "https://drive.google.com/file/d/1mb0ae2M5AouSDlqcUnIwaHq7avwGNrmB/view?usp=sharing" --fuzzy -O ./stocks_df_combined_2025_06_13.parquet.brotli

In [3]:
df_full = pd.read_parquet("./stocks_df_combined_2025_06_13.parquet.brotli")

In [4]:
df_full['ln_volume'] = df_full.Volume.apply(lambda x: np.log(x + 1e-8))
df = df_full[df_full.Date>='2000-01-01']
df.loc[:,'Month'] = df.Month.dt.strftime('%B')
df.loc[:,'Weekday'] = df.Weekday.astype(str)

  df.loc[:,'Month'] = df.Month.dt.strftime('%B')
  df.loc[:,'Weekday'] = df.Weekday.astype(str)


### Question 1: Dummies for Month and Week-of-Month
What is the ABSOLUTE CORRELATION VALUE of the most correlated dummy variable _w week_of_month with the binary outcome is_positive_growth_30d_future?

From the correlation analysis and modeling, you may have observed that October and November are potentially important seasonal months. In this task, you'll go further by generating dummy variables for both the Month and Week-of-Month (starting from 1). For example, the first week of October should be coded as: 'October_w1'.

Once you've generated these new variables, identify the one with the highest absolute correlation with is_positive_growth_30d_future, and round the result to three decimal places.

Suggested Steps:
* Use this to compute the week of the month using the following formula: (d.day - 1) // 7 + 1
* Create a new string variable that combines the month name and week of the month. Example: 'October_w1', 'November_w2', etc.
* Add the new variable (e.g., month_wom) to your set of categorical features.
* Your updated categorical feature list should include:
    - 'Month'
    - 'Weekday'
    - 'Ticker'
    - 'ticker_type'
    - 'month_wom'
* Use pandas.get_dummies() to generate dummy variables for all categorical features.
This should result in approximately 115 dummy variables, including around 60 for the month_wom feature (12 months × up to 5 weeks).
* Use DataFrame.corr() to compute the correlation between each feature and the target variable is_positive_growth_30d_future.
* Filter the correlation results to include only the dummy variables generated from month_wom.
* Create a new column named abs_corr in the correlation results that stores the absolute value of the correlations.
* Sort the correlation results by abs_corr in descending order.
* Identify and report the highest absolute correlation value among the month_wom dummy variables, rounded to three decimal places.

NOTE: new dummies will be used as features in the next tasks, please leave them in the dataset.

In [5]:
df['week_of_month'] = (df['Date'].dt.day - 1) // 7 + 1
df['month_wom'] = df['Month'] + '_w' + df['week_of_month'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['week_of_month'] = (df['Date'].dt.day - 1) // 7 + 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['month_wom'] = df['Month'] + '_w' + df['week_of_month'].astype(str)


In [6]:
df['month_wom'].sample(3)

5218        May_w3
2170      March_w1
3282    October_w2
Name: month_wom, dtype: object

In [7]:
CATEGORICAL = ['Month', 'Weekday', 'Ticker', 'ticker_type', 'month_wom']

In [8]:
df_dummies = pd.get_dummies(df[CATEGORICAL], columns=CATEGORICAL)

In [9]:
df_dummies.shape

(191795, 115)

In [10]:
df_dummies[[i for i in df_dummies.columns if '_w' in i]].shape

(191795, 60)

In [11]:
df = pd.concat([df, df_dummies], axis=1)

In [12]:
corr_is_positive_growth_30d_future = df[df_dummies.columns.tolist() + ['is_positive_growth_30d_future']].corr()['is_positive_growth_30d_future']
corr_is_positive_growth_30d_future_df = pd.DataFrame(corr_is_positive_growth_30d_future)

In [13]:
corr_is_positive_growth_30d_future_df[corr_is_positive_growth_30d_future_df.index.str.startswith('month_wom')]\
        .sort_values(by='is_positive_growth_30d_future').head()

Unnamed: 0,is_positive_growth_30d_future
month_wom_January_w2,-0.018327
month_wom_January_w5,-0.017437
month_wom_January_w3,-0.016737
month_wom_February_w1,-0.0167
month_wom_January_w4,-0.015362


In [14]:
corr_is_positive_growth_30d_future_df[corr_is_positive_growth_30d_future_df.index.str.startswith('month_wom')]\
        .sort_values(by='is_positive_growth_30d_future').tail()

Unnamed: 0,is_positive_growth_30d_future
month_wom_September_w4,0.013558
month_wom_October_w3,0.017734
month_wom_November_w2,0.018822
month_wom_November_w3,0.022097
month_wom_October_w4,0.024968


### Question 2: Define New "Hand" Rules on Macro and Technical Indicator Variables
What is the precision score for the best of the NEW predictions (pred3 or pred4), rounded to 3 digits after the comma?

In this task, you'll apply insights from the visualized decision tree (clf10) to manually define and evaluate new predictive rules.

* Define two new 'hand' rules based on branches that lead to 'positive' predictions in the tree:
    - pred3_manual_dgs10_5:\
    (DGS10 <= 4) & (DGS5 <= 1)
    - pred4_manual_dgs10_fedfunds:\
    (DGS10 > 4) & (FEDFUNDS <= 4.795)\
Hint: This is not exactly the same condition as in the estimated tree (original: (DGS10 <= 4.825) & (DGS5 <= 0.745); (DGS10 > 4.825) & (FEDFUNDS <= 4.795)), since in that case, there are no true positive predictions for both variables. Consider why this might be the case.
* Extend Manual "hand rule" predictions:
    - Implement and apply the above two rules (pred3, pred4) to your dataset.
    - Add the resulting predictions as new columns in your dataframe (e.g., new_df)
* Compute precision:
    - For the rule that does make positive predictions on the TEST set, compute its precision score.
    - Use standard precision metrics (TP / (TP + FP)).
    - Round the precision score to three decimal places.\
Example: If your result is 0.57897, your final answer should be: 0.579.

In [15]:
def temporal_split(df, min_date, max_date, train_prop=0.7, val_prop=0.15, test_prop=0.15):
    """
    Splits a DataFrame into three buckets based on the temporal order of the 'Date' column.

    Args:
        df (DataFrame): The DataFrame to split.
        min_date (str or Timestamp): Minimum date in the DataFrame.
        max_date (str or Timestamp): Maximum date in the DataFrame.
        train_prop (float): Proportion of data for training set (default: 0.6).
        val_prop (float): Proportion of data for validation set (default: 0.2).
        test_prop (float): Proportion of data for test set (default: 0.2).

    Returns:
        DataFrame: The input DataFrame with a new column 'split' indicating the split for each row.
    """
    train_end = min_date + pd.Timedelta(days=(max_date - min_date).days * train_prop)
    val_end = train_end + pd.Timedelta(days=(max_date - min_date).days * val_prop)

    split_labels = []
    for date in df['Date']:
        if date <= train_end:
            split_labels.append('train')
        elif date <= val_end:
            split_labels.append('validation')
        else:
            split_labels.append('test')

    df['split'] = split_labels

    return df

In [16]:
min_date_df = df.Date.min()
max_date_df = df.Date.max()

df = temporal_split(df,
                        min_date = min_date_df,
                        max_date = max_date_df)

In [17]:
df['split'].value_counts(normalize=True)

split
train         0.676399
test          0.163758
validation    0.159843
Name: proportion, dtype: float64

In [18]:
df['pred0_manual_cci'] = (df.cci>200).astype(int)
df['pred1_manual_prev_g1'] = (df.growth_30d>1).astype(int)
df['pred2_manual_prev_g1_and_snp'] = ((df['growth_30d'] > 1) & (df['growth_snp500_30d'] > 1)).astype(int)

In [19]:
df['pred3_manual_dgs10_5'] = ((df['DGS10'] <= 4) & (df['DGS5'] <= 1)) * 1
df['pred4_manual_dgs10_fedfunds'] = ((df['DGS10'] > 4) & (df['FEDFUNDS'] <= 4.795)) * 1

In [20]:
real_target = df[df['split'] == 'test']['is_positive_growth_30d_future']
pred_target = df[df['split'] == 'test']['pred3_manual_dgs10_5']
tp = ((pred_target == 1) & (pred_target == real_target)).sum()
fp = ((pred_target == 1) & (pred_target != real_target)).sum()
precision = tp / (tp+fp)
precision

np.float64(0.5797392176529589)

In [21]:
real_target = df[df['split'] == 'test']['is_positive_growth_30d_future']
pred_target = df[df['split'] == 'test']['pred4_manual_dgs10_fedfunds']
tp = ((pred_target == 1) & (pred_target == real_target)).sum()
fp = ((pred_target == 1) & (pred_target != real_target)).sum()
precision = tp / (tp+fp)
precision

np.float64(0.4664310954063604)

### Question 3: Unique Correct Predictions from a 10-Level Decision Tree Classifier (pred5_clf_10)
What is the total number of records in the TEST dataset where the new prediction pred5_clf_10 is correct, while all 'hand' rule predictions (pred0 to pred4) are incorrect?

To ensure reproducibility, please include the following parameter in the Decision Tree Classifier:\
clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42) 
* Step 1: Train the Decision Tree and Generate Predictions
    - Initialize a Decision Tree Classifier with a maximum depth of 10 and set random_state=42 for reproducibility.
    - Fit the classifier on the combined TRAIN and VALIDATION datasets.
    - Use the trained model to predict on the entire dataset (TRAIN + VALIDATION + TEST).
    - Store these predictions in a new column named pred5_clf_10 within your main dataframe.
* Step 2: Identify Unique Correct Predictions by pred5_clf_10
    - Create a new boolean column, only_pred5_is_correct, that is True only when:
        - The prediction from pred5_clf_10 is correct (i.e., matches the true label).
        - All other hand rule predictions (pred0 through pred4) are incorrect.
* Step 3: Count Unique Correct Predictions on the TEST Set
    - Convert the only_pred5_is_correct column from boolean to integer.
    - Filter the dataframe for records belonging to the TEST dataset.
    - Count how many records in the TEST set have only_pred5_is_correct equal to 1.
* Advanced (Optional)
    - To generalize this for many prediction columns (e.g., pred0 to pred99), define a function that can be applied to an entire dataframe row.
    - This function should identify whether a specific prediction (predX) is uniquely correct (correct while all others are incorrect).\
    This approach avoids hardcoding conditions for each predictor and scales easily.\
    For examples of how to apply functions to rows in pandas, see this helpful resource:
    Pandas apply function to every row


In [35]:
GROWTH = [g for g in df_full.keys() if (g.find('growth_')==0)&(g.find('future')<0)]
TECHNICAL_INDICATORS = ['adx', 'adxr', 'apo', 'aroon_1','aroon_2', 'aroonosc',
 'bop', 'cci', 'cmo','dx', 'macd', 'macdsignal', 'macdhist', 'macd_ext',
 'macdsignal_ext', 'macdhist_ext', 'macd_fix', 'macdsignal_fix',
 'macdhist_fix', 'mfi', 'minus_di', 'mom', 'plus_di', 'dm', 'ppo',
 'roc', 'rocp', 'rocr', 'rocr100', 'rsi', 'slowk', 'slowd', 'fastk',
 'fastd', 'fastk_rsi', 'fastd_rsi', 'trix', 'ultosc', 'willr',
 'ad', 'adosc', 'obv', 'atr', 'natr', 'ht_dcperiod', 'ht_dcphase',
 'ht_phasor_inphase', 'ht_phasor_quadrature', 'ht_sine_sine', 'ht_sine_leadsine',
 'ht_trendmod', 'avgprice', 'medprice', 'typprice', 'wclprice']
TECHNICAL_PATTERNS = [g for g in df_full.keys() if g.find('cdl')>=0]
CUSTOM_NUMERICAL = ['SMA10', 'SMA20', 'growing_moving_average', 'high_minus_low_relative','volatility', 'ln_volume']
MACRO = ['gdppot_us_yoy', 'gdppot_us_qoq', 'cpi_core_yoy', 'cpi_core_mom', 'FEDFUNDS',
 'DGS1', 'DGS5', 'DGS10']
NUMERICAL = GROWTH + TECHNICAL_INDICATORS + TECHNICAL_PATTERNS + CUSTOM_NUMERICAL + MACRO
DUMMIES = df_dummies.keys().to_list()

features_list = NUMERICAL+DUMMIES
print('Shape of the features df', df[features_list].shape)
to_predict = 'is_positive_growth_30d_future'

train_df = df[df.split.isin(['train','validation'])].copy(deep=True)
val_df = df[df.split.isin(['validation'])].copy(deep=True)
test_df = df[df.split.isin(['test'])].copy(deep=True)

X_train = train_df[features_list+[to_predict,'Date','Ticker']]
X_val = val_df[features_list+[to_predict,'Date','Ticker']]
X_test = test_df[features_list+[to_predict,'Date','Ticker']]

X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
X_val.replace([np.inf, -np.inf], np.nan, inplace=True)
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)
X_train.fillna(0, inplace=True)
X_val.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

print(f'length: X_train {X_train.shape},  X_test {X_test.shape}')

y_train = X_train[to_predict]
y_val = X_val[to_predict]
y_test = X_test[to_predict]

# remove y_train, y_test from X_ dataframes
del X_train[to_predict]
del X_val[to_predict]
del X_test[to_predict]

Shape of the features df (191795, 299)
length: X_train (160387, 302),  X_test (31408, 302)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val.replace([np.inf, -np.inf], np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test.replace([np.inf, -np.inf], np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

In [36]:
clf = DecisionTreeClassifier(max_depth=10, random_state=42)
clf.fit(X_train.drop(['Date','Ticker'],axis=1), y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,10
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [37]:
full_df_for_predict = pd.concat([X_train.drop(['Date','Ticker'],axis=1), X_test.drop(['Date','Ticker'],axis=1)])
pred5_clf_10 = clf.predict(full_df_for_predict)
df['pred5_clf_10'] = pred5_clf_10

In [38]:
real_target = 'is_positive_growth_30d_future'
preds_cols = ['pred0_manual_cci',
       'pred1_manual_prev_g1', 'pred2_manual_prev_g1_and_snp',
       'pred3_manual_dgs10_5', 'pred4_manual_dgs10_fedfunds', 'pred5_clf_10']
for idx, col in enumerate(preds_cols):
    df[f'is_correct_pred_{idx}'] = (df[col] == df[real_target]).astype(int)

In [39]:
df['only_pred5_is_correct'] = ((df['is_correct_pred_5'] == 1) 
    & (df['is_correct_pred_0'] != 1) 
    & (df['is_correct_pred_1'] != 1)
    & (df['is_correct_pred_2'] != 1)
    & (df['is_correct_pred_3'] != 1)
    & (df['is_correct_pred_4'] != 1))

In [40]:
df[(df['split'] == 'test') & (df['only_pred5_is_correct'] == 1)].shape[0]

4178

### Question 4: Hyperparameter tuning for a Decision Tree
What is the optimal tree depth (from 1 to 20) for a DecisionTreeClassifier?

NOTE: please include random_state=42 to the Decision Tree Classifier initialization (e.g., clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)) to ensure consistency in results.

Instructions:
* Iterate through max_depth values from 1 to 20.\
For each max_depth:
    - Train a Decision Tree Classifier with the current max_depth on the combined TRAIN+VALIDATION dataset.
    - Optionally, visualize how the 'head' (top levels) of each fitted tree changes with increasing tree depth. You can use:
sklearn.tree.plot_tree() for graphical visualization, or
The compact textual approach with export_text() function. For example:
```python
from sklearn.tree import export_text
tree_rules = export_text(model, feature_names=list(X_train), max_depth=3)
print(tree_rules)
```
* Calculate the precision score on the TEST dataset for each fitted tree. You may also track precision on the VALIDATION dataset to observe signs of overfitting.
* Identify the optimal max_depth where the precision score on the TEST dataset is highest. This value is your best_max_depth.
* Using best_max_depth, retrain the Decision Tree Classifier on the combined TRAIN+VALIDATION set.
* Predict on the entire dataset (TRAIN + VALIDATION + TEST) and add the predictions as a new column pred6_clf_best in your dataframe new_df.
* Compare the precision score of the tuned tree with previous predictions (pred0 to pred5). You should observe an improvement, ideally achieving precision > 0.58, indicating the tuned tree outperforms earlier models.
* Advanced (Optional)
    - Plot the precision (or accuracy) scores against the max_depth values to detect saturation or overfitting trends.
    - Observe the trade-off between model complexity (deeper trees) and generalization capability.


In [30]:
max_depth_scores = {}
for max_depth in range(1,21):
    clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    clf.fit(X_train.drop(['Date','Ticker'],axis=1), y_train)
    y_pred_val = clf.predict(X_val.drop(['Date','Ticker'],axis=1))
    y_pred_test = clf.predict(X_test.drop(['Date','Ticker'],axis=1))
    pr_val = round(precision_score(y_val, y_pred_val), 3)
    pr_test = round(precision_score(y_test, y_pred_test), 3)
    max_depth_scores[max_depth] = pr_test
    print('depth:', max_depth, 'presision val:', pr_val, 'presision test:', pr_test)

depth: 1 presision val: 0.64 presision test: 0.547
depth: 2 presision val: 0.64 presision test: 0.551
depth: 3 presision val: 0.64 presision test: 0.551
depth: 4 presision val: 0.644 presision test: 0.551
depth: 5 presision val: 0.718 presision test: 0.628
depth: 6 presision val: 0.712 presision test: 0.569
depth: 7 presision val: 0.717 presision test: 0.594
depth: 8 presision val: 0.741 presision test: 0.59
depth: 9 presision val: 0.76 presision test: 0.586
depth: 10 presision val: 0.759 presision test: 0.589
depth: 11 presision val: 0.788 presision test: 0.591
depth: 12 presision val: 0.795 presision test: 0.578
depth: 13 presision val: 0.81 presision test: 0.587
depth: 14 presision val: 0.825 presision test: 0.582
depth: 15 presision val: 0.857 presision test: 0.587
depth: 16 presision val: 0.877 presision test: 0.59
depth: 17 presision val: 0.885 presision test: 0.598
depth: 18 presision val: 0.906 presision test: 0.585
depth: 19 presision val: 0.932 presision test: 0.59
depth: 20 

In [31]:
sorted(max_depth_scores.items(), key=lambda x: -x[1])[0]

(5, 0.628)

In [32]:
clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf.fit(X_train.drop(['Date','Ticker'],axis=1), y_train)
pred6_clf_best = clf.predict(full_df_for_predict)
df['pred6_clf_best'] = pred6_clf_best

In [33]:
real_target = 'is_positive_growth_30d_future'
df['is_correct_pred_6'] = df['pred6_clf_best'] == df[real_target]
df['only_pred6_is_correct'] = ((df['is_correct_pred_6'] == 1) 
    & (df['is_correct_pred_0'] != 1) 
    & (df['is_correct_pred_1'] != 1)
    & (df['is_correct_pred_2'] != 1)
    & (df['is_correct_pred_3'] != 1)
    & (df['is_correct_pred_4'] != 1)
    & (df['is_correct_pred_5'] != 1))
df[(df['split'] == 'test') & (df['only_pred6_is_correct'] == 1)].shape[0]

665

In [34]:
from sklearn.tree import export_text
tree_rules = export_text(clf, feature_names=list(X_train.drop(['Date','Ticker'],axis=1)), max_depth=5)
print(tree_rules)

|--- DGS10 <= 4.83
|   |--- DGS5 <= 0.75
|   |   |--- growth_dax_90d <= 1.13
|   |   |   |--- growth_365d <= 1.48
|   |   |   |   |--- growth_gold_365d <= 1.40
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- growth_gold_365d >  1.40
|   |   |   |   |   |--- class: 1
|   |   |   |--- growth_365d >  1.48
|   |   |   |   |--- typprice <= 537.39
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- typprice >  537.39
|   |   |   |   |   |--- class: 0
|   |   |--- growth_dax_90d >  1.13
|   |   |   |--- obv <= 663286816.00
|   |   |   |   |--- growth_epi_7d <= 1.01
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- growth_epi_7d >  1.01
|   |   |   |   |   |--- class: 1
|   |   |   |--- obv >  663286816.00
|   |   |   |   |--- trix <= 0.53
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- trix >  0.53
|   |   |   |   |   |--- class: 1
|   |--- DGS5 >  0.75
|   |   |--- DGS1 <= 4.06
|   |   |   |--- growth_gold_365d <= 1.27
|   |   |   |   |--- growth_brent_oil_365d <= 1.36


### [EXPLORATORY] Question 5: What data is missing?
Now that you have gained insights from the correlation analysis and Decision Tree results regarding the most influential variables, suggest new indicators you would like to include in the dataset and explain your reasoning.

Alternatively, you may propose a completely different approach based on your intuition, provided it remains relevant to the shared dataset of the largest stocks from India, the EU, and the US. If you choose this route, please also specify the data source.

DGS10; DGS5; DGS1; FEDFUNDS; growth 90 and 365 days are most important

we can also add Exchange Rates