# Statistical and Machine Learning Models for Macroeconomic Data

This notebook is a useful tool for investors interested in the Brazilian macroeconomic data. It integrates machine learning techniques and statistical models to analyze fundamentalist data of companies listed on the stock exchange. The aim is to provide in-depth analysis and facilitate investment decision-making, focusing on identifying opportunities and mitigating risks. It includes interactive visualizations and real-time updates, making it accessible and practical for both experienced investors and beginners.

## Initial Setup

### Install Packages

In [1]:
%pip install pandas -q
%pip install plotly -q
%pip install scikit-learn -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Import libs

In [2]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import plotly.graph_objects as go

### Create a file path default

In [3]:
file_path_book = str(Path(os.getcwd()).parent.parent / "data/book")
file_path_scored = str(Path(os.getcwd()).parent.parent / "data/scored_base")

### Pandas Config

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Load data

In [5]:
df_macroeconomic_book = pd.read_csv(file_path_book + "/macroeconomic_book.csv")
df_macroeconomic_book.head(5)

Unnamed: 0,date,selic,confidence,pib,incc,ipca,dolar,monthly_inflation,gdp_growth,dollar_growth,real_interest_rate,inflation_confidence_difference
0,2019-01-31,6.5,128.64,578214.5,0.49,3.78,3.6513,-0.111948,-0.015756,0.016459,2.72,-124.86
1,2019-02-28,6.5,139.39,576089.7,0.09,3.89,3.7379,0.029101,-0.003675,0.023718,2.61,-135.5
2,2019-03-31,6.5,125.53,601749.8,0.31,4.58,3.8961,0.177378,0.044542,0.042323,1.92,-120.95
3,2019-04-30,6.5,121.71,612918.4,0.38,4.94,3.9447,0.078603,0.01856,0.012474,1.56,-116.77
4,2019-05-31,6.5,117.01,615304.9,0.03,4.66,3.9401,-0.05668,0.003894,-0.001166,1.84,-112.35


In [6]:
df_macroeconimic_numeric_cols = df_macroeconomic_book.select_dtypes(include=['float64','number', 'int'])
df_macroeconimic_numeric_cols.head()

Unnamed: 0,selic,confidence,pib,incc,ipca,dolar,monthly_inflation,gdp_growth,dollar_growth,real_interest_rate,inflation_confidence_difference
0,6.5,128.64,578214.5,0.49,3.78,3.6513,-0.111948,-0.015756,0.016459,2.72,-124.86
1,6.5,139.39,576089.7,0.09,3.89,3.7379,0.029101,-0.003675,0.023718,2.61,-135.5
2,6.5,125.53,601749.8,0.31,4.58,3.8961,0.177378,0.044542,0.042323,1.92,-120.95
3,6.5,121.71,612918.4,0.38,4.94,3.9447,0.078603,0.01856,0.012474,1.56,-116.77
4,6.5,117.01,615304.9,0.03,4.66,3.9401,-0.05668,0.003894,-0.001166,1.84,-112.35


## Models

### Creating trend columns (Target)

In [7]:
for column in df_macroeconimic_numeric_cols.columns:
    df_macroeconimic_numeric_cols[column + '_trend'] = (df_macroeconimic_numeric_cols[column].diff() > 0).astype(int)
    df_macroeconimic_numeric_cols[column + '_trend'].fillna(0, inplace=True)

df_macroeconimic_numeric_cols.head()

Unnamed: 0,selic,confidence,pib,incc,ipca,dolar,monthly_inflation,gdp_growth,dollar_growth,real_interest_rate,inflation_confidence_difference,selic_trend,confidence_trend,pib_trend,incc_trend,ipca_trend,dolar_trend,monthly_inflation_trend,gdp_growth_trend,dollar_growth_trend,real_interest_rate_trend,inflation_confidence_difference_trend
0,6.5,128.64,578214.5,0.49,3.78,3.6513,-0.111948,-0.015756,0.016459,2.72,-124.86,0,0,0,0,0,0,0,0,0,0,0
1,6.5,139.39,576089.7,0.09,3.89,3.7379,0.029101,-0.003675,0.023718,2.61,-135.5,0,1,0,0,1,1,1,1,1,0,0
2,6.5,125.53,601749.8,0.31,4.58,3.8961,0.177378,0.044542,0.042323,1.92,-120.95,0,0,1,1,1,1,1,1,1,0,1
3,6.5,121.71,612918.4,0.38,4.94,3.9447,0.078603,0.01856,0.012474,1.56,-116.77,0,0,1,1,1,1,0,0,0,0,1
4,6.5,117.01,615304.9,0.03,4.66,3.9401,-0.05668,0.003894,-0.001166,1.84,-112.35,0,0,1,0,0,0,0,0,0,1,1


### Feature Selection and Data Splitting

In [8]:
X = df_macroeconimic_numeric_cols.drop([col for col in df_macroeconimic_numeric_cols.columns if 'trend' in col], axis=1)
y = df_macroeconimic_numeric_cols['selic_trend']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

#### Distribution of the Target Variable

In [9]:
fig = px.histogram(df_macroeconimic_numeric_cols, x='selic_trend', title='Distribution of the Target Variable (selic_trend)', color_discrete_sequence=['rgb(100, 195, 181)'])
fig.update_layout(template="plotly_dark")

fig.show()

- The images show the distribution of a target variable named **`selic_trend`**
- The data is visualized in a histogram format, indicating the frequency (**`count`**) of observations for different values of **`selic_trend`**.


- The majority of the data is concentrated at **`selic_trend`** value of **0**.
- The **`count`** at **`selic_trend`** = **0** is **44**, which is the highest frequency observed in this distribution.

- There is a notable frequency at **`selic_trend`** value of **1**.
- The **`count`** at **`selic_trend`** = **1** is **12**, which represents a significant number of observations but less than the count at **`selic_trend`** = **0**.

#### General Observations:
- The range of **`selic_trend`** values extends from  **0** to **1**.
- The histogram indicates the variation between **`drops`** and **`rebounds`** in the **`selic_trend`** field.


#### Comparison between Training and Testing

In [10]:
colors = ['#7fdbda', '#1f77b4']

train_test_comparison = pd.DataFrame({'selic_trend': np.concatenate([y_train, y_test]), 'Dataset': ['Train'] * len(y_train) + ['Test'] * len(y_test)})

fig = px.histogram(train_test_comparison, x='selic_trend', color='Dataset', barmode='overlay', title='Comparison between Training and Testing Sets for selic_trend', template='plotly_dark')
fig.update_traces(opacity=0.6, marker=dict(color=colors))

fig.show()

- The chart consists of two columns: `selic_trend` and `Dataset`.
- Each row represents a record with a `selic_trend` value and the corresponding dataset category (Train or Test).

**Summary of `selic_trend` Values**:

- The `selic_trend` column contains binary values: **0** or **1**.

**Distribution in the Training Set**:

- The value **0** appears frequently in the `Train` dataset.
- The value **1** also appears in the `Train` dataset but with less frequency.

**Distribution in the Testing Set**:

- The `Test` dataset contains fewer records compared to the `Train` dataset.
- Similar to the `Train` dataset, **0** is more prevalent than **1** in the `Test` dataset.

**Counts of `selic_trend` Values**:

- In the `Train` dataset, `selic_trend` = **0** appears **22** times and `selic_trend` = **1** appears **8** times.
- In the `Test` dataset, `selic_trend` = **0** appears **18** times and `selic_trend` = **1** appears **3** times.

**General Observation**:

- The dataset appears to be imbalanced with more occurrences of `selic_trend` = **0** compared to `selic_trend` = **1**.


### Model Selection and Cross-Validation

In [11]:
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": SVC(),
    "Logistic Regression": LogisticRegression()
}

scores = {}

for name, model in models.items():
    
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    scores[name] = cv_scores
    print(f"{name} - Mean Cross-Validation Accuracy: {np.mean(cv_scores):.4f}")

fig = px.box(pd.DataFrame(scores), y=["Random Forest", "SVM", "Logistic Regression"], color_discrete_sequence=['rgb(100, 195, 181)'])
fig.update_layout(title_text="Comparison of Cross-Validation Accuracy", template="plotly_dark")

fig.show()


Random Forest - Mean Cross-Validation Accuracy: 0.8267
SVM - Mean Cross-Validation Accuracy: 0.7200
Logistic Regression - Mean Cross-Validation Accuracy: 0.7600


**`Analysis of Cross-Validation Accuracy Comparison`**

**`Random Forest`**

- Average cross-validation accuracy: **0.8267**

**`SVM (Support Vector Machine)`**

- Average cross-validation accuracy: **0.7200**

**`Logistic Regression`**

- Average cross-validation accuracy: **0.7600**

**`Visual Representation`**

- The `Random Forest` model shows a higher median accuracy and smaller interquartile range, indicating more consistent results across cross-validation folds.

- The `SVM` model displays the lowest median accuracy and a larger interquartile range, suggesting more variability in the validation accuracy.

- The `Logistic Regression` model has a median accuracy that lies between the Random Forest and SVM models, with a similar interquartile range to the SVM.


### Optimization of the Best Model

In [12]:
model = RandomForestClassifier(random_state=19051992)

param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}


grid_search = GridSearchCV(model, param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Melhores Parâmetros:", grid_search.best_params_)

Fitting 5 folds for each of 360 candidates, totalling 1800 fits
Melhores Parâmetros: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 10}


### Model Evaluation

In [13]:
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
fig = px.imshow(cm, text_auto=True, x=['Descend', 'Ascend'], y=['Descend', 'Ascend'], color_continuous_scale='Blues')
fig.update_layout(title="Confusion Matrix", xaxis_title="Prediction", yaxis_title="Actual")
fig.show()


              precision    recall  f1-score   support

           0       0.95      0.88      0.91        24
           1       0.50      0.75      0.60         4

    accuracy                           0.86        28
   macro avg       0.73      0.81      0.76        28
weighted avg       0.89      0.86      0.87        28



**`Confusion Matrix`**

- The confusion matrix titled "Confusion Matrix" shows the following counts:

  - True Negative (`Descend` / `Descend`): **21**

  - False Positive (`Descend` / `Ascend`): **1**

  - False Negative (`Ascend` / `Descend`): **3**

  - True Positive (`Ascend` / `Ascend`): **3**


- The rows represent the actual classes, with `Descend` being class **0** and `Ascend` being class **1**.
- The columns represent the predicted classes, with `Descend` being class **0** and `Ascend` being class **1**.
- The color gradient represents the count of instances, with darker colors indicating higher counts.

### Save File

In [14]:
# Adicionando as previsões ao DataFrame de teste
X_test_scored = X_test.copy()
X_test_scored['predicted_trend'] = y_pred  # Adiciona a coluna de previsões

# Se você também quiser adicionar probabilidades, pode fazer assim:
y_proba = grid_search.predict_proba(X_test)
X_test_scored['probability_descend'] = y_proba[:, 0]  # Probabilidade de 'Descend'
X_test_scored['probability_ascend'] = y_proba[:, 1]  # Probabilidade de 'Ascend'

# Agora X_test_scored contém as previsões e as probabilidades associadas


In [15]:
df_macroeconimic_selic_scored = pd.merge(df_macroeconimic_numeric_cols[['selic', 'selic_trend']], X_test_scored[['predicted_trend', 'probability_descend', 'probability_ascend']].reset_index(drop=True), left_index=True, right_index=True)
df_macroeconimic_selic_scored = pd.merge(df_macroeconimic_selic_scored, df_macroeconomic_book['date'], left_index=True, right_index=True)
df_macroeconimic_selic_scored.head()

Unnamed: 0,selic,selic_trend,predicted_trend,probability_descend,probability_ascend,date
0,6.5,0,0,1.0,0.0,2019-01-31
1,6.5,0,0,0.798864,0.201136,2019-02-28
2,6.5,0,1,0.414501,0.585499,2019-03-31
3,6.5,0,0,0.922727,0.077273,2019-04-30
4,6.5,0,0,0.729672,0.270328,2019-05-31


In [16]:
file_path_scored

'/Users/aalves/Documents/repos/tcc-fia-trading-advisor/data/scored_base'

In [17]:
Path(file_path_scored).mkdir(parents=True, exist_ok=True)
df_macroeconimic_selic_scored.to_csv(file_path_scored + '/macroeconimic_selic_scored.csv', index=False)

## TL/DR

**`Model Analysis`**

- **`Precision`**: The model has a precision of **95%** for the 'Descend' class (`0`), which means that when it predicts the indicator will go down, it is correct **95%** of the time. For the 'Ascend' class (`1`), the precision is **50%**, which is significantly lower.

- **`Recall`**: The recall for the 'Descend' class is **88%**, indicating that the model correctly identifies **88%** of the actual descending instances. For the 'Ascend' class, the recall is **75%**, suggesting it identifies **75%** of the ascending instances.

- **`F1-Score`**: The F1-score for the 'Descend' class is **91%**, a high value indicating a good balance between precision and recall. However, for the 'Ascend' class, the F1-score is **60%**, which is mediocre, reflecting the lower precision for this class.

- **`Support`**: The 'Descend' class has a support of **24**, meaning there are **24** actual instances of this class in the test set, compared to just **4** instances for the 'Ascend' class. This may indicate a class imbalance, which could explain why the model performs better for 'Descend' than for 'Ascend'.

- **`Confusion Matrix`**: The confusion matrix shows that the model correctly predicted the majority of the 'Descend' instances (**21 out of 24**). For 'Ascend', it correctly predicted only **3 out of 4**. The model appears to have a bias towards more easily predicting the majority class ('Descend').

**`Analysis of the Model's Performance`**

Based on this analysis, we can infer that the model is quite effective at predicting the 'Descend' class but not as effective for the 'Ascend' class. The high performance for 'Descend' might be due to the larger number of instances of this class in the dataset, allowing the model to learn more effectively. The low precision for 'Ascend' could be improved with resampling techniques or reweighting to address the class imbalance.

**`Why It Was the Champion`**

If this model was considered the "champion" among other tested models, it could be due to several factors:

- **`High Overall F1-score`**: The model has a good overall balance between precision and recall, especially for the 'Descend' class.
- **`Robustness to the Majority Class`**: It is very good at predicting the majority class, which might be the most prevalent or important for the problem context.

- **`Generalization`**: It may have offered the best generalization among the tested models, balancing performance well between the training and testing sets.