# Statistical and Machine Learning Models for Fundamentalist Data

This notebook is a useful tool for investors interested in the Brazilian stock market. It integrates machine learning techniques and statistical models to analyze fundamentalist data of companies listed on the stock exchange. The aim is to provide in-depth analysis and facilitate investment decision-making, focusing on identifying opportunities and mitigating risks. It includes interactive visualizations and real-time updates, making it accessible and practical for both experienced investors and beginners.

## Initial Setup

### Install Packages

In [17]:
%pip install pandas -q
%pip install plotly -q
%pip install scikit-learn -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Import libs

In [18]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score as sil_score
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

### Create a file path default

In [19]:
file_path_book = str(Path(os.getcwd()).parent.parent / "data/book")
file_path_scored = str(Path(os.getcwd()).parent.parent / "data/scored_base")


### Pandas Config

In [20]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Load data

In [21]:
df_fundamentals_book = pd.read_csv(file_path_book + "/fundamentals_book.csv")
df_fundamentals_book.head(5)

Unnamed: 0,ticker,long_name,sector,industry,market_cap,enterprise_value,total_revenue,profit_margins,operating_margins,dividend_rate,beta,ebitda,trailing_pe,forward_pe,volume,average_volume,fifty_two_week_low,fifty_two_week_high,price_to_sales_trailing_12_months,fifty_day_average,two_hundred_day_average,trailing_annual_dividend_rate,trailing_annual_dividend_yield,book_value,price_to_book,total_cash,total_cash_per_share,total_debt,earnings_quarterly_growth,revenue_growth,gross_margins,ebitda_margins,return_on_assets,return_on_equity,gross_profits,total_assets_approx,asset_turnover,earnings_growth_rate,dividend_payout_ratio,equity,debt_to_equity,roi,roce
0,ABCB4.SA,Banco ABC Brasil S.A.,Financial Services,Banks - Regional,4265434000.0,14773390000.0,1941779000.0,0.41576,0.38826,1.56,0.679,0.0,4.069768,4.706601,92300.0,747165.0,15.85,21.99,2.196663,19.3382,18.14667,1.55,0.080687,24.518,0.785138,7774306000.0,35.162,18298460000.0,0.001,0.003,0.0,0.0,0.0153,0.1568,1973086000.0,7774306000.0,0.249769,0.1,155000.0,-10524160000.0,-1.73871,0.131438,0.0
1,AGRO3.SA,BrasilAgro - Companhia Brasileira de Proprieda...,Consumer Defensive,Farm Products,2466480000.0,2912933000.0,1249437000.0,0.21493,0.25031,3.21,0.432,264892000.0,9.450382,6.332481,298100.0,666692.0,22.29,32.71,1.974073,27.0106,25.58635,3.24,0.132029,22.237,1.11346,383837000.0,3.885,872075000.0,6.801,0.671,0.25252,0.21201,0.03839,0.1217,315504000.0,383837000.0,3.255124,680.1,47.640053,-488238000.0,-1.786168,0.428927,0.079343
2,RAIL3.SA,Rumo S.A.,Industrials,Railroads,42288820000.0,55243050000.0,10317460000.0,0.07639,0.33544,0.07,0.227,4522541000.0,54.309525,21.72381,5733400.0,14644522.0,16.21,24.44,4.098764,22.5852,20.95235,0.066,0.002993,8.334,2.736981,7656040000.0,4.132,21843200000.0,3.935,0.121,0.34493,0.43834,0.04252,0.05163,3146360000.0,7656040000.0,1.347623,393.5,1.677255,-14187160000.0,-1.539646,0.186765,0.070519
3,ALPA3.SA,Alpargatas S.A.,Consumer Cyclical,Footwear & Accessories,5309793000.0,6482982000.0,4022153000.0,-0.05671,-0.06434,0.4,0.571,-198000.0,0.0,0.0,1100.0,3953.0,7.27,17.8,1.320137,8.7146,9.6354,0.0,0.0,7.867,1.008008,414288000.0,0.614,1550341000.0,0.0,-0.127,0.43246,-5e-05,-0.0091,-0.04153,1968303000.0,414288000.0,9.708591,0.0,0.0,-1136053000.0,-1.364673,0.620417,-2.9e-05
4,ALPA4.SA,Alpargatas S.A.,Consumer Cyclical,Footwear & Accessories,5350758000.0,6395236000.0,4022153000.0,-0.05671,-0.06434,0.43,0.571,-198000.0,0.0,14.555555,1132100.0,5605825.0,6.81,22.51,1.330322,8.3228,9.2729,0.0,0.0,7.867,0.99911,414288000.0,0.614,1550341000.0,0.0,-0.127,0.43246,-5e-05,-0.0091,-0.04153,1968303000.0,414288000.0,9.708591,0.0,0.0,-1136053000.0,-1.364673,0.62893,-2.9e-05


## Models

#### Transforming categorical features into numerical using OneHotEncoder

In [22]:
numeric_columns = df_fundamentals_book.select_dtypes(include=['int64', 'float64']).columns
categorical_columns = df_fundamentals_book.drop(['ticker', 'long_name', 'industry'], axis='columns').copy(deep=True)
categorical_columns = categorical_columns.select_dtypes(include='object').columns

encoder = OneHotEncoder(sparse=False)
encoded_categorical = encoder.fit_transform(df_fundamentals_book[categorical_columns])

df_encoded = pd.DataFrame(encoded_categorical, columns=encoder.get_feature_names_out(categorical_columns))
df_fundamentals_final = pd.concat([df_fundamentals_book.drop(categorical_columns, axis=1), df_encoded], axis=1)
df_fundamentals_final.columns = ['_'.join(col.lower().replace('-', '').split()) for col in df_fundamentals_final.columns]

df_fundamentals_final.head()

Unnamed: 0,ticker,long_name,industry,market_cap,enterprise_value,total_revenue,profit_margins,operating_margins,dividend_rate,beta,ebitda,trailing_pe,forward_pe,volume,average_volume,fifty_two_week_low,fifty_two_week_high,price_to_sales_trailing_12_months,fifty_day_average,two_hundred_day_average,trailing_annual_dividend_rate,trailing_annual_dividend_yield,book_value,price_to_book,total_cash,total_cash_per_share,total_debt,earnings_quarterly_growth,revenue_growth,gross_margins,ebitda_margins,return_on_assets,return_on_equity,gross_profits,total_assets_approx,asset_turnover,earnings_growth_rate,dividend_payout_ratio,equity,debt_to_equity,roi,roce,sector_basic_materials,sector_communication_services,sector_consumer_cyclical,sector_consumer_defensive,sector_energy,sector_financial_services,sector_healthcare,sector_industrials,sector_real_estate,sector_technology,sector_utilities
0,ABCB4.SA,Banco ABC Brasil S.A.,Banks - Regional,4265434000.0,14773390000.0,1941779000.0,0.41576,0.38826,1.56,0.679,0.0,4.069768,4.706601,92300.0,747165.0,15.85,21.99,2.196663,19.3382,18.14667,1.55,0.080687,24.518,0.785138,7774306000.0,35.162,18298460000.0,0.001,0.003,0.0,0.0,0.0153,0.1568,1973086000.0,7774306000.0,0.249769,0.1,155000.0,-10524160000.0,-1.73871,0.131438,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,AGRO3.SA,BrasilAgro - Companhia Brasileira de Proprieda...,Farm Products,2466480000.0,2912933000.0,1249437000.0,0.21493,0.25031,3.21,0.432,264892000.0,9.450382,6.332481,298100.0,666692.0,22.29,32.71,1.974073,27.0106,25.58635,3.24,0.132029,22.237,1.11346,383837000.0,3.885,872075000.0,6.801,0.671,0.25252,0.21201,0.03839,0.1217,315504000.0,383837000.0,3.255124,680.1,47.640053,-488238000.0,-1.786168,0.428927,0.079343,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,RAIL3.SA,Rumo S.A.,Railroads,42288820000.0,55243050000.0,10317460000.0,0.07639,0.33544,0.07,0.227,4522541000.0,54.309525,21.72381,5733400.0,14644522.0,16.21,24.44,4.098764,22.5852,20.95235,0.066,0.002993,8.334,2.736981,7656040000.0,4.132,21843200000.0,3.935,0.121,0.34493,0.43834,0.04252,0.05163,3146360000.0,7656040000.0,1.347623,393.5,1.677255,-14187160000.0,-1.539646,0.186765,0.070519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,ALPA3.SA,Alpargatas S.A.,Footwear & Accessories,5309793000.0,6482982000.0,4022153000.0,-0.05671,-0.06434,0.4,0.571,-198000.0,0.0,0.0,1100.0,3953.0,7.27,17.8,1.320137,8.7146,9.6354,0.0,0.0,7.867,1.008008,414288000.0,0.614,1550341000.0,0.0,-0.127,0.43246,-5e-05,-0.0091,-0.04153,1968303000.0,414288000.0,9.708591,0.0,0.0,-1136053000.0,-1.364673,0.620417,-2.9e-05,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ALPA4.SA,Alpargatas S.A.,Footwear & Accessories,5350758000.0,6395236000.0,4022153000.0,-0.05671,-0.06434,0.43,0.571,-198000.0,0.0,14.555555,1132100.0,5605825.0,6.81,22.51,1.330322,8.3228,9.2729,0.0,0.0,7.867,0.99911,414288000.0,0.614,1550341000.0,0.0,-0.127,0.43246,-5e-05,-0.0091,-0.04153,1968303000.0,414288000.0,9.708591,0.0,0.0,-1136053000.0,-1.364673,0.62893,-2.9e-05,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Feature Selection using PCA

In [23]:
n_components = 5
pca = PCA(n_components)
pca_result = pca.fit_transform(df_fundamentals_final.select_dtypes(include=['int', 'float64', 'number']))
explained_variance = pca.explained_variance_ratio_

variance_explained = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance_explained)

fig = px.bar(x=[f'PC{i+1}' for i in range(len(explained_variance))], y=explained_variance, labels={'x': 'Principal Component', 'y': 'Explained Variance'}, title='Explained Variance by Each Principal Component', color_discrete_sequence=['rgb(100, 195, 181)'], text=[f'{x:.2f}%' for x in explained_variance*100])
fig.update_traces(textposition='outside')
fig.update_layout(template='plotly_dark', font=dict(color='white'), height=500)
fig.show()

**`PC1 Explained Variance` (83.11%)**:

  - The first principal component `(PC1)` accounts for the majority of the variance in the dataset, suggesting a strong pattern or trend that PC1 represents.

**`PC2 Explained Variance` (11.82%)**

  - The second principal component `(PC2)` captures a significant but much smaller portion of the variance, indicating a secondary pattern in the data that's less dominant than `PC1`.

**`PC3 to PC5 Explained Variance`** **Decreasing Significance**:

  - `PC3` (**3.60%**), `PC4` (**1.12%**), and `PC5` (**0.29%**) explain increasingly smaller amounts of variance.
  - These components might represent noise or less relevant underlying patterns in the dataset.

**`Cumulative Explained Variance`**

  - The first two components (`PC1` and `PC2) together explain **94.93%** of the variance.
  - This high percentage suggests that the dataset can be effectively represented in a reduced dimensionality space with minimal loss of information.

**`Implications for Trading Strategy`**

  - Focusing on `PC1` and `PC2` could simplify the model without sacrificing significant predictive power.
  - Further investigation is needed to understand what market factors or combinations of factors these principal components represent.
  - Caution should be exercised when discarding the lower variance components, as they might capture important nuances in the data relevant for predictive accuracy in certain market conditions.


In [24]:
components = pca.components_
component_loadings = pd.DataFrame(components, columns=df_fundamentals_final.select_dtypes(include=['int', 'float64', 'number']).columns)

n_components_to_display = 5  
n_top_features = 20  
top_features_per_component = []

for i in range(n_components):
    component = component_loadings.iloc[i]
    top_features = component.abs().nlargest(n_top_features).index.tolist()
    top_features_per_component.append(top_features)

df_top_features = pd.DataFrame(top_features_per_component, index=[f'Component {i+1}' for i in range(n_components)])
df_top_features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Component 1,enterprise_value,total_debt,equity,market_cap,total_assets_approx,total_cash,total_revenue,gross_profits,ebitda,average_volume,volume,asset_turnover,dividend_payout_ratio,dividend_rate,book_value,fifty_two_week_high,debt_to_equity,total_cash_per_share,two_hundred_day_average,fifty_two_week_low
Component 2,total_revenue,total_debt,market_cap,ebitda,gross_profits,enterprise_value,total_cash,total_assets_approx,equity,average_volume,volume,asset_turnover,dividend_payout_ratio,earnings_growth_rate,debt_to_equity,dividend_rate,book_value,total_cash_per_share,fifty_day_average,fifty_two_week_low
Component 3,total_assets_approx,total_cash,equity,market_cap,enterprise_value,total_revenue,gross_profits,total_debt,ebitda,average_volume,volume,dividend_payout_ratio,asset_turnover,debt_to_equity,fifty_two_week_high,earnings_growth_rate,total_cash_per_share,dividend_rate,two_hundred_day_average,fifty_day_average
Component 4,total_revenue,market_cap,enterprise_value,total_debt,equity,gross_profits,total_cash,total_assets_approx,ebitda,average_volume,volume,asset_turnover,dividend_payout_ratio,earnings_growth_rate,fifty_two_week_high,two_hundred_day_average,fifty_day_average,fifty_two_week_low,total_cash_per_share,trailing_pe
Component 5,gross_profits,ebitda,market_cap,total_revenue,enterprise_value,total_debt,equity,total_cash,total_assets_approx,average_volume,volume,asset_turnover,dividend_payout_ratio,earnings_growth_rate,dividend_rate,fifty_two_week_high,book_value,total_cash_per_share,forward_pe,fifty_two_week_low


**`Component 1 Analysis`**:
  - *Dominant Metrics*: enterprise value, total debt, equity, market cap, total assets.
  - *Interpretation*: This component may represent the overall size and leverage of companies, as it loads heavily on enterprise value and total debt.

**`Component 2 Analysis`**:
  - *Dominant Metrics*: total revenue, total debt, market cap, ebitda.
  - *Interpretation*: This could be capturing companies' operational performance, given the emphasis on revenue and earnings before interest, taxes, depreciation, and amortization (EBITDA).

**`Component 3 Analysis`**:
  - *Dominant Metrics*: total assets, total cash, equity.
  - *Interpretation*: Appears to reflect the liquidity and solvency aspects of companies, highlighting the significance of cash holdings and assets.

**`Component 4 Analysis`**:
  - *Dominant Metrics*: total revenue, market cap, enterprise value.
  - *Interpretation*: This component seems to capture the growth aspect of the companies by loading on total revenue and market capitalization.

**`Component 5 Analysis`**:
  - *Dominant Metrics*: gross profits, ebitda, market cap.
  - *Interpretation*: May represent profitability and efficiency, as it is heavily influenced by profit-related metrics.

**`Implications for Quantitative Trading`**:
  - When developing trading strategies, these components could be used to classify companies into different profiles, like size & leverage, operational performance, liquidity, growth, and profitability.
  - A trading strategy might be developed to exploit patterns found in companies with similar loadings on these components.
  - It's important to note that PCA is sensitive to the scale of the variables, and thus the data should be appropriately normalized if it's not already.


In [25]:
pca.fit(df_fundamentals_final.select_dtypes(include=['int', 'float64', 'number']))
variance_explained = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance_explained)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(1, len(variance_explained) + 1), y=cumulative_variance, mode='lines+markers', line=dict(color='rgb(100, 195, 181)'), name='Cumulative Variance Explained'))
fig.update_layout(title='Elbow Method for PCA Analysis', xaxis_title='Number of Principal Components', yaxis_title='Cumulative Variance Explained', xaxis=dict(tickmode='linear'), yaxis=dict(tickformat='.0%'), template='plotly_dark', font=dict(color='white')) 
fig.show()

**`Scree Plot Analysis`**:
  - The plot indicates that the first principal component explains just over 85% of the variance.
  - Upon adding the second component, the explained variance increases to around 90%.
  - The third component brings the cumulative explained variance to approximately 95%.
  - Adding the fourth and fifth components does not significantly increase the explained variance, as it remains close to 100%.

**`Elbow Method Observation`**:
  - The "elbow" in the graph seems to occur at the third component, which is where the rate of increase in explained variance significantly tapers off.
  - This suggests that most of the variability in the data can be captured by the first three components.

**`Implications for Dimensionality Reduction`**:
  - For dimensionality reduction, it would be reasonable to choose three principal components since they explain the majority of the variance.
  - Choosing more than three components may not add substantial information and could lead to overfitting in a predictive model.

**`Recommendation for Quantitative Analysis`**:
  - In a quantitative trading context, this could imply that a model can be sufficiently informed with three aggregated metrics derived from the original variables, thus simplifying the model and potentially increasing computational efficiency.

**`Considerations for Trading Models`**:
  - It's essential to consider the nature of the financial data and market conditions when deciding on the number of components to use.
  - A qualitative review of what each principal component represents in terms of financial metrics should be performed to ensure that important information is not omitted from the model.


#### Selecting the Ideal Number of Clusters

In [26]:
n_components = 4

- Based on the analyses conducted so far, I am choosing to work with 4 components

In [27]:
pca = PCA(n_components)
principal_components = pca.fit_transform(df_fundamentals_final.select_dtypes(include=['int', 'float64', 'number']))
range_num_clusters = list(range(2, 10))
silhouette_scores = []
distortions = []

for num_cluster in range_num_clusters:

    kmeans = KMeans(n_clusters=num_cluster, max_iter=10_000, random_state=19051992)
    cluster_labels = kmeans.fit_predict(principal_components)
    silhouette_scores.append(sil_score(principal_components, cluster_labels))

    distortions.append(KMeans(n_clusters=num_cluster, max_iter=10_000, random_state=19051992).fit(df_fundamentals_final.select_dtypes(include=['int', 'float64', 'number'])).inertia_)

scaler = StandardScaler()
silhouette_scores_scaled = scaler.fit_transform(np.array(silhouette_scores).reshape(-1, 1)).flatten()
distortions_scaled = scaler.fit_transform(np.array(distortions).reshape(-1, 1)).flatten()

fig = go.Figure()
fig.add_trace(go.Scatter(x=range_num_clusters, y=silhouette_scores_scaled, mode='lines+markers', name='Silhouette Score'))
fig.add_trace(go.Scatter(x=range_num_clusters, y=distortions_scaled, mode='lines+markers', name='Distortions'))
fig.update_layout(title='Elbow Method for Determining the Optimal Number of Clusters (Normalized)', xaxis_title='Number of clusters', yaxis_title='Silhouette Score / Distortions (Normalized)', template='plotly_dark', font=dict(color='white'))
fig.show()

**`Elbow Method for Optimal Clusters`**:

  - The graph plots two lines representing normalized silhouette scores and distortions against the number of clusters.
  - The silhouette score measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
  - Distortions typically represent the within-cluster sum of squares and decrease with increasing number of clusters.

**`Silhouette Score Analysis`**:

  - The silhouette score starts high at **2 clusters** and decreases as more clusters are added, with noticeable changes up to **4 clusters**.
  - The score stabilizes after five clusters, suggesting limited improvement in cluster definition beyond this point.
  - Based on the visualization of cluster distributions, we prefer to proceed with **5 clusters** as the data points appear more tightly grouped compared to **4 clusters**, indicating better cohesion within clusters.

**`Distortion Analysis`**:

  - The distortions decrease rapidly as the number of clusters increases from **2** to **4**, suggesting significant improvement in the compactness of clusters.
  - There is an "elbow" at three clusters where the rate of decrease sharply changes, indicating that adding more clusters beyond this point yields diminishing returns in terms of reduced distortions.

**`Optimal Number of Clusters`**:

  - Considering both metrics, **3 clusters** could be the optimal number for this particular clustering scenario.
  - However, the elbow at **3 clusters** for distortions and the relatively high silhouette score at this point suggest that **3 clusters** provide a good balance between cluster compactness and separation.

#### Running Model (Kmeans)

In [28]:
df_pca = pd.DataFrame(principal_components[:, :2], columns=['PC1', 'PC2'])
df_pca['cluster'] = cluster_labels

fig = px.scatter(df_pca, x='PC1', y='PC2', color='cluster', title='Visualization of Clusters with PCA', color_continuous_scale=px.colors.qualitative.Vivid)
fig.update_layout(template='plotly_dark', font=dict(color='white'))
fig.show()


In [29]:
def calcular_coesao(X, labels, centroids):
    coesao = 0
    for i in range(len(centroids)):
        cluster_points = X[labels == i]
        coesao += np.sum((cluster_points - centroids[i])**2)
    return coesao

def calcular_separacao(centroids):
    separacao = 0
    for i in range(len(centroids)):
        for j in range(i+1, len(centroids)):
            separacao += np.sum((centroids[i] - centroids[j])**2)
    return separacao

In [30]:
pca = PCA(n_components)
principal_components = pca.fit_transform(df_fundamentals_final.select_dtypes(include=['int', 'float64', 'number']))

tsne = TSNE(n_components=2, metric='cosine')
tsne_results = tsne.fit_transform(principal_components)

kmeans = KMeans(n_clusters=3, max_iter=10_000, random_state=19051992)
clusters = kmeans.fit_predict(tsne_results)

df_visualization = pd.DataFrame(tsne_results, columns=['TSNE1', 'TSNE2'])
df_visualization['cluster'] = clusters

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

fig = go.Figure()
for i in range(5):
    cluster_points = df_visualization[df_visualization['cluster'] == i]
    fig.add_trace(go.Scatter(x=cluster_points['TSNE1'], y=cluster_points['TSNE2'], mode='markers', name=f'Cluster {i}', marker=dict(color=px.colors.qualitative.Vivid[i])))
fig.add_trace(go.Scatter(x=centroids[:, 0], y=centroids[:, 1], mode='markers', marker=dict(size=10, color='white'), name='Centroides'))
fig.update_layout(title='Visualization of Clusters with TSNE and Cosine Distance', template='plotly_dark', font=dict(color='white'))

cohesion = calcular_coesao(tsne_results, clusters, centroids)
separation = calcular_separacao(centroids)

fig.add_annotation(x=max(tsne_results[:, 0]), y=max(tsne_results[:, 1]), text=f'Cohesion: {cohesion:.2f}, Separation: {separation:.2f}', showarrow=False, yshift=10)
fig.show()

**`t-SNE and Cosine Distance Cluster Visualization`**

- **Visualization Technique**: Uses t-SNE for dimensionality reduction and cosine distance to measure similarity.
- **Clusters Identified**: Three distinct clusters labeled as Cluster 0 (orange), Cluster 1 (blue), and Cluster 2 (green).
- **Centroids**: Marked with white dots, representing the central point of each cluster.

**`Quantitative Measures`**

- **Cohesion**: 6920.75
  - Indicates a strong within-cluster similarity among data points.

- **Separation**: 2008.37
  - Reflects the distinctness of each cluster from the others.

**`Observations`**

- **Cluster 0** (orange) and **Cluster 1** (blue) are more dispersed compared to **Cluster 2** (green), which is more compact.
- The centroids are well-placed within each cluster, suggesting accurate representation of the central tendency.
- The high cohesion score implies that t-SNE and cosine distance effectively grouped similar data points together.

**`Implications`**

- The clear separation and high cohesion indicate that the clustering method has successfully identified meaningful groups within the data.
- This visualization can inform decision-making in scenarios where data segmentation is crucial, such as customer segmentation, targeted marketing, or portfolio management in finance.



#### Inserting the clusters into the base

In [31]:
df_fundamentals_final['kmeans_cluster'] = clusters
df_fundamentals_final.head()

Unnamed: 0,ticker,long_name,industry,market_cap,enterprise_value,total_revenue,profit_margins,operating_margins,dividend_rate,beta,ebitda,trailing_pe,forward_pe,volume,average_volume,fifty_two_week_low,fifty_two_week_high,price_to_sales_trailing_12_months,fifty_day_average,two_hundred_day_average,trailing_annual_dividend_rate,trailing_annual_dividend_yield,book_value,price_to_book,total_cash,total_cash_per_share,total_debt,earnings_quarterly_growth,revenue_growth,gross_margins,ebitda_margins,return_on_assets,return_on_equity,gross_profits,total_assets_approx,asset_turnover,earnings_growth_rate,dividend_payout_ratio,equity,debt_to_equity,roi,roce,sector_basic_materials,sector_communication_services,sector_consumer_cyclical,sector_consumer_defensive,sector_energy,sector_financial_services,sector_healthcare,sector_industrials,sector_real_estate,sector_technology,sector_utilities,kmeans_cluster
0,ABCB4.SA,Banco ABC Brasil S.A.,Banks - Regional,4265434000.0,14773390000.0,1941779000.0,0.41576,0.38826,1.56,0.679,0.0,4.069768,4.706601,92300.0,747165.0,15.85,21.99,2.196663,19.3382,18.14667,1.55,0.080687,24.518,0.785138,7774306000.0,35.162,18298460000.0,0.001,0.003,0.0,0.0,0.0153,0.1568,1973086000.0,7774306000.0,0.249769,0.1,155000.0,-10524160000.0,-1.73871,0.131438,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2
1,AGRO3.SA,BrasilAgro - Companhia Brasileira de Proprieda...,Farm Products,2466480000.0,2912933000.0,1249437000.0,0.21493,0.25031,3.21,0.432,264892000.0,9.450382,6.332481,298100.0,666692.0,22.29,32.71,1.974073,27.0106,25.58635,3.24,0.132029,22.237,1.11346,383837000.0,3.885,872075000.0,6.801,0.671,0.25252,0.21201,0.03839,0.1217,315504000.0,383837000.0,3.255124,680.1,47.640053,-488238000.0,-1.786168,0.428927,0.079343,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,RAIL3.SA,Rumo S.A.,Railroads,42288820000.0,55243050000.0,10317460000.0,0.07639,0.33544,0.07,0.227,4522541000.0,54.309525,21.72381,5733400.0,14644522.0,16.21,24.44,4.098764,22.5852,20.95235,0.066,0.002993,8.334,2.736981,7656040000.0,4.132,21843200000.0,3.935,0.121,0.34493,0.43834,0.04252,0.05163,3146360000.0,7656040000.0,1.347623,393.5,1.677255,-14187160000.0,-1.539646,0.186765,0.070519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
3,ALPA3.SA,Alpargatas S.A.,Footwear & Accessories,5309793000.0,6482982000.0,4022153000.0,-0.05671,-0.06434,0.4,0.571,-198000.0,0.0,0.0,1100.0,3953.0,7.27,17.8,1.320137,8.7146,9.6354,0.0,0.0,7.867,1.008008,414288000.0,0.614,1550341000.0,0.0,-0.127,0.43246,-5e-05,-0.0091,-0.04153,1968303000.0,414288000.0,9.708591,0.0,0.0,-1136053000.0,-1.364673,0.620417,-2.9e-05,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,ALPA4.SA,Alpargatas S.A.,Footwear & Accessories,5350758000.0,6395236000.0,4022153000.0,-0.05671,-0.06434,0.43,0.571,-198000.0,0.0,14.555555,1132100.0,5605825.0,6.81,22.51,1.330322,8.3228,9.2729,0.0,0.0,7.867,0.99911,414288000.0,0.614,1550341000.0,0.0,-0.127,0.43246,-5e-05,-0.0091,-0.04153,1968303000.0,414288000.0,9.708591,0.0,0.0,-1136053000.0,-1.364673,0.62893,-2.9e-05,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


##### Save File

In [32]:
Path(file_path_book).mkdir(parents=True, exist_ok=True)
df_fundamentals_book.to_csv(file_path_scored + "/fundamentals_scored_clusters.csv", index=False)

## TL/DR

**Decision Summary on Clustering**

- **`Selected Number of Clusters`**: After consideration, the decision is to proceed with `3 clusters`.

- **`Methodological Foundation`**: The Elbow Method suggested that **3** clusters strike a good balance between simplicity and meaningful data segmentation.

- **`t-SNE Visualization Support`**: The t-SNE visualizations confirmed that `3 clusters` have clear boundaries and show a strong degree of separation.

- **`Cohesion and Separation Metric`**: The selected clusters exhibit **high cohesion** and **adequate separation**, indicating a robust clustering model.

- **`Practicality and Simplicity`**: A model with **3 clusters** is preferred for its practicality and ease of interpretation in strategic applications.

- **`Flexibility for Future Analysis`**: The clustering approach with `3 clusters` allows for scalability and adaptability in future analyses.

- **`Use Case Application`**: The three-cluster model is well-suited for applications such as `customer segmentation` and `market analysis`, providing broad yet distinct segments for strategic initiatives.

The focus on **3 clusters** aligns with a strategic direction that prioritizes a more generalized understanding of the dataset, facilitating clearer communication and decision-making processes.
