Nós abstraímos o código do mês de junho/2022 no arquivo ``carteiras.py``.
Agora vamos usá-lo para ver se é consistente com os outros meses.

In [1]:
import numpy as np
import pandas as pd
import carteiras as c

Para muitos meses, não temos os dados do Banco Western Union, então resolvemos tirá-lo dos nossos dados.

Isso muda a ordem que os grupos ficam no nosso mês de junho/2022, mas eles continuam sendo essencialmente os mesmos.

In [2]:
estban = c.read_estban(202206)
carteiras, volume, _ = c.make_carteiras(estban)
kmeans_jun2022 = c.run_kmeans(carteiras, seed = 131)

c.sizes(carteiras, kmeans_jun2022)

0     6
1    46
2    27
3    25
4    12
dtype: int64

In [3]:
centers_jun2022 = c.find_centers(carteiras, kmeans_jun2022).round(1)

centers_jun2022

Unnamed: 0,161,162,163,164,165,166,167,169,171,172,173,174,176
0,0.3,0.1,0.3,0.0,0.0,0.0,0.0,0.1,0.0,0.3,-0.0,-0.0,0.0
1,0.1,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.9,-0.0,-0.0,0.0
2,0.8,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.1,-0.0,-0.0,0.0
3,0.5,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,-0.0,-0.0,0.0
4,0.2,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,-0.0,-0.1,0.0


In [4]:
c.clusters_and_vol(carteiras, volume, kmeans_jun2022).groupby("Grupo").head(5)

Unnamed: 0_level_0,Grupo,Volume
NOME_INSTITUICAO,Unnamed: 1_level_1,Unnamed: 2_level_1
BCO DO BRASIL S.A.,0,904788300000.0
BCO COOPERATIVO SICREDI S.A.,0,25511410000.0
BANCO SICOOB S.A.,0,18706440000.0
BCO DA AMAZONIA S.A.,0,15844230000.0
BCO BOCOM BBM S.A.,0,9521166000.0
BCO CITIBANK S.A.,1,56655770000.0
BCO BNP PARIBAS BRASIL S A,1,36730170000.0
BCO CRÉDIT AGRICOLE BR S.A.,1,31100450000.0
BCO SOCIETE GENERALE BRASIL,1,25672200000.0
BCO MUFG BRASIL S.A.,1,23092450000.0


Para ver se essa classificação é razoável em outros meses, vamos olhar nos últimos 2 anos, mês a mês.

Para não ter problema com a ordem dos grupos, nós vamos inicializar o KMeans com os centros que achamos em junho/2022. O algoritmo normalmente sorteia os pontos iniciais e segue até eles se estabilizarem. Começando com os centros de junho/2022, se de fato eles forem parecidos com os centros de outro mês, o algoritmo vai se estabilizar rapidamente, em centros bem parecidos.

Vamos listar os meses. São 24, não vale a pena fazer um código sofisticado.

In [5]:
months = [202007,202008,202009,202010,202011,202012,202101,202102,202103,202104,202105,202106,202107,202108,202109,202110,202111,202112,202201,202202,202203,202204,202205,202206]

A função abaixo calcula os centros para um dado mês como dissemos acima e mede o quão distante estão dos centros de junho/2022.

In [6]:
def diff_centers(YEARMONTH, centers = centers_jun2022):
  estban = c.read_estban(YEARMONTH)
  carteiras, _, _ = c.make_carteiras(estban)
  new_centers = c.find_centers(carteiras, centers=centers)
  diff = centers - new_centers

  return diff

Precisamos de alguma maneira transformar essa diferença em um número apenas, para ser mais fácil de comprar. O jeito mais natural é usar norma de matrizes.

In [7]:
def err_centers(YEARMONTH, centers = centers_jun2022):
  diff = diff_centers(YEARMONTH, centers)
  err  = np.linalg.norm(diff)
  return np.round(err, 2)

Vamos ver os erros:

In [8]:
pd.Series([err_centers(x) for x in months], index = months)

202007     0.46
202008    10.62
202009    10.76
202010    12.30
202011    10.67
202012    10.20
202101     0.26
202102     0.20
202103     0.15
202104     0.15
202105     0.19
202106     0.16
202107     0.16
202108     0.14
202109     0.17
202110     0.16
202111     0.22
202112     0.19
202201     0.20
202202     0.17
202203     0.21
202204     0.19
202205     0.15
202206     0.15
dtype: float64

O próprio mês de junho/2022 teve algum erro, por que o algoritmo foi rodado com esses pontos iniciais e existem erros de ponto flutuante. A maior parte dos meses tem erros bem pequenos. Podemos ver quais são as diferenças para alguns deles:

In [19]:
diff_centers(202103, centers=centers_jun2022).round(2)

Unnamed: 0,161,162,163,164,165,166,167,169,171,172,173,174,176
0,0.04,-0.01,-0.01,0.0,0.0,0.0,-0.0,0.03,-0.0,0.01,0.0,0.03,0.0
1,0.01,-0.02,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.02,0.0,0.02,0.0
2,-0.06,-0.03,-0.02,0.0,0.0,0.0,0.0,-0.02,-0.0,-0.01,0.0,0.04,0.0
3,0.01,-0.01,-0.01,0.0,0.0,0.0,-0.0,-0.03,-0.01,0.01,0.0,0.03,0.0
4,0.02,-0.07,-0.01,0.0,0.0,0.0,-0.0,-0.0,-0.01,0.0,0.0,-0.04,0.0


In [21]:
diff_centers(202201, centers=centers_jun2022).round(2)

Unnamed: 0,161,162,163,164,165,166,167,169,171,172,173,174,176
0,0.09,-0.0,-0.07,0.0,0.0,0.0,-0.0,0.02,-0.0,0.04,0.0,0.02,0.0
1,0.01,-0.03,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.03,0.0,0.05,0.0
2,-0.05,-0.05,-0.01,0.0,0.0,0.0,0.0,-0.03,-0.0,0.01,0.0,0.03,0.0
3,-0.01,0.03,-0.02,0.0,0.0,0.0,-0.0,-0.03,-0.0,0.01,0.0,0.03,0.0
4,-0.02,-0.06,-0.02,0.0,0.0,0.0,-0.0,-0.0,-0.01,0.05,0.0,-0.03,0.0


Em 2020 vimos que os erros ficam grandes, com dois dígitos. Mas 2020 foi atípico mesmo. Temos inclusive números negativos nas tabelas.

In [10]:
estban = c.read_estban(202012)
carteiras, _, _ = c.make_carteiras(estban)
new_centers = c.find_centers(carteiras, n_clusters=7)

new_centers.round(1)

Unnamed: 0,161,162,163,164,165,166,167,169,171,172,173,174,176
0,0.4,0.1,0.1,0.0,0.0,0.0,0.0,0.1,0.0,0.4,-0.0,-0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.1,0.0,-7.1,0.0
2,0.6,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,-0.0,-0.0,0.0
3,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,-0.0,-0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,-0.0,-0.0,0.0
5,0.2,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,-0.1,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,1.0,-0.0,-0.0,0.0


In [11]:
credito_sum, verbetes = c.make_sum(estban)

In [15]:
credito_sum["174"].sort_values()

NOME_INSTITUICAO
BCO DO BRASIL S.A.                   -4.571042e+10
BCO BRADESCO S.A.                    -3.700343e+10
CAIXA ECONOMICA FEDERAL              -3.411017e+10
ITAÚ UNIBANCO S.A.                   -2.597765e+10
BCO SANTANDER (BRASIL) S.A.          -2.162102e+10
                                          ...     
BCO BANDEPE S.A.                      0.000000e+00
ITAÚ UNIBANCO HOLDING S.A.            0.000000e+00
BCO CLASSICO S.A.                     0.000000e+00
STATE STREET BR S.A. BCO COMERCIAL    0.000000e+00
BCO WESTERN UNION                     0.000000e+00
Name: 174, Length: 115, dtype: float64

In [13]:
verbetes

Index(['VERBETE_160_OPERACOES_DE_CREDITO',
       'VERBETE_161_EMPRES_E_TIT_DESCONTADOS', 'VERBETE_162_FINANCIAMENTOS',
       'VERBETE_163_FIN_RURAIS_AGRICUL_CUST/INVEST',
       'VERBETE_164_FIN_RURAIS_PECUAR_CUST/INVEST',
       'VERBETE_165_FIN_RURAIS_AGRICUL_COMERCIALIZ',
       'VERBETE_166_FIN_RURAIS_PECUARIA_COMERCIALIZ',
       'VERBETE_167_FINANCIAMENTOS_AGROINDUSTRIAIS+VERBETE_168_RENDAS_A_APROPRIAR_FINANC_RURAIS_AGROINDUSTRIAIS',
       'VERBETE_169_FINANCIAMENTOS_IMOBILIARIOS',
       'VERBETE_171_OUTRAS_OPERACOES_DE_CREDITO',
       'VERBETE_172_OUTROS_CREDITOS', 'VERBETE_173_CREDITOS_EM_LIQUIDACAO',
       'VERBETE_174_PROV_P/_OPER_CREDITOS', 'VERBETE_176_OPERACOES_ESPECIAIS'],
      dtype='object')

Assim sendo, podemos concluir que a classificação que achamos para junho/2022 se comporta bem por 1 ano e meio antes, e quebra apenas quando situações bem anormais aconteceram.