<a href="https://colab.research.google.com/github/WellingtonFidelis/fc-mod-1-cluster-kmeans-head-brain/blob/main/bc_arch_big_data_brain_cluster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying Brain Weight Clusters Using K-Means Clustering

## Final challenge - Module 1

> Bootcamp: **Big Data Architect**
>
> Module 1

This job is to final challeng of the module 01 of the bootcamp.

---
---

> ## Context
> Business problem

You are a data professional hired by a research team that is studying the relationship between head size and brain weight in a group of individuals. The goal is to analyze whether there are any significant patterns or clusters in these measurements. The team wants to use these insights to conduct further studies and better understand the factors that may influence these characteristics.

For this activity, you will have access to the dataset "headbrain.csv", which contains information on head size (in cubic centimeters) and brain weight (in grams). Your job is to use clustering techniques, especially the K-Means algorithm, to identify groups of individuals with similar characteristics.

> ## Environments configuration
>

In [None]:
from platform import python_version
print('Language Python vesrion used in this notebook is: ', python_version())

Language Python vesrion used in this notebook is:  3.10.12


In [None]:
!cat /etc/*release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


Creating a folder to downloads.

In [None]:
!mkdir Downloads

mkdir: cannot create directory ‘Downloads’: File exists


In [None]:
!ls -la

total 20
drwxr-xr-x 1 root root 4096 May  2 23:44 .
drwxr-xr-x 1 root root 4096 May  2 23:43 ..
drwxr-xr-x 4 root root 4096 May  1 13:19 .config
drwxr-xr-x 4 root root 4096 May  2 23:44 Downloads
drwxr-xr-x 1 root root 4096 May  1 13:20 sample_data


> ## Project Structure and Libraries
>

> ### Loading libraries
>

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from os import path
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import plotly.graph_objects as go

> ### Loading datasets
> Data **collection**

Verifying if the dataset exists.

In [None]:
ds_head_brain_exists = path.exists("./Downloads/dados_headbrain/dados_headbrain.csv")
ds_head_brain_exists

True

In [None]:
ds_patients_exists = path.exists("./Downloads/dados_pacientes/dados_pacientes.csv")
ds_patients_exists

True

Downloading and extracting the dataset if it doesn't exist.

In [None]:
if not ds_head_brain_exists:
  !wget https://leandrolessa.com.br/wp-content/uploads/2024/04/dados_headbrain.zip -P ./Downloads

In [None]:
if not ds_patients_exists:
  !wget https://leandrolessa.com.br/wp-content/uploads/2024/02/dados_pacientes.zip -P ./Downloads

Checking if the directories was created

In [None]:
!ls -la Downloads/

total 108
drwxr-xr-x 4 root root  4096 May  2 23:44 .
drwxr-xr-x 1 root root  4096 May  2 23:44 ..
drwxr-xr-x 2 root root  4096 May  2 23:44 dados_headbrain
-rw-r--r-- 1 root root  1931 Apr 22 03:34 dados_headbrain.zip
drwxr-xr-x 2 root root  4096 May  2 23:44 dados_pacientes
-rw-r--r-- 1 root root 89108 Feb 26 21:14 dados_pacientes.zip


Unzip Head and Brain zip folder

In [None]:
!unzip -o ./Downloads/dados_headbrain.zip -d ./Downloads/dados_headbrain/

Archive:  ./Downloads/dados_headbrain.zip
  inflating: ./Downloads/dados_headbrain/dados_headbrain.csv  


Listing the content of the diretory created from unzip

In [None]:
!ls -la ./Downloads/dados_headbrain/

total 16
drwxr-xr-x 2 root root 4096 May  3 01:13 .
drwxr-xr-x 4 root root 4096 May  2 23:44 ..
-rw-r--r-- 1 root root 7773 Apr 21 22:14 dados_headbrain.csv


Unzip Pacients zip folder

In [None]:
!unzip -o ./Downloads/dados_pacientes.zip -d ./Downloads/dados_pacientes/

Archive:  ./Downloads/dados_pacientes.zip
  inflating: ./Downloads/dados_pacientes/dados_pacientes.csv  


Listing the content of the diretory created from unzip

In [None]:
!ls -la ./Downloads/dados_pacientes/

total 624
drwxr-xr-x 2 root root   4096 May  3 01:13 .
drwxr-xr-x 4 root root   4096 May  2 23:44 ..
-rw-r--r-- 1 root root 628519 Feb 26 14:38 dados_pacientes.csv


Creating a pandas dataframe to head brain dataset

In [None]:
df_head_brain = pd.read_csv(
    filepath_or_buffer="./Downloads/dados_headbrain/dados_headbrain.csv",
    sep=";",
    header="infer",
)
df_head_brain.head(n=5)

Unnamed: 0,cod_paciente,genero,Head Size(cm^3),Brain Weight(grams)
0,1,Masculino,4512,1530
1,2,Masculino,3738,1297
2,3,Masculino,4261,1335
3,4,Masculino,3777,1282
4,5,Masculino,4177,1590


Creating a pandas dataframe to pacients dataset

In [None]:
df_patients = pd.read_csv(
    filepath_or_buffer="./Downloads/dados_pacientes/dados_pacientes.csv",
    sep=";",
    header="infer",
)
df_patients.head(n=5)

Unnamed: 0,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario
0,1,39,Servidor Público,Ensino Médio Completo,2,Solteiro,Branco,2.0,4754
1,2,50,Autônomo,Superior Incompleto,24,Casado,Branco,1.0,3923
2,3,38,Funcionário Setor Privado,Ensino Médio Incompleto,4,Divorciado,Branco,0.0,1100
3,4,53,Funcionário Setor Privado,Ensino Médio Incompleto,24,Casado,Negro,1.0,1100
4,5,28,Funcionário Setor Privado,Ensino Médio Completo,15,Casado,Negro,0.0,3430


> ## Exploratory Data Analysis (EDA)
>

> ### Data Overview
>

In [None]:
def data_overview(df: pd.DataFrame()):
  print(f"> Head\n{df.head(n=5)}", end="\n\n\n")
  print(f"> Info\n{df.info()}", end="\n\n\n")
  print(f"> Shape\n{df.shape}", end="\n\n\n")
  print(f"> Is NAN\n{df.isna().sum()}", end="\n\n\n")
  print(f"> Is null\n{df.isnull().sum()}", end="\n\n\n")
  print(f"> Describe\n{df.describe()}", end="\n\n\n")

In [None]:
data_overview(df=df_head_brain)

> Head
   cod_paciente     genero  Head Size(cm^3)  Brain Weight(grams)
0             1  Masculino             4512                 1530
1             2  Masculino             3738                 1297
2             3  Masculino             4261                 1335
3             4  Masculino             3777                 1282
4             5  Masculino             4177                 1590


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   cod_paciente         319 non-null    int64 
 1   genero               319 non-null    object
 2   Head Size(cm^3)      319 non-null    int64 
 3   Brain Weight(grams)  319 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 10.1+ KB
> Info
None


> Shape
(319, 4)


> Is NAN
cod_paciente           0
genero                 0
Head Size(cm^3)        0
Brain Weight(grams)    0
dtype: int

⚠ Note that the *cod_paciente* variable is a numeric data type, if would be better if it were string type.

⚠ Note that the shape is 319 rows or observations and 4 variables of columns.

⚠ Note that the dataset does not have null values and NAN values.

⚠ Note that de _Head Size(cm^3)_ and _Brain Weigth(grams)_ have names with white spaces, this can affect the analysis.

In [None]:
data_overview(df=df_patients)

> Head
   id_cliente  idade            classe_trabalho             escolaridade  \
0           1     39           Servidor Público    Ensino Médio Completo   
1           2     50                   Autônomo      Superior Incompleto   
2           3     38  Funcionário Setor Privado  Ensino Médio Incompleto   
3           4     53  Funcionário Setor Privado  Ensino Médio Incompleto   
4           5     28  Funcionário Setor Privado    Ensino Médio Completo   

   id_estado estado_civil    raca  qtde_filhos  salario  
0          2     Solteiro  Branco          2.0     4754  
1         24       Casado  Branco          1.0     3923  
2          4   Divorciado  Branco          0.0     1100  
3         24       Casado   Negro          1.0     1100  
4         15       Casado   Negro          0.0     3430  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7999 entries, 0 to 7998
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           ------------

⚠ Note that the _id_cliente_, _id_estado_ variable are a numeric data type, if would be better if it were string type.

⚠ Note that the _qtde_filhos_ variable is a float data type, if would be better if it were integer type, cause doesn't exists partial childrens.

⚠ Note that the _salario_ variable is a integer data type, if would be better if it were float type.

⚠ Note that there is a not a number or null values on _classe_trabalho_ and _qtde_filhos_ variables.

> ### Data cleaning
>

In [None]:
df_patients.loc[df_patients["classe_trabalho"].isna()]

Unnamed: 0,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario
61,62,32,,Ensino Fundamental Incompleto,12,União Estável,Branco,4.0,1100
69,70,25,,Ensino Médio Completo,1,Solteiro,Branco,3.0,3450
127,128,35,,Ensino Médio Incompleto,1,Casado,Amarelo,4.0,1100
148,149,43,,Ensino Médio Completo,21,Divorciado,Branco,0.0,2490
153,154,52,,Ensino Médio Incompleto,18,Divorciado,Branco,1.0,1100
...,...,...,...,...,...,...,...,...,...
7931,7932,51,,Ensino Médio Completo,4,Divorciado,Branco,2.0,2660
7952,7953,19,,Ensino Médio Completo,1,Solteiro,Branco,0.0,2581
7960,7961,30,,Ensino Fundamental Completo,10,Solteiro,Branco,4.0,1100
7988,7989,20,,Ensino Médio Completo,1,Solteiro,Branco,2.0,3610


In [None]:
df_patients[df_patients["classe_trabalho"].isnull()]

Unnamed: 0,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario
61,62,32,,Ensino Fundamental Incompleto,12,União Estável,Branco,4.0,1100
69,70,25,,Ensino Médio Completo,1,Solteiro,Branco,3.0,3450
127,128,35,,Ensino Médio Incompleto,1,Casado,Amarelo,4.0,1100
148,149,43,,Ensino Médio Completo,21,Divorciado,Branco,0.0,2490
153,154,52,,Ensino Médio Incompleto,18,Divorciado,Branco,1.0,1100
...,...,...,...,...,...,...,...,...,...
7931,7932,51,,Ensino Médio Completo,4,Divorciado,Branco,2.0,2660
7952,7953,19,,Ensino Médio Completo,1,Solteiro,Branco,0.0,2581
7960,7961,30,,Ensino Fundamental Completo,10,Solteiro,Branco,4.0,1100
7988,7989,20,,Ensino Médio Completo,1,Solteiro,Branco,2.0,3610


In [None]:
qtt_obs_na_null = df_patients[df_patients["classe_trabalho"].isna()].shape[0]

print("{:.2f}%".format(((qtt_obs_na_null / len(df_patients)) * 100)))

4.84%


Note that the observations (rows) that contains NAN on _classe_trabalho_ variable is 4.84% of the dataset only.

So, I will remove these data.

In [None]:
df_patients.dropna(how="all", subset=["classe_trabalho"], inplace=True)

In [None]:
df_patients.isna().sum()

id_cliente          0
idade               0
classe_trabalho     0
escolaridade        0
id_estado           0
estado_civil        0
raca                0
qtde_filhos        10
salario             0
dtype: int64

NAN on _classe_trablaho_ variable removed.

In [None]:
df_patients[df_patients["qtde_filhos"].isna()]

Unnamed: 0,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario
12,13,23,Funcionário Setor Privado,Pós-Graduado,22,Solteiro,Branco,,5316
20,21,40,Funcionário Setor Privado,Doutorado,8,Casado,Branco,,12535
189,190,38,Funcionário Setor Privado,Ensino Médio Completo,27,Divorciado,Negro,,2369
233,234,28,Funcionário Setor Privado,Mestrado,26,Solteiro,Branco,,10671
295,296,37,MEI,Ensino Médio Completo,18,Divorciado,Branco,,4300
358,359,26,Autônomo,Ensino Fundamental Completo,24,Casado,Branco,,1100
359,360,36,Funcionário Setor Privado,Ensino Fundamental Completo,11,Casado,Branco,,1100
360,361,62,Funcionário Setor Privado,Mestrado,7,Casado,Branco,,6072
362,363,43,Funcionário Setor Privado,Ensino Médio Completo,12,Divorciado,Branco,,2240
402,403,28,Funcionário Setor Privado,Ensino Fundamental Incompleto,27,Casado,Amarelo,,1100


In [None]:
qtt_obs_na_null = df_patients[df_patients["qtde_filhos"].isna()].shape[0]

print("{:.2f}%".format((qtt_obs_na_null / len(df_patients)) * 100))

0.13%


Note that the observations (rows) that contains NAN on _classe_trabalho_ variable is 0.13% of the dataset only.

So, I will remove these data.

In [None]:
df_patients.dropna(how="all", subset=["qtde_filhos"], inplace=True)

In [None]:
df_patients.isna().sum()

id_cliente         0
idade              0
classe_trabalho    0
escolaridade       0
id_estado          0
estado_civil       0
raca               0
qtde_filhos        0
salario            0
dtype: int64

> ### Data wrangling
>

Changing columns name for easy handling.

In [None]:
df_head_brain.columns

Index(['cod_paciente', 'genero', 'Head Size(cm^3)', 'Brain Weight(grams)'], dtype='object')

In [None]:
df_head_brain.rename(
    columns={
        "Head Size(cm^3)": "head_size_cm_cubic",
        "Brain Weight(grams)": "brain_weight_gr"
    },
    inplace=True
  )

In [None]:
df_head_brain.columns

Index(['cod_paciente', 'genero', 'head_size_cm_cubic', 'brain_weight_gr'], dtype='object')

Changing data types

In [None]:
df_head_brain["cod_paciente"] = df_head_brain["cod_paciente"].astype(dtype=np.str_)

In [None]:
df_head_brain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   cod_paciente        319 non-null    object
 1   genero              319 non-null    object
 2   head_size_cm_cubic  319 non-null    int64 
 3   brain_weight_gr     319 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 10.1+ KB


In [None]:
df_patients = df_patients.astype(dtype={
    "id_cliente": np.str_,
    "id_estado": np.str_,
    "qtde_filhos": np.int16,
    "salario": np.float64
})

In [None]:
df_patients.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7602 entries, 0 to 7998
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id_cliente       7602 non-null   object 
 1   idade            7602 non-null   int64  
 2   classe_trabalho  7602 non-null   object 
 3   escolaridade     7602 non-null   object 
 4   id_estado        7602 non-null   object 
 5   estado_civil     7602 non-null   object 
 6   raca             7602 non-null   object 
 7   qtde_filhos      7602 non-null   int16  
 8   salario          7602 non-null   float64
dtypes: float64(1), int16(1), int64(1), object(6)
memory usage: 549.4+ KB


> ### Data visualization
>

In [None]:
df_head_brain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   cod_paciente        319 non-null    object
 1   genero              319 non-null    object
 2   head_size_cm_cubic  319 non-null    int64 
 3   brain_weight_gr     319 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 10.1+ KB


In [None]:
df_head_brain["genero"].value_counts().reset_index()

Unnamed: 0,genero,count
0,Masculino,198
1,Feminino,121


In [None]:
fig = px.bar(
    df_head_brain["genero"].value_counts().reset_index(),
    x="count",
    y="genero",
    orientation='h',
    color="genero",
    title=""

  )
fig.show()

In [None]:
fig = px.histogram(
    data_frame=df_head_brain,
    x="head_size_cm_cubic"
)
fig.show()

In [None]:
fig = px.histogram(
    data_frame=df_head_brain,
    x="brain_weight_gr"
)
fig.show()

In [None]:
fig = px.scatter(
    data_frame=df_head_brain,
    x="head_size_cm_cubic",
    y="brain_weight_gr"
)
fig.show()

In [None]:
df_patients.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7602 entries, 0 to 7998
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id_cliente       7602 non-null   object 
 1   idade            7602 non-null   int64  
 2   classe_trabalho  7602 non-null   object 
 3   escolaridade     7602 non-null   object 
 4   id_estado        7602 non-null   object 
 5   estado_civil     7602 non-null   object 
 6   raca             7602 non-null   object 
 7   qtde_filhos      7602 non-null   int16  
 8   salario          7602 non-null   float64
dtypes: float64(1), int16(1), int64(1), object(6)
memory usage: 549.4+ KB


In [None]:
df_patients["classe_trabalho"].value_counts().reset_index()

Unnamed: 0,classe_trabalho,count
0,Funcionário Setor Privado,5374
1,Autônomo,587
2,Empresário,496
3,Servidor Público,299
4,MEI,274
5,Aposentado,267
6,Funcionário Público,216
7,Menor Aprendiz,85
8,Desempregado,3
9,Estagiário,1


In [None]:
fig = px.bar(
    data_frame=df_patients["classe_trabalho"].value_counts().reset_index(),
    x="count",
    y="classe_trabalho",
    orientation='h',
    color="classe_trabalho",
)
fig.show()

In [None]:
df_patients["escolaridade"].value_counts().reset_index()

Unnamed: 0,escolaridade,count
0,Ensino Médio Completo,3520
1,Ensino Médio Incompleto,2823
2,Mestrado,543
3,Ensino Fundamental Completo,349
4,Ensino Fundamental Incompleto,249
5,Doutorado,90
6,Analfabeto,10
7,Superior Incompleto,7
8,Superior Completo,6
9,Pós-Graduado,5


In [None]:
fig = px.bar(
    data_frame=df_patients["escolaridade"].value_counts().reset_index(),
    x="count",
    y="escolaridade",
    orientation='h',
    color="escolaridade",
)
fig.show()

In [None]:
df_patients["estado_civil"].value_counts().reset_index()

Unnamed: 0,estado_civil,count
0,Casado,3561
1,Solteiro,2433
2,Divorciado,1052
3,Separado,243
4,Viúvo,213
5,União Estável,100


In [None]:
fig = px.bar(
    data_frame=df_patients["estado_civil"].value_counts().reset_index(),
    x="count",
    y="estado_civil",
    orientation='h',
    color="estado_civil",
)
fig.show()

In [None]:
df_patients["raca"].value_counts().reset_index()

Unnamed: 0,raca,count
0,Branco,6499
1,Negro,735
2,Pardo,230
3,Indígena,73
4,Amarelo,65


In [None]:
fig = px.bar(
    data_frame=df_patients["raca"].value_counts().reset_index(),
    x="count",
    y="raca",
    orientation='h',
    color="raca",
)
fig.show()

In [None]:
fig = px.histogram(
    data_frame=df_patients,
    x="qtde_filhos",
    nbins=10
)
fig.show()

In [None]:
fig = px.histogram(
    data_frame=df_patients,
    x="salario"
)
fig.show()

In [None]:
fig = px.scatter(
    data_frame=df_patients,
    x="salario",
    y="qtde_filhos",

)
fig.show()

In [None]:
fig = px.violin(
    data_frame=df_patients,
    x="qtde_filhos",
    y="salario",
    #color="raca",
    #box=True, # draw box plot inside the violin
    #points='all', # can be 'outliers', or False
  )
fig.show()

> ## Feature engineering
>

> ## Pre-processing
>

In [None]:
df_head_brain.head(n=5)

Unnamed: 0,cod_paciente,genero,head_size_cm_cubic,brain_weight_gr
0,1,Masculino,4512,1530
1,2,Masculino,3738,1297
2,3,Masculino,4261,1335
3,4,Masculino,3777,1282
4,5,Masculino,4177,1590


In [None]:
le = LabelEncoder()
df_head_brain["le_genero"] = le.fit_transform(df_head_brain["genero"])
df_head_brain.sample(n=5)

Unnamed: 0,cod_paciente,genero,head_size_cm_cubic,brain_weight_gr,le_genero
233,234,Feminino,3394,1215,0
74,75,Masculino,3824,1240,1
4,5,Masculino,4177,1590,1
105,106,Masculino,3648,1260,1
152,153,Feminino,3680,1321,0


Now, _Feminino_ category is 0 and _Masculino_ category is 1.

In [None]:
df_patients.head(n=5)

Unnamed: 0,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario
0,1,39,Servidor Público,Ensino Médio Completo,2,Solteiro,Branco,2,4754.0
1,2,50,Autônomo,Superior Incompleto,24,Casado,Branco,1,3923.0
2,3,38,Funcionário Setor Privado,Ensino Médio Incompleto,4,Divorciado,Branco,0,1100.0
3,4,53,Funcionário Setor Privado,Ensino Médio Incompleto,24,Casado,Negro,1,1100.0
4,5,28,Funcionário Setor Privado,Ensino Médio Completo,15,Casado,Negro,0,3430.0


In [None]:
for col in df_patients[["classe_trabalho", "escolaridade", "estado_civil", "raca"]]:
    df_patients[f"le_{col}"] = le.fit_transform(df_patients[col])

df_patients.sample(n=10)

Unnamed: 0,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario,le_classe_trabalho,le_escolaridade,le_estado_civil,le_raca
1000,1001,54,Autônomo,Ensino Médio Incompleto,18,Casado,Branco,4,1100.0,1,5,0,1
3451,3452,38,Funcionário Setor Privado,Ensino Médio Incompleto,23,Casado,Branco,1,1100.0,6,5,0,1
5357,5358,25,Funcionário Setor Privado,Ensino Médio Completo,12,Solteiro,Branco,5,3519.0,6,4,3,1
1837,1838,45,Funcionário Público,Ensino Médio Incompleto,15,Solteiro,Negro,2,1103.0,5,5,3,3
4004,4005,20,Funcionário Setor Privado,Ensino Médio Completo,21,Solteiro,Branco,0,3079.0,6,4,3,1
6003,6004,28,Funcionário Setor Privado,Ensino Médio Completo,11,Solteiro,Branco,4,3705.0,6,4,3,1
207,208,58,MEI,Ensino Médio Incompleto,6,Casado,Branco,4,4559.0,7,5,0,1
2534,2535,54,Funcionário Setor Privado,Ensino Médio Incompleto,7,Viúvo,Branco,0,1100.0,6,5,5,1
7419,7420,56,Funcionário Setor Privado,Ensino Fundamental Completo,8,Solteiro,Branco,0,1100.0,6,2,3,1
7459,7460,40,Funcionário Setor Privado,Ensino Médio Completo,14,Solteiro,Branco,4,2376.0,6,4,3,1


In [None]:
data = pd.merge(
    left=df_head_brain,
    right=df_patients,
    how="inner",
    left_on="cod_paciente",
    right_on="id_cliente",
)
data.head(n=5)

Unnamed: 0,cod_paciente,genero,head_size_cm_cubic,brain_weight_gr,le_genero,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario,le_classe_trabalho,le_escolaridade,le_estado_civil,le_raca
0,1,Masculino,4512,1530,1,1,39,Servidor Público,Ensino Médio Completo,2,Solteiro,Branco,2,4754.0,9,4,3,1
1,2,Masculino,3738,1297,1,2,50,Autônomo,Superior Incompleto,24,Casado,Branco,1,3923.0,1,9,0,1
2,3,Masculino,4261,1335,1,3,38,Funcionário Setor Privado,Ensino Médio Incompleto,4,Divorciado,Branco,0,1100.0,6,5,1,1
3,4,Masculino,3777,1282,1,4,53,Funcionário Setor Privado,Ensino Médio Incompleto,24,Casado,Negro,1,1100.0,6,5,0,3
4,5,Masculino,4177,1590,1,5,28,Funcionário Setor Privado,Ensino Médio Completo,15,Casado,Negro,0,3430.0,6,4,0,3


Set order by patient identification.

In [None]:
data.sort_values(by="cod_paciente", ascending=True, inplace=True, ignore_index=True)
#data.reset_index(inplace=True)
data.head()

Unnamed: 0,cod_paciente,genero,head_size_cm_cubic,brain_weight_gr,le_genero,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario,le_classe_trabalho,le_escolaridade,le_estado_civil,le_raca
0,1,Masculino,4512,1530,1,1,39,Servidor Público,Ensino Médio Completo,2,Solteiro,Branco,2,4754.0,9,4,3,1
1,10,Masculino,3982,1375,1,10,42,Funcionário Setor Privado,Ensino Médio Completo,16,Casado,Branco,5,3798.0,6,4,0,1
2,100,Masculino,3478,1270,1,100,76,Aposentado,Mestrado,9,Casado,Branco,0,2024.0,0,6,0,1
3,100,Masculino,3478,1270,1,100,76,Aposentado,Mestrado,9,Casado,Branco,0,2024.0,0,6,0,1
4,101,Masculino,3495,1218,1,101,44,Funcionário Setor Privado,Ensino Médio Completo,9,Casado,Branco,0,3154.0,6,4,0,1


> ## Data Modeling
>

> ### Model 1

> #### Building model

In [None]:
data_to_model = data[["head_size_cm_cubic", "brain_weight_gr"]]
data_to_model.head()

Unnamed: 0,head_size_cm_cubic,brain_weight_gr
0,4512,1530
1,3982,1375
2,3478,1270
3,3478,1270
4,3495,1218


> #### Fitting model

Measuring the number of clusters necessary

In [None]:
def calculate_wcss(df: pd.DataFrame()) -> list:
  wcss = []
  for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, n_init="auto")
    kmeans.fit(X=df)
    wcss.append(kmeans.inertia_)

  return wcss

In [None]:
wcss_to_model = calculate_wcss(df=data_to_model)

In [None]:
for i in range(len(wcss_to_model)):
  print(f"The cluster {i} has WCSS value of: {wcss_to_model[i]}")

The cluster 0 has WCSS value of: 41802610.35099338
The cluster 1 has WCSS value of: 15812104.560796317
The cluster 2 has WCSS value of: 9004712.027605243
The cluster 3 has WCSS value of: 6148569.157782188
The cluster 4 has WCSS value of: 4599789.713283785
The cluster 5 has WCSS value of: 3641465.4154849686
The cluster 6 has WCSS value of: 3370039.6387988077
The cluster 7 has WCSS value of: 2582634.926365639
The cluster 8 has WCSS value of: 2423585.7210448487
The cluster 9 has WCSS value of: 2056195.921896149


In [None]:
fig = px.line(
    x=range(1, 11),
    y=wcss_to_model
)
fig1 = go.Figure(fig)
fig1.update_layout(
    title="Measuring the WCSS",
    xaxis_title="Number of the clusters",
    yaxis_title="WCSS value",
    template="plotly_white"
)
fig1.show()

> #### Getting cluster labels

After training model, predict the clusters

In [None]:
kmeans_head_brain = KMeans(n_clusters=3, random_state=42, init="k-means++", n_init="auto")
data["cluster"] = kmeans_head_brain.fit_predict(data_to_model)

In [None]:
data.head()

Unnamed: 0,cod_paciente,genero,head_size_cm_cubic,brain_weight_gr,le_genero,id_cliente,idade,classe_trabalho,escolaridade,id_estado,estado_civil,raca,qtde_filhos,salario,le_classe_trabalho,le_escolaridade,le_estado_civil,le_raca,cluster
0,1,Masculino,4512,1530,1,1,39,Servidor Público,Ensino Médio Completo,2,Solteiro,Branco,2,4754.0,9,4,3,1,1
1,10,Masculino,3982,1375,1,10,42,Funcionário Setor Privado,Ensino Médio Completo,16,Casado,Branco,5,3798.0,6,4,0,1,1
2,100,Masculino,3478,1270,1,100,76,Aposentado,Mestrado,9,Casado,Branco,0,2024.0,0,6,0,1,2
3,100,Masculino,3478,1270,1,100,76,Aposentado,Mestrado,9,Casado,Branco,0,2024.0,0,6,0,1,2
4,101,Masculino,3495,1218,1,101,44,Funcionário Setor Privado,Ensino Médio Completo,9,Casado,Branco,0,3154.0,6,4,0,1,2


Getting the centers

In [None]:
centers_head_brain = kmeans_head_brain.cluster_centers_
centers_head_brain

array([[3265.54545455, 1174.        ],
       [4168.35087719, 1413.54385965],
       [3704.67123288, 1308.13013699]])

In [None]:
fig = px.scatter(
    x=data["head_size_cm_cubic"],
    y=data["brain_weight_gr"],
    color=data["cluster"]
)
fig1 = px.scatter(
    x=centers_head_brain[:,0],
    y=centers_head_brain[:,1],
    size=[7, 7, 7]
)
fig2 = go.Figure(
    data=fig.data + fig1.data
)
fig2.update_layout(
    title="Clusters predict",
    xaxis_title="Head size (cm^3)",
    yaxis_title="Brain weight (gr)",
    template="plotly_white"
)
fig2.show()

**Cluster defitions**

| ID_CLUSTER | CATEGORY | DESCRIPTION |
| :--: | -- | :-- |
| 0 | Small size | This group includes patients with characteristics that indicate both a head size and brain volume below average. |
| 1 | Medium size | Patients in this group have an intermediate head size, while brain volume ranges from medium to large. |
| 2 | Large size | This group includes patients with a head size of medium to large and a brain volume that is generally larger than average. |

> #### Evaluation

> #### Visualization

Checking distributuins of the _idade_ variable.

In [None]:
fig = px.histogram(
    data_frame=data,
    x="idade",
    nbins=20,
    title="'idade' Histogram",
)
fig.update_layout(
    xaxis_title="Idade",
    yaxis_title="Frequency",
)
fig.show()

In [None]:
fig = px.bar(
    data["genero"].value_counts().reset_index(),
    x="count",
    y="genero",
    orientation='h',
    color="genero",
    title=""

  )
fig.show()

In [None]:
fig = px.histogram(
    data_frame=data[(data["le_genero"] == 0) & (data["cluster"] == 0)],
    x="idade",
    nbins=15
)
fig.show()

> #### Interpretation and Conclusion

In [None]:
data.groupby(by=[ "cluster"]).describe()

Unnamed: 0_level_0,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,brain_weight_gr,brain_weight_gr,...,le_estado_civil,le_estado_civil,le_raca,le_raca,le_raca,le_raca,le_raca,le_raca,le_raca,le_raca
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,99.0,3265.545455,167.889314,2720.0,3170.5,3297.0,3394.0,3490.0,99.0,1174.0,...,3.0,5.0,99.0,1.222222,0.942809,0.0,1.0,1.0,1.0,4.0
1,57.0,4168.350877,173.619194,3962.0,4036.0,4121.0,4253.0,4747.0,57.0,1413.54386,...,3.0,5.0,57.0,1.385965,0.860946,1.0,1.0,1.0,1.0,4.0
2,146.0,3704.671233,137.177414,3478.0,3582.75,3690.5,3831.5,3937.0,146.0,1308.130137,...,3.0,5.0,146.0,1.383562,0.926694,0.0,1.0,1.0,1.0,4.0


In order, the number of patients per cluster (from largest to smallest) is:
- cluster 2: 146;
- cluster 0: 99;
- cluster 1: 57;

In [None]:
data[["head_size_cm_cubic", "brain_weight_gr"]].corr()

Unnamed: 0,head_size_cm_cubic,brain_weight_gr
head_size_cm_cubic,1.0,0.791284
brain_weight_gr,0.791284,1.0


There is a strong correlation between _head_size_cm_cubic_ and _brain_weight_gr_ variable and positive.

In [None]:
data.groupby(by=["genero", "cluster"]).describe()[["head_size_cm_cubic", "brain_weight_gr"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,head_size_cm_cubic,brain_weight_gr,brain_weight_gr,brain_weight_gr,brain_weight_gr,brain_weight_gr,brain_weight_gr,brain_weight_gr,brain_weight_gr
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
genero,cluster,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
Feminino,0,65.0,3209.415385,172.91995,2720.0,3145.0,3233.0,3323.0,3479.0,65.0,1152.446154,75.839681,955.0,1103.0,1160.0,1220.0,1322.0
Feminino,1,6.0,4065.5,107.719543,3979.0,3997.5,4005.5,4154.5,4204.0,6.0,1327.0,43.409676,1280.0,1297.5,1313.0,1366.0,1380.0
Feminino,2,39.0,3667.74359,110.592357,3493.0,3571.5,3685.0,3735.0,3903.0,39.0,1306.0,78.28356,1127.0,1250.0,1305.0,1350.0,1520.0
Masculino,0,34.0,3372.852941,87.443534,3095.0,3330.25,3392.5,3412.75,3490.0,34.0,1215.205882,65.538691,1120.0,1173.0,1217.5,1269.25,1340.0
Masculino,1,51.0,4180.45098,176.583613,3962.0,4046.0,4121.0,4265.5,4747.0,51.0,1423.72549,91.997408,1256.0,1363.0,1422.0,1485.0,1635.0
Masculino,2,107.0,3718.130841,143.777415,3478.0,3589.0,3700.0,3850.0,3937.0,107.0,1308.906542,80.514564,1165.0,1251.0,1297.0,1348.0,1588.0


A statics about _head_size_cm_cubic_ and _brain_weight_gr_ variables.

In [None]:
data[data["cluster"] == 0]["idade"].mean()

41.54545454545455

The mean of the _idade_ variable of the cluster 0 (Samll size)

In [None]:
data[(data["le_genero"] == 0) & (data["cluster"] == 0)].shape

(65, 19)

The shape of the observations when _genero_ variable is equal 0 (feminino) and cluster 0 (Small size).

**Analyze Cluster Characteristics**: Analyze the average head size and brain weight within each cluster.

**Insights and Next Steps**: Discuss the meaning of the clusters and their potential implications. Mention limitations and suggest further exploration (e.g., trying different K values or including additional features).