# **Analisando Dataset "Cardiovascular Study Dataset"**
#### Retirado do site Kaggle

Link do dataset e descrição de dados: https://www.kaggle.com/christofel04/cardiovascular-study-dataset-predict-heart-disea?select=train.csv

## **Objetivos**
A seguinte análise destina-se, inicialmente, a averiguação da presença de um perfil de paciente que posso vir a desenvolver doenças coronárias.
A posteriori, tentar-se-á aplicar um algorito de classificação no intuito de prever doenças desse tipo.

## **Descrição das Características**

1. Sex: male or female("M" or "F")
2. Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
3. is_smoking: whether or not the patient is a current smoker ("YES" or "NO")
4. Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)
5. BP Meds: whether or not the patient was on blood pressure medication (Nominal)
6. Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
7. Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
8. Diabetes: whether or not the patient had diabetes (Nominal)
9. Tot Chol: total cholesterol level (Continuous)
10. Sys BP: systolic blood pressure (Continuous)
11. Dia BP: diastolic blood pressure (Continuous)
12. BMI: Body Mass Index (Continuous)
13. Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
14. Glucose: glucose level (Continuous)
15. 10 year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”)

### Preparativos Iniciais

In [1]:
# importando bibliotecas

import pandas as pd
import numpy as np
import seaborn as sns
import scipy

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import chi2
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
%matplotlib inline

from pandas_profiling import ProfileReport

In [2]:
# carregando o dataset do estudo
ds = pd.read_csv('dataset.csv')

In [3]:
# avaliando variáveis
ds.head()

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,0,64,2.0,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,1,36,4.0,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,2,46,1.0,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,3,50,1.0,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,4,64,1.0,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


In [4]:
ds.shape

(3390, 17)

In [5]:
# informações dos atributos
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               3390 non-null   int64  
 1   age              3390 non-null   int64  
 2   education        3303 non-null   float64
 3   sex              3390 non-null   object 
 4   is_smoking       3390 non-null   object 
 5   cigsPerDay       3368 non-null   float64
 6   BPMeds           3346 non-null   float64
 7   prevalentStroke  3390 non-null   int64  
 8   prevalentHyp     3390 non-null   int64  
 9   diabetes         3390 non-null   int64  
 10  totChol          3352 non-null   float64
 11  sysBP            3390 non-null   float64
 12  diaBP            3390 non-null   float64
 13  BMI              3376 non-null   float64
 14  heartRate        3389 non-null   float64
 15  glucose          3086 non-null   float64
 16  TenYearCHD       3390 non-null   int64  
dtypes: float64(9),

In [6]:
# dados estatísticos dos atributos contínuos
ds.describe().round(2)

Unnamed: 0,id,age,education,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,3390.0,3390.0,3303.0,3368.0,3346.0,3390.0,3390.0,3390.0,3352.0,3390.0,3390.0,3376.0,3389.0,3086.0,3390.0
mean,1694.5,49.54,1.97,9.07,0.03,0.01,0.32,0.03,237.07,132.6,82.88,25.79,75.98,82.09,0.15
std,978.75,8.59,1.02,11.88,0.17,0.08,0.46,0.16,45.25,22.29,12.02,4.12,11.97,24.24,0.36
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.96,45.0,40.0,0.0
25%,847.25,42.0,1.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,74.5,23.02,68.0,71.0,0.0
50%,1694.5,49.0,2.0,0.0,0.0,0.0,0.0,0.0,234.0,128.5,82.0,25.38,75.0,78.0,0.0
75%,2541.75,56.0,3.0,20.0,0.0,0.0,1.0,0.0,264.0,144.0,90.0,28.04,83.0,87.0,0.0
max,3389.0,70.0,4.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


In [7]:
# Quantidade de homens idosos
ds[(ds['age'] > 59) & (ds['sex'] == 'M')].count()

id                 232
age                232
education          224
sex                232
is_smoking         232
cigsPerDay         232
BPMeds             231
prevalentStroke    232
prevalentHyp       232
diabetes           232
totChol            231
sysBP              232
diaBP              232
BMI                229
heartRate          231
glucose            222
TenYearCHD         232
dtype: int64

In [8]:
# dados dos atributos nominais
ds.describe(include=['object'])

Unnamed: 0,sex,is_smoking
count,3390,3390
unique,2,2
top,F,NO
freq,1923,1703


### **Pré-Processamento**

In [9]:
# drop de coluna não-relevante
ds = ds.drop(['id'], axis=1)

In [10]:
# mudança de nome de variável alvo
ds.rename(columns={'TenYearCHD': 'Cardiopatia'}, inplace=True)

In [11]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              3390 non-null   int64  
 1   education        3303 non-null   float64
 2   sex              3390 non-null   object 
 3   is_smoking       3390 non-null   object 
 4   cigsPerDay       3368 non-null   float64
 5   BPMeds           3346 non-null   float64
 6   prevalentStroke  3390 non-null   int64  
 7   prevalentHyp     3390 non-null   int64  
 8   diabetes         3390 non-null   int64  
 9   totChol          3352 non-null   float64
 10  sysBP            3390 non-null   float64
 11  diaBP            3390 non-null   float64
 12  BMI              3376 non-null   float64
 13  heartRate        3389 non-null   float64
 14  glucose          3086 non-null   float64
 15  Cardiopatia      3390 non-null   int64  
dtypes: float64(9), int64(5), object(2)
memory usage: 423.9+ KB


In [12]:
ds.shape

(3390, 16)

###  Ajuste de Dados

Para os dados deste estudo especificamente, não será necessário realizar processo de discretização e normalização de variáveis. Para trabalhos futuros, caso haja necessidade de desenvolver um modelo, é recomendável realizar os processos citados.

In [13]:
# observação dos formatos dos atributos
ds.dtypes

age                  int64
education          float64
sex                 object
is_smoking          object
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
Cardiopatia          int64
dtype: object

In [14]:
# assimetria dos dados
ds.skew()

age                 0.225796
education           0.698946
cigsPerDay          1.223005
BPMeds              5.524325
prevalentStroke    12.297612
prevalentHyp        0.795189
diabetes            6.001977
totChol             0.940636
sysBP               1.175837
diaBP               0.718173
BMI                 1.022252
heartRate           0.676490
glucose             6.144390
Cardiopatia         1.953182
dtype: float64

#### Averiguando existência de dados nulos

In [15]:
#averiguando dados nulos
print(ds.isnull().sum())

age                  0
education           87
sex                  0
is_smoking           0
cigsPerDay          22
BPMeds              44
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             38
sysBP                0
diaBP                0
BMI                 14
heartRate            1
glucose            304
Cardiopatia          0
dtype: int64


In [16]:
# preenchimento de dados nulos por meio de métodos estatísticos

ds['totChol'].fillna(ds['totChol'].median(),inplace=True)
ds['education'].fillna(ds['BPMeds'].mode().iloc[0], inplace=True)
ds['cigsPerDay'].fillna(ds['cigsPerDay'].median(), inplace=True)
ds['BPMeds'].fillna(ds['BPMeds'].mode().iloc[0], inplace=True)
ds['BMI'].fillna(ds['BMI'].mean(), inplace=True)
ds['glucose'].fillna(ds['glucose'].median(),inplace=True)
ds['heartRate'].fillna(ds['heartRate'].mean(),inplace=True)

In [17]:
print(ds.isnull().sum())

age                0
education          0
sex                0
is_smoking         0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
Cardiopatia        0
dtype: int64


In [18]:
# observando comportamento estatístico dos dados
ds.describe()

Unnamed: 0,age,education,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,Cardiopatia
count,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0
mean,49.542183,1.920354,9.010619,0.029499,0.00649,0.315339,0.025664,237.039823,132.60118,82.883038,25.794964,75.977279,81.720059,0.150737
std,8.592878,1.053104,11.862839,0.169224,0.080309,0.464719,0.158153,44.994205,22.29203,12.023581,4.106939,11.970102,23.161265,0.357846
min,32.0,0.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.96,45.0,40.0,0.0
25%,42.0,1.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,74.5,23.03,68.0,72.0,0.0
50%,49.0,2.0,0.0,0.0,0.0,0.0,0.0,234.0,128.5,82.0,25.4,75.0,78.0,0.0
75%,56.0,3.0,20.0,0.0,0.0,1.0,0.0,264.0,144.0,90.0,27.9975,83.0,85.0,0.0
max,70.0,4.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


In [19]:
# ordenando valores por idade
ds.sort_values(by=['age'],axis=0)

Unnamed: 0,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,Cardiopatia
2642,32,2.0,F,YES,15.0,0.0,0,0,0,242.0,111.0,70.0,29.840000,80.0,88.0,0
518,33,2.0,F,YES,5.0,0.0,0,0,0,200.0,119.0,74.0,23.800000,75.0,74.0,0
193,33,3.0,F,YES,15.0,0.0,0,0,0,199.0,116.0,81.0,21.610000,75.0,93.0,0
440,33,2.0,M,NO,0.0,0.0,0,1,0,165.0,141.5,95.0,26.740000,54.0,77.0,0
3027,33,4.0,M,NO,0.0,0.0,0,0,0,165.0,136.0,75.0,24.950000,88.0,90.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1070,69,1.0,M,YES,4.0,0.0,0,1,0,232.0,151.0,74.0,24.140000,75.0,62.0,0
3175,69,3.0,F,NO,0.0,1.0,0,1,0,203.0,166.0,90.0,25.400000,77.0,80.0,0
1098,69,1.0,M,YES,1.0,0.0,0,0,0,245.0,123.0,77.0,26.580000,70.0,81.0,1
2231,70,1.0,F,NO,0.0,0.0,1,1,0,107.0,143.0,93.0,25.794964,68.0,62.0,1


#### Averiguando existência de dados duplicados

In [20]:
# existência de dados duplicados
len(ds.duplicated() == False)

3390

#### Alocação de variáveis de interesse em nova variável

In [21]:
# alocação num novo dataset

df = ds.drop(['is_smoking','BMI','cigsPerDay','prevalentStroke','prevalentHyp','BPMeds','sysBP','diaBP','glucose','heartRate','education'], axis=1)
df

Unnamed: 0,age,sex,diabetes,totChol,Cardiopatia
0,64,F,0,221.0,1
1,36,M,0,212.0,0
2,46,F,0,250.0,0
3,50,M,0,233.0,1
4,64,F,0,241.0,0
...,...,...,...,...,...
3385,60,F,0,261.0,0
3386,46,F,0,199.0,0
3387,44,M,0,352.0,1
3388,60,M,0,191.0,0


In [22]:
df.dtypes

age              int64
sex             object
diabetes         int64
totChol        float64
Cardiopatia      int64
dtype: object

In [23]:
# distribuição das variáveis
sns.pairplot(df, hue='Cardiopatia')

<seaborn.axisgrid.PairGrid at 0x21a83fcccd0>

#### Categorização de Atributos

In [24]:
# convertendo atributo 'diabetes' em categórico

df['diabetes'] = df['diabetes'].astype('category')

In [25]:
# convertendo variáveis "dummy"

df_dummy = pd.get_dummies(df)
df_dummy.head()

Unnamed: 0,age,totChol,Cardiopatia,sex_F,sex_M,diabetes_0,diabetes_1
0,64,221.0,1,1,0,1,0
1,36,212.0,0,0,1,1,0
2,46,250.0,0,1,0,1,0
3,50,233.0,1,0,1,1,0
4,64,241.0,0,1,0,1,0


In [26]:
df_dummy.describe().round(2)

Unnamed: 0,age,totChol,Cardiopatia,sex_F,sex_M,diabetes_0,diabetes_1
count,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0
mean,49.54,237.04,0.15,0.57,0.43,0.97,0.03
std,8.59,44.99,0.36,0.5,0.5,0.16,0.16
min,32.0,107.0,0.0,0.0,0.0,0.0,0.0
25%,42.0,206.0,0.0,0.0,0.0,1.0,0.0
50%,49.0,234.0,0.0,1.0,0.0,1.0,0.0
75%,56.0,264.0,0.0,1.0,1.0,1.0,0.0
max,70.0,696.0,1.0,1.0,1.0,1.0,1.0


In [27]:
# Discretização da variável 'totChol' em níveis de 'Alto' e 'Baixo'

df_dummy['totChol'] = pd.cut(df_dummy['totChol'],[0,240,800],labels=['Baixo', 'Alto'])
df_dummy.head()

Unnamed: 0,age,totChol,Cardiopatia,sex_F,sex_M,diabetes_0,diabetes_1
0,64,Baixo,1,1,0,1,0
1,36,Baixo,0,0,1,1,0
2,46,Alto,0,1,0,1,0
3,50,Baixo,1,0,1,1,0
4,64,Alto,0,1,0,1,0


In [28]:
#discretizando as idades do estudo em Adulto e Idoso

df_dummy['age_dist'] = pd.cut(df_dummy['age'],[18,60,100],labels=['Adulto', 'Idoso'])
df_dummy['age_dist'].value_counts() 

Adulto    2928
Idoso      462
Name: age_dist, dtype: int64

In [29]:
# Removendo antiga variável 'age'

data = df_dummy.drop('age', axis=1)

In [30]:
data.head()

Unnamed: 0,totChol,Cardiopatia,sex_F,sex_M,diabetes_0,diabetes_1,age_dist
0,Baixo,1,1,0,1,0,Idoso
1,Baixo,0,0,1,1,0,Adulto
2,Alto,0,1,0,1,0,Adulto
3,Baixo,1,0,1,1,0,Adulto
4,Alto,0,1,0,1,0,Idoso


In [31]:
# coletando variáveis "dummy" restantes
data = pd.get_dummies(data)
data.head()

Unnamed: 0,Cardiopatia,sex_F,sex_M,diabetes_0,diabetes_1,totChol_Baixo,totChol_Alto,age_dist_Adulto,age_dist_Idoso
0,1,1,0,1,0,1,0,0,1
1,0,0,1,1,0,1,0,1,0
2,0,1,0,1,0,0,1,1,0
3,1,0,1,1,0,1,0,1,0
4,0,1,0,1,0,0,1,0,1


In [32]:
# Renomeando colunas do estudo

data.rename(columns={'age_dist_Adulto': 'Adulto', 'age_dist_Idoso': 'Idoso', 'totChol_Baixo': 'Col_Baixo', 'totChol_Alto': 'Col_Alto'}, inplace=True)

In [33]:
data.head()

Unnamed: 0,Cardiopatia,sex_F,sex_M,diabetes_0,diabetes_1,Col_Baixo,Col_Alto,Adulto,Idoso
0,1,1,0,1,0,1,0,0,1
1,0,0,1,1,0,1,0,1,0
2,0,1,0,1,0,0,1,1,0
3,1,0,1,1,0,1,0,1,0
4,0,1,0,1,0,0,1,0,1


In [34]:
data.rename(columns={'sex_F': 'Feminino', 'sex_M': 'Masculino', 'diabetes_0': 'Não Diabético', 'diabetes_1': 'Diabético'}, inplace=True)
data.head()

Unnamed: 0,Cardiopatia,Feminino,Masculino,Não Diabético,Diabético,Col_Baixo,Col_Alto,Adulto,Idoso
0,1,1,0,1,0,1,0,0,1
1,0,0,1,1,0,1,0,1,0
2,0,1,0,1,0,0,1,1,0
3,1,0,1,1,0,1,0,1,0
4,0,1,0,1,0,0,1,0,1


### **Análise Estatística** 

In [35]:
# Funções para converter a informação transmitida nos dados

def conv(item):
    if item == 0:
        item = 'Não Diabético'
    else:
        item = 'Diabético'
    return item

def conv2(item):
    if item == 'F':
        item = 'Feminino'
    else:
        item = 'Masculino'
    return item

def conv3(item):
    if item == 0:
        item = 'Ausente'
    else:
        item = 'Presente'
    return item

In [36]:
# Copiando informações para um novo dataset destinado a análise estatística

df_stats = df.copy()
df_stats['totChol'] = df_dummy['totChol']

df_stats['diabetes'] = df_stats['diabetes'].apply(conv)
df_stats['sex'] = df_stats['sex'].apply(conv2)
df_stats['Cardiopatia'] = df_stats['Cardiopatia'].apply(conv3)

#### **Plotando Gráficos** 

In [42]:
# plt.rcParams["backend"] = "agg"

In [44]:
# segmentação de pacientes por sexo

plt.figure(figsize=(8,6))
sns.countplot(df_stats['sex'])
sns.set_theme(style="darkgrid")
plt.xlabel('Sexo')
plt.ylabel('Quantidade de Pacientes')
plt.savefig('fig-sexo-pacientes')



In [45]:
# segmentação de pacientes por colesterol

plt.figure(figsize=(8,6))
sns.countplot(df_stats['totChol'])
sns.set_theme(style="darkgrid")
plt.xlabel('Colesterol')
plt.ylabel('Quantidade de Pacientes')
# plt.show()
plt.savefig('fig-colesterol-pacientes')



In [46]:
# segmentação de pacientes por diabetes

plt.figure(figsize=(8,6))
sns.countplot(df_stats['diabetes'])
sns.set_theme(style="darkgrid")
plt.xlabel('Diabetes')
plt.ylabel('Quantidade de Pacientes')
# plt.show()
plt.savefig('fig-diabetes-pacientes')

In [47]:
# segmentação de pacientes por cardipatia

plt.figure(figsize=(8,6))
sns.countplot(df_stats['Cardiopatia'])
sns.set_theme(style="darkgrid")
plt.xlabel('Cardiopatia')
plt.ylabel('Quantidade de Pacientes')
# plt.show()
plt.savefig('fig-cardio-pacientes')

In [48]:
# Idades dos pacientes com divisória indicando a faixa de idosos

plt.figure(figsize=(15,6))
sns.countplot(df_stats['age'])
plt.grid(True, linestyle='--')
plt.xlabel('Idades')
plt.ylabel('Quantidade de Pacientes')
plt.axvline(x = 28, color = 'r', label = 'axvline - full height')
# plt.show()
plt.savefig('fig-idade-pacientes')



In [39]:
# Percentual de cardiopatas
round((df_stats['Cardiopatia'].value_counts()/df_stats.shape[0])*100,2)

Ausente     84.93
Presente    15.07
Name: Cardiopatia, dtype: float64

In [40]:
# Percentual de Diabéticos
round((df_stats['diabetes'].value_counts()/df_stats.shape[0])*100,2)

Não Diabético    97.43
Diabético         2.57
Name: diabetes, dtype: float64

In [41]:
# Percentual de pessoas
round((df_stats['sex'].value_counts()/df_stats.shape[0])*100,2)

Feminino     56.73
Masculino    43.27
Name: sex, dtype: float64

In [42]:
# Percentual por idade
round((df_dummy['age_dist'].value_counts()/df_stats.shape[0])*100,2)

Adulto    86.37
Idoso     13.63
Name: age_dist, dtype: float64

In [43]:
# Percentual por colesterol
round((df_stats['totChol'].value_counts()/df_stats.shape[0])*100,2)

Baixo    57.23
Alto     42.77
Name: totChol, dtype: float64

## **Análise Estatística Chi-Square**

![chi](chi-square.jpg)

O método chi-quadrado é um teste de independência estatístico que mensura a relação de dependência de duas (ou mais a depender do caso) variáveis categóricas, verificando como os valores esperados do conjunto se desviam dos valores observados. Em outras palavras, ele verifica se a frequência com que um determinado evento observado numa amostra desvia-se de forma significativa ou não da frequência que se espera dele.

Com um alto valor de Chi-quadrado (o valor p será baixo), o que indica que se tem evidência estatística o suficiente para inferir que os valores observados e esperados não são os mesmos. Logo, possuem dependência entre si. Quanto maior o valor do chi-quadrado, maior a dependência das variáveis.

In [49]:
# passando todas as variáveis como categóricas

data['Cardiopatia'] = data['Cardiopatia'].astype('category')
data['Feminino'] = data['Feminino'].astype('category')
data['Masculino'] = data['Masculino'].astype('category')
data['Não Diabético'] = data['Não Diabético'].astype('category')
data['Diabético'] = data['Diabético'].astype('category')
data['Col_Baixo'] = data['Col_Baixo'].astype('category')
data['Col_Alto'] = data['Col_Alto'].astype('category')
data['Adulto'] = data['Adulto'].astype('category')
data['Idoso'] = data['Idoso'].astype('category')

#### Criação de Grupos para Análise

A seguir, serão feitos conjuntos a partir das variáveis para que estes grupos possam ser analisados em relação à variável alvo 'Cardiopatia'.

O foco de interesse, é observar os grupos com os indivíduos que possuem elevados índices de colesterol.

In [50]:
# SEXO

#homem
data['Perfil_M'] = np.zeros
data['Perfil_M'] = np.where(((data['Masculino'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

#mulher
data['Perfil_F'] = np.zeros
data['Perfil_F'] = np.where(((data['Feminino'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# IDADE

#adulto
data['Perfil_A'] = np.zeros
data['Perfil_A'] = np.where(((data['Adulto'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

#idoso
data['Perfil_I'] = np.zeros
data['Perfil_I'] = np.where(((data['Idoso'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# DIABETES

#diabético
data['Perfil_D'] = np.zeros
data['Perfil_D'] = np.where(((data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

#não diabético
data['Perfil_N'] = np.zeros
data['Perfil_N'] = np.where(((data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# SEXO E IDADE

# homem
data['Perfil_MA'] = np.zeros
data['Perfil_MA'] = np.where(((data['Masculino'] == 1) & (data['Adulto'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_MI'] = np.zeros
data['Perfil_MI'] = np.where(((data['Masculino'] == 1) & (data['Idoso'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# mulher
data['Perfil_FA'] = np.zeros
data['Perfil_FA'] = np.where(((data['Feminino'] == 1) & (data['Adulto'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_FI'] = np.zeros
data['Perfil_FI'] = np.where(((data['Feminino'] == 1) & (data['Idoso'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# SEXO E DIABETES

# homem
data['Perfil_MD'] = np.zeros
data['Perfil_MD'] = np.where(((data['Masculino'] == 1) & (data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_MN'] = np.zeros
data['Perfil_MN'] = np.where(((data['Masculino'] == 1) & (data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# mulher
data['Perfil_FD'] = np.zeros
data['Perfil_FD'] = np.where(((data['Feminino'] == 1) & (data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_FN'] = np.zeros
data['Perfil_FN'] = np.where(((data['Feminino'] == 1) & (data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)


# sexo - idade - diabetes

# homem
data['Perfil_MAD'] = np.zeros
data['Perfil_MAD'] = np.where(((data['Masculino'] == 1) & (data['Adulto'] == 1) &  (data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_MID'] = np.zeros
data['Perfil_MID'] = np.where(((data['Masculino'] == 1) & (data['Idoso'] == 1) &  (data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_MAN'] = np.zeros
data['Perfil_MAN'] = np.where(((data['Masculino'] == 1) & (data['Adulto'] == 1) &  (data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_MIN'] = np.zeros
data['Perfil_MIN'] = np.where(((data['Masculino'] == 1) & (data['Idoso'] == 1) &  (data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

# mulher

data['Perfil_FAD'] = np.zeros
data['Perfil_FAD'] = np.where(((data['Feminino'] == 1) & (data['Adulto'] == 1) &  (data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_FID'] = np.zeros
data['Perfil_FID'] = np.where(((data['Feminino'] == 1) & (data['Idoso'] == 1) &  (data['Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_FAN'] = np.zeros
data['Perfil_FAN'] = np.where(((data['Feminino'] == 1) & (data['Adulto'] == 1) &  (data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)
data['Perfil_FIN'] = np.zeros
data['Perfil_FIN'] = np.where(((data['Feminino'] == 1) & (data['Idoso'] == 1) &  (data['Não Diabético'] == 1) & (data['Col_Alto'] == 1)), 1, 0)

In [51]:
# Valores do perfil masculino, adulto e diabético

data['Perfil_MAD'].value_counts()

0    3378
1      12
Name: Perfil_MAD, dtype: int64

In [52]:
data.dtypes

Cardiopatia      category
Feminino         category
Masculino        category
Não Diabético    category
Diabético        category
Col_Baixo        category
Col_Alto         category
Adulto           category
Idoso            category
Perfil_M            int32
Perfil_F            int32
Perfil_A            int32
Perfil_I            int32
Perfil_D            int32
Perfil_N            int32
Perfil_MA           int32
Perfil_MI           int32
Perfil_FA           int32
Perfil_FI           int32
Perfil_MD           int32
Perfil_MN           int32
Perfil_FD           int32
Perfil_FN           int32
Perfil_MAD          int32
Perfil_MID          int32
Perfil_MAN          int32
Perfil_MIN          int32
Perfil_FAD          int32
Perfil_FID          int32
Perfil_FAN          int32
Perfil_FIN          int32
dtype: object

In [53]:
# Convertendo os novos grupos para categóricos

for col in data:
    data[col] = data[col].astype('category')

In [54]:
data.dtypes

Cardiopatia      category
Feminino         category
Masculino        category
Não Diabético    category
Diabético        category
Col_Baixo        category
Col_Alto         category
Adulto           category
Idoso            category
Perfil_M         category
Perfil_F         category
Perfil_A         category
Perfil_I         category
Perfil_D         category
Perfil_N         category
Perfil_MA        category
Perfil_MI        category
Perfil_FA        category
Perfil_FI        category
Perfil_MD        category
Perfil_MN        category
Perfil_FD        category
Perfil_FN        category
Perfil_MAD       category
Perfil_MID       category
Perfil_MAN       category
Perfil_MIN       category
Perfil_FAD       category
Perfil_FID       category
Perfil_FAN       category
Perfil_FIN       category
dtype: object

In [55]:
# Dividindo os grupos em variáveis de análise e controle

x = data.drop(['Cardiopatia'], axis=1)
y = data['Cardiopatia']

In [56]:
# recolhendo os valores de Chi e p-value
chi_scores = chi2(x,y)

In [57]:
scores = pd.Series(chi_scores[0], index = x.columns)
pvalues = pd.Series(chi_scores[1], index = x.columns)

In [58]:
# valores ordenados de chi-quadrado com suas relevâncias estatísticas

final = pd.DataFrame({'Chi2':scores, 'p-Value':pvalues})
final.sort_values(by = 'Chi2', ascending=False)

Unnamed: 0,Chi2,p-Value
Idoso,72.228656,1.9165240000000002e-17
Diabético,35.506306,2.542262e-09
Perfil_I,33.786,6.152e-09
Perfil_D,30.323399,3.656877e-08
Perfil_M,26.984868,2.050545e-07
Perfil_MD,21.190952,4.157224e-06
Perfil_MI,20.936739,4.747025e-06
Perfil_MN,20.177696,7.057101e-06
Perfil_FI,16.09024,6.039451e-05
Perfil_MIN,15.080636,0.0001030146


In [59]:
# Percentual de pacientes
round((data['Perfil_MI'].value_counts()[1]/232)*100,2)

29.74

In [60]:
round(data['Perfil_MI'].value_counts(normalize=True) * 100, 2)

0    97.96
1     2.04
Name: Perfil_MI, dtype: float64

In [61]:
# Mapa de calor com os resultados do chi-quadrado
fig = plt.figure(figsize=(8,8))
sns.heatmap(final, annot=True, cmap='Blues')
plt.title('Resultados do Teste Chi-Quadrado')
# plt.show()
plt.savefig('fig-heatmap-chi')

In [None]:
# gerando distribuição

x = np.linspace(-1, 20, 1000)
dist = scipy.stats.chi2(1,0)
plt.plot(x, dist.pdf(x), ls='-', c='black', label=r'$gl=1$')
plt.axvline(x = 3.84, color = 'r', label = 'axvline - full height')

plt.text(4, 0.2, 'p = 0.05')

plt.xlim(0, 10)
plt.ylim(0, 0.8)
plt.xlabel('$Q$')
plt.ylabel(r'$p(Q|k)$')
plt.title('Distribuição Qui-Quadrado para 1 Grau de Liberdade')
plt.savefig('fig-dist')