# PROJETO A3 | ANÁLISE DE DADOS E BIG DATA & INTELIGÊNCIA ARTIFICIAL

**Integrantes do grupo**

<br>

NOME | RA
--- | ---
Gustavo Buenos Aires Colpaert | 823126177
Stephany Silva Dantas | 822223694
Paloma Lopes de Sousa | 822167506
Bianca Alves Ribeiro | 8222240261
Lucas Vasconcellos | 8222242709
Sara Alves| 822224386
Fernando Fukunaga | 821232457
<br>

**Descrição do Projeto**

Diante do aumento alarmante das taxas de mortalidade causadas pelo Acidente Vascular Cerebral (AVC) e da perspectiva preocupante de um crescimento contínuo no número de óbitos, este projeto visa prever o risco de AVC com base em variáveis clínicas e sociais, como idade, gênero, estado civil, condições clínicas e histórico de tabagismo, índice de IMC. O problema central consiste em desenvolver modelos que possam identificar indivíduos em maior risco de AVC, permitindo intervenções preventivas e tratamentos precoces, com o objetivo de reduzir o impacto dessa doença.

## OBTENÇÃO DE DADOS

---



In [25]:
import pandas as pd
import numpy as np

In [26]:
#arquivo csv - https://drive.google.com/file/d/1BvZcRMYNPjr_hDUiwj0SdAaimAlqelie/view?usp=sharing

avc_dataset = pd.read_csv('healthcare-dataset-stroke-data.csv', delimiter= ';')

In [27]:
avc_dataset.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

## ANÁLISE EXPLORATÓRIA DOS DADOS

In [28]:
avc_dataset.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [29]:
avc_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


### PREPARAÇÃO DOS DADOS

In [30]:
#Deletando coluna ID
avc_dataset.drop(['id'], axis=1, inplace=True)

In [31]:
# Traduzindo os títulos das colunas para português
avc_dataset.rename(columns= {'age': 'Idade' , 'gender': 'Genero' , 'hypertension': 'Hipertensao',
                      'heart_disease': 'Doenca Cardiaca' , 'ever_married': 'Estado Civil',
                      'work_type': 'Tipo de trabalho', 'Residence_type': 'Localizacao Residencial' ,
                      'avg_glucose_level' : 'Nivel medio de glicose' , 'bmi': 'IMC',
                      'stroke': 'AVC', 'smoking_status': 'Condicao de fumante'}, inplace=True)
avc_dataset.columns

Index(['Genero', 'Idade', 'Hipertensao', 'Doenca Cardiaca', 'Estado Civil',
       'Tipo de trabalho', 'Localizacao Residencial', 'Nivel medio de glicose',
       'IMC', 'Condicao de fumante', 'AVC'],
      dtype='object')

In [32]:
#Identificar tipo 'object' na coluna Genero
avc_dataset['Genero'].unique()

array(['Male', 'Female', 'Other'], dtype=object)

In [33]:
#Contagem de registros na coluna 'Genero' para substituição posterior por número mais frequente (moda)
avc_dataset['Genero'].value_counts()

Genero
Female    2994
Male      2115
Other        1
Name: count, dtype: int64

In [34]:
#Contagem de registros na coluna 'Condição de fumante' para substituição por moda
avc_dataset['Condicao de fumante'].value_counts()

Condicao de fumante
never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: count, dtype: int64

In [35]:
#Verificando valores nulos
avc_dataset.isnull().sum()

Genero                       0
Idade                        0
Hipertensao                  0
Doenca Cardiaca              0
Estado Civil                 0
Tipo de trabalho             0
Localizacao Residencial      0
Nivel medio de glicose       0
IMC                        201
Condicao de fumante          0
AVC                          0
dtype: int64

In [36]:
#Substituindo valores nulos ('NaN') da coluna IMC pela média
avc_dataset['IMC'].fillna(avc_dataset['IMC'].median(), inplace=True)

In [37]:
#Moda Gênero e Condição desconhecida (fumante)
avc_dataset.replace({'Other':'Female',
                  'No': 0, 'Yes': 1,
                  'Unknown':'never smoked'}, inplace=True)

In [38]:
#Dados prontos para a regrassao logistica e knn, salvando em uma variavel
dataset_rl_knn = avc_dataset

In [39]:
#Substituindo strings por valores lógicos + substituições por valores frequentes - moda
avc_dataset.replace({'Female': 0, 'Male': 1,
                    'Urban': 0, 'Rural': 1,
                    'Unknown':0, 'never smoked':0 ,'smokes': 1, 'formerly smoked': 2,
                    'children':0, 'Never_worked':0,'Private':1, 'Self-employed':2,'Govt_job':3 }, inplace=True)

In [40]:
avc_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Genero                   5110 non-null   int64  
 1   Idade                    5110 non-null   float64
 2   Hipertensao              5110 non-null   int64  
 3   Doenca Cardiaca          5110 non-null   int64  
 4   Estado Civil             5110 non-null   int64  
 5   Tipo de trabalho         5110 non-null   int64  
 6   Localizacao Residencial  5110 non-null   int64  
 7   Nivel medio de glicose   5110 non-null   float64
 8   IMC                      5110 non-null   float64
 9   Condicao de fumante      5110 non-null   int64  
 10  AVC                      5110 non-null   int64  
dtypes: float64(3), int64(8)
memory usage: 439.3 KB


In [41]:
avc_dataset

Unnamed: 0,Genero,Idade,Hipertensao,Doenca Cardiaca,Estado Civil,Tipo de trabalho,Localizacao Residencial,Nivel medio de glicose,IMC,Condicao de fumante,AVC
0,1,67.0,0,1,1,1,0,228.69,36.6,2,1
1,0,61.0,0,0,1,2,1,202.21,28.1,0,1
2,1,80.0,0,1,1,1,1,105.92,32.5,0,1
3,0,49.0,0,0,1,1,0,171.23,34.4,1,1
4,0,79.0,1,0,1,2,1,174.12,24.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
5105,0,80.0,1,0,1,1,0,83.75,28.1,0,0
5106,0,81.0,0,0,1,2,0,125.20,40.0,0,0
5107,0,35.0,0,0,1,2,1,82.99,30.6,0,0
5108,1,51.0,0,0,1,1,1,166.29,25.6,2,0


In [42]:
#Normalização de valores
avc=(avc_dataset-avc_dataset.min())/(avc_dataset.max()-avc_dataset.min())
avc

Unnamed: 0,Genero,Idade,Hipertensao,Doenca Cardiaca,Estado Civil,Tipo de trabalho,Localizacao Residencial,Nivel medio de glicose,IMC,Condicao de fumante,AVC
0,1.0,0.816895,0.0,1.0,1.0,0.333333,0.0,0.801265,0.301260,1.0,1.0
1,0.0,0.743652,0.0,0.0,1.0,0.666667,1.0,0.679023,0.203895,0.0,1.0
2,1.0,0.975586,0.0,1.0,1.0,0.333333,1.0,0.234512,0.254296,0.0,1.0
3,0.0,0.597168,0.0,0.0,1.0,0.333333,0.0,0.536008,0.276060,0.5,1.0
4,0.0,0.963379,1.0,0.0,1.0,0.666667,1.0,0.549349,0.156930,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
5105,0.0,0.975586,1.0,0.0,1.0,0.333333,0.0,0.132167,0.203895,0.0,0.0
5106,0.0,0.987793,0.0,0.0,1.0,0.666667,0.0,0.323516,0.340206,0.0,0.0
5107,0.0,0.426270,0.0,0.0,1.0,0.666667,1.0,0.128658,0.232532,0.0,0.0
5108,1.0,0.621582,0.0,0.0,1.0,0.333333,1.0,0.513203,0.175258,1.0,0.0


### Separação de valores (0 e 1) na coluna AVC

In [43]:
#Contagem valores AVC = 0/1
avc['AVC'].value_counts()

AVC
0.0    4861
1.0     249
Name: count, dtype: int64

In [44]:
#Separando do dataframe registros de pacientes que não tiveram avc, isto é = 0
avc_negativo = avc.loc[avc['AVC'] == 0]
avc_negativo = avc_negativo.sample(n=249)
avc_negativo

Unnamed: 0,Genero,Idade,Hipertensao,Doenca Cardiaca,Estado Civil,Tipo de trabalho,Localizacao Residencial,Nivel medio de glicose,IMC,Condicao de fumante,AVC
1278,0.0,0.133301,0.0,0.0,0.0,0.000000,0.0,0.346367,0.079038,0.0,0.0
1541,0.0,0.572754,0.0,0.0,1.0,0.333333,0.0,0.470363,0.180985,0.5,0.0
399,1.0,0.707031,1.0,0.0,1.0,0.333333,1.0,0.776660,0.357388,1.0,0.0
3030,0.0,0.487305,1.0,0.0,1.0,0.333333,0.0,0.320192,0.323024,0.0,0.0
4486,1.0,0.169922,0.0,0.0,0.0,0.333333,0.0,0.329840,0.178694,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
1258,0.0,0.304199,0.0,0.0,0.0,0.333333,0.0,0.127643,0.297824,0.0,0.0
2668,0.0,0.768066,0.0,0.0,1.0,0.333333,1.0,0.454621,0.202749,1.0,0.0
2004,1.0,0.645996,0.0,0.0,1.0,0.333333,1.0,0.660696,0.318442,0.0,0.0
4737,0.0,0.658203,0.0,0.0,1.0,1.000000,0.0,0.814422,0.224513,0.0,0.0


In [45]:
#Separando do dataframe registros de pacientes que tiveram avc, isto é = 1
avc_positivo = avc.loc[avc['AVC'] == 1]
avc_positivo

Unnamed: 0,Genero,Idade,Hipertensao,Doenca Cardiaca,Estado Civil,Tipo de trabalho,Localizacao Residencial,Nivel medio de glicose,IMC,Condicao de fumante,AVC
0,1.0,0.816895,0.0,1.0,1.0,0.333333,0.0,0.801265,0.301260,1.0,1.0
1,0.0,0.743652,0.0,0.0,1.0,0.666667,1.0,0.679023,0.203895,0.0,1.0
2,1.0,0.975586,0.0,1.0,1.0,0.333333,1.0,0.234512,0.254296,0.0,1.0
3,0.0,0.597168,0.0,0.0,1.0,0.333333,0.0,0.536008,0.276060,0.5,1.0
4,0.0,0.963379,1.0,0.0,1.0,0.666667,1.0,0.549349,0.156930,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
244,1.0,0.694824,0.0,0.0,1.0,0.333333,1.0,0.137753,0.302405,0.0,1.0
245,0.0,0.169922,0.0,0.0,0.0,0.000000,1.0,0.012972,0.235968,0.0,1.0
246,0.0,0.914551,0.0,0.0,1.0,0.666667,1.0,0.109316,0.217640,1.0,1.0
247,1.0,0.865723,1.0,0.0,1.0,0.666667,1.0,0.150863,0.203895,0.0,1.0


In [46]:
#Concatenando os registros totais e randomizando valores
avc = pd.concat([avc_positivo, avc_negativo], ignore_index=True)
avc = avc.sample(frac=1).reset_index(drop=True)

In [47]:
avc.head(500)

Unnamed: 0,Genero,Idade,Hipertensao,Doenca Cardiaca,Estado Civil,Tipo de trabalho,Localizacao Residencial,Nivel medio de glicose,IMC,Condicao de fumante,AVC
0,0.0,0.645996,0.0,0.0,1.0,1.000000,0.0,0.041778,0.357388,0.0,1.0
1,1.0,0.816895,0.0,0.0,1.0,0.333333,1.0,0.129443,0.206186,0.0,0.0
2,0.0,0.853516,0.0,0.0,1.0,0.333333,0.0,0.768442,0.426117,0.0,1.0
3,0.0,0.060059,0.0,0.0,0.0,0.000000,1.0,0.043071,0.130584,0.0,0.0
4,1.0,0.987793,0.0,0.0,1.0,0.666667,1.0,0.168129,0.241695,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
493,0.0,0.487305,0.0,0.0,1.0,0.333333,0.0,0.047549,0.081329,1.0,0.0
494,1.0,0.072266,0.0,0.0,0.0,0.000000,1.0,0.043532,0.081329,0.0,0.0
495,1.0,0.975586,0.0,1.0,1.0,0.666667,1.0,0.186363,0.243986,0.0,0.0
496,1.0,0.853516,1.0,0.0,1.0,0.333333,1.0,0.865109,0.403207,1.0,1.0


In [48]:
avc.to_csv('dataset-avc-tratado.csv', index=False)