## Limpeza de dados Diabetes

Este conjunto de dados é originalmente do National Institute of Diabetes and Digestive and Kidney Doenças. O objetivo do conjunto de dados é prever de forma diagnóstica se um paciente tem diabetes, com base em certas medições de diagnóstico incluídas no conjunto de dados. Várias restrições foram colocadas na seleção dessas instâncias de um banco de dados maior. Em particular, todos os pacientes aqui são mulheres pelo menos 21 anos de herança indígena.

In [62]:
import numpy as np
import pandas as pd
import scipy

In [63]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [64]:
df = pd.read_csv('C:/Users/Pichau/Desktop/Trabalho/diabetes.csv')

In [65]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [67]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Verificando se há valores nulos e valores zero nos dados para limpeza.


In [68]:
df['Insulin'].value_counts()

0      374
105     11
130      9
140      9
120      8
      ... 
73       1
171      1
255      1
52       1
112      1
Name: Insulin, Length: 186, dtype: int64

In [69]:
df['SkinThickness'].value_counts().max()

227

Substituindo todos os valores zero inapropriados por NaN, para que esses valores zero não façam parte do cálculo da correlação.

In [70]:
df['Glucose'].replace(to_replace = 0, value = np.NaN, inplace=True)
df['BloodPressure'].replace(to_replace = 0, value = np.NaN, inplace=True)
df['SkinThickness'].replace(to_replace = 0, value = np.NaN, inplace=True)
df['Insulin'].replace(to_replace = 0, value = np.NaN, inplace=True)
df['BMI'].replace(to_replace = 0, value = np.NaN, inplace=True)

Confirme se os zeros foram substituídos por valores NaN.

In [71]:
df.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.128135,0.214178,0.100239,0.082171,0.021719,-0.033523,0.544341,0.221898
Glucose,0.128135,1.0,0.223192,0.228043,0.581186,0.232771,0.137246,0.267136,0.49465
BloodPressure,0.214178,0.223192,1.0,0.226839,0.098272,0.28923,-0.002805,0.330107,0.170589
SkinThickness,0.100239,0.228043,0.226839,1.0,0.184888,0.648214,0.115016,0.166816,0.259491
Insulin,0.082171,0.581186,0.098272,0.184888,1.0,0.22805,0.130395,0.220261,0.303454
BMI,0.021719,0.232771,0.28923,0.648214,0.22805,1.0,0.155382,0.025841,0.31368
DiabetesPedigreeFunction,-0.033523,0.137246,-0.002805,0.115016,0.130395,0.155382,1.0,0.033561,0.173844
Age,0.544341,0.267136,0.330107,0.166816,0.220261,0.025841,0.033561,1.0,0.238356
Outcome,0.221898,0.49465,0.170589,0.259491,0.303454,0.31368,0.173844,0.238356,1.0


Analisando os valores de glicose ausentes.

Em seguida, os dados de insulina ausentes com base nos valores de glicose.

Verificando as linhas que possuem valores nulos para glicose.

In [72]:
df[pd.isnull(df['Glucose'])]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
75,1,,48.0,20.0,,24.7,0.14,22,0
182,1,,74.0,20.0,23.0,27.7,0.299,21,0
342,1,,68.0,35.0,,32.0,0.389,22,0
349,5,,80.0,32.0,,41.0,0.346,37,1
502,6,,68.0,41.0,,39.0,0.727,41,1


Para cada linha, vamos imputar valores com base na coluna IMC. Calcularemos o valor médio de glicose para uma determinada pontuação de IMC e substituiremos o valor de glicose ausente por essa média. Não há valores de glicose para um IMC de 41, então o IMC de 40 foi usado para substituir esse valor.

In [73]:
df.at[75, 'Glucose'] = df[df['BMI'] == 24.7]['Glucose'].mean()
df.at[182, 'Glucose'] = df[df['BMI'] == 27.7]['Glucose'].mean()
df.at[342, 'Glucose'] = df[df['BMI'] == 32.0]['Glucose'].mean()
df.at[349, 'Glucose'] = df[df['BMI'] == 40.0]['Glucose'].mean()
df.at[502, 'Glucose'] = df[df['BMI'] == 39.0]['Glucose'].mean()

A coluna Glucose não possui mais nenhum valor nulo.

In [74]:
df[pd.isnull(df['Glucose'])]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


Para preencher os valores que faltam na coluna Insulina. Uma forma de fazer isso seria interpolar os valores com base na linha de regressão que relaciona glicose e insulina. Podemos calcular estimativas para valores de insulina ausentes. Por fim, escreveremos uma função que usa as colunas Glucose e Insulina e, se o valor Insulina for NaN, ela imputará o valor estimado de insulina derivado da linha de regressão.

In [75]:
gluc_ins_df = df[['Glucose', 'Insulin']].copy(deep=True)
gluc_ins_df.dropna(inplace = True)

In [76]:
gluc_ins_df.head()

Unnamed: 0,Glucose,Insulin
3,89.0,94.0
4,137.0,168.0
6,78.0,88.0
8,197.0,543.0
13,189.0,846.0


In [77]:
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(gluc_ins_df['Glucose'],  gluc_ins_df['Insulin'])
intercept

-118.95263596735862

In [78]:
slope

2.238674263550429

Usando os valores de inclinação e interceptação, agora podemos escrever uma função para calcular estimativas para valores ausentes de insulina com base nos níveis de glicose.

In [79]:
def impute_insulin(cols):
    insulin = cols[0]
    glucose = cols[1]
    
    if pd.isnull(insulin):
        return glucose * slope + intercept
    else:
        return insulin
    
df['Insulin'] = df[['Insulin', 'Glucose']].apply(impute_insulin, axis=1)

Verificando os valores nulos restantes para IMC e Pressão Arterial.

In [80]:
df[pd.isnull(df['BMI'])]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
9,8,125.0,96.0,,160.881647,,0.232,54,1
49,7,105.0,,,116.108162,,0.305,24,0
60,2,84.0,,,69.096002,,0.304,21,0
81,2,74.0,,,46.70926,,0.102,22,0
145,0,102.0,75.0,23.0,109.392139,,0.572,21,0
371,0,118.0,64.0,23.0,89.0,,1.731,21,0
426,0,94.0,,,91.482745,,0.256,25,0
494,3,80.0,,,60.141305,,0.174,22,0
522,6,114.0,,,136.25623,,0.189,26,0
684,5,136.0,82.0,,185.507064,,0.64,69,0


Várias das linhas estão faltando dados para BloodPressure, BMI e SkinThickness e, como é um número pequeno, será eliminado essas linhas do conjunto de dados. Como SkinThickness tem muitos valores ausentes e se correlaciona menos com Outcome, descartaremos essa coluna também em vez de imputar valores.

In [81]:
df.drop('SkinThickness', axis=1, inplace=True)

In [82]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,212.371155,33.6,0.627,50,1
1,1,85.0,66.0,71.334676,26.6,0.351,31,0
2,8,183.0,64.0,290.724754,23.3,0.672,32,1
3,1,89.0,66.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,168.0,43.1,2.288,33,1


In [83]:
df.dropna(inplace=True)

In [55]:
#df.to_csv('D:/diabetesLimpo.csv', index=False)