# PIMA Dataset

The Pima Indian Diabetes Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA. The outcome tested was Diabetes, 268 tested positive and 500 tested negative. Therefore, there is one target (dependent) variable and the 8 attributes (TYNECKI, 2018): pregnancies, OGTT(Oral Glucose Tolerance Test), blood pressure, skin thickness, insulin, BMI(Body Mass Index), age, pedigree diabetes function. The Pima population has been under study by the National Institute of Diabetes and Digestive and Kidney Diseases at intervals of 2 years since 1965. As epidemiological evidence indicates that T2DM results from interaction of genetic and environmental factors, the Pima Indians Diabetes Dataset includes information about attributes that could and should be related to the onset of diabetes and its future complications.

### 1. EDA

In [13]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv"

data = pd.read_csv(url)

data.head(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


### Análisis estadístico simple

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [5]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Características a modificar: ['Glucose', 'BloodPressure', 'BMI']

### Analizar el balance de los datos

In [9]:
#Caso ideal -> 50% (Positivo - 1) - 50% (Negativo - 0)
data['Outcome'].unique() # Clases existentes
print("Muestras medidas de diabetes: ")
data['Outcome'].value_counts()

Muestras medidas de diabetes: 


Outcome
0    500
1    268
Name: count, dtype: int64

### Datos nan o faltantes '?'

In [11]:
datos_faltantes = data.isnull()

for i in datos_faltantes.columns.values.tolist():
    print(datos_faltantes[i].value_counts())

Pregnancies
False    768
Name: count, dtype: int64
Glucose
False    768
Name: count, dtype: int64
BloodPressure
False    768
Name: count, dtype: int64
SkinThickness
False    768
Name: count, dtype: int64
Insulin
False    768
Name: count, dtype: int64
BMI
False    768
Name: count, dtype: int64
DiabetesPedigreeFunction
False    768
Name: count, dtype: int64
Age
False    768
Name: count, dtype: int64
Outcome
False    768
Name: count, dtype: int64


### Datos Anómalos o Cero

In [19]:
col_zero = ['Glucose', 'BloodPressure', 'BMI']
data_sec = data[col_zero]
data_zero = pd.DataFrame(data_sec == 0) #Segmentar dónde existen ceros

for i in data_zero.columns.values.tolist():
    print(data_zero[i].value_counts())
    

Glucose
False    763
True       5
Name: count, dtype: int64
BloodPressure
False    733
True      35
Name: count, dtype: int64
BMI
False    757
True      11
Name: count, dtype: int64


### Imputación con la media y mediana

In [20]:
data_mod = data.copy()
data_mod.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Generación de modelos (Luego de la imputación)

In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Matriz de entrada
X = data.drop('Outcome', axis=1)
# Target - Vector de salida
y = data['Outcome']

# Generar nuestro conjunto de entrenamiento y validación
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)

# Crear el modelo
modelo_rf = RandomForestClassifier(n_estimators=200) #hiperparámetros
# Entrenar el modelo
modelo_rf.fit(X_train, y_train)
# Validar el modelo
y_pred = modelo_rf.predict(X_test)


### Métricas

In [44]:
# Matriz de confusión
con_mat = confusion_matrix(y_test, y_pred)
print(f"Matriz de confusión:\n {con_mat}")
# Exactitud
exactitud = accuracy_score(y_test, y_pred)
# Precision
precision = precision_score(y_test, y_pred)
# Sensibilidad
recall = recall_score(y_test, y_pred)
# Especificidad
vn, fp, fn, vp = con_mat.ravel()
especificidad = vn / (vn + fp)
#f1-score
f1 = f1_score(y_test, y_pred)

print(f"Exactitud: {exactitud:.2f}\nPrecision: {precision:.2f}\nSensibilidad: {recall:.2f}\nEspecificidad: {especificidad:.2f}\nf1-Score: {f1:.2f}")

Matriz de confusión:
 [[124  18]
 [ 36  53]]
Exactitud: 0.77
Precision: 0.75
Sensibilidad: 0.60
Especificidad: 0.87
f1-Score: 0.66


#### Exactitud: 0.76
Para la muestra de VALIDACIÓN(X_test) el modelo predice correctamente el 76% de los casos.
768 casos -> 500 negativos - 268 positivos.
230 casos para validación: 200 negativos - 30 positivos: 229 negativos - 1 positivo:1 negativo - 229 positivos.
#### Precision: 0.73
Cada vez que el modelo predice un diagnóstico de diabetes, ese diagnóstico es correcto el 73% de las veces.
#### Sensibilidad: 0.61 (VP)
Nuestro modelo es capaz de identificar correctamente el 61% de los casos positivos de diabetes. El 39% de los casos no son identificados correctamente por el modelo.
#### Especificidad: 0.86 (VN)
Nuestro modelo es capaz de identificar correctamente el 86% de los casos negativos de diabetes.
#### f1-Score: 0.64 (Proporción adecuada entre precisión y la sensibilidad) - Data no está balanceada


230.39999999999998