[Minerando Dados](https://minerandodados.com.br)

# Manipulando Missing Values com Pandas

In [0]:
import pandas as pd
import numpy as np

In [0]:
df_diabetes = pd.read_csv('sample_data/diabetes.csv')

In [0]:
df_diabetes.head()

### Preparando a Base de Dados

In [0]:
df_diabetes.describe()

In [0]:
df_diabetes.isnull().sum()

In [51]:
(df_diabetes[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']] == 0).sum()

Pregnancies      111
Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64

In [0]:
df_diabetes = df_diabetes[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NAN)

In [53]:
df_diabetes.isnull().sum()

Pregnancies      111
Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64

## Tratando os registros Missing

### Excluindo os registros

**Excluindo todos os registros que contem pelo menos um dado NaN em algum atributo**

In [0]:
df2 = df_diabetes.dropna()

Verificando os registros do Dataframe

In [55]:
df2.isnull().sum()

Pregnancies      0
Glucose          0
BloodPressure    0
SkinThickness    0
Insulin          0
BMI              0
dtype: int64

**Excluindo todos os registros que contém em todos os atributos dados NaN**

In [0]:
df2 = df_diabetes.dropna(how='all')

Verificando os registros do Dataframe

In [57]:
df2.isnull().sum()

Pregnancies      111
Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64

### Preenchendo os registros faltantes

**Preenchendo os registros faltantes com o valor zero.**

In [58]:
df_diabetes['SkinThickness'].head(10)

0    35.0
1    29.0
2     NaN
3    23.0
4    35.0
5     NaN
6    32.0
7     NaN
8    45.0
9     NaN
Name: SkinThickness, dtype: float64

In [59]:
df_diabetes['SkinThickness'].fillna(0).head(10)

0    35.0
1    29.0
2     0.0
3    23.0
4    35.0
5     0.0
6    32.0
7     0.0
8    45.0
9     0.0
Name: SkinThickness, dtype: float64

**Preenchendo os registros faltantes com os valores da média, moda, mediana.**

In [0]:
media = df_diabetes['SkinThickness'].mean()

In [61]:
print(media)

29.153419593345657


In [0]:
df_diabetes['SkinThickness'].fillna(media).head(10)

In [0]:
mode = df_diabetes['SkinThickness'].mode()

In [65]:
print(mode)

0    32.0
dtype: float64


In [0]:
df_diabetes['SkinThickness'].fillna(int(mode)).head(10)

In [0]:
median = df_diabetes['SkinThickness'].median()

In [68]:
print(median)

29.0


In [0]:
df_diabetes['SkinThickness'].fillna(median).head(10)

**Preenchendo com valores diferentes cada atributo do DataFrame**

In [0]:
valores_preenchimento = {'SkinThickness': media, 'Insulin': df_diabetes['Insulin'].median(), 'Glucose': 0}

In [0]:
df_diabetes.fillna(value=valores_preenchimento).isnull().sum()

**Preenchendo registros faltantes através da propagação de valores forward e backward**

In [73]:
df_diabetes['SkinThickness'].head(10)

0    35.0
1    29.0
2     NaN
3    23.0
4    35.0
5     NaN
6    32.0
7     NaN
8    45.0
9     NaN
Name: SkinThickness, dtype: float64

In [74]:
df_diabetes['SkinThickness'].fillna(method='ffill').head(10)

0    35.0
1    29.0
2    29.0
3    23.0
4    35.0
5    35.0
6    32.0
7    32.0
8    45.0
9    45.0
Name: SkinThickness, dtype: float64

In [75]:
df_diabetes['SkinThickness'].fillna(method='bfill').head(10)

0    35.0
1    29.0
2    23.0
3    23.0
4    35.0
5    32.0
6    32.0
7    45.0
8    45.0
9    23.0
Name: SkinThickness, dtype: float64

**Preenchendo Registros Faltantes utilizando o Inputer**

In [0]:
from sklearn.impute import SimpleImputer

In [0]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [0]:
df_diabetes['SkinThickness2'] = imp.fit_transform(df_diabetes[['SkinThickness']])

In [81]:
df_diabetes[['SkinThickness','SkinThickness2']].head(10)

Unnamed: 0,SkinThickness,SkinThickness2
0,35.0,35.0
1,29.0,29.0
2,,29.15342
3,23.0,23.0
4,35.0,35.0
5,,29.15342
6,32.0,32.0
7,,29.15342
8,45.0,45.0
9,,29.15342
