# Detecção e Exclusão de Linhas Duplicadas

Detectar dados duplicados em um DataFrame (ou outro conjunto de dados) é importante para garantir a qualidade e integridade dos dados. Dados duplicados podem surgir de diversas formas, como erros no processo de coleta, integração de múltiplas fontes de dados ou falhas de importação.

### Por que detectar dados duplicados?

1. Evitar análises incorretas: Duplicatas distorcem métricas como médias e somas, gerando insights inválidos.
2. Preservar a integridade dos dados: No aprendizado de máquinas, duplicatas podem levar a modelos enviesados ou menos generalizáveis.
3. Melhorar desempenho: Reduz o tamanho do dataset, otimizando memória e velocidade de processamento.

### O que dados duplicados podem causar?

1. Distorsão de insights: Exemplo: duplicar registros pode inflar métricas como vendas ou receitas.
3. Problemas em ML/DL: Duplicatas podem causar overfitting, já que o modelo "aprende" padrões redundantes.


## Configurando dados

In [1]:
# Importando biblioteca pandas
import pandas as pd

# Importando a função minmax_scale da biblioteca scikit-learn
from sklearn.preprocessing import minmax_scale

# Lendo arquivo CSV e transformando em um DataFrame
df = pd.read_csv('datas/diabetes.csv')

# Varrendo colunas do DataFrame
for column in df:

    # Substituindo valores NAN por zero para fins de demonstração
    df[column] = df[column].fillna(0)

    # Dimensiona valores para mesma escala (entre 0 e 1) para fins de visualização de gráfico
    df[column] = minmax_scale(df[column])

# Duplicando linhas para fins de demonstração
linhas_duplicadas = df.iloc[:10]  # Seleciona as dez primeiras linhas
df = pd.concat([df, linhas_duplicadas], ignore_index=True)

# Exibindo 10 primeiras linhas do DataFrame
df.head(10)

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.269183,0.617284
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.150672,0.382716
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.288505,0.395062
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.071665,0.259259
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.982395,0.0
5,0.294118,0.582915,0.606557,0.0,0.0,0.38152,0.086264,0.37037
6,0.176471,0.39196,0.409836,0.323232,0.104019,0.461997,0.106445,0.320988
7,0.588235,0.577889,0.0,0.0,0.0,0.52608,0.057495,0.358025
8,0.117647,0.98995,0.57377,0.454545,0.641844,0.454545,0.0678,0.654321
9,0.470588,0.628141,0.786885,0.0,0.0,0.0,0.099575,0.666667


## Detecção de dados Duplicados

In [2]:
# Buscando linhas duplicadas
duplicacoes = df.duplicated()

# Exibindo linhas duplicadas
df[duplicacoes].head(10)

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
768,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.269183,0.617284
769,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.150672,0.382716
770,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.288505,0.395062
771,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.071665,0.259259
772,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.982395,0.0
773,0.294118,0.582915,0.606557,0.0,0.0,0.38152,0.086264,0.37037
774,0.176471,0.39196,0.409836,0.323232,0.104019,0.461997,0.106445,0.320988
775,0.588235,0.577889,0.0,0.0,0.0,0.52608,0.057495,0.358025
776,0.117647,0.98995,0.57377,0.454545,0.641844,0.454545,0.0678,0.654321
777,0.470588,0.628141,0.786885,0.0,0.0,0.0,0.099575,0.666667


## Exclusão de dados Duplicados

In [3]:
# Removendo duplicatas e mantendo apenas a primeira ocorrência
df = df.drop_duplicates()

# Buscando novamente linhas duplicadas
duplicacoes = df.duplicated()

# Exibindo linhas duplicadas
df[duplicacoes].head(10)

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
