<a href="https://colab.research.google.com/github/eldercamposds/Breast-Cancer-Wisconsin-Diagnostic-Data-Set./blob/Pr%C3%A9-Processamento/Breast_Cancer_Wisconsin_(Diagnostic)_Data_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#KDD

Neste projeto será aplicada as etapas do KDD (Knowledge Discovery in Database), que consiste em um processo sistemático para extrair conhecimento útil a partir de grandes volumes de dados.

Etapas:

1.   **Seleção:** Escolha e obtenção dos dados relevantes do conjunto total
2.   **Pré-processamento:** limpeza e preparação dos dados, incluindo a correção de valores ausentes e insonsistentes.
3.   **Transformação:** Conversão dos dados em formatos adequados para a mineração, como normalização e agregação.
4.   **Mineração de dados:** Análise de grandes volumes para descobrir padrões.Isso envolve o uso de técnicas de estatísticas ou aprendizado de maquina.
5.   **Interpretação/Avaliação:** Análise e interpretação dos padrões descobertos.



#Sobre o conjunto de dados

**Conjunto de dados de diagnóstico de câncer de mama em Wisconsin**<br>
Prever se o câncer é benigno ou maligno



fonte: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

As características são computadas a partir de uma imagem digitalizada de uma aspiração por agulha fina (FNA) de uma massa mamária. Elas descrevem características dos núcleos celulares presentes na imagem.

##Descrição dos atributos

* 1)ID number = Número de identificação
* 2)Diagnosis = Diagnóstico (M = maligno, B = benigno)
* 3-32) Dez características de valor real são computadas para cada núcleo celular:

* a)radius = raio (média das distâncias do centro aos pontos do perímetro)
* b)texture = textura (desvio padrão dos valores da escala de cinza)
* c)perimeter = perímetro
* d)area = área
* e)smoothness = suavidade (variação local nos comprimentos dos raios)
* f)compactness = compacidade (perímetro^2 / área - 1,0)
* g)concavity = concavidade (severidade das porções côncavas do contorno)
* h)concave points = pontos côncavos (número de porções côncavas do contorno)
* i)symmetry = simetria
* j)fractal dimension = dimensão fractal ("aproximação da linha costeira" - 1)

<br>

Todos os valores de recursos são recodificados com quatro dígitos significativos.
 <br>

Distribuição de classes: 357 benignos, 212 malignos

#Seleção dos dados

In [None]:
from zipfile import ZipFile
import os

In [None]:
diretorio_destino = "/content/drive/MyDrive/Colab Notebooks/Breast Cancer Wisconsin (Diagnostic) Data Set"
arquivo_zip = "/content/drive/MyDrive/Colab Notebooks/Breast Cancer Wisconsin (Diagnostic) Data Set/archive.zip"

In [None]:
with ZipFile(arquivo_zip, "r") as zip: # descompactando arquivo zip com a base de dados
  zip.extractall(diretorio_destino)
  print("Arquivo descompactado com sucesso!")

Arquivo descompactado com sucesso!


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Breast Cancer Wisconsin (Diagnostic) Data Set/data.csv")
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [None]:
data.isnull().sum() #verificando valores nulos

Unnamed: 0,0
id,0
diagnosis,0
radius_mean,0
texture_mean,0
perimeter_mean,0
area_mean,0
smoothness_mean,0
compactness_mean,0
concavity_mean,0
concave points_mean,0


In [None]:
data.isna().sum() #verificando valores NAN

Unnamed: 0,0
id,0
diagnosis,0
radius_mean,0
texture_mean,0
perimeter_mean,0
area_mean,0
smoothness_mean,0
compactness_mean,0
concavity_mean,0
concave points_mean,0


In [None]:
data.shape

(569, 33)

In [None]:
data.drop(["Unnamed: 32", "id"], axis=1, inplace=True) #removendo id e coluna "Unnamed: 32"

In [None]:
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#Pré_processamento

In [None]:
pd.set_option('display.max_columns', None) #exibir todas as colunas do DF
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
