# **Data Complexity and Meta-Learning**

This project explores the application of **meta-learning techniques** to understand the relationship between dataset characteristics and classification performance. Using Instance Space Analysis (ISA), it is aimed to investigate how different datasets, characterized by their complexity and meta-features, behave across a variety of machine learning algorithms. By analyzing these datasets, it is seeked to identify patterns that correlate with good or bad algorithmic performance and better understand the challenges posed by complex data.

## **Imports**

In [1]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
from sklearn.preprocessing import StandardScaler
from pymfe.mfe import MFE
import numpy as np
import matplotlib as plt
import problexity as px
import warnings
warnings.filterwarnings("ignore")

## **Loading the datasets**

The datasets chosen for this analysis come from a wide range of domains, encompassing binary and multiclass classification tasks, numeric and categorical features, and varying degrees of complexity. Each dataset represents a different level of difficulty for classification, providing a rich foundation for exploring how meta-features impact model effectiveness. 15 diverse datasets were selected from the UCI repository using fetch_ucirepo. 

In [2]:
iris = fetch_ucirepo(id=53) 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)
wine = fetch_ucirepo(id=109) 
blood_transfusion_service_center = fetch_ucirepo(id=176)
ionosphere = fetch_ucirepo(id=52) 
default_of_credit_card_clients = fetch_ucirepo(id=350) 
mammographic_mass = fetch_ucirepo(id=161)
connectionist_bench_sonar_mines_vs_rocks = fetch_ucirepo(id=151) 
spambase = fetch_ucirepo(id=94) 


In [5]:
datasets = {
    "wine": wine,
    "iris": iris,
    "breast_cancer_wisconsin_diagnostic": breast_cancer_wisconsin_diagnostic,
    "connectionist_bench_sonar_mines_vs_rocks": connectionist_bench_sonar_mines_vs_rocks,
    "blood_transfusion_service_center": blood_transfusion_service_center,
    "ionosphere": ionosphere,
    "mammographic_mass": mammographic_mass,
    "default_of_credit_card_clients": default_of_credit_card_clients,
    "spambase": spambase
    # "bank_marketing": bank_marketing,
    # "adult": adult,
    # "car_evaluation": car_evaluation,
    # "mushroom": mushroom,
    # "heart_disease": heart_disease,
    # "wine_quality": wine_quality,
    # "connect_4": connect_4,
    # "pen_based_recognition_of_handwritten_digits": pen_based_recognition_of_handwritten_digits
}

## **Pre Processing**

### **Pre Processing Functions**

In [6]:
# Convert all datasets to dataframes
def convert_all_to_dataframes(datasets_dict):
    dataframes = {}

    for name, dataset in datasets_dict.items():
        features = dataset['data']['features']
        targets = dataset['data']['targets']

        df = pd.concat([features, targets], axis=1)
        
        dataframes[name] = df

    return dataframes

In [7]:
# Analyze missing values in the datasets
def analyze_missing_values_with_columns(dataframes_dict):
    missing_values_info = []
    
    for dataset_name, df in dataframes_dict.items():
        # Calculate total missing values
        total_missing = df.isnull().sum().sum()
        
        # Get columns with missing values
        missing_columns = df.columns[df.isnull().any()].tolist()
        missing_columns_count = df.isnull().sum()[missing_columns].to_dict()
        
        missing_values_info.append({
            "Dataset": dataset_name,
            "Total Missing Values": total_missing,
            "Missing Columns": missing_columns,
            "Missing Values Count": missing_columns_count
        })
    
    # Create a DataFrame with the results
    missing_values_df = pd.DataFrame(missing_values_info)
    
    return missing_values_df

In [8]:
# Check the data types of each column in all DataFrames
def check_column_types(dataframes_dict):
    column_types_info = []
    
    for dataset_name, df in dataframes_dict.items():
        # Get the data types of each column
        column_types = df.dtypes
        
        # Append dataset name, column name, and data type to the list
        for column_name, column_type in column_types.items():
            column_types_info.append({
                "Dataset": dataset_name,
                "Column": column_name,
                "Data Type": column_type
            })
    
    # Create a DataFrame with the results
    column_types_df = pd.DataFrame(column_types_info)
    
    return column_types_df

In [9]:
# Check the data types of specific columns in all DataFrames
def check_specific_column_types(dataframes_dict, target_columns):
    specific_column_types_info = []
    
    for dataset_name, df in dataframes_dict.items():
        # Check only the columns specified in target_columns
        for column in target_columns:
            if column in df.columns:
                column_type = df[column].dtype
                specific_column_types_info.append({
                    "Dataset": dataset_name,
                    "Column": column,
                    "Data Type": column_type
                })
    
    # Create a DataFrame with the results
    specific_column_types_df = pd.DataFrame(specific_column_types_info)
    
    return specific_column_types_df

In [10]:
def impute_missing_values(dataframes_dict, target_columns, create_unknown_category=False):
    imputed_dataframes = {}
    
    for dataset_name, df in dataframes_dict.items():
        for column in target_columns:
            if column in df.columns:
                # Check if there are missing values in the column
                if df[column].isnull().any():
                    # Check the dtype of the column
                    if df[column].dtype == 'object' or df[column].dtype == 'category':
                        # If creating an unknown category is desired
                        if create_unknown_category:
                            # Create a new category for missing values
                            df[column] = df[column].fillna('unknown')
                        else:
                            # Impute with the mode for categorical columns
                            mode_value = df[column].mode()[0]
                            df[column] = df[column].fillna(mode_value)
                    elif df[column].dtype == 'float64':
                        # Impute with the mean for float64 columns
                        mean_value = df[column].mean()
                        df[column] = df[column].fillna(mean_value)
                    elif df[column].dtype == 'int64':
                        # Optionally handle int64 columns with mean or mode
                        mean_value = df[column].mean()
                        df[column] = df[column].fillna(mean_value)

        # Store the imputed DataFrame in the result dictionary
        imputed_dataframes[dataset_name] = df
    
    return imputed_dataframes

In [11]:
def normalize_z_score(dataframes):
    normalized_dataframes = {}
    
    for dataset_name, df in dataframes.items():
        # Assume que a última coluna é a coluna alvo
        target_col = df.columns[-1]
        
        # Identifica as colunas numéricas, excluindo a coluna alvo
        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
        if target_col in numeric_cols:
            numeric_cols.remove(target_col)  # Remove a coluna alvo se for numérica
        
        if len(numeric_cols) == 0:
            print(f"No numeric columns found in {dataset_name}. Skipping normalization.")
            normalized_dataframes[dataset_name] = df  # Mantém o dataframe original
            continue
        
        # Normaliza as colunas numéricas
        scaler = StandardScaler()
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
        
        normalized_dataframes[dataset_name] = df
    
    return normalized_dataframes


In [12]:
# Encode categorical columns in the datasets
def one_hot_encode(dataframes):
    encoded_dataframes = {}
    
    for dataset_name, df in dataframes.items():
        target_col = df.columns[-1]  # Assume que a última coluna é a coluna alvo
        target_data = df[target_col]
        
        # Verifica o tipo da coluna alvo
        if not pd.api.types.is_numeric_dtype(target_data) and target_data.nunique() < 10:
            # Se a coluna alvo não for numérica e tiver poucos valores únicos, pode ser convertida para categórica
            df[target_col] = df[target_col].astype('category')
        
        df_features = df.drop(columns=[target_col])  # Separa as características
        
        # Identifica as colunas categóricas
        categorical_cols = df_features.select_dtypes(include=['object', 'category']).columns
        
        # Codifica as colunas categóricas
        df_encoded = pd.get_dummies(df_features, columns=categorical_cols, drop_first=True)
        
        # Reinsere a coluna alvo original
        df_encoded[target_col] = target_data.reset_index(drop=True)
        
        encoded_dataframes[dataset_name] = df_encoded
    
    return encoded_dataframes

### **Pre Processing**

In [13]:
dataframes = convert_all_to_dataframes(datasets)
dataframes

{'wine':      Alcohol  Malicacid   Ash  Alcalinity_of_ash  Magnesium  Total_phenols  \
 0      14.23       1.71  2.43               15.6        127           2.80   
 1      13.20       1.78  2.14               11.2        100           2.65   
 2      13.16       2.36  2.67               18.6        101           2.80   
 3      14.37       1.95  2.50               16.8        113           3.85   
 4      13.24       2.59  2.87               21.0        118           2.80   
 ..       ...        ...   ...                ...        ...            ...   
 173    13.71       5.65  2.45               20.5         95           1.68   
 174    13.40       3.91  2.48               23.0        102           1.80   
 175    13.27       4.28  2.26               20.0        120           1.59   
 176    13.17       2.59  2.37               20.0        120           1.65   
 177    14.13       4.10  2.74               24.5         96           2.05   
 
      Flavanoids  Nonflavanoid_phenols  Pr

In [14]:
column_types_df = check_column_types(dataframes)
column_types_df

Unnamed: 0,Dataset,Column,Data Type
0,wine,Alcohol,float64
1,wine,Malicacid,float64
2,wine,Ash,float64
3,wine,Alcalinity_of_ash,float64
4,wine,Magnesium,int64
...,...,...,...
234,spambase,char_freq_#,float64
235,spambase,capital_run_length_average,float64
236,spambase,capital_run_length_longest,int64
237,spambase,capital_run_length_total,int64


In [15]:
missing_values_df = analyze_missing_values_with_columns(dataframes)
missing_values_df

Unnamed: 0,Dataset,Total Missing Values,Missing Columns,Missing Values Count
0,wine,0,[],{}
1,iris,0,[],{}
2,breast_cancer_wisconsin_diagnostic,0,[],{}
3,connectionist_bench_sonar_mines_vs_rocks,0,[],{}
4,blood_transfusion_service_center,0,[],{}
5,ionosphere,0,[],{}
6,mammographic_mass,162,"[BI-RADS, Age, Shape, Margin, Density]","{'BI-RADS': 2, 'Age': 5, 'Shape': 31, 'Margin'..."
7,default_of_credit_card_clients,0,[],{}
8,spambase,0,[],{}


In [16]:
target_columns = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density']
specific_column_types_df = check_specific_column_types(dataframes, target_columns)
specific_column_types_df

Unnamed: 0,Dataset,Column,Data Type
0,mammographic_mass,BI-RADS,float64
1,mammographic_mass,Age,float64
2,mammographic_mass,Shape,float64
3,mammographic_mass,Margin,float64
4,mammographic_mass,Density,float64


MUDAR ISTOOOOOOOO

**bank_marketing Dataset**
- **job:** This column represents the type of job held by a client (e.g., admin, technician, services, etc.).0
- **education:** This column indicates the education level of the client (e.g., primary, secondary, tertiary).
- **contact:** This represents the communication type used to contact the client (e.g., cellular, telephone).
- **poutcome:** This column reflects the outcome of the previous marketing campaign (e.g., success, failure, nonexistent).

**adult Dataset**
- **education:** This indicates the education level attained by the individual (e.g., HS-grad, Bachelors, Masters).
- **workclass:** This column shows the type of employment (e.g., Private, Self-emp, Government).
- **occupation:** This column represents the job title of the individual (e.g., Tech-support, Sales).
- **native-country:** This indicates the country of origin of the individual (e.g., United States, Mexico, etc.).

**mushroom Dataset**
- **stalk-root:** This column indicates the type of stalk root present in the mushroom (e.g., bulbous, club, etc.).

**heart_disease**
- **ca:** This indicates the number of major vessels (ranging from 0 to 3) that are colored by fluoroscopy, which is a diagnostic imaging technique used to identify potential blockages in the heart's arteries.
- **thal:** This represents the type of thalassemia, with values indicating normal (1), fixed defect (2), or reversible defect (3), which helps in assessing the patient's heart condition and potential for heart disease.

Due to the nature and meaning of the data, the imputation method was the one chosen to handle the missing data.

In [17]:
target_columns = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density']

imputed_dataframes = impute_missing_values(dataframes, target_columns, create_unknown_category=False)

missing_values_df = analyze_missing_values_with_columns(dataframes)
missing_values_df

Unnamed: 0,Dataset,Total Missing Values,Missing Columns,Missing Values Count
0,wine,0,[],{}
1,iris,0,[],{}
2,breast_cancer_wisconsin_diagnostic,0,[],{}
3,connectionist_bench_sonar_mines_vs_rocks,0,[],{}
4,blood_transfusion_service_center,0,[],{}
5,ionosphere,0,[],{}
6,mammographic_mass,0,[],{}
7,default_of_credit_card_clients,0,[],{}
8,spambase,0,[],{}


In [18]:
import pandas as pd
import numpy as np

# Initialize a dictionary to store class counts for each dataset
class_counts = {}

# Iterate over each dataset in the dictionary
for df_name, df in dataframes.items():
    y = df.iloc[:, -1]  # Assuming the last column is the target

    # Get the count of each class in the target column
    counts = y.value_counts().to_dict()
    
    # Add to class_counts dictionary with dataset name as key
    class_counts[df_name] = counts

# Display the class counts for each dataset
for dataset, counts in class_counts.items():
    print(f"Dataset: {dataset}")
    for class_label, count in counts.items():
        print(f"  Class {class_label}: {count} samples")
    print()  # Add a newline for readability


Dataset: wine
  Class 2: 71 samples
  Class 1: 59 samples
  Class 3: 48 samples

Dataset: iris
  Class Iris-setosa: 50 samples
  Class Iris-versicolor: 50 samples
  Class Iris-virginica: 50 samples

Dataset: breast_cancer_wisconsin_diagnostic
  Class B: 357 samples
  Class M: 212 samples

Dataset: connectionist_bench_sonar_mines_vs_rocks
  Class M: 111 samples
  Class R: 97 samples

Dataset: blood_transfusion_service_center
  Class 0: 570 samples
  Class 1: 178 samples

Dataset: ionosphere
  Class g: 225 samples
  Class b: 126 samples

Dataset: mammographic_mass
  Class 0: 516 samples
  Class 1: 445 samples

Dataset: default_of_credit_card_clients
  Class 0: 23364 samples
  Class 1: 6636 samples

Dataset: spambase
  Class 0: 2788 samples
  Class 1: 1813 samples



In [19]:
# normalized_dfs = normalize_z_score(dataframes)
# normalized_dfs

In [20]:
# encoded_dfs = one_hot_encode(normalized_dfs)
# df_dict = encoded_dfs

## **Meta Features' Extraction**

In [28]:
metafeatures_data = []

for dataset_name, df in dataframes.items():
    # Create an instance of the MFE class
    mfe = MFE(groups=["general", "statistical", "info-theory"])
    
    # Fit the MFE instance to the dataset
    mfe.fit(df.iloc[:, :-1].values, df.iloc[:, -1].values)
    
    # Extract the metafeature names and values
    feature_names, feature_values = mfe.extract()
    
    # Create a dictionary of metafeature names and values
    metafeatures = dict(zip(feature_names, feature_values))
    
    # Add the dataset name to the metafeatures dictionary
    metafeatures["Dataset"] = dataset_name
    
    # Append the metafeatures dictionary to the list
    metafeatures_data.append(metafeatures)

# Create the final DataFrame
df_metafeatures = pd.DataFrame(metafeatures_data).set_index('Dataset')

In [29]:
df_metafeatures

Unnamed: 0_level_0,attr_conc.mean,attr_conc.sd,attr_ent.mean,attr_ent.sd,attr_to_inst,can_cor.mean,can_cor.sd,cat_to_num,class_conc.mean,class_conc.sd,...,sd_ratio,skewness.mean,skewness.sd,sparsity.mean,sparsity.sd,t_mean.mean,t_mean.sd,var.mean,var.sd,w_lambda
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
wine,0.082144,0.050032,2.317306,0.008828,0.073034,,,0.0,0.152791,0.071452,...,1.36247,0.344289,0.465443,0.006197,0.005509,65.071108,191.571123,7645.5,27498.76,
iris,0.209222,0.11995,2.27901,0.057426,0.026667,,,0.0,0.272326,0.142589,...,1.267185,0.066034,0.298864,0.028715,0.011032,3.469722,1.905054,1.142323,1.331291,
breast_cancer_wisconsin_diagnostic,0.069047,0.078168,2.999956,4.9e-05,0.052724,,,0.0,0.051585,0.03584,...,,1.731241,1.271995,0.000209,0.000152,53.294606,165.816892,15063.22,62593.55,
connectionist_bench_sonar_mines_vs_rocks,0.042509,0.044148,2.321728,0.000239,0.288462,,,0.0,0.01716,0.013992,...,1.291039,0.953314,0.939512,0.000689,0.001122,0.271992,0.235652,0.02913314,0.02589374,
blood_transfusion_service_center,0.222062,0.366012,2.882541,0.268578,0.005348,,,0.0,0.010159,0.008808,...,,2.254043,1.183456,0.025118,0.009126,267.449444,506.647734,532947.0,1065432.0,
ionosphere,0.09086,0.04767,2.504029,0.650889,0.096866,,,0.0,0.067908,0.111682,...,,,,0.045112,0.18905,0.300629,0.307977,0.2725297,0.08203703,
mammographic_mass,0.092187,0.103962,1.777794,0.960494,0.005203,,,0.0,0.143962,0.153532,...,1.185422,4.072998,11.111378,0.140141,0.077749,13.796784,23.639551,43.1428,92.49021,
default_of_credit_card_clients,0.08199,0.0917,3.39044,1.486675,0.000767,,,0.0,0.006327,0.009802,...,,5.285823,7.553778,0.064655,0.114855,13159.477889,30239.805946,1974938000.0,3789461000.0,
spambase,0.043967,0.062612,1.130098,0.993315,0.012389,,,0.0,0.047754,0.040697,...,2.821298,11.179346,6.954063,0.006084,0.004604,2.535708,16.107536,7134.779,48864.66,


In [30]:
complexity_data = []

for dataset_name, df in dataframes.items():
    # Create an instance of the MFE class
    mfe = MFE(groups="complexity", summary = "max")
    
    # Fit the MFE instance to the dataset
    mfe.fit(df.iloc[:, :-1].values, df.iloc[:, -1].values)
    
    # Extract the metafeature names and values
    feature_names, feature_values = mfe.extract()
    
    # Create a dictionary of metafeature names and values
    metafeatures = dict(zip(feature_names, feature_values))
    
    # Add the dataset name to the metafeatures dictionary
    metafeatures["Dataset"] = dataset_name
    
    # Append the metafeatures dictionary to the list
    complexity_data.append(metafeatures)

# Create the final DataFrame
df_complexity = pd.DataFrame(complexity_data).set_index('Dataset')

Exception: 

In [None]:
df_complexity

In [26]:
# Inicializa uma lista para armazenar as métricas de cada dataset
complexity_data = []

# Itera sobre cada dataset no dicionário
for df_name, df in dataframes.items(): 
    # Converte X e y para float antes de passá-los ao ComplexityCalculator
    X = df.iloc[:, :-1].astype(float).to_numpy()
    y = df.iloc[:, -1].to_numpy()   # Última coluna (target)
    
    try:
        # Calcula a complexidade usando o ComplexityCalculator
        cc = px.ComplexityCalculator()
        cc.fit(X, y)
        
        # Cria um dicionário com as métricas de complexidade e o nome do dataset
        metrics_dict = {metric: value for metric, value in zip(cc._metrics(), cc.complexity)}
        metrics_dict['Dataset'] = df_name  # Adiciona o nome do dataset
        
        # Adiciona o dicionário à lista
        complexity_data.append(metrics_dict)
        print(df_name, ":" , metrics_dict)

    except Exception as e:
        print(f"Error calculating complexity for dataset: {df_name} - {e}")


# Cria o DataFrame final
df_prbxty = pd.DataFrame(complexity_data).set_index('Dataset')

wine : {'f1': 0.14721884574690347, 'f1v': 0.026841189195927397, 'f2': 2.0354274454154463e-05, 'f3': 0.1532643826761474, 'f4': 0.0, 'l1': 0.004271114001573237, 'l2': 0.0038461538461538464, 'l3': 0.0, 'n1': 0.07449681526097139, 'n2': 0.3454838412330818, 'n3': 0.13522504616512313, 'n4': 0.12440438432741896, 't1': 0.2953460320145312, 'lsc': 0.8181968676480578, 'density': 0.9096724942558718, 'clsCoef': 0.5479668767863313, 'hubs': 0.7851428584117524, 't2': 0.11024634152726509, 't3': 0.008480487809789624, 't4': 0.07692307692307693, 'c1': 0.013636449935969167, 'c2': 0.036611795976815964, 'Dataset': 'wine'}
iris : {'f1': 0.07870888942123055, 'f1v': 0.026266732914796354, 'f2': 0.0063817663817663794, 'f3': 0.12333333333333334, 'f4': 0.043333333333333335, 'l1': 0.01076722138530393, 'l2': 0.01, 'l3': 0.005, 'n1': 0.015, 'n2': 0.24253220945423562, 'n3': 0.020000000000000018, 'n4': 0.01, 't1': 0.07666666666666667, 'lsc': 0.6055666666666666, 'density': 0.7757575757575758, 'clsCoef': 0.2463936337819516

KeyboardInterrupt: 

In [None]:
df_prbxty

Unnamed: 0_level_0,f1,f1v,f2,f3,f4,l1,l2,l3,n1,n2,...,t1,lsc,density,clsCoef,hubs,t2,t3,t4,c1,c2
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
wine,0.147219,0.026841,2e-05,0.153264,0.0,0.004271,0.003846,0.0,0.074497,0.345484,...,0.295346,0.818197,0.909672,0.547967,0.785143,0.110246,0.00848,0.076923,0.013636,0.036612
iris,0.078709,0.026267,0.006382,0.123333,0.043333,0.010767,0.01,0.001667,0.015,0.242532,...,0.076667,0.605567,0.775758,0.246394,0.65091,0.04,0.023333,0.583333,0.0,0.0
mammographic_mass,0.520937,0.2586,0.081119,0.953174,0.952133,0.171232,0.171696,0.120708,0.12487,0.454525,...,0.508845,0.99323,0.846976,0.288398,0.748813,0.005203,0.001041,0.2,0.003941,0.010858


In [None]:
# List of datasets to analyze
datasets_to_analyze = ['breast_cancer_wisconsin_diagnostic', 'connectionist_bench_sonar_mines_vs_rocks', 'ionosphere', 'default_of_credit_card_clients', 'spambase']

for df_name in datasets_to_analyze:
    # Get the corresponding DataFrame from the dictionary
    df = dataframes[df_name]

    # Display general information about the dataset
    print(f"\nAnalysis of : {df_name}")
    print(f"DataFrame's Size: {df.shape}")

    # Descriptive statistics
    print("\nDiscriptive Analysis:")
    print(df.describe())

    # Count unique values in each column
    unique_counts = df.nunique()
    print("\nUnique values per column:")
    print(unique_counts)

    # Calculate variance for numeric columns excluding the target column
    numeric_columns = df.select_dtypes(include=[np.number]).columns.difference(['Diagnosis'])
    variance = df[numeric_columns].var()
    print("\nVariance per column:")
    print(variance)



Analysis of : breast_cancer_wisconsin_diagnostic
DataFrame's Size: (569, 31)

Discriptive Analysis:
          radius1    texture1  perimeter1        area1  smoothness1  \
count  569.000000  569.000000  569.000000   569.000000   569.000000   
mean    14.127292   19.289649   91.969033   654.889104     0.096360   
std      3.524049    4.301036   24.298981   351.914129     0.014064   
min      6.981000    9.710000   43.790000   143.500000     0.052630   
25%     11.700000   16.170000   75.170000   420.300000     0.086370   
50%     13.370000   18.840000   86.240000   551.100000     0.095870   
75%     15.780000   21.800000  104.100000   782.700000     0.105300   
max     28.110000   39.280000  188.500000  2501.000000     0.163400   

       compactness1  concavity1  concave_points1   symmetry1  \
count    569.000000  569.000000       569.000000  569.000000   
mean       0.104341    0.088799         0.048919    0.181162   
std        0.052813    0.079720         0.038803    0.027414   
min