# S01 Self Directed Activity

This activity will be composed of the following notebooks:

1. `N01_EDA`: notebook with a small EDA of the 2 datasets.

2. `N02_models`: notebook with 3 classification models for the iris dataset: SVM, Logistic Regression and Random Forest. Finally, a comparison will be made with some evaluation metrics.

3. `N03_MLP_Comparison`: last notebook, with the Scikit-Learn implementation of Multi Layer Perceptron (MLP) and a comparison with the other 3 models.

This Self Directed Activity has the objective of refreshing the concepts and contents of the previous subjects, serving as a summary of the contents learnt. 


## Github Repository 

Here´s a link to the GitHub repository of this activity. For all the course, a separate repository will be made and set to private mode. In case the assesment needs to be displayed in public, a different repository will be made for that activity: 

- [S01_Self_Directed_Activity](https://github.com/VforVitorio/S01_Self_Directed_Activity.git)

In [51]:
__author__ = "Víctor Vega Sobral"

#### Loading necessary libraries 

In [52]:
import pandas as pd
import numpy as np 
import plotly.express as px
from sklearn.preprocessing import LabelEncoder

### Loading the datasets 

First of all, we need to load both of the datasets we will be working on. We can make this loading them locally or putting a seed. In this case, I´ll load them locally. The datasets are

- Breast Cancer Wisconsin Diagnostics.
- Iris Dataset.

In [53]:
# Load Breast Cancer dataset
cancer_df = pd.read_csv('datasets/breast+cancer+wisconsin+diagnostic/wdbc.data', header=None)

# Load Iris dataset
iris_df = pd.read_csv('datasets/iris/iris.data', header=None)

# Verify correct loading
print("Cancer dataset shape:", cancer_df.shape)
print("Iris dataset shape:", iris_df.shape)

Cancer dataset shape: (569, 32)
Iris dataset shape: (150, 5)


In [54]:
print(f"First registers of cancer dataframe \n {cancer_df.head()}")

First registers of cancer dataframe 
          0  1      2      3       4       5        6        7       8   \
0    842302  M  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001   
1    842517  M  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869   
2  84300903  M  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974   
3  84348301  M  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414   
4  84358402  M  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980   

        9   ...     22     23      24      25      26      27      28      29  \
0  0.14710  ...  25.38  17.33  184.60  2019.0  0.1622  0.6656  0.7119  0.2654   
1  0.07017  ...  24.99  23.41  158.80  1956.0  0.1238  0.1866  0.2416  0.1860   
2  0.12790  ...  23.57  25.53  152.50  1709.0  0.1444  0.4245  0.4504  0.2430   
3  0.10520  ...  14.91  26.50   98.87   567.7  0.2098  0.8663  0.6869  0.2575   
4  0.10430  ...  22.54  16.67  152.20  1575.0  0.1374  0.2050  0.4000  0.1625   

       30       31  
0  0.4601

In [55]:
print(f"First registers of iris dataframe \n {iris_df.head()}")

First registers of iris dataframe 
      0    1    2    3            4
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa


#### Data types 

In [56]:
datasets = {
    'Cancer': cancer_df,
    'Iris': iris_df
}

for name, df in datasets.items():
    print(f"\n{name} Dataset Types Summary:")
    display(df.dtypes.value_counts().to_frame(name='count'))


Cancer Dataset Types Summary:


Unnamed: 0,count
float64,30
int64,1
object,1



Iris Dataset Types Summary:


Unnamed: 0,count
float64,4
object,1


### Identification of numeric and categoric variables

1. Changing the column names of the dataset with more representative column names. These names can be found in the .names archives of both datasets.

2. Brief summary of numeric columns.

In [57]:
# Names for iris dataset
iris_columns = [
    'sepal_length',
    'sepal_width', 
    'petal_length',
    'petal_width',
    'class'
]

# Names for breast cancer dataset
cancer_columns = [
    'id',
    'diagnosis',
    'radius_mean', 
    'texture_mean',
    'perimeter_mean',
    'area_mean',
    'smoothness_mean',
    'compactness_mean',
    'concavity_mean',
    'concave_points_mean',
    'symmetry_mean',
    'fractal_dimension_mean',
    'radius_se',
    'texture_se', 
    'perimeter_se',
    'area_se',
    'smoothness_se',
    'compactness_se',
    'concavity_se',
    'concave_points_se',
    'symmetry_se',
    'fractal_dimension_se',
    'radius_worst',
    'texture_worst',
    'perimeter_worst',
    'area_worst',
    'smoothness_worst', 
    'compactness_worst',
    'concavity_worst',
    'concave_points_worst',
    'symmetry_worst',
    'fractal_dimension_worst'
]

# Renaming columns 
iris_df.columns = iris_columns
cancer_df.columns = cancer_columns

# Verifying new names
print("Iris columns:\n", iris_df.columns.tolist())
print("\nCancer columns:\n", cancer_df.columns.tolist())

Iris columns:
 ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

Cancer columns:
 ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst']


In [58]:
# Brief summary of numerical columns 
for name, df in datasets.items():
        print(f"\n{'-'*20} {name} Dataset {'-'*20}")
        # identifying numeric and categorical columns
        numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
        categorical_cols = df.select_dtypes(include=['object', 'category']).columns
        
        # Summary
        print("\nBrief summary of numeric columns:")
        display(df[numeric_cols].describe())

        



-------------------- Cancer Dataset --------------------

Brief summary of numeric columns:


Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075



-------------------- Iris Dataset --------------------

Brief summary of numeric columns:


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [59]:
def create_boxplots(datasets):
    for dataset_name, df in datasets.items():
        # Select numeric columns
        numeric_cols = df.select_dtypes(include=['number']).columns
        
        # Select categorical columns
        categorical_cols = df.select_dtypes(include=['object', 'category']).columns
        
        print(f"\nCreating boxplots for {dataset_name} dataset:")
        
        # Generate boxplots for numeric columns
        for column in numeric_cols:
            fig = px.box(df, y=column)
            fig.update_layout(
                title=f'Boxplot for {column} - {dataset_name} Dataset',
                yaxis_title=column,
                xaxis_title='Category'
            )
            fig.show()

create_boxplots(datasets)


Creating boxplots for Cancer dataset:



Creating boxplots for Iris dataset:


#### Identify missing values



In [60]:
for name, df in datasets.items():
        print(f"Analylis of {name}")
        # 1. Analyze missing values
        missing_values = df.isnull().sum()

        if not missing_values.any():
                print("There are no missing values")
        else:
                print("\Missing values:")
                print(missing_values[missing_values > 0])

        



Analylis of Cancer
There are no missing values
Analylis of Iris
There are no missing values


#### Making binary columns

In both datasets there´s one column that needs to be codified with dummies, in order to conserve the essence of the data of this columns and proceding to the implementation of classification models.


In [61]:


le = LabelEncoder()

# Cancer dataset 
cancer_df['diagnosis'] = le.fit_transform(cancer_df['diagnosis'])  # M=1, B=0

# Iris dataset
iris_df['class'] = le.fit_transform(iris_df['class'])  # Codify the 3 classes as 0,1,2

## Display transformed columns
print(f"\n{'-'*55}")
print("\nVerify the correct transformation of diagnosis column:")
print(f"\n{'-'*55}")

display(cancer_df.head(10))

print(f"\n{'-'*55}")
print("\n Verify the correct transformation of class column")
print(f"\n{'-'*55}")

display(iris_df.head(10))



-------------------------------------------------------

Verify the correct transformation of diagnosis column:

-------------------------------------------------------


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,843786,1,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,844359,1,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,84458202,1,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151
8,844981,1,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,...,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072
9,84501001,1,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075



-------------------------------------------------------

 Verify the correct transformation of class column

-------------------------------------------------------


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


### Data normalization

We use Scikit-Learn StandardScaler to normalize the data. Having normalized data can improve the performance of future models.

In [63]:
from sklearn.preprocessing import StandardScaler

# Cancer dataset - Excluir 'id' y 'diagnosis' (binaria)

# Eliminate id column, there is no significant information in the column
cancer_df.drop('id', axis=1, inplace=True)

cancer_numeric = cancer_df.drop(['diagnosis'], axis=1)
cancer_df[cancer_numeric.columns] = StandardScaler().fit_transform(cancer_numeric)

# Iris dataset - Excluir 'class' (categórica)
iris_numeric = iris_df.drop(['class'], axis=1)
iris_df[iris_numeric.columns] = StandardScaler().fit_transform(iris_numeric)

# Display results
display(cancer_df.head(), iris_df.head())

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,1,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,1,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,-0.900681,1.032057,-1.341272,-1.312977,0
1,-1.143017,-0.124958,-1.341272,-1.312977,0
2,-1.385353,0.337848,-1.398138,-1.312977,0
3,-1.506521,0.106445,-1.284407,-1.312977,0
4,-1.021849,1.26346,-1.341272,-1.312977,0


### Storing Iris dataset in a csv file por model implementation

In [67]:
iris_df.to_csv("datasets/iris_normalized_data.csv")