# **CSC173: Group Activity 1**
>*Members: Febe Belvis, Kervin Paalisbo, Joshua Radz Adlaon*

This notebook documents this group's Activity 1, where the goal is to implement a neural network from scratch (without using machine learning libraries such as TensorFlow or PyTorch). The neural network will be trained and evaluated on the Breast Cancer dataset from the UCI Machine Learning Repository.

***

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ucimlrepo import fetch_ucirepo # to directly import the dataset from its repo


***

## Loading and Preprocessing the Dataset

Before we use train the network, we import the dataset from its repository using the ```ucimlrepo``` library. 

In [25]:
breast_cancer = fetch_ucirepo(id=17) # import the dataset from the ML repo

# extract features and its values/targets
X = breast_cancer.data.features
y = breast_cancer.data.targets

# combine into one DataFrame
df = pd.concat([X, y], axis=1)

We then preprocess the dataset by standardizing the feature names and converting the diagnosis values from ```'M/B'``` to ```'1/0'```.

In [26]:
# lowercase 'diagnosis' column for standardization 
df.rename(columns={df.columns[-1]: 'diagnosis'}, inplace=True)

# convert diagnosis values to binary (M=1, B=0)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

df.head(25)

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,1
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,1
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,1
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,1
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,1


***
## Feature Selection (using Correlation)
In this step, we identify the two features most correlated with the target ```diagnosis```.

We use the **Pearson correlation coefficient**, which measures the linear relationship between two variables. 


In [31]:
# compute correlation with the target 'diagnosis'
corr = df.corr()['diagnosis'].abs().sort_values(ascending=False)

# display top correlations
corr.head(10)

diagnosis          1.000000
concave_points3    0.793566
perimeter3         0.782914
concave_points1    0.776614
radius3            0.776454
perimeter1         0.742636
area3              0.733825
radius1            0.730029
area1              0.708984
concavity1         0.696360
Name: diagnosis, dtype: float64

*A higher absolute correlation value (closer to 1 or -1) means that changes in the feature are strongly associated with changes in the target, making that feature more useful for prediction.*

<span style="font-size:12px">*Source: GeeksforGeeks. (2025, July 23). Pearson Correlation Coefficient. GeeksforGeeks. https://www.geeksforgeeks.org/maths/pearson-correlation-coefficient/*</span>




<br><br>Since we're limited to using only 2 input features, we pick the top 2 most correlated features to ```diagnosis```

In [None]:
top_features = corr.index[1:3].tolist()

# extract top 2 correlated features
X_selected = X[top_features]

print("Selected features:", top_features)

Selected features: ['concave_points3', 'perimeter3']
     concave_points3  perimeter3
0             0.2654      184.60
1             0.1860      158.80
2             0.2430      152.50
3             0.2575       98.87
4             0.1625      152.20
..               ...         ...
564           0.2216      166.10
565           0.1628      155.00
566           0.1418      126.70
567           0.2650      184.60
568           0.0000       59.16

[569 rows x 2 columns]
