<a href="https://colab.research.google.com/github/YogithL/Data-Science/blob/main/YogiLogaU9proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [34]:
%%capture
!pip install lightning
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

import lightning as L
from torch.utils.data import TensorDataset, DataLoader

import pandas as pd
from sklearn.model_selection import train_test_split

# **Breast Cancer Classification Using Neural Networks**

Breast Cancer is one of the most common and deadliest cancers affecting women worldwide. However, like most diseases, making accurate and early diagnoses goes a long way in improving patient outcomes. This project uses the Wisconsin Breast Cancer Dataset, which contains features extracted from digitized images of Fine Needle Aspirates (FNAs) of breast masses.

*One source I heavily relied on was https://colab.research.google.com/github/StatQuest/signa/blob/main/chapter_03/chapter_03_multiple_inputs_and_outputs.ipynb. This guide was as effective as the notes in explaining some new topics used in this project. Though they gave commentary in almost all their code, I'm putting it in my words, or at least understanding.

## **The Data**


Each of the 10 cell characteristics in the dataset came with three types of measurements: mean, standard error, and worst measurement. For my model, I chose to only focus on the mean because I wanted to keep things simple. Looking back, including the "worst" values might have been helpful since those represent the average of the 3 most abnormal values. This is especially relevant since cancer itself is a type of abnormality.

In [35]:
columns = [ #This was given in a seperate file from the data
    "ID number", "Diagnosis",
    "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean",
    "compactness_mean", "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean",

    "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se",
    "compactness_se", "concavity_se", "concave_points_se", "symmetry_se", "fractal_dimension_se",

    "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst",
    "compactness_worst", "concavity_worst", "concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
]


wbc= pd.read_csv('https://raw.githubusercontent.com/YogithL/Data-Science/refs/heads/main/wdbc.data', header=None, names=columns)

wbc.head()

Unnamed: 0,ID number,Diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## **Prepping The Data**

Since I have some background knowledge about cancer from previous classes and research, I decided to select features that made the most sense to me. I mainly focused on traits commonly seen in skin cancer and tried to draw parallels to breast cancer. I'm not sure how effective this approach will be, but I'm confident it won't hurt the model too much. For example, I know that cancerous patches of skin are often uneven, vary in color, and tend to be larger, so I kept those general patterns in mind when choosing which features to include.

In [36]:
input_values = wbc[['area_mean', 'texture_mean', "smoothness_mean"]]
input_values.head()

Unnamed: 0,area_mean,texture_mean,smoothness_mean
0,1001.0,10.38,0.1184
1,1326.0,17.77,0.08474
2,1203.0,21.25,0.1096
3,386.1,20.38,0.1425
4,1297.0,14.34,0.1003


These are the values I'm predicting, whether the mass is malignant or benign.

In [37]:
label_values = wbc['Diagnosis']
label_values.head()

Unnamed: 0,Diagnosis
0,M
1,M
2,M
3,M
4,M


Here, I'm converting the categorical values into numbers so the computer can understand and work with them; the guide told me this process is called factorizing.

In [38]:
#0= Malignant
#1= Benign
diagnosis_as_numbers = label_values.factorize()[0]
diagnosis_as_numbers

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

Now I'm splitting the input values (my features) and labels (malignant or benign) into a training and testing set. I'll use 75% of the data to train the model and keep the remaining 25% to test how well the model performs.

In [39]:
input_train, input_test, label_train, label_test = train_test_split(input_values,
                                                                    diagnosis_as_numbers,
                                                                    test_size=0.25,
                                                                    stratify=diagnosis_as_numbers)

Here, I'm checking the shape of the data to makesure the process went well.

In [40]:
#75 percent of the 569 records is, ~426, so this is good!
input_train.shape

(426, 3)

In [41]:
#75 percent of the 569 records is, ~426, so this is good!
label_train.shape

(426,)

Since the neural network returns two output values, one for each diagnosis, I need to convert the labels into a two-element array so they match the expected output format.

In [42]:
#[0,1] is B
#[1,0] is M
one_hot_label_train = F.one_hot(torch.tensor(label_train)).type(torch.float32)
one_hot_label_train[:10]

tensor([[1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.]])

Now, I need to normalize the data; this means scaling down the values so that the maximum is 1. This helps the NN train more effectively since it prevents features with large values from overpowering the those with smaller ones; it also slightly speeds up how quickly the model learns.

In [43]:
## Determine the maximum values
max_vals_in_input_train = input_train.max()

## Printing them
max_vals_in_input_train

Unnamed: 0,0
area_mean,2499.0
texture_mean,39.28
smoothness_mean,0.1425


In [44]:
## Determining the minimum values
min_vals_in_input_train = input_train.min()

## Printing them
min_vals_in_input_train

Unnamed: 0,0
area_mean,143.5
texture_mean,9.71
smoothness_mean,0.06251


In [45]:
## Normalizeing input_train with the maximum and minimum values from input_train
input_train = (input_train - min_vals_in_input_train) / (max_vals_in_input_train - min_vals_in_input_train)
input_train.head()


Unnamed: 0,area_mean,texture_mean,smoothness_mean
517,0.454468,0.356781,0.514939
287,0.158098,0.11532,0.088011
25,0.326555,0.226243,0.701213
253,0.333135,0.249239,0.478685
369,0.568245,0.412242,0.547443


In [46]:
## Normalizing input_test with the maximum and minimum values from input_train
input_test = (input_test - min_vals_in_input_train) / (max_vals_in_input_train - min_vals_in_input_train)
input_test.head()

Unnamed: 0,area_mean,texture_mean,smoothness_mean
519,0.148716,0.236388,0.624953
408,0.360093,0.370308,0.513689
291,0.230864,0.317552,0.342668
518,0.148419,0.287792,0.741218
385,0.221269,0.459249,0.303913


Here, I'm putting my data into a data loader. According to the guide, data loaders help organize the data so the model can access it in smaller groups (batches). They also shuffle the data each epoch, which I learned prevents the model from learning patterns based on the order of the data, rather than actually generalizing.

In [47]:
#Converting the DataFrame input_train into tensors
input_train_tensors = torch.tensor(input_train.values).type(torch.float32)

#Printing out the first 5 rows to make sure they are what I expect
input_train_tensors[:5]

tensor([[0.4545, 0.3568, 0.5149],
        [0.1581, 0.1153, 0.0880],
        [0.3266, 0.2262, 0.7012],
        [0.3331, 0.2492, 0.4787],
        [0.5682, 0.4122, 0.5474]])

In [48]:
#Converting the DataFrame input_test into tensors
input_test_tensors = torch.tensor(input_test.values).type(torch.float32)

#Printing out the first 5 rows to make sure they are what I expect
input_test_tensors[:5]

tensor([[0.1487, 0.2364, 0.6250],
        [0.3601, 0.3703, 0.5137],
        [0.2309, 0.3176, 0.3427],
        [0.1484, 0.2878, 0.7412],
        [0.2213, 0.4592, 0.3039]])

In [49]:
train_dataset = TensorDataset(input_train_tensors, one_hot_label_train)
train_dataloader = DataLoader(train_dataset)

## **Building The Model**

The class below I got from the guide. To my understanding, the class defines a simple neural network model using PyTorch Lightning for binary classification.
1. The **init()** method sets up the model by initializing the weights and biases, plus other basic settings
2. The **forward()** method runs the input data through the model to make predictions
3. The **configure_optimizers()** method sets up the optimizer, which updates the model’s weights during training
4. The **training_step()** method takes a batch of training data, uses **forward()** to make predictions, calculates the loss between the predictions and actual labels, and then returns that loss so the model can learn




In [50]:
#ALL TAKEN FROM THE GUIDE
class MultipleInsOuts(L.LightningModule):

    def __init__(self):

        super().__init__()

        L.seed_everything(seed=42)

        self.input_to_hidden = nn.Linear(in_features=3, out_features=2, bias=True)

        self.hidden_to_output = nn.Linear(in_features=2, out_features=2, bias=True)

        self.loss = nn.MSELoss(reduction='sum')


    def forward(self, input):
        hidden = self.input_to_hidden(input)
        output_values = self.hidden_to_output(torch.relu(hidden))

        return(output_values)


    def configure_optimizers(self):
        return Adam(self.parameters(), lr=0.001)


    def training_step(self, batch, batch_idx):
        inputs, labels = batch

        outputs = self.forward(inputs)

        loss = self.loss(outputs, labels)

        return loss

Now, after we handled all the data processing, we're ready to finally build and train our model!

In [51]:
model = MultipleInsOuts()

INFO:lightning.fabric.utilities.seed:Seed set to 42


The code below creates a Trainer object that lets us train our model for 10 epochs, basically how many times we want to run through our data. I can increase this amount later if I want, but for now this is fine.

In [52]:
trainer = L.Trainer(max_epochs=10)
trainer.fit(model, train_dataloaders=train_dataloader)

INFO:lightning.pytorch.utilities.rank_zero:Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.callbacks.model_summary:
  | Name             | Type    | Params | Mode 
-----------------------------------------------------
0 | input_to_hidden  | Linear  | 8      | train
1 | hidden_to_output | Linear  | 6      | train
2 | loss             | MSELoss | 0      | train
-----------------------------------------------------
14        Trainable params
0         Non-trainable params
14        Total params
0.000     Total estimated model params size (MB)
3         Modules in train mode
0         Modules in eval mode


Training: |          | 0/? [00:00<?, ?it/s]

INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


Great! Now that we've trained our model, we can input our test tensors into it and save the predictions.

In [53]:
predictions = model(input_test_tensors)

#Printing our first 4 predictions to make sure they look right
predictions[0:4,]

tensor([[0.2762, 0.6763],
        [0.7471, 0.3952],
        [0.3349, 0.6410],
        [0.3905, 0.6072]], grad_fn=<SliceBackward0>)

Note how the output above has 2 elements in each row, these represent our two different diagnoses. We can decode which diagnosis was predicted by choosing the index with the largest value. In the first 4 predictions, Index 0 had the largest number, this means the model believes all those masses were likely malignant. We can automate this process using **torch.argmax()**, a function that returns the indices of largest values for each row.

In [54]:
predicted_labels = torch.argmax(predictions, dim=1)
predicted_labels[0:4]

tensor([1, 0, 1, 1])

The output above corresponds to our interperation above! Now we can find our models accuracy by summing the number of correct predictions and dividing by the total number of predictions.

In [55]:
#This line does just that!
torch.sum(torch.eq(torch.tensor(label_test), predicted_labels)) / len(predicted_labels)

tensor(0.9231)

Okay, so our model was 84.62% accurate, this is definitely not bad at all. Lets see if we can improve it by training it further.

In [56]:
#This makes a checkpoint to the last time we tested the model, so our first 10 epochs are saved
path_to_checkpoint = trainer.checkpoint_callback.best_model_path

#Now we're training to epoch 50
trainer = L.Trainer(max_epochs=50)
trainer.fit(model, train_dataloaders=train_dataloader, ckpt_path=path_to_checkpoint)

INFO:lightning.pytorch.utilities.rank_zero:Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_2/checkpoints/epoch=9-step=4260.ckpt
/usr/local/lib/python3.11/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py:362: The dirpath has changed from '/content/lightning_logs/version_2/checkpoints' to '/content/lightning_logs/version_3/checkpoints', therefore `best_model_score`, `kth_best_model_path`, `kth_value`, `last_model_path` and `best_k_models` won't be reloaded. Only `best_model_path` will be reloaded.
INFO:lightning.p

Training: |          | 0/? [00:00<?, ?it/s]

INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=50` reached.


In [58]:
#The code below does finds our accuracy as a percentage just like before
predictions = model(input_test_tensors)

predicted_labels = torch.argmax(predictions, dim=1)

torch.sum(torch.eq(torch.tensor(label_test), predicted_labels)) / len(predicted_labels)

tensor(0.8951)

Wow, this model is over 90% accurate (most of the time)! I think I'm going to leave it here, running more than 50 epochs might overfit the model (if I haven't already done that yet).



## **My Conclusion**

Ultimately, I'm proud of this project; it was one of my most succesful and complex, other than the Conditional Autoregression Model. I'm proud with achieving over 90% accuracy, but I'm sure I could've gotten that slightly better if time alotted.