# Cancer Cell Classification and Clustering Report

## Introduction
This project focuses on the development of machine learning and deep learning models to classify cell types and detect cancerous cells using a medical imaging dataset. The dataset, sourced from a custom CSV file (`data_labels_mainData.csv`) and corresponding images stored in a `dataset/images` folder, contains 186 examples with features such as `ImageName`, `cellTypeName`, `cellType`, and `isCancerous`. The primary objectives are to predict cell types (e.g., fibroblast, inflammatory, epithelial, others) and determine whether a cell is cancerous (binary classification: 0 for non-cancerous, 1 for cancerous). Additionally, unsupervised clustering techniques are employed to identify inherent patterns in the data, aiding in early diagnosis and treatment planning for improved patient outcomes.

## Tools and Technologies
The project leverages a robust set of tools and libraries to process data, train models, and visualize results:
- **Python**: Core programming language for implementation.
- **Pandas**: Used for data manipulation and analysis of the CSV dataset.
- **NumPy**: Facilitates numerical computations and array operations.
- **Scikit-learn**: Provides machine learning algorithms (Logistic Regression, Random Forest, K-Means, Hierarchical Clustering) and utilities (train-test split, cross-validation, scaling).
- **TensorFlow/Keras**: Enables the construction and training of a Convolutional Neural Network (CNN) for image-based classification.
- **Matplotlib & Seaborn**: Employed for creating visualizations such as bar plots, confusion matrices, ROC curves, and clustering dendrograms.
- **Pillow**: Handles image loading and preprocessing.
- **SciPy**: Supports hierarchical clustering visualization.
- **Joblib**: Used for saving trained machine learning models.
- **Jupyter Notebook**: Environment for interactive development and documentation.

The project is managed within a virtual environment (`venv`) to isolate dependencies, with all required libraries listed in `requirements.txt` for reproducibility.

## Models
The project implements a variety of supervised and unsupervised learning models to address the classification and clustering tasks:
- **Logistic Regression**: A linear model applied for binary classification of cancerous cells, using `cellTypeName` and `cellType` as features.
- **Random Forest**: An ensemble method with 100 trees, enhancing robustness in classifying cancer status and cell types.
- **Convolutional Neural Network (CNN)**: A deep learning model designed for image-based classification, featuring two convolutional layers (32 and 64 filters), max-pooling, a dense layer (128 units), dropout (0.5), and a sigmoid output layer.
- **K-Means Clustering**: Groups similar cells into 4 clusters based on scaled features, reflecting the four unique cell types.
- **Hierarchical Clustering**: Explores hierarchical relationships using the Ward linkage method, visualized via a dendrogram.

These models are trained and evaluated to determine the best-performing approach for the given dataset.

## Evaluation Pipeline
The evaluation pipeline ensures a comprehensive assessment of model performance:
- **Data Preprocessing**: Missing values are dropped, `cellTypeName` is encoded using LabelEncoder, and features (`cellTypeName`, `cellType`) are scaled with StandardScaler. Images are resized to 64x64 pixels and normalized.
- **Train-Test Split**: An 80/20 split is applied for both cancer detection and cell type classification tasks.
- **Cross-Validation**: 5-fold cross-validation is performed for Logistic Regression and Random Forest to assess generalization.
- **Metrics**: Models are evaluated using accuracy, precision, recall, F1-score (weighted averages), confusion matrices, and ROC curves with AUC scores.
- **Visualizations**: 
  - Dataset distributions (bar plots for `cellTypeName` and `isCancerous`).
  - Model performance (confusion matrices, ROC curves).
  - Clustering results (K-Means scatter plot, Hierarchical Clustering dendrogram).
  - Comparative analysis (bar plot of evaluation metrics).
- **Model Selection**: The best-performing model (based on accuracy) is saved as `logistic_regression_model.pkl`, `random_forest_model.pkl`, or `cnn_model.h5`.

## Complete End-to-End Workflow
The end-to-end pipeline integrates all stages from data loading to real-time testing:
1. **Data Loading**: The CSV file is read from `./dataset/data_labels_mainData.csv`, and images are loaded from `./dataset/images` using `ImageName` with `.png` extension.
2. **Preprocessing**: Data cleaning, encoding, scaling, and image preprocessing are performed.
3. **Model Training**: Logistic Regression, Random Forest, and CNN are trained on the preprocessed data.
4. **Clustering**: K-Means and Hierarchical Clustering are applied to identify patterns.
5. **Evaluation**: Models are evaluated with cross-validation and visualizations generated.
6. **Model Saving**: The best model is saved for future use.
7. **Real-Time Testing**: A sample image from the test set is selected, displayed, and predicted using the best model, with results (true vs. predicted cancer status and cell type) printed.
 

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Function to load and preprocess images from CSV
def load_and_preprocess_images(csv_path, img_folder='images', img_size=(64, 64), max_images=1000):
    try:
        # Load CSV
        df = pd.read_csv(csv_path)
        print("Dataset loaded successfully. Shape:", df.shape)
        
        # Select relevant columns and drop rows with missing values
        df = df[['ImageName', 'cellTypeName', 'isCancerous']].dropna()
        print("Dropped rows with missing values. Shape:", df.shape)
        
        # Encode cellTypeName
        le_cell = LabelEncoder()
        df['cellTypeName_encoded'] = le_cell.fit_transform(df['cellTypeName'])
        
        # Initialize lists for images and labels
        images, y_cancer, y_cell_type = [], [], []
        
        # Load images
        for idx, row in df.iterrows():
            if len(images) >= max_images:  # Limit to prevent memory issues
                break
            img_path = os.path.join(img_folder, row['ImageName'])
            if os.path.exists(img_path):
                try:
                    img = load_img(img_path, target_size=img_size)
                    img_array = img_to_array(img) / 255.0  # Normalize pixel values
                    images.append(img_array)
                    y_cancer.append(row['isCancerous'])
                    y_cell_type.append(row['cellTypeName_encoded'])
                except Exception as e:
                    print(f"Error loading image {img_path}: {e}")
            else:
                print(f"Image not found: {img_path}")
        
        # Convert to numpy arrays
        images = np.array(images)
        y_cancer = np.array(y_cancer)
        y_cell_type = np.array(y_cell_type)
        print(f"Loaded {len(images)} images successfully.")
        
        return images, y_cancer, y_cell_type, df, le_cell
    except Exception as e:
        print(f"Error loading/preprocessing data: {e}")
        return None, None, None, None, None

# Function to build and train multi-output CNN
def train_multi_output_cnn(X_train, y_train_cancer, y_train_cell_type, X_test, y_test_cancer, y_test_cell_type):
    # Define input layer
    input_layer = Input(shape=(64, 64, 3))
    
    # Convolutional base
    x = Conv2D(32, (3, 3), activation='relu')(input_layer)
    x = MaxPooling2D((2, 2))(x)
    x = Conv2D(64, (3, 3), activation='relu')(x)
    x = MaxPooling2D((2, 2))(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.5)(x)
    
    # Output layers
    cancer_output = Dense(1, activation='sigmoid', name='cancer_output')(x)  # Binary classification
    cell_type_output = Dense(len(np.unique(y_train_cell_type)), activation='softmax', name='cell_type_output')(x)  # Multi-class classification
    
    # Define model
    model = Model(inputs=input_layer, outputs=[cancer_output, cell_type_output])
    
    # Compile model
    model.compile(optimizer='adam',
                  loss={'cancer_output': 'binary_crossentropy', 'cell_type_output': 'sparse_categorical_crossentropy'},
                  metrics={'cancer_output': 'accuracy', 'cell_type_output': 'accuracy'})
    
    # Train model
    history = model.fit(X_train, {'cancer_output': y_train_cancer, 'cell_type_output': y_train_cell_type},
                        validation_data=(X_test, {'cancer_output': y_test_cancer, 'cell_type_output': y_test_cell_type}),
                        epochs=10, batch_size=32, verbose=1)
    
    # Evaluate model
    loss, cancer_loss, cell_type_loss, cancer_acc, cell_type_acc = model.evaluate(X_test, {'cancer_output': y_test_cancer, 'cell_type_output': y_test_cell_type}, verbose=0)
    print(f"Test Cancer Accuracy: {cancer_acc:.4f}")
    print(f"Test Cell Type Accuracy: {cell_type_acc:.4f}")
    
    # Predict on test set
    y_pred_cancer, y_pred_cell_type = model.predict(X_test)
    y_pred_cancer = (y_pred_cancer > 0.5).astype(int)
    y_pred_cell_type = np.argmax(y_pred_cell_type, axis=1)
    
    return model, cancer_acc, cell_type_acc, y_pred_cancer, y_pred_cell_type

# Function to visualize a sample prediction
def visualize_prediction(test_image, true_cancer, true_cell_type, pred_cancer, pred_cell_type, image_name, le_cell):
    plt.figure(figsize=(4, 4))
    plt.imshow(test_image)
    true_cancer_str = 'Cancerous' if true_cancer == 1 else 'Non-Cancerous'
    pred_cancer_str = 'Cancerous' if pred_cancer == 1 else 'Non-Cancerous'
    true_cell_type_str = le_cell.inverse_transform([true_cell_type])[0]
    pred_cell_type_str = le_cell.inverse_transform([pred_cell_type])[0]
    plt.title(f'Image: {image_name}\nTrue: {true_cell_type_str}, {true_cancer_str}\nPred: {pred_cell_type_str}, {pred_cancer_str}')
    plt.axis('off')
    plt.show()

# Function for real-time prediction
def predict_image(model, img_path, img_size=(64, 64), le_cell=None):
    try:
        img = load_img(img_path, target_size=img_size)
        img_array = img_to_array(img) / 255.0
        img_array = np.expand_dims(img_array, axis=0)
        
        # Predict
        pred_cancer, pred_cell_type = model.predict(img_array)
        pred_cancer = (pred_cancer > 0.5).astype(int)[0][0]
        pred_cell_type = np.argmax(pred_cell_type, axis=1)[0]
        
        # Convert to human-readable format
        cancer_status = 'Cancerous' if pred_cancer == 1 else 'Non-Cancerous'
        cell_type = le_cell.inverse_transform([pred_cell_type])[0]
        
        return cell_type, cancer_status
    except Exception as e:
        print(f"Error predicting image: {e}")
        return None, None

# Main execution
if __name__ == "__main__":
    # Load and preprocess data
    X_images, y_cancer, y_cell_type, df, le_cell = load_and_preprocess_images('data_labels_mainData.csv')
    if X_images is None or len(X_images) == 0:
        print("No images loaded. Exiting.")
        exit()
    
    # Split data into train and test sets
    X_train_img, X_test_img, y_train_cancer, y_test_cancer, y_train_cell_type, y_test_cell_type = train_test_split(
        X_images, y_cancer, y_cell_type, test_size=0.2, random_state=42
    )
    
    # Train multi-output CNN
    model, cancer_acc, cell_type_acc, y_pred_cancer, y_pred_cell_type = train_multi_output_cnn(
        X_train_img, y_train_cancer, y_train_cell_type, X_test_img, y_test_cancer, y_test_cell_type
    )
    
    # Save the model
    model.save('multi_output_cnn_model.h5')
    print("Model saved as 'multi_output_cnn_model.h5'")
    
    # Visualize a sample prediction from test set
    test_idx = 0
    test_image = X_test_img[test_idx]
    true_cancer = y_test_cancer[test_idx]
    true_cell_type = y_test_cell_type[test_idx]
    pred_cancer = y_pred_cancer[test_idx][0]
    pred_cell_type = y_pred_cell_type[test_idx]
    image_name = df.iloc[df.index[df['ImageName'] == df['ImageName'].iloc[X_test_img.shape[0] + test_idx]]]['ImageName'].values[0]
    visualize_prediction(test_image, true_cancer, true_cell_type, pred_cancer, pred_cell_type, image_name, le_cell)
    
    # Real-time prediction example
    sample_image_path = os.path.join('images', '22405.png')  # Example image path
    if os.path.exists(sample_image_path):
        predicted_cell_type, predicted_cancer_status = predict_image(model, sample_image_path, le_cell=le_cell)
        if predicted_cell_type and predicted_cancer_status:
            print(f"\nReal-Time Prediction for {os.path.basename(sample_image_path)}:")
            print(f"Predicted Cell Type: {predicted_cell_type}")
            print(f"Predicted Cancer Status: {predicted_cancer_status}")
    else:
        print(f"Sample image {sample_image_path} not found.")

Dataset loaded successfully. Shape: (9896, 6)
Dropped rows with missing values. Shape: (9896, 3)
Image not found: images\22405.png
Image not found: images\22406.png
Image not found: images\22407.png
Image not found: images\22408.png
Image not found: images\22409.png
Image not found: images\22410.png
Image not found: images\22411.png
Image not found: images\22412.png
Image not found: images\22413.png
Image not found: images\22414.png
Image not found: images\22415.png
Image not found: images\22417.png
Image not found: images\22418.png
Image not found: images\22419.png
Image not found: images\22420.png
Image not found: images\22421.png
Image not found: images\22422.png
Image not found: images\22423.png
Image not found: images\22424.png
Image not found: images\19035.png
Image not found: images\19036.png
Image not found: images\19037.png
Image not found: images\19038.png
Image not found: images\19039.png
Image not found: images\19040.png
Image not found: images\19041.png
Image not found: im

ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

: 