# Vidhamjot Kaur
# C0909093

**https://github.com/VidhamjotKaur/Breast-Cancer-Data-Analysis**


Assignment Guide: Breast Cancer Data Analysis and Streamlit App
Step 1: Project Setup
1. Create a Project Directory:
o Create a new directory for your project in VS Code.
o Initialize a Git repository.
2. Set Up a Virtual Environment:
o Create and activate a virtual environment for the project.
Step 2: Dataset Acquisition and Preparation
1. Download the Dataset:
o Download the Breast Cancer dataset from a reliable source like the UCI Machine
Learning Repository, Kaggle or get the dataset from sklearn.
2. Data Preparation:
o Write a Python script to load and preprocess the dataset, ensuring it is ready for
analysis.
Step 3: Feature Selection
1. Feature Selection Technique:
o Implement feature selection using methods like SelectKBest from
sklearn.feature_selection.
Step 4: Grid Search CV for Model Tuning
1. Grid Search Cross-Validation:
o Provide a template or guide for setting up Grid Search CV to optimize the
parameters of an ANN model (MLPClassifier from sklearn.neural_network).
Step 5: Implementing an Artificial Neural Network (ANN) Model
1. ANN Model Creation:
o Outline the steps to create an ANN model.
o Train and evaluate the model using the breast cancer dataset.
Step 6: Building a Streamlit App Locally
1. Streamlit code:
o Use Streamlit as a tool for building interactive web apps with Python.
2. Developing the Streamlit App:
o Create a basic Streamlit app that allows users to interact with the breast cancer
dataset and view model predictions.
o Integrate model predictions, and user interaction within the Streamlit app.
Step 7: Deployment and Version Control
1. GitHub Repository Setup:
o Setting up a GitHub repository for their project. Give the link in the comment
section.
o Commit their code regularly and push changes to GitHub.
2. Submission Requirements:
o Specify the deliverables, such as the Python scripts, Streamlit app code, and a
README.md file documenting the project.
Additional Tips
• Documentation and Comments: Emphasize the importance of clear documentation and
comments in the code to explain each step and rationale.
• Encourage Exploration: Encourage students to explore different feature selection
techniques, model architectures, and hyperparameter configurations beyond the basic
requirements.
By following these steps, students can gain hands-on experience in data preprocessing, model
development, and interactive web application creation using Streamlit, enhancing their
understanding of machine learning concepts and practical skills.

In [None]:
!python.exe -m pip install --upgrade pip

In [None]:
!pip install pandas
!pip install scikit-learn
!pip install joblib
!pip install streamlit
!pip install numpy
!pip install warnings

In [None]:
pip freeze > requirements.txt

In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib
import warnings
from sklearn.exceptions import ConvergenceWarning

def load_dataset():
    """
    Load the breast cancer dataset from scikit-learn.

    Returns:
        X (pd.DataFrame): Feature data.
        y (pd.Series): Target labels.
    """
    data = load_breast_cancer()
    X = pd.DataFrame(data.data, columns=data.feature_names)
    y = pd.Series(data.target, name='target')
    return X, y

def check_missing_values(X):
    """
    Check and print the number of missing values in each feature.

    Args:
        X (pd.DataFrame): Feature data.
    """
    missing = X.isnull().sum()
    print("Missing values in each feature:\n", missing)

def select_features(X, y, k=10):
    """
    Select the top k features based on ANOVA F-value.

    Args:
        X (pd.DataFrame): Feature data.
        y (pd.Series): Target labels.
        k (int): Number of top features to select.

    Returns:
        X_selected_df (pd.DataFrame): DataFrame containing the selected features.
        selected_features (List[str]): List of selected feature names.
    """
    selector = SelectKBest(score_func=f_classif, k=k)
    X_selected = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()]
    print("Selected Features:", selected_features.tolist())
    
    # Save the selected features
    joblib.dump(selected_features, 'selected_features.pkl')
    print("Selected features saved to 'selected_features.pkl'.")
    
    X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
    return X_selected_df, selected_features

def standardize_features(X, selected_features):
    """
    Standardize the selected features using StandardScaler.

    Args:
        X (pd.DataFrame): DataFrame containing selected features.
        selected_features (List[str]): List of selected feature names.

    Returns:
        X_scaled_df (pd.DataFrame): Standardized feature data.
        scaler (StandardScaler): Fitted scaler object.
    """
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled_df = pd.DataFrame(X_scaled, columns=selected_features)
    
    # Save the scaler
    joblib.dump(scaler, 'scaler.pkl')
    print("Scaler saved to 'scaler.pkl'.")
    
    return X_scaled_df, scaler

def split_data(X, y, test_size=0.2, random_state=42):
    """
    Split the dataset into training and testing sets.

    Args:
        X (pd.DataFrame): Feature data.
        y (pd.Series): Target labels.
        test_size (float): Proportion of the dataset to include in the test split.
        random_state (int): Random seed.

    Returns:
        X_train, X_test, y_train, y_test: Split data.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    return X_train, X_test, y_train, y_test


def perform_grid_search(X_train, y_train):
    """
    Perform an expanded Grid Search to find the best hyperparameters for MLPClassifier.

    This function defines a moderately extensive parameter grid and uses GridSearchCV to
    search for the optimal combination of hyperparameters. The search includes various
    configurations for hidden layer sizes, activation functions, solvers, regularization
    parameters, and learning rates.

    Args:
        X_train (pd.DataFrame): Training feature data.
        y_train (pd.Series): Training target labels.

    Returns:
        clf (GridSearchCV): Fitted GridSearchCV object with the best found parameters.
    """
    # Define an expanded yet reasonable parameter grid for Grid Search
    parameter_space = {
        'hidden_layer_sizes': [
            (50,), (100,), (100, 50), (100, 100), (150, 100)
        ],
        'activation': ['tanh', 'relu', 'logistic'],
        'solver': ['adam', 'lbfgs'],
        'alpha': [0.0001, 0.001, 0.01, 0.05],
        'learning_rate': ['constant', 'adaptive'],
    }

    # Initialize the MLPClassifier with a random state for reproducibility
    mlp = MLPClassifier(
        max_iter=500,  # Increased from 100 to 500 for better convergence
        random_state=42
    )

    # Initialize GridSearchCV with the expanded parameter grid
    clf = GridSearchCV(
        estimator=mlp,
        param_grid=parameter_space,
        n_jobs=-1,          # Utilize all available CPU cores
        cv=3,               # 5-fold cross-validation
        verbose=2,          # Verbosity level for detailed logs
        scoring='accuracy'  # Evaluation metric
    )

    # Fit GridSearchCV to the training data
    clf.fit(X_train, y_train)

    # Display the best parameters found by Grid Search
    print('Best parameters found:\n', clf.best_params_)
    print('Best cross-validation accuracy:', clf.best_score_)

    return clf


def evaluate_model(model, X_test, y_test):
    """
    Evaluate the trained model on the test set and print metrics.

    Args:
        model (MLPClassifier): Trained MLPClassifier model.
        X_test (pd.DataFrame): Testing feature data.
        y_test (pd.Series): Testing target labels.
    """
    y_pred = model.predict(X_test)
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

def save_model(model, filename='mlp_model.pkl'):
    """
    Save the trained model to a file.

    Args:
        model (MLPClassifier): Trained MLPClassifier model.
        filename (str): Filename to save the model.
    """
    joblib.dump(model, filename)
    print(f"Trained model saved to '{filename}'.")

def main():
    """
    Main function to execute the machine learning pipeline:
    - Load data
    - Check for missing values
    - Select features
    - Standardize features
    - Split data
    - Perform Grid Search
    - Train and evaluate the model
    - Save the trained model and preprocessing objects
    """
    # Optionally suppress convergence warnings (not recommended)
    # warnings.filterwarnings("ignore", category=ConvergenceWarning)
    
    # Load dataset
    X, y = load_dataset()
    
    # Check for missing values
    check_missing_values(X)
    
    # Feature Selection
    X_selected_df, selected_features = select_features(X, y, k=10)
    
    # Standardize Features (only selected features)
    X_scaled_df, scaler = standardize_features(X_selected_df, selected_features)
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = split_data(X_scaled_df, y, test_size=0.2, random_state=42)
    
    # Perform Grid Search to find the best MLPClassifier parameters
    clf = perform_grid_search(X_train, y_train)
    
    # Train the final model with best parameters
    best_mlp = clf.best_estimator_
    best_mlp.fit(X_train, y_train)
    
    # Evaluate the model
    evaluate_model(best_mlp, X_test, y_test)
    
    # Save the trained model
    save_model(best_mlp, 'mlp_model.pkl')

if __name__ == "__main__":
    main()


Missing values in each feature:
 mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64
Selected Features: ['mean radius', 'mean perimeter', 'mean area', 'mean concavity', 