<a href="https://colab.research.google.com/github/doowilliams/Birth-Rate/blob/main/Breast_Cancer_improvement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [4]:
# Function for loading the dataset
def load_dataset(file_path):
    df = pd.read_csv(file_path)
    return df

In [5]:
# Function for data exploration and cleaning
def explore_and_clean_data(df):
    # Displaying the first few rows of the dataset
    print("First few rows of the dataset:")
    print(df.head())

    # Checking for missing values
    print("\nMissing values summary:")
    print(df.isnull().sum())

    # Dropping the 'Unnamed: 32' column
    df.drop("Unnamed: 32", axis=1, inplace=True)

    # Replacing 'M' with 1 and 'B' with 0 in the 'diagnosis' column
    df['diagnosis'] = df['diagnosis'].replace({'M': 1, 'B': 0})

    # Displaying the modified dataset
    print("\nModified dataset:")
    print(df.head())

    # Checking for missing values again
    print("\nMissing values after modifications:")
    print(df.isnull().sum())

    return df

In [6]:
# Function for data preprocessing and feature engineering
def preprocess_and_engineer_features(df):
    # Separating features and labels
    features = df.iloc[:, 2:32]
    label = df.iloc[:, 1]

    return features, label

In [8]:
# Function for model training using cross-validation
def train_model(x_train, y_train):
    # Creating a Logistic Regression model
    model = LogisticRegression()

    # Performing cross-validation
    cross_val_scores = cross_val_score(model, x_train, y_train, cv=5, scoring='accuracy')

    # Displaying cross-validation scores
    print("\nCross-Validation Scores:")
    print(cross_val_scores)
    print(f"Mean Cross-Validation Score: {np.mean(cross_val_scores)}")

    # Fitting the model on the entire training data
    model.fit(x_train, y_train)

    return model

In [9]:
# Function for model evaluation
def evaluate_model(model, x_test, y_test):
    # Making predictions on test data
    predicted_classes = model.predict(x_test)

    # Displaying confusion matrix and classification report
    con_mat = confusion_matrix(y_test, predicted_classes)
    class_rep = classification_report(y_test, predicted_classes)

    print("\nConfusion Matrix:")
    print(con_mat)
    print("\nClassification Report:")
    print(class_rep)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
def main():
    # Load the dataset
    file_path = '/content/drive/MyDrive/breast cancer.csv'
    dataset = load_dataset(file_path)

    # Data exploration and cleaning
    dataset = explore_and_clean_data(dataset)

    # Data preprocessing and feature engineering
    features, label = preprocess_and_engineer_features(dataset)

    # Splitting the data into training and testing sets
    x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.3, train_size=0.7, random_state=88)

    # Model training with cross-validation
    trained_model = train_model(x_train, y_train)

    # Sample prediction
    showpd = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]])
    prediction = trained_model.predict(showpd)
    print("\nSample prediction using Logistic Regression:")
    print(prediction)

    # Model evaluation
    evaluate_model(trained_model, x_test, y_test)

if __name__ == "__main__":
    main()

First few rows of the dataset:
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  texture_worst  perimete

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


Cross-Validation Scores:
[0.925      0.925      0.9        0.97468354 0.92405063]
Mean Cross-Validation Score: 0.9297468354430378

Sample prediction using Logistic Regression:
[1]

Confusion Matrix:
[[109   5]
 [  4  53]]

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       114
           1       0.91      0.93      0.92        57

    accuracy                           0.95       171
   macro avg       0.94      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
