# Breast Cancer Prediction Project

Welcome to the Breast Cancer Prediction project! 🎯

In this beginner-friendly notebook, we'll walk through a step-by-step machine learning pipeline to predict whether a tumor is malignant or benign based on various features.

## 🔍 Objective

Our goal is to build a machine learning model that can accurately predict whether a tumor is **malignant (M)** or **benign (B)** based on a set of measurements.

We'll use the dataset `Cancer_Data.csv` for this purpose.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## 📥 Step 1: Load the Dataset

In [None]:
data = pd.read_csv('Cancer_Data.csv')

## 🔍 Step 2: Explore the Dataset

In [None]:
# For displaying the first few rows of the dataset
print(data.head())

In [None]:
# To check the shape (number of rows and columns)
print("\nShape of the dataset:", data.shape)

In [None]:
# To get summary statistics of the dataset
print("\nSummary statistics:\n", data.describe())

In [None]:
# To check for missing values
print("\nMissing values in the dataset:\n", data.isnull().sum())

In [None]:
# To check the data types of each column
print("\nData types:\n", data.dtypes)

## 🧹 Step 3: Clean the Data

We'll remove any unnecessary columns and handle missing values.

In [None]:
# Drop the 'Unnamed: 32' column and 'id'
data = data.drop(columns=['Unnamed: 32', 'id'], errors='ignore')

## 📊 Step 4: Visualize the Data

In [None]:
# to plot the distribution of the 'diagnosis' column
sns.countplot(x='diagnosis', data=data)
plt.title('Distribution of Diagnosis')
plt.xlabel('Diagnosis (0: Benign, 1: Malignant)')
plt.ylabel('Count')
plt.show()

# To plot a heatmap of the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

# To visualize relationships between features
sns.pairplot(data, hue='diagnosis', diag_kind='kde')
plt.show()

## ⚙️ Step 5: Preprocess the Data

In [None]:
# Convert diagnosis column to 0 (benign) and 1 (malignant)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})

# Split features and labels
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## 🤖 Step 6: Train a Machine Learning Model

In [None]:
# Use Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

## 🧪 Step 7: Evaluate the Model

In [None]:

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

## ✅ Conclusion

Awesome work! 🎉

You've successfully built a breast cancer prediction model using Logistic Regression. You explored the data, cleaned it, visualized it, trained a model, and evaluated its performance.

### 🚀 Next Steps
- Try different models like RandomForest or SVM
- Perform feature selection
- Tune hyperparameters for better accuracy