# Logistic Regression Exercise

In this notebook, we will use the Breast Cancer dataset from scikit-learn to create a logistic regression model that classifies if a tumor is benign or malignant.

## Objectives
- Understand the steps to create a logistic regression model.
- Interpret the model's parameters and evaluate its performance.
- Apply feature engineering and preprocessing methods.

## Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


# Step 2: Load the Dataset
We will load the Breast Cancer dataset directly from scikit-learn.

In [None]:
# Load the Breast Cancer dataset
cancer_data = load_breast_cancer()

# Convert to a DataFrame for easier manipulation
data = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
data['target'] = cancer_data.target

# Show the first few rows of the dataset
data.head()


# Step 3: Data Preprocessing
## 3.1 Understand the Data
The dataset contains features like the mean, standard error, and worst measurements of different characteristics (e.g., radius, texture, perimeter) for each tumor.

In [None]:
# Check the target distribution
data['target'].value_counts()


### 3.2 Features and Target Variable





In [None]:
# Features (X) and target variable (y)
X = data.drop('target', axis=1)
y = data['target']


## Step 4: Split the Data into Training and Test Sets

In [None]:
# Split the data: 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Step 5: Feature Scaling
Standardize the features using StandardScaler to ensure all variables are on a similar scale.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## Step 6: Train the Logistic Regression Model
We'll train a logistic regression model using scikit-learn.

In [None]:
# Initialize the logistic regression model
log_reg = LogisticRegression(max_iter=10000)  # Increase iterations if convergence warning

# Train the model with the training data
log_reg.fit(X_train, y_train)


# Step 7: Model Parameters
We can initialize the model with specific parameters if desired:

*  C: Controls regularization strength.
*  solver: The optimization algorithm used for finding the best weights.



In [None]:
# Initialize the logistic regression model with specific parameters
log_reg = LogisticRegression(C=1.0, solver='liblinear', max_iter=10000)
log_reg.fit(X_train, y_train)


## Step 8: Make Predictions


In [None]:
# Make predictions on the test set
y_pred = log_reg.predict(X_test)


## Step 9: Evaluate the Model
## 9.1 Confusion Matrix

In [None]:
# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

# Plotting the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


## 9.2 Classification Report and accuracy


In [None]:
# Classification report
print(classification_report(y_test, y_pred))

In [None]:

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")


## Step 10: Interpret the Results

- **Confusion Matrix**: Shows the number of true positives, true negatives, false positives, and false negatives.
- **Classification Report**: Provides metrics like precision, recall, and F1-score for each class.
- **Accuracy Score**: The proportion of correct predictions.

### Key Points

- Logistic Regression is a strong baseline model for binary classification problems.
- **Regularization** is crucial for controlling model complexity and avoiding overfitting.
- **Feature scaling** is a key preprocessing step for logistic regression.

## Step 11: Improving the Model (Optional)

- Adjust the **C** parameter to see how regularization affects performance.
- Use **cross-validation** to obtain a more robust estimate of model performance.

