# Data Science Exercises: Breast Cancer Classification
In this notebook, you will work through six exercises focusing on data loading, model creation, and evaluation. You will learn how to leverage different machine learning algorithms to classify breast cancer samples based on various medical measurements.

In [None]:
!pip install XGBClassifier

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score, precision_recall_curve
from sklearn.tree import DecisionTreeClassifier  
from sklearn.metrics import confusion_matrix, accuracy_score
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

## Exercise 1: Read the Data
In this first exercise, you will import the breast cancer dataset from the `sklearn.datasets`. This dataset contains features of different breast cancer tumors, labelled as benign or malignant.

### Instructions:
- Import the necessary libraries.
- Load the dataset using the `load_breast_cancer()` function.
- Store the features in `X` and labels in `Y`.

In [None]:
data = load_breast_cancer()
X = data['data']
Y = data['target']
df = pd.DataFrame(X, columns=data['feature_names'])
df.head()

## Exercise 2: Make a Small EDA (Exploratory Data Analysis) of the Breast Cancer Dataset

In this exercise, you will perform a basic exploratory data analysis (EDA) on the breast cancer dataset. EDA is crucial for understanding the underlying patterns, distributions, and potential relationships in the data, which will help in selecting appropriate models and preprocessing steps.

### Steps:

1. **Check the Data Types and Missing Values:**
   - Start by inspecting the data types of each column. This will help you understand the type of each feature (e.g., numeric, categorical) and whether any conversion is needed.
   - Check for any missing values in the dataset. Missing values can be handled by either filling them with appropriate values (e.g., mean or median) or by removing the corresponding rows or columns.

2. **Summary Statistics:**
   - Use the `describe()` method to get a summary of the numeric columns. This will provide key statistics such as the mean, standard deviation, min, and max values, which are helpful for understanding the distribution of each feature.
   - Look for any anomalies, outliers, or features that may need special attention (e.g., highly skewed distributions).

3. **Visualizations:**
   - **Feature Distributions:** Create histograms or box plots to visualize the distribution of the features, such as tumor size, texture, or smoothness. This will help assess the spread of the data and detect outliers.
   - **Correlation Matrix:** Visualize the correlation matrix to examine relationships between the features, focusing on features that are highly correlated with the target variable (malignant or benign).
   - **Pairwise Scatter Plots:** Visualize the relationships between pairs of features using scatter plots, particularly to explore any separation between malignant and benign cases.

4. **Class Distribution:**
   - Investigate the distribution of the target variable (malignant vs benign). Use the `value_counts()` method to determine how balanced the classes are.
   - Visualize the class distribution using a bar plot or pie chart to understand if there is an imbalance that could affect model training.

5. **Target Variable Analysis:**
   - Investigate how the target variable (malignant or benign) relates to the features. Is there a clear distinction between the two classes based on the features, or do the features overlap?


In [None]:
# your code

## Exercise 3: Baseline Model
Now that you are familiar with the data, it's time to create a baseline machine learning model. You will use logistic regression or a decision tree as a starting point to predict labels.

### Instructions:
- **Split the data:**
  - Split the dataset into training and testing sets using `train_test_split`.
  - `X_train, X_test` are the features for training and testing.
  - `Y_train, Y_test` are the labels for training and testing.
  - `test_size=0.3` means 30% of the data will be used for testing.

- **Train the model:**
  - Create a `DecisionTreeClassifier` model.
  - Fit the model to the training data (`X_train` and `Y_train`).

- **Predictions:**
  - Use the trained model to make predictions on the test set (`X_test`).
  - Store the predictions in `Y_pred`.

- **Evaluation:**
  - Calculate and print the accuracy of the model using `accuracy_score`, `precision_score`, `recall_score`.
  - Print the confusion matrix using `confusion_matrix` to assess the model's performance.
  - ROC AUC

 

In [None]:
# Your code here to create and evaluate a baseline model




In [None]:
# evaluate the model to simplify try to fill it in the gaps
print("Accuracy for Baseline:",accuracy_score(Y_test, Y_pred_baseline)) # where Y_pred_baseline is the predictions you done on the model above
print("Precision for Baseline:"
print("Recall for Baseline:",
print("ROC AUC for Baseline:"
      
# Plot Precision-Recall Curve

precision_baseline, recall_baseline, _ = precision_recall_curve(Y_test, Y_pred_baseline)
plt.figure(figsize=(8, 6))
plt.plot(recall_baseline, precision_baseline, label='Baseline', color='blue')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()

## Exercise 4: Ensemble Models
In this exercise, you will implement ensemble methods to improve your model's performance. Consider using a voting classifier or stacking multiple models together.

### Instructions:
- Utilize multiple models like `RandomForestClassifier` and `LogisticRegression`.
- Use `VotingClassifier` or `StackingClassifier` and train on the same training data.
- Evaluate the ensemble model using accuracy and other metrics.

In [None]:
# Your code here to create and evaluate ensemble models


logistic_regression_model=??
# use this to help you fill it in tho
ensemble_model = VotingClassifier(estimators=[('lr', logistic_regression_model),('rf',),voting= 

ensemble_model.fit(???
# Predictions
Y_pred = ensemble_model.predict(????)
# Evaluation
print("Ensemble Accuracy:", accuracy_score(Y_test, Y_pred))

In [None]:
# evaluate the model use previous code on exercise 4

## Exercise 5: Use XGBoost
For this exercise, you will implement an XGBoost model. XGBoost often provides better performance due to its boosting algorithm and handling of overfitting.

### Instructions:
- Import the `XGBClassifier` from `xgboost`.
- Train the model on the training data and evaluate its performance.




In [None]:
# Your code here to create and evaluate an XGBoost model

# Create and train model
xgb_model = XGBClassifier()


## Exercise 6: Final Evaluation
In the final exercise, you should compare the performance of the different models you've trained, including the baseline, ensemble, and XGBoost models. Use various evaluation metrics such as precision, recall, and ROC AUC.

### Instructions:
- For each model, compute precision, recall, and other metrics.
- Use the `classification_report` and ROC AUC to summarize model performance.

 

In [None]:
# Accuracy
print("Accuracy for Baseline:", accuracy_score(Y_test, Y_pred_baseline))
print("Accuracy for Ensemble:", accuracy_score(Y_test, Y_pred_ensemble))
print("Accuracy for XGBoost:", accuracy_score(Y_test, Y_pred_xgb))

# Precision
print("Precision for Baseline:", precision_score(Y_test, Y_pred_baseline))
print("Precision for Ensemble:", precision_score(Y_test, Y_pred_ensemble))
print("Precision for XGBoost:", precision_score(Y_test, Y_pred_xgb))

# Recall
print("Recall for Baseline:", recall_score(Y_test, Y_pred_baseline))
print("Recall for Ensemble:", recall_score(Y_test, Y_pred_ensemble))
print("Recall for XGBoost:", recall_score(Y_test, Y_pred_xgb))

# ROC AUC
print("ROC AUC for Baseline:", roc_auc_score(Y_test, Y_pred_baseline))
print("ROC AUC for Ensemble:", roc_auc_score(Y_test, Y_pred_ensemble))
print("ROC AUC for XGBoost:", roc_auc_score(Y_test, Y_pred_xgb))

# Precision-Recall Curve
precision_baseline, recall_baseline, _ = precision_recall_curve(Y_test, Y_pred_baseline)
precision_ensemble, recall_ensemble, _ = precision_recall_curve(Y_test, Y_pred_ensemble)
precision_xgb, recall_xgb, _ = precision_recall_curve(Y_test, Y_pred_xgb)

# Plot Precision-Recall Curve
import matplotlib.pyplot as plt
precision_baseline, recall_baseline, _ = precision_recall_curve(Y_test, Y_pred_baseline)
plt.figure(figsize=(8, 6))
plt.plot(recall_baseline, precision_baseline, label='Baseline', color='blue')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()