# Exercises
In this exercise, I will ask you to use the famous "Breast Cancer Wisconsin" dataset from Scikit-learn. The goal is to build classification models to predict whether a tumor is malignant or benign based on various features.

**Dataset Description:**
The Breast Cancer Wisconsin dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The features describe various characteristics of cell nuclei present in the image. The target variable is binary, where '0' represents malignant tumors, and '1' represents benign tumors. Here you can find more details: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

**Exercise:**

**Step 1: Data Loading**
1. Load the Breast Cancer Wisconsin dataset from Scikit-learn using `sklearn.datasets.load_breast_cancer()`.
2. Split the data into features and target variables.

**Step 2: Data Preprocessing**
3. Split the dataset into a training set and a testing set (e.g., 70% train and 30% test).
4. Perform any necessary data preprocessing and feature engineering, such as scaling the features.

**Step 2.5: Feature selection**
5. Apply initial feature selection process (e.g., use some statisical tests like Fisher)

**Step 3** Baseline model
6. Please create simple logistic regression model as a baseline.

**Step 4: AdaBoost Classifier**
7. Train an AdaBoost classifier on the training data.
8. Use cross-validation to find the optimal number of base estimators (n_estimators) for AdaBoost.
9. Tune other hyperparameters (e.g., learning rate) using cross-validation.
10. Visualize the feature importances in the model and try to apply additional feature selection based on it.
11. Evaluate the model's performance on the test set using accuracy, precision-recall curve, and F1-score.

**Step 5: Gradient Boosting Machine (GBM)**
12. Train a Gradient Boosting Machine classifier on the training data.
13. Use cross-validation to find the optimal values for hyperparameters like the number of trees (n_estimators), maximum depth (max_depth), and learning rate.
14. Visualize the feature importances in the model and try to apply additional feature selection based on it.
15. Evaluate the GBM model's performance on the test set using accuracy, precision-recall curve, and F1-score.

**Step 6: Model Comparison and **
16. Compare the performance of the AdaBoost and GBM classifiers and Logistc Regression.
17. Summarize the results and provide insights on which algorithm performed better on this dataset and why.
18. Discuss the impact of hyperparameter tuning on model performance.

Hint, here you will find a case study from Machine Learning 1 where we discuss the entire model creation pipeline using various feature engineering and feature selection techniques: https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates

# Import libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_breast_cancer


# Step 1:
### Data loading

In [9]:
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()

# Split the data into features (X) and target (y)
X = data.data  # Features
y = data.target  # Target variable


# Step 2:

In [11]:
# Split the dataset into a training set and a testing set (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# Step 2.5

In [12]:


# Select the top k features (you can adjust k as needed)
k = 10  # Number of features to select
selector = SelectKBest(score_func=f_classif, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)


# Step 3

In [13]:
# Create a logistic regression model
model = LogisticRegression()

# Fit the model on the training data
model.fit(X_train_selected, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test_selected)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the Logistic Regression baseline model: {:.2f}%".format(accuracy * 100))


Accuracy of the Logistic Regression baseline model: 96.49%
