In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report

In [28]:
# Load datasets
from sklearn.datasets import load_breast_cancer, load_iris
iris = load_iris()
cancer = load_breast_cancer()

## Start with EDA before anything else

Once you've uploaded a real dataset, your first step should be **EDA (Exploratory Data Analysis)**.  
This helps you understand the structure, patterns, and potential issues in your data before diving into modeling.

---

### Recommended workflow: EDA → Feature selection → Train-test split → Scaling → Modeling & evaluation

1. **EDA (Exploratory Data Analysis)**  
   - Work with the **original, unmodified data**  
   - Explore **distributions**, **correlations**, **missing values**, **outliers**, and **class balance**  
   - Gain insights that guide the rest of the workflow

2. **Feature selection / engineering**  
   - Still use the **full dataset** at this stage  
   - Create meaningful features, remove highly correlated or irrelevant ones  
   - Optionally select top-k features for modeling

3. **Train-test split**  
   - Divide data into training and testing sets   

4. **Scaling (or other transformations)**  
   - Fit the scaler **only on the training data**  
   - **Apply the same** transformation **to the test set**  

5. **Modeling & evaluation**  
   - Train your models using the processed training data  
   - Use appropriate metrics to assess performance
   - At the very end, evaluate them on the transformed test data  


## Why EDA comes first?

- You want to explore the **real distribution** of your data.  
- Scaling or splitting early might obscure trends (e.g., **class imbalance**, **skewness**).  
- Think of EDA as your **diagnostic phase** before doing surgery (modeling).

For our toy datasets, we will not perform EDA, as the purpose of this demo is to showcase supervised learning models, not exploratory analysis.

We will also skip feature engineering and selection, since the dataset already contains a clean and well-structured set of features suitable for modeling.

Instead, we will immediately **begin with the train-test split** and proceed directly to modeling and evaluation.

## Step 1. EDA → *Almost omitted here*

## Using PCA/t-SNE in EDA vs PCA for preprocessing before modeling


#### 1. PCA/t-SNE just for visualization in EDA

#### Workflow: EDA (incl. PCA/t-SNE) → Feature selection → Train-test split → Scaling → Modeling & evaluation


- Used to **explore structure**, clusters, or class separation
- Run on **full dataset** (optionally scaled)
- **Does not affect** model training



#### 2. PCA for preprocessing before modeling


#### Workflow:  EDA (incl. PCA/t-SNE) → Feature selection → Train-test split → Scaling → PCA (fit on train) → Modeling

- Used to **reduce dimensionality**
- Fit PCA on **training set only**
- Transform both train and test
- **Part of model input**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
X_scaled = scaler.fit_transform(cancer.data)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df_pca['Target'] = pd.Categorical.from_codes(cancer.target, categories=cancer.target_names)

sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='Target', palette='coolwarm', alpha=0.7).set_title("PCA - 2D Projection")
sns.despine()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

scaler = StandardScaler()
X_scaled = scaler.fit_transform(cancer.data)

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

df_tsne = pd.DataFrame(X_tsne, columns=['Dim1', 'Dim2'])
df_tsne['Target'] = pd.Categorical.from_codes(cancer.target, categories=cancer.target_names)

sns.scatterplot(data=df_tsne, x='Dim1', y='Dim2', hue='Target', palette='coolwarm', alpha=0.7).set_title("t-SNE - 2D Projection")
sns.despine()

## Step 2. Feature Selection / Engineering → *Omitted*

## Step 3. Train-test split

In [31]:
from sklearn.model_selection import train_test_split

# Binary classification: Breast Cancer
X_cancer, y_cancer = cancer.data, cancer.target
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_cancer, y_cancer, test_size=0.3, random_state=42)

# Multiclass classification: Iris
X_iris, y_iris = iris.data, iris.target
Xi_train, Xi_test, yi_train, yi_test = train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)

### Why do we split before scaling?

When working with machine learning models, it's important to **split your dataset into training and testing sets _before_ applying any scaling** (such as standardization or normalization).

If you scale the entire dataset **before splitting**, you're using information from the entire dataset — including the test set — to compute the mean and standard deviation used in scaling.

This causes **data leakage**.

> ⚠️ **Data leakage** happens when your model gets access to information it shouldn't have during training, leading to overly optimistic results and poor generalization.


####  Correct approach: Split → Then Scale

The correct workflow is:

1. **Split** the dataset into training and test sets.
2. **Fit the scaler on the training set** only (i.e., compute mean and std from training data).
3. **Transform** both the training and test sets using this fitted scaler.


## Step 4. Scaling

**1. `StandardScaler`**

- **What it does**: Standardizes features by removing the mean and scaling to unit variance (mean = 0, std = 1).
- **Best used when**:
  - Your features are approximately **normally distributed**
  - You want to give equal weight to all features, especially when they are on different scales
- **Commonly used with**:
  - ✅ Linear Regression
  - ✅ Logistic Regression
  - ✅ Support Vector Machines (SVM)
  - ✅ k-Nearest Neighbors (k-NN)
  - ✅ Clustering (e.g., k-Means)
  - ❌ Decision Trees / Random Forests (not needed — tree-based models are scale-invariant)
  - ❌ Naive Bayes (scaling most likely will not help, as the algorithm relies on distributional assumptions)

---

**2. `MinMaxScaler`**

- **What it does**: Scales features to a fixed range, usually [0, 1].
- **Best used when**:
  - You want to **preserve the shape** of the original distribution
  - Your model is sensitive to the **magnitude of features**
  - Your features are **bounded** and **do not contain many outliers**
- **Commonly used with**:
  - ✅ Logistic Regression
  - ✅ SVM
  - ✅ k-NN
  - ✅ Clustering
  - ✅ Neural Networks (especially important here)
  - ❌ Decision Trees / Random Forests
  - ❌ Naive Bayes

---

**3. `RobustScaler`**

- **What it does**: Scales using the **median** and **interquartile range (IQR)**, making it more robust to outliers.
- **Best used when**:
  - Your data contains **many outliers**
  - You still need scaling for models that assume similar feature scales
- **Commonly used with**:
  - ✅ Linear Regression (when data has outliers)
  - ✅ Logistic Regression
  - ✅ SVM
  - ✅ k-NN
  - ✅ Clustering
  - ❌ Decision Trees / Random Forests
  - ❌ Naive Bayes

---

**4. `Normalizer`**

- **What it does**: Scales **individual samples (rows)** to have unit norm (i.e., length = 1).
- **Best used when**:
  - You're working with **sparse data** (e.g., text data like TF-IDF vectors)
  - You care about the **direction of vectors**, not their magnitude
- **Commonly used with**:
  - ✅ k-NN
  - ✅ Clustering
  - ✅ Text classification
  - ❌ Linear/Logistic Regression, SVM 
  - ❌ Decision Trees / Random Forests
  - ❌ Naive Bayes


#### Code for common scalers in scikit-learn

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer

# StandardScaler: Centers data (mean = 0) and scales to unit variance (std = 1)
standard_scaler = StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_test_standard = standard_scaler.transform(X_test)

# MinMaxScaler: Scales features to a defined range, typically [0, 1]
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

# RobustScaler: Uses median and IQR for scaling
robust_scaler = RobustScaler()
X_train_robust = robust_scaler.fit_transform(X_train)
X_test_robust = robust_scaler.transform(X_test)

# Normalizer: Scales each sample (row) to unit norm (L2 by default)
# Useful when you care about the direction of feature vectors, not their magnitude
normalizer = Normalizer()
X_train_normalized = normalizer.fit_transform(X_train)
X_test_normalized = normalizer.transform(X_test)
```

Each of the transformed datasets (`X_train_standard`, `X_train_minmax`, etc.) can now be used independently for model training and evaluation.


For our datasets, we will use `StandardScaler`.

In [32]:
from sklearn.preprocessing import StandardScaler

# Standardize
scaler_cancer = StandardScaler()
Xc_train_scaled = scaler_cancer.fit_transform(Xc_train)
Xc_test_scaled = scaler_cancer.transform(Xc_test)

scaler_iris = StandardScaler()
Xi_train_scaled = scaler_iris.fit_transform(Xi_train)
Xi_test_scaled = scaler_iris.transform(Xi_test)

## Using clustering during EDA

Clustering is typically considered an unsupervised learning technique, but it can be a powerful tool during **Exploratory Data Analysis (EDA)**.  
It helps uncover **natural groupings**, **outliers**, and **hidden structures** in your data before you build predictive models.

Using clustering in EDA can surface patterns you wouldn't otherwise see — and can lead to smarter, more informed modeling decisions.

**When do we use clustering in EDA?**

- If clustering is used for insights only → okay-ish to cluster before the split (but if you want to be completly unbiased do NOT do it!).
- If clustering creates a feature → do it after the split, on training data only!

**When clustering is useful for supervised learning?**

- When you want to understand data structure
- To visualize clusters using techniques like PCA or t-SNE
- For detecting outliers or unusual subgroups
- To engineer new features (e.g., cluster labels for supervised models)



**Important considerations**

- Do **not** fit clustering models on the test set — only on the training or full EDA set
- Apply **feature scaling** (e.g., `StandardScaler`) before clustering, especially for distance-based methods like k-Means
- Choose clustering algorithms wisely:
  - `KMeans`: Assumes spherical clusters and equal variance
  - `GaussianMixture`: More flexible — allows **elliptical clusters** and models **probability** of membership
  - `AgglomerativeClustering`: Good for detecting **hierarchical relationships**


**You can use clusters as features**

Cluster assignments can be saved as a **new categorical feature** and added to your dataset.  
This is useful in supervised models, but make sure:
- Clusters are created using **training data only**
- The process is **replicable** on future (test/real-world) data

---


#### Suggested workflow: EDA → Feature selection → Train-test split → Scaling → Clustering **on training data only** (you can add feature "Cluster label") → Modeling & evaluation

**If the cluster feature is created on the training data only, how do I use it on the test set later?**

You apply the clustering model (fitted on training data) to the test set.
Clustering, just like StandardScaler, PCA, or any learned transformation, needs to be:
  - Fitted on training data
  - Applied (transformed) on test data


``` python

from sklearn.cluster import KMeans

# Fit on training data only
kmeans = KMeans(n_clusters=3, random_state=42)
train_clusters = kmeans.fit_predict(X_train_scaled)

# Add as feature
X_train_with_cluster = np.column_stack((X_train_scaled, train_clusters))

# Transform test data using THE SAME model
test_clusters = kmeans.predict(X_test_scaled)
X_test_with_cluster = np.column_stack((X_test_scaled, test_clusters))
```

## Step 5. Modeling & evaluation 

## Decision Tree 

The `DecisionTreeClassifier` in `scikit-learn` has several key parameters that control how the tree is built and how complex it becomes. Here's a summary of the most important ones:

- **criterion**: Specifies the function to measure the quality of a split. Common options include `'gini'` (default) and `'entropy'`.

- **max_depth**: The maximum depth of the tree. Limiting depth helps prevent overfitting by controlling how specific the model can become.

- **min_samples_split**: The minimum number of samples required to split an internal node. Higher values make the model more conservative.

- **min_samples_leaf**: The minimum number of samples required to be at a leaf node. This helps smooth the model by avoiding small leaf nodes.

- **max_features**: The number of features to consider when looking for the best split. Can be a number, a percentage, or methods like `'sqrt'` or `'log2'`.

- **random_state**: Setting this ensures reproducibility across runs.

These parameters can be tuned to balance model complexity, accuracy, and generalization.


In [None]:
from sklearn.tree import DecisionTreeClassifier

# Binary
dt_cancer = DecisionTreeClassifier()
dt_cancer.fit(Xc_train_scaled, yc_train)
yc_pred = dt_cancer.predict(Xc_test_scaled)

print("Decision Tree - Breast Cancer")
print(classification_report(yc_test, yc_pred))

In [None]:
# Multiclass
dt_iris = DecisionTreeClassifier()
dt_iris.fit(Xi_train_scaled, yi_train)
yi_pred = dt_iris.predict(Xi_test_scaled)

print("Decision Tree - Iris")
print(classification_report(yi_test, yi_pred))

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(16, 10))
plot_tree(dt_iris, filled=True, feature_names=iris.feature_names, class_names=iris.target_names);

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(40, 28))  
plot_tree(dt_cancer, filled=True, feature_names=cancer.feature_names, class_names=cancer.target_names);

## Naive Bayes

The Naive Bayes family of classifiers in `scikit-learn` is based on Bayes’ Theorem and assumes that features are conditionally independent given the class label. These models are simple, fast, and effective, especially for high-dimensional data.

### GaussianNB

Used for continuous numeric features that are assumed to follow a normal (Gaussian) distribution.

- **priors**: Set prior probabilities for each class. If `None`, the priors are learned from the training data.
- **var_smoothing**: A small value added to the variances to prevent division by zero and improve numerical stability.

### MultinomialNB

Used for discrete count data, such as word counts in text classification problems.

- **alpha**: Additive (Laplace/Lidstone) smoothing parameter. Helps handle features that don’t appear in the training data.
- **fit_prior**: Whether to learn class prior probabilities from the training data. If `False`, uniform priors are used.
- **class_prior**: Manually specify class priors if `fit_prior=False`.

### BernoulliNB

Used for binary/boolean features, where each feature is either present or absent (1 or 0).

- **alpha**: Smoothing parameter, similar to `MultinomialNB`.
- **binarize**: Threshold for converting feature values to binary (0 or 1). Default is `0.0`, meaning any positive value becomes 1.
- **fit_prior** and **class_prior**: Same behavior as in `MultinomialNB`.

Each variant of Naive Bayes is suited for different types of input features. Despite the naive assumption of feature independence, these models often perform well in practice and are especially useful as a baseline for classification tasks.


In [None]:
from sklearn.naive_bayes import GaussianNB

# Binary
nb_cancer = GaussianNB()
nb_cancer.fit(Xc_train_scaled, yc_train)
yc_pred = nb_cancer.predict(Xc_test_scaled)

print("Naive Bayes - Breast Cancer")
print(classification_report(yc_test, yc_pred))


In [None]:
# Multiclass
nb_iris = GaussianNB()
nb_iris.fit(Xi_train_scaled, yi_train)
yi_pred = nb_iris.predict(Xi_test_scaled)
print("Naive Bayes - Iris")
print(classification_report(yi_test, yi_pred))

## K-Nearest Neighbors (KNN)

The `KNeighborsClassifier` in `scikit-learn` is a simple, non-parametric method that classifies a data point based on the majority class among its *k* nearest neighbors in the feature space.

Here are the key parameters that control how the model behaves:

- **n_neighbors**: The number of nearest neighbors to consider (i.e., the "k" in KNN). Increasing this value can make the model more stable but less sensitive to local patterns.

- **weights**: Determines how to weight the contribution of neighbors. Options include:
  - `'uniform'`: All neighbors are weighted equally (default).
  - `'distance'`: Closer neighbors have a greater influence.

- **algorithm**: The algorithm used to compute nearest neighbors. Options include `'auto'`, `'ball_tree'`, `'kd_tree'`, and `'brute'`. `'auto'` chooses the best method based on the data.

- **metric**: The distance metric used to find neighbors. Common options include `'minkowski'`, `'euclidean'`, and `'manhattan'`.

- **p**: Power parameter for the Minkowski metric. When `p=1`, it is equivalent to Manhattan distance; when `p=2`, it becomes Euclidean distance.

KNN is intuitive and often effective for smaller datasets, but it can become computationally expensive with large datasets or high-dimensional feature spaces.


In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Binary
knn_cancer = KNeighborsClassifier()
knn_cancer.fit(Xc_train_scaled, yc_train)
yc_pred = knn_cancer.predict(Xc_test_scaled)

print("KNN - Breast Cancer")
print(classification_report(yc_test, yc_pred))

In [None]:
# Multiclass
knn_iris = KNeighborsClassifier()
knn_iris.fit(Xi_train_scaled, yi_train)
yi_pred = knn_iris.predict(Xi_test_scaled)

print("KNN - Iris")
print(classification_report(yi_test, yi_pred))

## Logistic Regression

The `LogisticRegression` classifier in `scikit-learn` is a linear model used for binary and multiclass classification. It models the probability of class membership using a logistic (sigmoid) function and finds the best-fitting linear decision boundary.

Below are the key parameters that control its behavior:

- **penalty**: Specifies the type of regularization to apply. Common options include:
  - `'l2'` (default): Ridge regularization.
  - `'l1'`: Lasso regularization (only supported with certain solvers).
  - `'elasticnet'`: Combination of L1 and L2.
  - `'none'`: No regularization.

- **C**: Inverse of regularization strength. Smaller values imply stronger regularization. It helps control overfitting.

- **solver**: The algorithm used to optimize the model. Common options include:
  - `'lbfgs'` (default, good for small to medium data).
  - `'liblinear'` (works with L1 and L2 penalties).
  - `'saga'` (supports L1, L2, and elasticnet for large datasets).

- **max_iter**: Maximum number of iterations taken for the solver to converge. Increase this if the model doesn't converge with the default setting.

- **multi_class**: Strategy for multiclass problems. Options include:
  - `'auto'` (default): Chooses `'ovr'` or `'multinomial'` based on solver.
  - `'ovr'`: One-vs-Rest.
  - `'multinomial'`: Multiclass optimization.

- **random_state**: Ensures reproducibility by setting the seed for random number generation.

Logistic Regression is a robust and interpretable model that works well when the relationship between features and the log-odds of the outcome is linear.


In [None]:
from sklearn.linear_model import LogisticRegression

# Binary
lr_cancer = LogisticRegression(max_iter=1000)
lr_cancer.fit(Xc_train_scaled, yc_train)
yc_pred = lr_cancer.predict(Xc_test_scaled)

print("Logistic Regression - Breast Cancer")
print(classification_report(yc_test, yc_pred))

In [None]:
# Multiclass
lr_iris = LogisticRegression(max_iter=1000)
lr_iris.fit(Xi_train_scaled, yi_train)
yi_pred = lr_iris.predict(Xi_test_scaled)

print("Logistic Regression - Iris")
print(classification_report(yi_test, yi_pred))

#### Feature selection

In [None]:
model = LogisticRegression(penalty='l1', solver='liblinear', C=1.0)
model.fit(Xc_train_scaled, yc_train)

feature_names = cancer.feature_names  
coefs = model.coef_.ravel()  
eliminated = feature_names[coefs == 0]
selected = feature_names[coefs != 0]

print("Eliminated features:")
print(eliminated)

print("\nSelected features:")
print(selected)

In [None]:
# Fit logistic regression with L1 for multiclass (use 'saga' solver with multinomial loss)
model = LogisticRegression(
    penalty='l1',
    solver='saga',
    C=1.0,
    max_iter=5000
)
model.fit(Xi_train_scaled, yi_train)

feature_names = iris.feature_names  
coefs = model.coef_  # Get coefficients matrix: shape (n_classes, n_features)

used_mask = (coefs != 0).any(axis=0)  # True for selected features
eliminated = np.array(feature_names)[~used_mask]
selected = np.array(feature_names)[used_mask]

print("Eliminated features (all-zero across all classes):")
print(eliminated)

print("\nSelected features (used by at least one class):")
print(selected)


### Customizing the Decision Threshold in Logistic Regression

By default, `LogisticRegression` uses a **0.5 threshold** to make class predictions. This means that a sample is classified into a class if its predicted probability is **≥ 0.5**. However, in many real-world scenarios, it can be useful to **adjust this threshold** to better control the trade-off between **precision** and **recall**.


####  Why Adjust the Threshold?

- **Imbalanced classes**: In cases like fraud detection or medical diagnosis, the positive class is rare. Lowering the threshold can increase the sensitivity (recall).
- **Business or application priorities**: You may prefer to avoid false negatives or false positives depending on the context.
- **More control over model behavior**: Especially useful for fine-tuning performance beyond what accuracy alone can provide.


In [None]:
# Get predicted probabilities
probs = lr_cancer.predict_proba(Xc_test_scaled)

# Use custom threshold (e.g., 0.3)
threshold = 0.3
custom_preds = (probs[:, 1] >= threshold).astype(int)
print(classification_report(yc_test, custom_preds))

In [None]:
probs = lr_iris.predict_proba(Xi_test_scaled)
threshold = 0.6

# Assign class only if highest probability ≥ threshold
max_probs = np.max(probs, axis=1)
pred_classes = np.argmax(probs, axis=1)
custom_preds = np.where(max_probs >= threshold, pred_classes, -1)  # -1 = uncertain
print(classification_report(yi_test, custom_preds))

## Support Vector Machine (SVM)

The `SVC` classifier in `scikit-learn` is a powerful model that finds the optimal hyperplane to separate classes in the feature space. It works well for both linear and non-linear classification tasks and can handle high-dimensional data effectively.

Below are the main parameters to control an SVM model:

- **kernel**: Specifies the kernel type used to transform the data. Common options include:
  - `'linear'`: Linear decision boundary.
  - `'rbf'` (default): Radial Basis Function, useful for non-linear problems.
  - `'poly'`: Polynomial kernel.
  - `'sigmoid'`: Used less frequently.

- **C**: Regularization parameter. A smaller value allows a softer margin (more tolerance for misclassification), while a larger value tries to fit the training data more strictly.

- **gamma**: Controls the influence of individual data points for non-linear kernels (`'rbf'`, `'poly'`, `'sigmoid'`). 
  - `'scale'` (default) or `'auto'` are common settings.
  - Higher values lead to tighter decision boundaries (can overfit).

- **degree**: Degree of the polynomial kernel function (used only if `kernel='poly'`).

- **probability**: If set to `True`, enables probability estimates via cross-validation. This makes the model slower but useful when you need class probabilities.

- **random_state**: Controls the shuffling for probability estimates and reproducibility.

SVMs are effective in high-dimensional spaces and are versatile due to the use of different kernel functions. They can be sensitive to parameter settings, so tuning `C`, `gamma`, and `kernel` is often necessary.


In [None]:
from sklearn.svm import SVC

# Binary - Linear
svm_linear_cancer = SVC(kernel='linear')
svm_linear_cancer.fit(Xc_train_scaled, yc_train)
yc_pred = svm_linear_cancer.predict(Xc_test_scaled)

print("SVM (Linear) - Breast Cancer")
print(classification_report(yc_test, yc_pred))

In [None]:
# Binary - RBF
svm_rbf_cancer = SVC(kernel='rbf')
svm_rbf_cancer.fit(Xc_train_scaled, yc_train)
yc_pred = svm_rbf_cancer.predict(Xc_test_scaled)

print("SVM (RBF) - Breast Cancer")
print(classification_report(yc_test, yc_pred))


In [None]:
# Multiclass - Linear
svm_linear_iris = SVC(kernel='linear')
svm_linear_iris.fit(Xi_train_scaled, yi_train)
yi_pred = svm_linear_iris.predict(Xi_test_scaled)

print("SVM (Linear) - Iris")
print(classification_report(yi_test, yi_pred))

In [None]:
# Multiclass - RBF
svm_rbf_iris = SVC(kernel='rbf')
svm_rbf_iris.fit(Xi_train_scaled, yi_train)
yi_pred = svm_rbf_iris.predict(Xi_test_scaled)

print("SVM (RBF) - Iris")
print(classification_report(yi_test, yi_pred))