# <h3 align="center">__Module 8 Activity__</h3>
# <h3 align="center">__Assigned at the start of Module 8__</h3>
# <h3 align="center">__Due at the end of Module 8__</h3><br>

# Weekly Discussion Forum Participation

Each week, you are required to participate in the module’s discussion forum. The discussion forum consists of the week's Module Activity, which is released at the beginning of the module. You must complete/attempt the activity before you can post about the activity and anything that relates to the topic. 

## Grading of the Discussion

### 1. Initial Post:
Create your thread by **Day 5 (Saturday night at midnight, PST).**

### 2. Responses:
Respond to at least two other posts by **Day 7 (Monday night at midnight, PST).**

---

## Grading Criteria:

Your participation will be graded as follows:

### Full Credit (100 points):
- Submit your initial post by **Day 5.**
- Respond to at least two other posts by **Day 7.**

### Half Credit (50 points):
- If your initial post is late but you respond to two other posts.
- If your initial post is on time but you fail to respond to at least two other posts.

### No Credit (0 points):
- If both your initial post and responses are late.
- If you fail to submit an initial post and do not respond to any others.

---

## Additional Notes:

- **Late Initial Posts:** Late posts will automatically receive half credit if two responses are completed on time.
- **Substance Matters:** Responses must be thoughtful and constructive. Comments like “Great post!” or “I agree!” without further explanation will not earn credit.
- **Balance Participation:** Aim to engage with threads that have fewer or no responses to ensure a balanced discussion.

---

## Avoid:
- A number of posts within a very short time-frame, especially immediately prior to the posting deadline.
- Posts that complement another post, and then consist of a summary of that.


In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

# Problem 1: Support Vector Machines (SVM) - Understanding Margins & Decision Boundaries

A company is classifying emails as spam or not spam using an SVM classifier. The dataset consists of word frequency features extracted from emails. Your task is to visualize the decision boundary, experiment with the kernel type, and analyze how support vectors influence classification.

## Dataset
We will generate a synthetic dataset with two classes (spam and not spam) for visualization purposes.

```python
# Generate synthetic dataset (2 features for visualization)
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define an SVM classifier
svm_model = SVC(kernel='linear', C=1.0)
svm_model.fit(X_train, y_train)

# Function to plot decision boundary
def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('SVM Decision Boundary')
    plt.show()

# Plot decision boundary for SVM
plot_decision_boundary(svm_model, X_train, y_train)
```
## Questions
1. Modify the kernel type in the SVM classifier `(SVC(kernel='linear'))` to `rbf` and `poly`. How does the decision boundary change?
2. Identify the support vectors in the model. What role do they play in defining the decision boundary?
3. Adjust the `C` parameter (try `C=0.1` vs. `C=10`). What effect does this have on the margin width and classification?
4. If you have overlapping classes, which kernel type would likely perform best? Why?

# Problem 2: Calculatin Decision Boundaries - Logistic Regression vs SVM

You are training a binary classifier to predict whether a customer will buy a product based on their income and age. The model must learn the decision boundary that separates the two groups.

```python
# Generate dataset with two features
X, y = make_classification(n_samples=200, n_features=2, n_classes=2, n_clusters_per_class=1, n_informative=2, n_redundant=0, random_state=42)

# Train Logistic Regression model
log_reg = LogisticRegression()
log_reg.fit(X, y)

# Train SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X, y)

# Function to plot decision boundaries
def plot_models(models, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))

    plt.figure(figsize=(12, 5))

    for i, (model, title) in enumerate(models):
        plt.subplot(1, 2, i + 1)
        Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        plt.contourf(xx, yy, Z, alpha=0.3)
        plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
        plt.xlabel('Feature 1')
        plt.ylabel('Feature 2')
        plt.title(title)

    plt.show()

# Plot decision boundaries for both models
plot_models([(log_reg, "Logistic Regression Decision Boundary"), (svm_model, "SVM Decision Boundary")], X, y)

```

## Questions
1. Compare the decision boundaries of Logistic Regression and SVM. How do they differ in separating the two classes?
2. Logistic Regression models the probability of a class using a sigmoid function. How does this affect its decision boundary compared to SVM?
3. Modify the SVM model to use different kernel functions (RBF, Polynomial). How does the boundary change compared to the linear model?
4. If a dataset has overlapping classes, which model is more likely to generalize well? Explain.
5. Try adding random noise to the dataset `(X += np.random.normal(0, 0.5, X.shape))`. Which model is more robust to noise, and why?

# Problem 3: Comparing Multiple Classification Models

A company wants to classify its customers into three segments based on their purchasing behavior:

- Class 0: Low-Value Customers
- Class 1: Mid-Value Customers
- Class 2: High-Value Customers

Using customer transaction data, we need to train multiple classification models and determine which performs best.

```python
# Generate a multi-class dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, n_clusters_per_class=1, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Complete the code using the task list below
```
## Tasks
1. Apply SMOTE for class balancing.
2. Train Logistic Regression, Random Forest, and Support Vector Machine (SVM) on the dataset.
3. Compare their performance using Accuracy, Precision, Recall, F1-Score, and ROC-AUC.
4. Analyze the confusion matrix to understand misclassifications.
5. Discuss the impact of using SMOTE for class balancing and whether it improves performance.

## Questions
1. Which model performs best overall? Compare using F1-score and ROC-AUC.
2. How does SMOTE affect classification performance? Would balancing the dataset improve certain metrics?
3. If this were a business decision-making tool, should we prioritize Precision or Recall? Why?
4. Try switching the multi-class classification setting in Logistic Regression. How does the performance change? (Hint: `ovr`)
5. Modify the dataset to have more overlapping classes. Which model handles this situation better?