# Assignment Solutions (SVM, Naïve Bayes, Spam Classification)

This notebook contains solutions to all questions (theory and practical) from the assignment.

## Q1: What is a Support Vector Machine (SVM), and how does it work?
Support Vector Machine (SVM) is a supervised algorithm used for classification and regression. It finds the optimal hyperplane that maximizes the margin between classes. Support vectors are the data points closest to the hyperplane. Kernels allow non-linear classification.

## Q2: Difference between Hard Margin and Soft Margin SVM
- **Hard Margin**: Assumes data is perfectly separable, no errors allowed, sensitive to outliers.
- **Soft Margin**: Allows misclassification via slack variables, controlled by C. More robust to noise.

## Q3: Kernel Trick in SVM
The kernel trick allows computing inner products in high-dimensional feature space without explicit mapping. Example: RBF kernel `K(x,x') = exp(-γ||x-x'||^2)`. Use case: when decision boundaries are nonlinear.

## Q4: Naïve Bayes Classifier and 'naïve' assumption
Naïve Bayes uses Bayes’ theorem with the assumption of conditional independence of features. Naïve because it assumes features are independent given the class.

## Q5: Types of Naïve Bayes
- **Gaussian NB**: For continuous features (assumes normal distribution).
- **Multinomial NB**: For discrete counts (word counts, TF).
- **Bernoulli NB**: For binary features (word present/absent).

In [None]:
## Q6: Iris + Linear SVM
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

svm_lin = SVC(kernel='linear', probability=True)
svm_lin.fit(X_train, y_train)
y_pred = svm_lin.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
support_vectors = svm_lin.support_vectors_

print("Accuracy:", accuracy)
print("Support vectors shape:", support_vectors.shape)
print("Sample support vectors (up to 5):", support_vectors[:5])


In [None]:
## Q7: Breast Cancer + GaussianNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

breast = datasets.load_breast_cancer()
X, y = breast.data, breast.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

print(classification_report(y_test, y_pred, target_names=['malignant','benign'], zero_division=0))


In [None]:
## Q8: Wine + SVM GridSearchCV
from sklearn.model_selection import GridSearchCV

wine = datasets.load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

param_grid = {'C': [0.1, 1, 10], 'gamma': ['scale', 0.01, 0.1]}
svc = SVC(kernel='rbf')
grid = GridSearchCV(svc, param_grid, cv=3, n_jobs=1)
grid.fit(X_train, y_train)

best_params = grid.best_params_
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best hyperparameters:", best_params)
print("Test Accuracy:", accuracy)


In [None]:
## Q9: Text + MultinomialNB ROC-AUC
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

try:
    newsgroups = fetch_20newsgroups(subset='train',
                                    categories=['alt.atheism','soc.religion.christian'],
                                    remove=('headers','footers','quotes'))
    texts, labels = newsgroups.data, newsgroups.target
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42, stratify=labels)
except Exception:
    import numpy as np
    rng = np.random.RandomState(42)
    vocab = ['god','church','science','atheist','belief','pray','team','win','game','policy','religion','text','email','offer','discount']
    texts, labels = [], []
    for i in range(300):
        if i < 150:
            w = rng.choice(vocab, size=rng.randint(5,12), replace=True)
            w[:2] = ['atheist','science']
            texts.append(" ".join(w)); labels.append(0)
        else:
            w = rng.choice(vocab, size=rng.randint(5,12), replace=True)
            w[:2] = ['church','pray']
            texts.append(" ".join(w)); labels.append(1)
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42, stratify=labels)

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

mnb = MultinomialNB()
mnb.fit(X_train_vec, y_train)
proba = mnb.predict_proba(X_test_vec)[:,1]
roc_auc = roc_auc_score(y_test, proba)

print("ROC-AUC:", roc_auc)


## Q10: Spam classification pipeline & business impact
1. **Preprocessing**: clean text, handle missing, vectorize with TF-IDF, add features (URLs, caps ratio, etc.).  
2. **Model choice**: start with MultinomialNB (fast), then LinearSVC or Logistic Regression.  
3. **Imbalance handling**: class weights, resampling (SMOTE/undersample), threshold tuning.  
4. **Evaluation**: precision, recall, F1, ROC-AUC, PR-AUC.  
5. **Business impact**: reduces spam, avoids phishing, but false positives hurt trust — need monitoring & feedback loop.  

**Sample pipeline:**
```python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=0.95)),
    ('clf', LogisticRegression(class_weight='balanced', max_iter=500))
])

X_train, X_test, y_train, y_test = train_test_split(texts, labels, stratify=labels, test_size=0.2)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
```
