<a href="https://colab.research.google.com/github/aliemirerten/INTRODUCTION-TO-MACHINE-LEARNING/blob/main/quiz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Undergrad-Friendly Data Science Quiz — Questions Only

**Instructions:** This notebook contains only the questions (no solutions).  
Answer each question in the provided code cells. You can run all questions directly in **Google Colab**.

**Distribution:**
- 5 Python basics
- 2 Descriptive statistics
- 2 Math for data science
- 2 Linear algebra
- 9 Intro ML topics (linear regression, logistic regression, k-means, k-NN, PCA, etc.)

_Generated on 2025-11-01 06:39:35 UTC_


### Setup (run once)

In [None]:
import numpy as np
import pandas as pd

from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, r2_score, silhouette_score
from sklearn.datasets import make_classification, load_iris, make_blobs, make_moons, load_wine
from sklearn.linear_model import LinearRegression, LogisticRegression, Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

np.set_printoptions(suppress=True, precision=4)
rng = np.random.default_rng(1)


## Part A — Python Programming (5 questions)

### Q1 — List basics
Create a list of integers from 1 to 10. Print the first 3 and the last 2 elements.

In [None]:
numbers = list(range(1, 11))
print("Full list:", numbers)

print("First 3 elements:", numbers[:3])

print("Last 2 elements:", numbers[-2:])

Full list: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
First 3 elements: [1, 2, 3]
Last 2 elements: [9, 10]


### Q2 — Sum of evens
Using a loop or list comprehension, compute the sum of even numbers from 1 to 20.

In [None]:
even_sum_loop = 0
for i in range(1, 21):
    if i % 2 == 0:
        even_sum_loop += i
print("Sum of even numbers (1-20) using loop:", even_sum_loop)

Sum of even numbers (1-20) using loop: 110


### Q3 — Function with default
Write `greet(name, lang='EN')` that prints 'Hello, <name>!' if EN and 'Merhaba, <name>!' if TR.

In [None]:
def greet(name, lang='EN'):
    if lang == 'EN':
        print(f'Hello, {name}!')
    elif lang == 'TR':
        print(f'Merhaba, {name}!')
    else:
        print(f'Unknown language: {lang}')

greet("Alice")
greet("Ahmet", "TR")
greet("Bob", "EN")

Hello, Alice!
Merhaba, Ahmet!
Hello, Bob!


### Q4 — Dictionary counts
Given `text = 'data science is fun and data is useful'`, build a dict mapping each word to its count.

In [None]:
text = 'data science is fun and data is useful'

word_count = {}
words = text.split()
for word in words:
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1

print("Word count", word_count)

Word count {'data': 2, 'science': 1, 'is': 2, 'fun': 1, 'and': 1, 'useful': 1}


### Q5 — Simple NumPy array
Create a NumPy array `a = [1,2,3,4]`. Compute elementwise square and the mean.

In [None]:
a = np.array([1, 2, 3, 4])
print("Original array:", a)

a_squared = a ** 2
print("Elementwise square:", a_squared)

mean_value = np.mean(a)
print("Mean of original array:", mean_value)

mean_squared = np.mean(a_squared)
print("Mean of squared array:", mean_squared)

Original array: [1 2 3 4]
Elementwise square: [ 1  4  9 16]
Mean of original array: 2.5
Mean of squared array: 7.5


## Part B — Descriptive Statistics (2 questions)

### Q6 — Mean/Median/Mode/Std
For `data = [1,2,2,3,4,4,4,5]`, compute mean, median, mode, and population std.

In [None]:
data = [1, 2, 2, 3, 4, 4, 4, 5]
print("Data:", data)

mean = np.mean(data)
print(f"Mean: {mean}")

median = np.median(data)
print(f"Median: {median}")

mode_result = stats.mode(data, keepdims=True)
mode_value = mode_result.mode[0]
mode_count = mode_result.count[0]
print(f"Mode: {mode_value} (appears {mode_count} times)")

pop_std = np.std(data, ddof=0)
print(f"Population standard deviation: {pop_std}")

sample_std = np.std(data, ddof=1)
print(f"Sample standard deviation: {sample_std}")

Data: [1, 2, 2, 3, 4, 4, 4, 5]
Mean: 3.125
Median: 3.5
Mode: 4 (appears 3 times)
Population standard deviation: 1.2686114456365274
Sample standard deviation: 1.3562026818605375


### Q7 — Quartiles & simple outlier rule
For `x = [10,11,12,13,14,50]`, find Q1, Q3, and IQR. Mark values greater than `Q3 + 1.5×IQR` as outliers.

In [None]:
x = [10, 11, 12, 13, 14, 50]
print("Data:", x)

Q1 = np.percentile(x, 25)
Q3 = np.percentile(x, 75)
IQR = Q3 - Q1

print(f"Q1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR (Interquartile Range): {IQR}")

outlier_threshold = Q3 + 1.5 * IQR
print(f"Outlier threshold (Q3 + 1.5×IQR): {outlier_threshold}")

outliers = []
normal_values = []

for value in x:
    if value > outlier_threshold:
        outliers.append(value)
    else:
        normal_values.append(value)

print(f"Values greater than {outlier_threshold}: {outliers}")
print(f"Normal values: {normal_values}")

if outliers:
    print(f"Outliers detected: {outliers}")
else:
    print("No outliers detected")

Data: [10, 11, 12, 13, 14, 50]
Q1 (25th percentile): 11.25
Q3 (75th percentile): 13.75
IQR (Interquartile Range): 2.5
Outlier threshold (Q3 + 1.5×IQR): 17.5
Values greater than 17.5: [50]
Normal values: [10, 11, 12, 13, 14]
Outliers detected: [50]


## Part C — Math for Data Science (2 questions)

### Q8 — Simple probability
A fair six-sided die is rolled once. Compute the probability of getting an even number.

In [None]:
total_outcomes = 6
print(f"Total possible outcomes: {total_outcomes}")

even_numbers = [2, 4, 6]
favorable_outcomes = len(even_numbers)
print(f"Even numbers on die: {even_numbers}")
print(f"Number of favorable outcomes (even numbers): {favorable_outcomes}")

probability = favorable_outcomes / total_outcomes
print(f"Probability of getting an even number: {probability}")

print(f"Probability as fraction: {favorable_outcomes}/{total_outcomes}")

percentage = probability * 100
print(f"Probability as percentage: {percentage}%")

Total possible outcomes: 6
Even numbers on die: [2, 4, 6]
Number of favorable outcomes (even numbers): 3
Probability of getting an even number: 0.5
Probability as fraction: 3/6
Probability as percentage: 50.0%


### Q9 — Z-score
Given `x = 70`, mean = 60, std = 10, compute the z-score.

In [None]:
x = 70
mean = 60
std = 10

print(f"Value: {x}")
print(f"Mean: {mean}")
print(f"Standard deviation: {std}")

z_score = (x - mean) / std

print(f"Z-score calculation: ({x} - {mean}) / {std} = {z_score}")

if z_score > 0:
    print(f"Interpretation: The value {x} is {z_score} standard deviations ABOVE the mean.")
elif z_score < 0:
    print(f"Interpretation: The value {x} is {abs(z_score)} standard deviations BELOW the mean.")
else:
    print(f"Interpretation: The value {x} is exactly at the mean.")

z_score_scipy = stats.zscore([x], ddof=0)[0] if std != 0 else 0
print(f"Verification using scipy: {z_score_scipy}")

Value: 70
Mean: 60
Standard deviation: 10
Z-score calculation: (70 - 60) / 10 = 1.0
Interpretation: The value 70 is 1.0 standard deviations ABOVE the mean.
Verification using scipy: nan


## Part D — Basic Linear Algebra (2 questions)

### Q10 — Dot product & norm
Let `u = [3,4]`, `v = [1,2]`. Compute the dot product `u·v` and the Euclidean norm of `u`.

In [None]:
u = np.array([3, 4])
v = np.array([1, 2])

print(f"Vector u: {u}")
print(f"Vector v: {v}")

dot_product = np.dot(u, v)
print(f"Dot product u·v: {dot_product}")

norm_u = np.linalg.norm(u)
print(f"Euclidean norm of u: {norm_u}")

Vector u: [3 4]
Vector v: [1 2]
Dot product u·v: 11
Euclidean norm of u: 5.0


### Q11 — Solve a 2×2 system
Solve `A x = b` for `x` where `A = [[2,1],[1,3]]` and `b = [5,7]`.

In [None]:
A = np.array([[2, 1],
              [1, 3]])
b = np.array([5, 7])

print("Matrix A:")
print(A)
print(f"Vector b: {b}")
print()

x = np.linalg.solve(A, b)
print("Solution:")
print(f"x = {x}")
print(f"x₁ = {x[0]}, x₂ = {x[1]}")
print()

det_A = np.linalg.det(A)
print(f"Determinant of A: {det_A}")

x1_cramer = np.linalg.det([[b[0], A[0,1]], [b[1], A[1,1]]]) / det_A
x2_cramer = np.linalg.det([[A[0,0], b[0]], [A[1,0], b[1]]]) / det_A

print("Solution using Cramer's rule:")
print(f"x₁ = {x1_cramer}, x₂ = {x2_cramer}")
print()

verification = A @ x
print("Verification (A @ x should equal b):")
print(f"A @ x = {verification}")
print(f"b     = {b}")
print(f"Equal? {np.allclose(verification, b)}")

Matrix A:
[[2 1]
 [1 3]]
Vector b: [5 7]

Solution:
x = [1.6 1.8]
x₁ = 1.6, x₂ = 1.8

Determinant of A: 5.000000000000001
Solution using Cramer's rule:
x₁ = 1.5999999999999988, x₂ = 1.8

Verification (A @ x should equal b):
A @ x = [5. 7.]
b     = [5 7]
Equal? True


## Part E — Intro ML (9 questions)

### Q12 — Linear regression (tiny)
Generate 50 points: `y = 2x + 1 + noise` with `x ~ U[0,5]`, noise ~ N(0,0.5). Fit `LinearRegression` and print slope, intercept.

In [None]:
np.random.seed(42)

n_samples = 50

x = np.random.uniform(0, 5, n_samples)

noise = np.random.normal(0, 0.5, n_samples)

y = 2 * x + 1 + noise

print(f"Generated {n_samples} data points")
print(f"x range: [{x.min():.2f}, {x.max():.2f}]")
print(f"y range: [{y.min():.2f}, {y.max():.2f}]")
print(f"True relationship: y = 2x + 1 + noise")
print()

X = x.reshape(-1, 1)

model = LinearRegression()
model.fit(X, y)

slope = model.coef_[0]
intercept = model.intercept_

print("Linear Regression Results:")
print(f"Fitted equation: y = {slope:.4f}x + {intercept:.4f}")
print(f"Slope (coefficient): {slope:.4f} (true value: 2.0)")
print(f"Intercept: {intercept:.4f} (true value: 1.0)")
print()

slope_error = abs(slope - 2.0)
intercept_error = abs(intercept - 1.0)
print(f"Slope error: {slope_error:.4f}")
print(f"Intercept error: {intercept_error:.4f}")

globals()['X_q12'] = X
globals()['y_q12'] = y
globals()['model_q12'] = model

Generated 50 data points
x range: [0.10, 4.85]
y range: [1.14, 11.27]
True relationship: y = 2x + 1 + noise

Linear Regression Results:
Fitted equation: y = 1.9777x + 1.0483
Slope (coefficient): 1.9777 (true value: 2.0)
Intercept: 1.0483 (true value: 1.0)

Slope error: 0.0223
Intercept error: 0.0483


### Q13 — Linear regression R²
Using the model from Q12, compute and print the R² score.

In [None]:
X = globals().get('X_q12')
y = globals().get('y_q12')
model = globals().get('model_q12')

if X is None or y is None or model is None:
    print("Error: Please run Q12 first to generate the data and model!")
else:
    print("Using data and model from Q12...")
    print(f"Data shape: X={X.shape}, y={y.shape}")
    print()

    r2_score_method1 = model.score(X, y)
    print(f"R² score: {r2_score_method1:.4f}")
    print()

    print("R² Interpretation:")
    print(f"R² = {r2_score_method1:.4f} means {r2_score_method1*100:.2f}% of the variance in y is explained by x")

    if r2_score_method1 > 0.8:
        print("This indicates a very good fit!")
    elif r2_score_method1 > 0.6:
        print("This indicates a good fit.")
    elif r2_score_method1 > 0.4:
        print("This indicates a moderate fit.")
    else:
        print("This indicates a poor fit.")

    print()
    print("Additional metrics:")
    print(f"Mean Squared Error (MSE): {np.mean((y - r2_score_method1)**2):.4f}")
    print(f"Root Mean Squared Error (RMSE): {np.sqrt(np.mean((y - r2_score_method1)**2)):.4f}")

    globals()['r2_score_q13'] = r2_score_method1

Using data and model from Q12...
Data shape: X=(50, 1), y=(50,)

R² score: 0.9749

R² Interpretation:
R² = 0.9749 means 97.49% of the variance in y is explained by x
This indicates a very good fit!

Additional metrics:
Mean Squared Error (MSE): 28.2985
Root Mean Squared Error (RMSE): 5.3196


### Q14 — Logistic regression (easy)
Create a simple binary dataset with `make_moons(n_samples=400, noise=0.2)`. Train/test split 80/20. Fit `LogisticRegression` and print accuracy.

In [None]:
X, y = make_moons(n_samples=400, noise=0.2, random_state=42)

print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"Classes: {np.unique(y)}")
print(f"Class distribution: {np.bincount(y)}")
print()

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set: X_test={X_test.shape}, y_test={y_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
print()

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Logistic Regression Results:")
print(f"Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print()

from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Additional Classification Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print()

print("Sample predictions (first 5 test samples):")
for i in range(min(5, len(y_test))):
    actual = y_test[i]
    predicted = y_pred[i]
    prob_class_0 = y_pred_proba[i][0]
    prob_class_1 = y_pred_proba[i][1]
    print(f"Sample {i+1}: Actual={actual}, Predicted={predicted}, Prob(0)={prob_class_0:.3f}, Prob(1)={prob_class_1:.3f}")

globals()['X_test_q14'] = X_test
globals()['y_test_q14'] = y_test
globals()['model_q14'] = model

Dataset shape: X=(400, 2), y=(400,)
Number of samples: 400
Number of features: 2
Classes: [0 1]
Class distribution: [200 200]

Training set: X_train=(320, 2), y_train=(320,)
Test set: X_test=(80, 2), y_test=(80,)
Train class distribution: [160 160]
Test class distribution: [40 40]

Logistic Regression Results:
Test Accuracy: 0.9375 (93.75%)

Additional Classification Metrics:
Precision: 0.9268
Recall: 0.9500
F1-score: 0.9383

Sample predictions (first 5 test samples):
Sample 1: Actual=0, Predicted=1, Prob(0)=0.484, Prob(1)=0.516
Sample 2: Actual=1, Predicted=1, Prob(0)=0.281, Prob(1)=0.719
Sample 3: Actual=1, Predicted=1, Prob(0)=0.036, Prob(1)=0.964
Sample 4: Actual=0, Predicted=0, Prob(0)=0.965, Prob(1)=0.035
Sample 5: Actual=0, Predicted=0, Prob(0)=0.746, Prob(1)=0.254


### Q15 — k-NN (Iris)
Load Iris. Split 80/20. Fit `KNeighborsClassifier(n_neighbors=3)` and print accuracy.

In [None]:
iris = load_iris()
X, y = iris.data, iris.target

print("Iris Dataset Information:")
print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"Feature names: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Class labels: {np.unique(y)}")
print(f"Class distribution: {np.bincount(y)}")
print()

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set: X_test={X_test.shape}, y_test={y_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
print()

knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

y_pred = knn_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("k-NN Classification Results:")
print(f"k (number of neighbors): 3")
print(f"Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print()

print("Sample predictions (first 10 test samples):")
print("Sample | Actual | Predicted | Actual Name      | Predicted Name")
print("-" * 65)
for i in range(min(10, len(y_test))):
    actual = y_test[i]
    predicted = y_pred[i]
    actual_name = iris.target_names[actual]
    predicted_name = iris.target_names[predicted]
    match = "✓" if actual == predicted else "✗"
    print(f"{i+1:6d} | {actual:6d} | {predicted:9d} | {actual_name:15s} | {predicted_name:15s} {match}")

print()
print(f"Correct predictions in sample: {np.sum(y_pred[:10] == y_test[:10])}/10")

print("\nAdditional k-NN Information:")
print(f"Distance metric: {knn_model.metric}")
print(f"Algorithm used: {knn_model.algorithm}")

globals()['X_test_q15'] = X_test
globals()['y_test_q15'] = y_test
globals()['y_pred_q15'] = y_pred
globals()['knn_model_q15'] = knn_model
globals()['iris_target_names'] = iris.target_names

Iris Dataset Information:
Dataset shape: X=(150, 4), y=(150,)
Number of samples: 150
Number of features: 4
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']
Class labels: [0 1 2]
Class distribution: [50 50 50]

Training set: X_train=(120, 4), y_train=(120,)
Test set: X_test=(30, 4), y_test=(30,)
Train class distribution: [40 40 40]
Test class distribution: [10 10 10]

k-NN Classification Results:
k (number of neighbors): 3
Test Accuracy: 1.0000 (100.00%)

Sample predictions (first 10 test samples):
Sample | Actual | Predicted | Actual Name      | Predicted Name
-----------------------------------------------------------------
     1 |      0 |         0 | setosa          | setosa          ✓
     2 |      2 |         2 | virginica       | virginica       ✓
     3 |      1 |         1 | versicolor      | versicolor      ✓
     4 |      1 |         1 | versicolor      | versicolor      ✓
     5 | 

### Q16 — Confusion matrix (Iris + k-NN)
Using the model from Q15, print the confusion matrix.

In [None]:
y_test = globals().get('y_test_q15')
y_pred = globals().get('y_pred_q15')
target_names = globals().get('iris_target_names')

if y_test is None or y_pred is None:
    print("Error: Please run Q15 first to generate the k-NN model and predictions!")
else:
    print("Using k-NN predictions from Q15...")
    print()

    cm = confusion_matrix(y_test, y_pred)

    print("Confusion Matrix:")
    print(cm)
    print()

    print("Confusion Matrix with Class Names:")
    print(f"{'':>12}", end="")
    for name in target_names:
        print(f"{name:>12}", end="")
    print()

    for i, name in enumerate(target_names):
        print(f"{name:>12}", end="")
        for j in range(len(target_names)):
            print(f"{cm[i,j]:>12}", end="")
        print()
    print()

    print("Confusion Matrix Analysis:")
    print("-" * 50)

    total_samples = np.sum(cm)
    correct_predictions = np.trace(cm)
    overall_accuracy = correct_predictions / total_samples

    print(f"Total test samples: {total_samples}")
    print(f"Correct predictions: {correct_predictions}")
    print(f"Overall accuracy: {overall_accuracy:.4f} ({overall_accuracy*100:.2f}%)")
    print()

    print("Per-Class Analysis:")
    print(f"{'Class':>12} {'Precision':>10} {'Recall':>8} {'F1-Score':>10} {'Support':>8}")
    print("-" * 58)

    for i, class_name in enumerate(target_names):
        tp = cm[i, i]
        fp = np.sum(cm[:, i]) - tp
        fn = np.sum(cm[i, :]) - tp

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        support = np.sum(cm[i, :])

        print(f"{class_name:>12} {precision:>10.4f} {recall:>8.4f} {f1:>10.4f} {support:>8}")

    print()

    print("Misclassification Analysis:")
    misclassified = 0
    for i in range(len(target_names)):
        for j in range(len(target_names)):
            if i != j and cm[i, j] > 0:
                actual_class = target_names[i]
                predicted_class = target_names[j]
                count = cm[i, j]
                misclassified += count
                print(f"  {actual_class} → {predicted_class}: {count} sample(s)")

    if misclassified == 0:
        print("  No misclassifications! Perfect prediction.")
    else:
        print(f"  Total misclassified: {misclassified}")

    print()

    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100

    print("Confusion Matrix (Percentages by Row):")
    print(f"{'':>12}", end="")
    for name in target_names:
        print(f"{name:>12}", end="")
    print()

    for i, name in enumerate(target_names):
        print(f"{name:>12}", end="")
        for j in range(len(target_names)):
            print(f"{cm_percent[i,j]:>11.1f}%", end="")
        print()

    print("\nNote: Each row shows the percentage distribution of actual class predictions.")

Using k-NN predictions from Q15...

Confusion Matrix:
[[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]

Confusion Matrix with Class Names:
                  setosa  versicolor   virginica
      setosa          10           0           0
  versicolor           0          10           0
   virginica           0           0          10

Confusion Matrix Analysis:
--------------------------------------------------
Total test samples: 30
Correct predictions: 30
Overall accuracy: 1.0000 (100.00%)

Per-Class Analysis:
       Class  Precision   Recall   F1-Score  Support
----------------------------------------------------------
      setosa     1.0000   1.0000     1.0000       10
  versicolor     1.0000   1.0000     1.0000       10
   virginica     1.0000   1.0000     1.0000       10

Misclassification Analysis:
  No misclassifications! Perfect prediction.

Confusion Matrix (Percentages by Row):
                  setosa  versicolor   virginica
      setosa      100.0%        0.0%        0.0%
  versicolor

### Q17 — Decision Tree (easy)
On Iris, fit a small `DecisionTreeClassifier(max_depth=3)` and print test accuracy.

In [None]:
iris = load_iris()
X, y = iris.data, iris.target

print("Decision Tree on Iris Dataset:")
print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print()

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set: X_test={X_test.shape}, y_test={y_test.shape}")
print()

dt_model = DecisionTreeClassifier(
    max_depth=3,
    random_state=42,
    criterion='gini'
)
dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Decision Tree Results:")
print(f"Max depth: {dt_model.max_depth}")
print(f"Criterion: {dt_model.criterion}")
print(f"Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print()

print("Decision Tree Structure:")
print(f"Number of nodes: {dt_model.tree_.node_count}")
print(f"Number of leaves: {dt_model.tree_.n_leaves}")
print(f"Actual tree depth: {dt_model.tree_.max_depth}")
print()

print("Feature Importance:")
feature_importance = dt_model.feature_importances_
for i, (feature_name, importance) in enumerate(zip(iris.feature_names, feature_importance)):
    print(f"{feature_name:20s}: {importance:.4f}")
print()

most_important_idx = np.argmax(feature_importance)
most_important_feature = iris.feature_names[most_important_idx]
print(f"Most important feature: {most_important_feature} (importance: {feature_importance[most_important_idx]:.4f})")
print()

print("Sample predictions (first 8 test samples):")
print("Sample | Actual | Predicted | Actual Name      | Predicted Name     | Match")
print("-" * 75)
for i in range(min(8, len(y_test))):
    actual = y_test[i]
    predicted = y_pred[i]
    actual_name = iris.target_names[actual]
    predicted_name = iris.target_names[predicted]
    match = "✓" if actual == predicted else "✗"
    print(f"{i+1:6d} | {actual:6d} | {predicted:9d} | {actual_name:15s} | {predicted_name:15s}  | {match}")

print()
print(f"Correct predictions in sample: {np.sum(y_pred[:8] == y_test[:8])}/8")

knn_accuracy = None
if 'y_pred_q15' in globals() and 'y_test_q15' in globals():
    y_test_knn = globals()['y_test_q15']
    y_pred_knn = globals()['y_pred_q15']
    if np.array_equal(y_test, y_test_knn):  # Same test set
        knn_accuracy = accuracy_score(y_test_knn, y_pred_knn)
        print(f"\nComparison with k-NN (Q15):")
        print(f"Decision Tree accuracy: {accuracy:.4f}")
        print(f"k-NN accuracy:          {knn_accuracy:.4f}")
        if accuracy > knn_accuracy:
            print("Decision Tree performs better!")
        elif accuracy < knn_accuracy:
            print("k-NN performs better!")
        else:
            print("Both models perform equally!")

globals()['dt_model_q17'] = dt_model
globals()['dt_accuracy_q17'] = accuracy

Decision Tree on Iris Dataset:
Dataset shape: X=(150, 4), y=(150,)
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']

Training set: X_train=(120, 4), y_train=(120,)
Test set: X_test=(30, 4), y_test=(30,)

Decision Tree Results:
Max depth: 3
Criterion: gini
Test Accuracy: 0.9667 (96.67%)

Decision Tree Structure:
Number of nodes: 9
Number of leaves: 5
Actual tree depth: 3

Feature Importance:
sepal length (cm)   : 0.0000
sepal width (cm)    : 0.0000
petal length (cm)   : 0.5791
petal width (cm)    : 0.4209

Most important feature: petal length (cm) (importance: 0.5791)

Sample predictions (first 8 test samples):
Sample | Actual | Predicted | Actual Name      | Predicted Name     | Match
---------------------------------------------------------------------------
     1 |      0 |         0 | setosa          | setosa           | ✓
     2 |      2 |         2 | virginica       | virginica        | ✓
   

### Q18 — Naive Bayes (Iris)
Train `GaussianNB` on Iris with an 80/20 split. Print accuracy.

In [None]:
iris = load_iris()
X, y = iris.data, iris.target

print("Naive Bayes on Iris Dataset:")
print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Class distribution: {np.bincount(y)}")
print()

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set: X_test={X_test.shape}, y_test={y_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
print()

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

y_pred = nb_model.predict(X_test)
y_pred_proba = nb_model.predict_proba(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Naive Bayes Results:")
print(f"Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print()

print("Model Parameters:")
print(f"Classes: {nb_model.classes_}")
print(f"Number of training samples per class: {nb_model.class_count_}")
print()

print("Feature Statistics (means for each class):")
print(f"{'Feature':25s}", end="")
for class_name in iris.target_names:
    print(f"{class_name:>12s}", end="")
print()

for i, feature_name in enumerate(iris.feature_names):
    print(f"{feature_name:25s}", end="")
    for j, class_idx in enumerate(nb_model.classes_):
        mean_value = nb_model.theta_[j, i]
        print(f"{mean_value:>12.2f}", end="")
    print()

print()

print("Sample predictions (first 8 test samples):")
print("Sample | Actual | Predicted | Actual Name      | Predicted Name     | Max Prob | Match")
print("-" * 85)
for i in range(min(8, len(y_test))):
    actual = y_test[i]
    predicted = y_pred[i]
    actual_name = iris.target_names[actual]
    predicted_name = iris.target_names[predicted]
    max_prob = np.max(y_pred_proba[i])
    match = "✓" if actual == predicted else "✗"
    print(f"{i+1:6d} | {actual:6d} | {predicted:9d} | {actual_name:15s} | {predicted_name:15s}  | {max_prob:8.3f} | {match}")

print()
print(f"Correct predictions in sample: {np.sum(y_pred[:8] == y_test[:8])}/8")

print("\nComparison with Previous Models:")
dt_accuracy = globals().get('dt_accuracy_q17')
if dt_accuracy is not None:
    print(f"Naive Bayes accuracy:   {accuracy:.4f}")
    print(f"Decision Tree accuracy: {dt_accuracy:.4f}")
    if accuracy > dt_accuracy:
        print("Naive Bayes performs better than Decision Tree!")
    elif accuracy < dt_accuracy:
        print("Decision Tree performs better than Naive Bayes!")
    else:
        print("Both Naive Bayes and Decision Tree perform equally!")

if 'y_pred_q15' in globals() and 'y_test_q15' in globals():
    y_test_knn = globals()['y_test_q15']
    y_pred_knn = globals()['y_pred_q15']
    if np.array_equal(y_test, y_test_knn):  # Same test set
        knn_accuracy = accuracy_score(y_test_knn, y_pred_knn)
        print(f"k-NN accuracy:          {knn_accuracy:.4f}")

globals()['nb_model_q18'] = nb_model
globals()['nb_accuracy_q18'] = accuracy
globals()['X_test_q18'] = X_test
globals()['y_test_q18'] = y_test
globals()['y_pred_q18'] = y_pred

Naive Bayes on Iris Dataset:
Dataset shape: X=(150, 4), y=(150,)
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']
Class distribution: [50 50 50]

Training set: X_train=(120, 4), y_train=(120,)
Test set: X_test=(30, 4), y_test=(30,)
Train class distribution: [40 40 40]
Test class distribution: [10 10 10]

Gaussian Naive Bayes Results:
Test Accuracy: 0.9667 (96.67%)

Model Parameters:
Classes: [0 1 2]
Number of training samples per class: [40. 40. 40.]

Feature Statistics (means for each class):
Feature                        setosa  versicolor   virginica
sepal length (cm)                4.99        5.93        6.61
sepal width (cm)                 3.41        2.75        2.98
petal length (cm)                1.48        4.25        5.58
petal width (cm)                 0.25        1.32        2.04

Sample predictions (first 8 test samples):
Sample | Actual | Predicted | Actual Name      | Predicted

### Q19 — k-Means (blobs)
Generate 3 clusters with `make_blobs(n_samples=300)`. Fit `KMeans(n_clusters=3)` and print inertia.

In [None]:
X, y_true = make_blobs(
    n_samples=300,
    centers=3,
    random_state=42,
    cluster_std=1.5
)

print("k-Means Clustering on Blob Dataset:")
print(f"Dataset shape: X={X.shape}")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"True number of clusters: 3")
print(f"X range: [{X.min():.2f}, {X.max():.2f}]")
print()

kmeans = KMeans(
    n_clusters=3,
    random_state=42,
    n_init=10,
    max_iter=300
)
y_pred = kmeans.fit_predict(X)

inertia = kmeans.inertia_
print("k-Means Results:")
print(f"Number of clusters (k): {kmeans.n_clusters}")
print(f"Inertia (WCSS): {inertia:.2f}")
print()

print("Clustering Analysis:")
print(f"Cluster centers:")
for i, center in enumerate(kmeans.cluster_centers_):
    print(f"  Cluster {i}: [{center[0]:.2f}, {center[1]:.2f}]")
print()

unique_labels, counts = np.unique(y_pred, return_counts=True)
print("Samples per cluster:")
for label, count in zip(unique_labels, counts):
    print(f"  Cluster {label}: {count} samples")
print()

print("True vs Predicted Cluster Comparison:")
print(f"True clusters distribution: {np.bincount(y_true)}")
print(f"Predicted clusters distribution: {np.bincount(y_pred)}")
print()

silhouette_avg = silhouette_score(X, y_pred)
print(f"Silhouette Score: {silhouette_avg:.4f}")
print("(Silhouette score ranges from -1 to 1, where 1 indicates perfect clustering)")
print()

print("Sample cluster assignments (first 10 points):")
print("Point | X-coord | Y-coord | True Cluster | Predicted Cluster | Match")
print("-" * 70)
for i in range(min(10, len(X))):
    x_coord = X[i, 0]
    y_coord = X[i, 1]
    true_cluster = y_true[i]
    pred_cluster = y_pred[i]

    print(f"{i+1:5d} | {x_coord:7.2f} | {y_coord:7.2f} | {true_cluster:12d} | {pred_cluster:17d} |")

print()

distances_to_centers = np.sqrt(((X - kmeans.cluster_centers_[y_pred])**2).sum(axis=1))
print("Cluster Quality Metrics:")
print(f"Mean distance to cluster center: {np.mean(distances_to_centers):.3f}")
print(f"Max distance to cluster center: {np.max(distances_to_centers):.3f}")
print(f"Min distance to cluster center: {np.min(distances_to_centers):.3f}")

print()
print("k-Means Algorithm Details:")
print(f"Number of iterations: {kmeans.n_iter_}")
print(f"Algorithm used: {kmeans.algorithm}")
print(f"Initialization method: {kmeans.init}")

globals()['kmeans_model_q19'] = kmeans
globals()['X_blobs_q19'] = X
globals()['y_true_q19'] = y_true
globals()['y_pred_q19'] = y_pred
globals()['inertia_q19'] = inertia

k-Means Clustering on Blob Dataset:
Dataset shape: X=(300, 2)
Number of samples: 300
Number of features: 2
True number of clusters: 3
X range: [-10.59, 13.09]

k-Means Results:
Number of clusters (k): 3
Inertia (WCSS): 1275.43

Clustering Analysis:
Cluster centers:
  Cluster 0: [-2.70, 9.06]
  Cluster 1: [-6.89, -7.04]
  Cluster 2: [4.80, 2.03]

Samples per cluster:
  Cluster 0: 100 samples
  Cluster 1: 100 samples
  Cluster 2: 100 samples

True vs Predicted Cluster Comparison:
True clusters distribution: [100 100 100]
Predicted clusters distribution: [100 100 100]

Silhouette Score: 0.7734
(Silhouette score ranges from -1 to 1, where 1 indicates perfect clustering)

Sample cluster assignments (first 10 points):
Point | X-coord | Y-coord | True Cluster | Predicted Cluster | Match
----------------------------------------------------------------------
    1 |   -7.57 |   -8.15 |            2 |                 1 |
    2 |   -8.17 |   -7.46 |            2 |                 1 |
    3 |   -1

### Q20 — Simple Neural Network (MLP)
Use `make_moons(n_samples=400, noise=0.25)`. Split 75/25, fit `MLPClassifier(hidden_layer_sizes=(8,))` and print train/test accuracies.

In [None]:
X, y = make_moons(n_samples=400, noise=0.25, random_state=42)

print("Neural Network (MLP) on Moons Dataset:")
print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Number of samples: {len(X)}")
print(f"Number of features: {X.shape[1]}")
print(f"Classes: {np.unique(y)}")
print(f"Class distribution: {np.bincount(y)}")
print(f"Noise level: 0.25")
print()

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set: X_test={X_test.shape}, y_test={y_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
print()

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling applied (StandardScaler)")
print(f"Original X_train mean: {X_train.mean(axis=0)}")
print(f"Scaled X_train mean: {X_train_scaled.mean(axis=0)}")
print(f"Scaled X_train std: {X_train_scaled.std(axis=0)}")
print()

mlp = MLPClassifier(
    hidden_layer_sizes=(8,),
    random_state=42,
    max_iter=1000,
    alpha=0.01,
    solver='adam',
    learning_rate_init=0.001
)

mlp.fit(X_train_scaled, y_train)

y_train_pred = mlp.predict(X_train_scaled)
y_test_pred = mlp.predict(X_test_scaled)

train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Multi-Layer Perceptron (MLP) Results:")
print(f"Hidden layer sizes: {mlp.hidden_layer_sizes}")
print(f"Number of layers: {mlp.n_layers_}")
print(f"Number of outputs: {mlp.n_outputs_}")
print()
print(f"Train Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Test Accuracy:  {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print()

accuracy_diff = train_accuracy - test_accuracy
print("Model Performance Analysis:")
if accuracy_diff > 0.1:
    print(f"Potential overfitting detected (train-test gap: {accuracy_diff:.4f})")
elif accuracy_diff < -0.05:
    print(f"Unusual: test accuracy higher than train accuracy (gap: {accuracy_diff:.4f})")
else:
    print(f"Good generalization (train-test gap: {accuracy_diff:.4f})")
print()

print("Training Details:")
print(f"Number of iterations: {mlp.n_iter_}")
print(f"Solver: {mlp.solver}")
print(f"Learning rate: {mlp.learning_rate_init}")
print(f"Alpha (regularization): {mlp.alpha}")
print(f"Converged: {mlp.n_iter_ < mlp.max_iter}")
print()

print("Network Architecture:")
print(f"Input layer: {X_train.shape[1]} neurons (features)")
print(f"Hidden layer 1: {mlp.hidden_layer_sizes[0]} neurons")
print(f"Output layer: {mlp.n_outputs_} neurons (classes)")
total_params = (X_train.shape[1] * mlp.hidden_layer_sizes[0] + mlp.hidden_layer_sizes[0] +
                mlp.hidden_layer_sizes[0] * mlp.n_outputs_ + mlp.n_outputs_)
print(f"Total parameters (approx): {total_params}")
print()

y_test_proba = mlp.predict_proba(X_test_scaled)

print("Sample predictions (first 8 test samples):")
print("Sample | Actual | Predicted | Prob(Class 0) | Prob(Class 1) | Match")
print("-" * 70)
for i in range(min(8, len(y_test))):
    actual = y_test[i]
    predicted = y_test_pred[i]
    prob_0 = y_test_proba[i][0]
    prob_1 = y_test_proba[i][1]
    match = "✓" if actual == predicted else "✗"
    print(f"{i+1:6d} | {actual:6d} | {predicted:9d} | {prob_0:13.3f} | {prob_1:13.3f} | {match}")

print()
print(f"Correct predictions in sample: {np.sum(y_test_pred[:8] == y_test[:8])}/8")

print("\nComparison with Previous Models:")
if 'X_test_q14' in globals() and 'y_test_q14' in globals():
    X_test_q14 = globals()['X_test_q14']
    y_test_q14 = globals()['y_test_q14']
    model_q14 = globals()['model_q14']

    if X_test_q14.shape == X_test.shape:
        logistic_pred = model_q14.predict(X_test_q14)
        logistic_accuracy = accuracy_score(y_test_q14, logistic_pred)
        print(f"MLP accuracy:               {test_accuracy:.4f}")
        print(f"Logistic Regression accuracy: {logistic_accuracy:.4f}")
        if test_accuracy > logistic_accuracy:
            print("MLP performs better than Logistic Regression!")
        elif test_accuracy < logistic_accuracy:
            print("Logistic Regression performs better than MLP!")
        else:
            print("Both models perform equally!")
    else:
        print("MLP trained on different dataset than Logistic Regression")

globals()['mlp_model_q20'] = mlp
globals()['scaler_q20'] = scaler
globals()['X_train_q20'] = X_train_scaled
globals()['X_test_q20'] = X_test_scaled
globals()['y_train_q20'] = y_train
globals()['y_test_q20'] = y_test
globals()['train_accuracy_q20'] = train_accuracy
globals()['test_accuracy_q20'] = test_accuracy

Neural Network (MLP) on Moons Dataset:
Dataset shape: X=(400, 2), y=(400,)
Number of samples: 400
Number of features: 2
Classes: [0 1]
Class distribution: [200 200]
Noise level: 0.25

Training set: X_train=(300, 2), y_train=(300,)
Test set: X_test=(100, 2), y_test=(100,)
Train class distribution: [150 150]
Test class distribution: [50 50]

Feature scaling applied (StandardScaler)
Original X_train mean: [0.5373 0.2641]
Scaled X_train mean: [-0.  0.]
Scaled X_train std: [1. 1.]

Multi-Layer Perceptron (MLP) Results:
Hidden layer sizes: (8,)
Number of layers: 3
Number of outputs: 1

Train Accuracy: 0.8867 (88.67%)
Test Accuracy:  0.9500 (95.00%)

Model Performance Analysis:
Unusual: test accuracy higher than train accuracy (gap: -0.0633)

Training Details:
Number of iterations: 1000
Solver: adam
Learning rate: 0.001
Alpha (regularization): 0.01
Converged: False

Network Architecture:
Input layer: 2 neurons (features)
Hidden layer 1: 8 neurons
Output layer: 1 neurons (classes)
Total parame



---
### End of Quiz
Save and submit your Colab notebook after completing all questions.