## Neural Networks

1) What is Deep Learning? Briefly describe how it evolved and how it differs from traditional machine learning.

->

Deep Learning is a subset of Machine Learning that uses **artificial neural networks** with many layers (hence “deep”) to automatically learn complex patterns from data.  
It evolved from traditional ML by overcoming the limitation of manual feature extraction — instead, deep networks learn hierarchical features directly from raw data.

Traditional ML models like Decision Trees or SVMs rely heavily on human-engineered features, while Deep Learning leverages **representation learning**, where hidden layers capture increasingly abstract patterns (e.g., edges → shapes → objects in images).

**Evolution factors:**
- Availability of large datasets  
- Powerful GPUs and TPUs  
- Advanced algorithms (e.g., backpropagation, dropout)

Deep Learning has revolutionized fields like computer vision, speech recognition, and NLP.




2) Explain the basic architecture and functioning of a Perceptron. What are its limitations?

->

A **Perceptron** is the simplest form of a neural network, consisting of:
- Inputs (features)
- Weights and bias
- Activation function
- Output node

**Computation:**

$$
z = \sum_{i=1}^{n} w_i x_i + b
$$

If z exceeds a threshold, it outputs 1; otherwise 0.

**Limitations:**
- Can only solve **linearly separable** problems (fails on XOR-type data)  
- No hidden layers → limited learning capacity  
- Sensitive to feature scaling

Hence, Multi-Layer Perceptrons (MLPs) were developed to learn non-linear relationships.



3) Describe the purpose of activation function in neural networks. Compare Sigmoid, ReLU, and Tanh functions.

->

Activation functions introduce **non-linearity** in neural networks so they can model complex data.

**1. Sigmoid Function**

$$
f(x) = \frac{1}{1 + e^{-x}}
$$

Range → (0, 1)  
✅ Smooth gradient  
❌ Causes vanishing gradients


**2. Tanh Function**

$$
f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

Range → (-1, 1)  
✅ Zero-centered  
❌ Still suffers from vanishing gradients


**3. ReLU (Rectified Linear Unit)**

$$
f(x) = \max(0, x)
$$

✅ Fast, efficient  
✅ Helps avoid vanishing gradient  
❌ “Dead neurons” for negative inputs

➡ **ReLU** is most commonly used today.




4) What is the difference between Loss function and Cost function in neural networks? Provide examples.

->

Both measure model performance but differ in **scope**:

- **Loss Function:** error for a single observation  
- **Cost Function:** average of all individual losses

**Mathematical form:**

$$
J(\theta) = \frac{1}{m} \sum_{i=1}^{m} L(y_i, \hat{y}_i)
$$

**Examples:**

*Mean Squared Error (MSE)*  
$$
L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2
$$

*Binary Cross-Entropy Loss*  
$$
L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y)\log(1 - \hat{y})]
$$

**Summary:**  
Loss → single sample | Cost → dataset average




5) What is the role of optimizers in neural networks? Compare Gradient Descent, Adam, and RMSprop.

->

Optimizers control **how weights are updated** to minimize the loss function — determining convergence speed and stability.


**1. Gradient Descent**

$$
w = w - \eta \frac{\partial J}{\partial w}
$$

✅ Simple and effective  
❌ May get stuck in local minima  
❌ Sensitive to learning rate


**2. RMSprop**

Uses moving average of squared gradients to adapt learning rates.  
✅ Good for non-stationary problems  
✅ Faster convergence than plain GD

---

**3. Adam (Adaptive Moment Estimation)**

Combines momentum + RMSprop:

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

$$
w = w - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}
$$

✅ Fast, adaptive, robust  
✅ Default optimizer for most networks

Use NumPy, Matplotlib, and Tensorflow/Keras for implementation.

In [None]:
"""
6) Write a Python program to implement a single-layer perceptron from scratch using NumPy
to solve the logical AND gate.

->

"""
import numpy as np

# Training data for AND gate
# Inputs (with bias term appended as 1)
X = np.array([[0, 0, 1],
              [0, 1, 1],
              [1, 0, 1],
              [1, 1, 1]], dtype=float)  # last column is bias input (x0 = 1)

# Targets
y = np.array([0, 0, 0, 1], dtype=float)

# Perceptron parameters
np.random.seed(42)
weights = np.random.uniform(-1, 1, size=(3,))  # two feature weights + bias weight
lr = 0.1
n_epochs = 50

def step(x):
    return 1.0 if x >= 0 else 0.0

# Training loop (Perceptron learning rule)
for epoch in range(n_epochs):
    errors = 0
    for xi, target in zip(X, y):
        activation = np.dot(weights, xi)
        pred = step(activation)
        update = lr * (target - pred)
        if update != 0:
            weights += update * xi
            errors += 1
    # Early stop if no errors
    if errors == 0:
        break

# Results
print("Trained weights (w1, w2, bias):", weights)
print("Predictions on AND inputs:")
for xi in X:
    print(f"Input: {xi[:2]} -> Pred:", step(np.dot(weights, xi)))

In [None]:
"""
7) Implement and visualize Sigmoid, ReLU, and Tanh activation functions using Matplotlib.

->

"""
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-6, 6, 500)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def relu(x):
    return np.maximum(0, x)

def tanh(x):
    return np.tanh(x)

y_sig = sigmoid(x)
y_relu = relu(x)
y_tanh = tanh(x)

plt.figure(figsize=(8, 5))
plt.plot(x, y_sig, label="Sigmoid", linewidth=2)
plt.plot(x, y_relu, label="ReLU", linewidth=2)
plt.plot(x, y_tanh, label="Tanh", linewidth=2)
plt.axvline(0, color='gray', linewidth=0.5)
plt.axhline(0, color='gray', linewidth=0.5)
plt.title("Activation Functions: Sigmoid, ReLU, Tanh")
plt.xlabel("Input")
plt.ylabel("Output")
plt.legend()
plt.grid(alpha=0.2)
plt.show()

In [None]:
"""
8) Use Keras to build and train a simple multilayer neural network on the MNIST digits dataset.
Print the training accuracy.

->

"""
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.utils import to_categorical

# Load MNIST
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Preprocess
x_train = x_train.reshape(-1, 28 * 28).astype("float32") / 255.0
x_test = x_test.reshape(-1, 28 * 28).astype("float32") / 255.0
y_train_cat = to_categorical(y_train, 10)
y_test_cat = to_categorical(y_test, 10)

# Build model
model = models.Sequential([
    layers.Input(shape=(28*28,)),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train (set epochs small to be notebook-friendly; increase as needed)
history = model.fit(x_train, y_train_cat, epochs=5, batch_size=128, validation_split=0.1, verbose=2)

# Print final training accuracy
train_acc = history.history['accuracy'][-1]
print(f"Final training accuracy (last epoch): {train_acc:.4f}")

In [None]:
"""
9) Visualize the loss and accuracy curves for a neural network model trained on the Fashion MNIST dataset.
Interpret the training behavior.

->

"""
import tensorflow as tf
from tensorflow.keras import layers, models
import matplotlib.pyplot as plt

# Load and preprocess Fashion MNIST
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
x_train = x_train[..., None]  # add channel dimension
x_test = x_test[..., None]
y_train_cat = tf.keras.utils.to_categorical(y_train, 10)
y_test_cat = tf.keras.utils.to_categorical(y_test, 10)

# Simple CNN
model_f = models.Sequential([
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D((2,2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.4),
    layers.Dense(10, activation='softmax')
])

model_f.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

# Train briefly (adjust epochs if you want more)
history_f = model_f.fit(x_train, y_train_cat, epochs=6, batch_size=256, validation_split=0.1, verbose=2)

# Plot loss and accuracy
plt.figure(figsize=(12,5))

plt.subplot(1,2,1)
plt.plot(history_f.history['loss'], label='train loss')
plt.plot(history_f.history['val_loss'], label='val loss')
plt.title('Loss Curve')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1,2,2)
plt.plot(history_f.history['accuracy'], label='train acc')
plt.plot(history_f.history['val_accuracy'], label='val acc')
plt.title('Accuracy Curve')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Interpretation (print summary)
print("Interpretation tips:")
print("- If training loss decreases and training accuracy increases while validation loss decreases and val accuracy increases: model is learning and generalizing.")
print("- If training loss decreases but validation loss increases (val acc drops): model is overfitting.")
print("- If both train and val loss stay high: underfitting or model too simple / need more training.")
print("- Use early stopping, dropout, data augmentation or regularization to handle overfitting.")

10) You are working on a project for a bank that wants to automatically detect fraudulent transactions. The dataset is large, imbalanced, and contains structured features like transaction amount, merchant ID, and customer location. The goal is to classify each transaction as fraudulent or legitimate.

->

**Real-time Fraud Detection — End-to-End Deep Learning Workflow**

**1. Problem framing & data**
- **Task:** Binary classification (fraud = 1, legit = 0).  
- **Data sources:** transaction_amount, merchant_id, customer_id/location, timestamp, device info, historical customer behavior, engineered features (rolling averages, time-since-last-transaction, merchant risk score).  
- **Challenges:** severe class imbalance, concept drift (fraud evolves), high volume (latency constraints), high-cardinality categorical features.

**2. Model design**
- Use a **multilayer neural network (MLP)** for tabular data with specialized handling for categorical IDs:
  - Numeric inputs → standardize (e.g., `StandardScaler`).
  - High-cardinality categorical features (merchant_id, customer_id) → integer encode → **embedding layers**.
  - Concatenate embeddings + numeric features → dense stack.
- Example architecture (conceptual):
  - Inputs: numeric vector \(x_{num}\), embeddings \(e_{merchant}, e_{customer}\)  
  - Concatenate: \(h_0 = [x_{num}, e_{merchant}, e_{customer}]\)  
  - Dense(256, ReLU) → BatchNorm → Dropout(0.3)  
  - Dense(128, ReLU) → BatchNorm → Dropout(0.3)  
  - Dense(64, ReLU) → Dropout(0.2)  
  - Output: Dense(1, activation='sigmoid') → probability of fraud

**3. Activation & loss**
- **Hidden layers:** ReLU (or variants like Leaky ReLU) for fast convergence and to mitigate vanishing gradients.  
- **Output layer:** Sigmoid to produce probability \(p=\hat{y}\in(0,1)\).  
- **Primary loss:** Binary Cross-Entropy (BCE):
  $$
  L_{\text{BCE}}(y,p) = -\big[y\log(p) + (1-y)\log(1-p)\big]
  $$
- **For imbalance:**  
  - **Weighted BCE:** multiply BCE for class 1 (fraud) by a larger weight \(w_1\).  
  - **Focal Loss** (focus on hard examples):
  $$
  FL(p_t) = - (1 - p_t)^\gamma \log(p_t)
  $$
  where \(p_t = p\) if \(y=1\), else \(p_t=1-p\); \(\gamma>0\) focuses training on hard-to-classify examples.

**4. Handling class imbalance (training & evaluation)**
- **Data-level methods:**
  - Careful oversampling of minority (fraud) using time-aware approaches (avoid leakage). Consider SMOTE-like techniques for training but beware temporal dependencies.
  - Undersampling the majority class for experiments (not always production-viable).
- **Algorithm-level methods:**
  - Use **class weights** during training (Keras `class_weight`) or use focal loss.
  - Threshold tuning: choose decision threshold to meet business constraints (maximize recall at acceptable precision).
- **Evaluation metrics:** use **Precision-Recall AUC (PR-AUC)**, **Recall (sensitivity)**, **Precision**, **F1**, and business-cost metrics (cost-weighted confusion matrix). ROC-AUC can be misleading under heavy imbalance — prefer PR-AUC.
- **Validation strategy:** time-aware split (train on past, validate on later window) to avoid leakage; consider rolling-window cross-validation.

**5. Optimizer & training details**
- **Optimizer:** Adam (or AdamW) — adaptive, fast convergence. Use learning rate scheduling (ReduceLROnPlateau) and warm restarts if needed.  
- **Adam moment updates (concept):**
  $$
  m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t,\quad
  v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
  $$
  $$
  \hat{m}_t = \frac{m_t}{1-\beta_1^t},\quad
  \hat{v}_t = \frac{v_t}{1-\beta_2^t}
  $$
  $$
  \theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
  $$

**6. Preventing overfitting & improving robustness**
- Regularization: Dropout, L2 weight decay (or AdamW), Batch Normalization.  
- Feature engineering & noise-robust features (avoid overfitting to spurious correlations).  
- Early stopping on validation PR-AUC.  
- Ensembling: combine NN with tree-based models (e.g., XGBoost) — often improves performance on tabular data.  
- Calibrate probabilities (Platt scaling / isotonic) if using model probabilities in downstream decision-making.

**7. Real-time / production considerations**
- **Feature store & online features:** compute and serve up-to-date rolling statistics in real-time (or near real-time).  
- **Latency:** keep the model small enough to meet inference-time SLAs; use model quantization or distilled models if necessary.  
- **Monitoring:** track data drift, concept drift, input distributions, prediction distributions, and business KPIs (false negative alerts). Automate alerting when performance degrades.  
- **Retraining strategy:** scheduled retraining plus drift-triggered retraining; maintain model lineage and versioning.  
- **Explainability & audit:** store features and predictions for every decision for auditing; use SHAP/LIME for root-cause analysis on flagged transactions.

**8. Deployment & operations**
- Serve via low-latency API (REST/gRPC) or integrate into streaming pipeline (Kafka → model microservice).  
- Build fallback/business rules for critical low-confidence decisions.  
- Log predictions and outcomes to support continuous learning and regulatory audit.

**9. Business alignment**
- Optimize for business cost: weight false negatives (missed fraud) and false positives (customer friction) according to business loss matrix.  
- Provide metrics dashboards for analysts to review flagged transactions and adjust thresholds.

**Summary (one-line):**  
Use a multilayer NN with embeddings for categorical IDs, ReLU hidden activations, sigmoid output, weighted BCE or focal loss, Adam/AdamW optimizer, time-aware training and PR-AUC-based monitoring, plus robust production infra (feature store, drift detection, model versioning) to deliver accurate, low-latency fraud detection with ongoing retraining and monitoring.