<h1 align="center"><b>Programming Assignment 2 (100 points total)</b></h1>
<h3 align="center"><b>Due at the end of Module 14</b></h3><br>


## Question 1: MNIST Feature Extraction and Classification (50 Points Total)

### **Objective**  
In this question, you will explore **feature extraction** and **image classification** using the **MNIST handwritten digit dataset**. Your task is to preprocess raw image data, apply the **2D Discrete Cosine Transform (DCT)**, extract directionally informative coefficients using **frequency masks**, and conduct **dimensionality reduction via eigen decomposition**. You will then use the resulting features to train and compare multiple **classification algorithms**, including traditional machine learning and deep learning approaches.

- **Dataset:** [MNIST Handwritten Digit Dataset](https://www.openml.org/d/554)  
- **Focus Areas:** **Signal-based feature engineering**, **dimensionality reduction**, **supervised classification**, and **model comparison**.

---

### **Part 1: Preprocessing and Visualization (10 Points)**
#### **Instructions:**
1. Load the **MNIST Dataset** and inspect its structure (e.g., flattened vector, first column label, no. of observations).
   You can use the following code below:
```python
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
```
2. Display example images of each written number 0-9. 
3. Final dataset must include representative samples from each digit class (at least 100 per class). We understand compute limitations and thus do not require all observations but you can use all if you so choose.

#### **Deliverables:**
- 9 black & white images of an example of a handwritten digit from the dataset
- Output showing how many observations and features 
- Proof that you have a representative sample from each class
---

### **Part 2: Feature Engineering using Eigen Decompoosition (15 points)**
#### **Instructions:**
1. Using `scipy.fft.dct` or `scipy.fftpack.dct` apply the **2D Discrete Cosine Transform** on each 28x28 image.
2. Using the directional masks created using the code below extract the **directional coefficients** from each direction.
   ```python
   def create_custom_dct_masks(size=28):
      h_mask = np.zeros((size, size), dtype=bool)
      v_mask = np.zeros((size, size), dtype=bool)
      d_mask = np.zeros((size, size), dtype=bool)

      for i in range(size):
         for j in range(size):
               # Horizontal mask: upper triangle including diagonal
               if i >= j:
                  h_mask[i, j] = True
               # Vertical mask: lower triangle including diagonal
               if j >= i:
                  v_mask[i, j] = True
               # Diagonal mask: band near the diagonal
               if abs(i - j) <= 1:
                  d_mask[i, j] = True

      return h_mask, v_mask, d_mask
   ```
3. For each directional component, **flatten the selected masked DCT coefficients** and create a matrix of samples.
4. Compute the covariance matrix and perform **eigen decomposition**
5. For each of the three directions, **retain the top 20 eigenvectors**
6. Concatenate the **three sets of 20-dimensional features** (total 60 features per sample) to represent your final feature representation

#### **Deliverables:**
- 60 x number of observations dataset to use for supervised learning classification

---

### **Part 3: Classification Algorithms (25 Points)**

#### **Instructions:**
1. **Train a supervised classification model** using the reduced feature set generated in **Part 2**. You may use any built-in method from `sklearn` (e.g., KNN, Random Forest, SVM, etc.).  
   - Evaluate model performance (e.g., accuracy, confusion matrix).  
   - **Interpret the model**: What patterns does it learn? Which features seem important?

2. **Train a second model using your own SVM implementation** from Homework 3.  
   - You may choose a **linear or RBF kernel**.  
   - Use the same feature set from Part 2.  
   - Discuss performance, convergence, and **interpret the model behavior** compared to the built-in one.

3. **Compare model performance** between your `sklearn` classifier and your custom SVM.  
   - Use **plots** (accuracy bars, confusion matrices, etc.) and **textual analysis** to highlight key differences.  
   - Consider trade-offs in **accuracy**, **training time**, and **model flexibility**.

4. **Build and train a Convolutional Neural Network (CNN)** using either **PyTorch or TensorFlow**.  
   - Input should be the **raw 28×28 image** (not the reduced feature set).  
   - You may use standard architectures (e.g., 2 convolutional layers + dense layers).  
   - Train and evaluate the CNN on the same subset of data.

5. **Compare and analyze CNN vs. DCT-based models.**  
   - Report the **accuracy of all three models** (built-in, custom SVM, CNN).  
   - Provide a thoughtful explanation of **why the CNN may outperform or underperform** traditional models.  
   - Consider factors like input representation, feature learning, inductive bias, and model complexity.

#### **Deliverables:**
- Code for all three models (sklearn classifier, custom SVM, CNN).
- Accuracy reports and visual comparisons (e.g., bar charts, confusion matrices).
- A short written analysis comparing performance, highlighting **why results differ**, **algorithm complexity**, and discussing **model interpretability** vs. accuracy.


---

### **Key Considerations**
✅ **Logical Flow:** The problem follows a structured pipeline: **Image Reshaping → DCT Feature Extraction → Masking → Eigen Decomposition → Classification**  
✅ **Feature Engineering Emphasis:** Focuses on building compact, informative features using **directional DCT coefficients** and **eigenvectors**, not just using raw pixels.  
✅ **Algorithmic Thinking:** Requires understanding of **signal processing**, **linear algebra** (e.g., eigen decomposition, projections), and **classification pipelines**.    
✅ **Model Comparison:** Involves evaluating and comparing **sklearn models**, a **custom SVM**, and a **CNN**, encouraging reflection on strengths of different classification algorithms.


---

Good luck! 🚀


## Question 2: Design of Experiments (25 Points Total)


In the next cells you’ll be provided with code that defines the initial ground object box (with latitude and longitude boundaries) and runs the simulation. 

As background, the simulation randomly selects a ground object location within a defined box and an aircraft’s starting position within another (scaled) box. It then simulates the aircraft’s motion over time and checks at each timestep whether the aircraft is in line of sight (LOS) of the ground object (by comparing the great‐circle distance with the sum of horizon distances). 

The outcome is recorded as a binary **“target”** variable:  
- `0` for detection (LOS exists)  
- `1` for no detection (no LOS)

> **Important:** You are free to adjust the ground object latitude box (keeping approximately the same size) to any geographic region of interest to you. This will allow you to explore the effects of location on LOS detection.

Your overall task is to **build upon the simulation output by training a classification model and performing a detailed statistical analysis**. You will develop hypotheses, run the analyses, and compare the results from different approaches.

Your answers should include:
- Code
- Outputs (e.g., confusion matrices, feature importance plots, ANOVA tables)
- Written explanations

---

## Part 1: Background and Hypothesis (5 points)

**Question:**  
Briefly describe, in your own words, what the simulation code is doing. Your explanation should cover:

- How the simulation uses geographic bounding boxes to set up the ground object and aircraft positions.
- How the simulation determines if LOS exists between the ground object and the aircraft.
- What the “target” variable represents.

**Additionally:**  
Propose a **hypothesis** about which parameters (e.g., aircraft altitude or initial aircraft longitude) you expect to have the greatest influence on LOS detection, and briefly justify your reasoning.

> **Note:** You do not need to fully understand every detail of the simulation code, focus on the overall purpose and mechanism as described above. Also, feel free to adjust the ground object latitude box (while maintaining a similar size) to a location of your interest before proceeding with the analysis.

---

## Part 2: Building a Classification Model (7 points)

**Question:**  
Using the simulation output (stored in a DataFrame named `df` with columns including:

- `ground_lat`, `ground_lon`, `ground_alt`
- `init_plane_lat`, `init_plane_lon`
- `plane_alt`, `plane_speed`, `plane_heading`
- `target`

Complete the following tasks:

1. Separate the features (`X`) and the target variable (`y`).
2. Split the data into an 80-20 train-test split.
3. Train a **Random Forest classifier** on the training set.
4. Evaluate the model by computing the **accuracy** on both the training and test sets and **visualizing the confusion matrices**.

**Discussion:**  
Explain your **initial hypothesis** regarding which features might most strongly influence LOS detection, and comment on whether your model’s performance and the confusion matrices align with your expectations.

---

## Part 3: Performing ANOVA and Logistic Regression (7 points)

**Question:**  
Using the `statsmodels` module, perform a **statistical analysis** on the training data by:

1. Fitting an **Ordinary Least Squares (OLS)** model using a formula that includes all the features (e.g.,  
   `target ~ ground_lat + ground_lon + ground_alt + init_plane_lat + init_plane_lon + plane_alt + plane_speed + plane_heading`)  
   and generating an **ANOVA table**.

2. Fitting a **logistic regression model** (which is more appropriate for binary outcomes) with the same formula.

3. Reporting and comparing the results, particularly highlighting **which features are statistically significant** in both models.

**Discussion:**  
Formulate a **hypothesis** on which features you expect to be statistically significant in explaining LOS detection. Explain how the ANOVA and logistic regression results support or contradict your hypothesis.

---

## Part 4: Comparative Analysis and Critical Discussion (6 points)

**Question:**  
Compare the insights obtained from your **Random Forest classifier** (particularly the feature importance scores) with the findings from your **ANOVA table** and **logistic regression summary**. Address the following:

- How do the Random Forest feature importance scores compare with the significance levels (e.g., p-values) from the ANOVA and logistic regression outputs?
- What do these comparisons reveal about the key parameters affecting LOS detection in the simulation?
- Based on your analysis, propose potential improvements to the simulation or suggest further experiments to enhance understanding of LOS detection.

**Discussion:**  
In your written analysis, clearly state your **conclusions**, supporting them with **evidence** from your code outputs and plots. Make sure your discussion is well-reasoned and data-driven.


In [None]:
import math
import random
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.patches as mpatches
import pandas as pd

# -----------------------
# Define the Ground Object box (ground object region)
# -----------------------
ground_lon_min, ground_lat_min = -82, 27.5
ground_lon_max, ground_lat_max = -80, 29.0
ground_width = ground_lon_max - ground_lon_min  # 2.0 degrees
ground_height = ground_lat_max - ground_lat_min   # 1.5 degrees

# Compute the center of the ground box
center_lon = (ground_lon_min + ground_lon_max) / 2
center_lat = (ground_lat_min + ground_lat_max) / 2

# -----------------------
# Define the Aircraft bounding box (30% smaller than the 50x area box)
# -----------------------
# Scale up dimensions for a 50x area then reduce by 30%
scale_factor = 50**0.5
aircraft_width = ground_width * scale_factor * 0.7
aircraft_height = ground_height * scale_factor * 0.7

# Center the aircraft box on the ground box center.
aircraft_lon_min = center_lon - aircraft_width / 2
aircraft_lon_max = center_lon + aircraft_width / 2
aircraft_lat_min = center_lat - aircraft_height / 2
aircraft_lat_max = center_lat + aircraft_height / 2

# -----------------------
# Visualization: Plotting Both Boxes
# -----------------------
fig = plt.figure(figsize=(10, 8))
ax = plt.axes(projection=ccrs.PlateCarree())

# Add geographic features.
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS, linestyle=':')

# Draw the Ground Object box (red).
ground_rect = mpatches.Rectangle(
    (ground_lon_min, ground_lat_min),
    ground_width,
    ground_height,
    linewidth=2,
    edgecolor='red',
    facecolor='none',
    transform=ccrs.PlateCarree()
)
ax.add_patch(ground_rect)
ax.text((ground_lon_min + ground_lon_max) / 2, ground_lat_max,
        "Ground Object", color='red',
        ha='center', va='bottom', transform=ccrs.PlateCarree())

# Draw the Aircraft bounding box (blue).
aircraft_rect = mpatches.Rectangle(
    (aircraft_lon_min, aircraft_lat_min),
    aircraft_width,
    aircraft_height,
    linewidth=2,
    edgecolor='blue',
    facecolor='none',
    transform=ccrs.PlateCarree()
)
ax.add_patch(aircraft_rect)
ax.text((aircraft_lon_min + aircraft_lon_max) / 2, aircraft_lat_max,
        "Aircraft Box", color='blue',
        ha='center', va='bottom', transform=ccrs.PlateCarree())

# Set the extent to show both boxes with a margin.
margin_lon = 5
margin_lat = 5
ax.set_extent([aircraft_lon_min - margin_lon, aircraft_lon_max + margin_lon,
               aircraft_lat_min - margin_lat, aircraft_lat_max + margin_lat],
              crs=ccrs.PlateCarree())

plt.title("Ground Object Region and Aircraft Region")
plt.show()


In [None]:
import math
import random
import pandas as pd

# Earth's radius in meters
R = 6371000

# Simulation parameters
total_time = 3600  # seconds (1 hour)
dt = 10            # time step in seconds
num_steps = total_time // dt

# Number of simulation runs
num_runs = 10000

# -----------------------
# Helper Functions
# -----------------------
def haversine(lat1, lon1, lat2, lon2):
    """Calculate the great-circle distance between two points (in meters)."""
    lat1_rad, lon1_rad = math.radians(lat1), math.radians(lon1)
    lat2_rad, lon2_rad = math.radians(lat2), math.radians(lon2)
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad
    a = math.sin(dlat/2)**2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a))
    return R * c

def update_position(lat, lon, speed, heading, dt):
    """
    Update position based on current lat/lon, speed (m/s), heading (degrees),
    and time step dt. Uses a simple spherical approximation.
    """
    distance = speed * dt
    heading_rad = math.radians(heading)
    delta_north = distance * math.cos(heading_rad)
    delta_east = distance * math.sin(heading_rad)
    delta_lat = (delta_north / R) * (180 / math.pi)
    delta_lon = (delta_east / (R * math.cos(math.radians(lat)))) * (180 / math.pi)
    new_lat = lat + delta_lat
    new_lon = lon + delta_lon
    new_lon = (new_lon + 180) % 360 - 180  # normalize longitude
    new_lat = max(min(new_lat, 90), -90)   # constrain latitude
    return new_lat, new_lon

def horizon_distance(alt):
    """
    Calculate the horizon distance (in meters) for a given altitude 'alt'
    using the approximation: distance ≈ √(2 * R * alt)
    """
    return math.sqrt(2 * R * alt)

# -----------------------
# Simulation Function
# -----------------------
def simulate_run():
    # Choose a random ground object location within the ground box.
    ground_lon = random.uniform(ground_lon_min, ground_lon_max)
    ground_lat = random.uniform(ground_lat_min, ground_lat_max)
    ground_alt = 1.5  # observer height in meters

    # Choose a random aircraft initial position within the aircraft bounding box.
    init_plane_lat = random.uniform(aircraft_lat_min, aircraft_lat_max)
    init_plane_lon = random.uniform(aircraft_lon_min, aircraft_lon_max)

    # Randomly select the aircraft altitude between 150 ft and 65,000 ft (converted to meters).
    plane_alt = random.uniform(150 * 0.3048, 65000 * 0.3048)

    # Aircraft speed and heading.
    plane_speed = 250  # m/s (~900 km/h)
    plane_heading = random.uniform(0, 360)  # degrees

    # Set initial aircraft position.
    plane_lat = init_plane_lat
    plane_lon = init_plane_lon

    # Flag for line-of-sight occurrence.
    los_occurred = False

    for step in range(int(num_steps) + 1):
        # Calculate the great-circle distance between ground object and aircraft.
        distance = haversine(ground_lat, ground_lon, plane_lat, plane_lon)
        # Calculate horizon distances.
        d_ground = horizon_distance(ground_alt)
        d_plane = horizon_distance(plane_alt)
        # If LOS exists at this timestep, flag it.
        if distance <= (d_ground + d_plane):
            los_occurred = True
            break
        # Update aircraft position.
        plane_lat, plane_lon = update_position(plane_lat, plane_lon, plane_speed, plane_heading, dt)

    # If LOS occurred at least once, target is 0, otherwise 1.
    target = 0 if los_occurred else 1

    return {
        "ground_lat": ground_lat,
        "ground_lon": ground_lon,
        "ground_alt": ground_alt,
        "init_plane_lat": init_plane_lat,
        "init_plane_lon": init_plane_lon,
        "plane_alt": plane_alt,
        "plane_speed": plane_speed,
        "plane_heading": plane_heading,
        "target": target
    }

# -----------------------
# Run the Simulation and Save Results in a DataFrame
# -----------------------
results = [simulate_run() for _ in range(num_runs)]
df = pd.DataFrame(results)

# Calculate the fraction of runs with target = 1 (i.e. no LOS ever occurred)
fraction_no_los = df["target"].mean()
print(f"Fraction of runs with no LOS (target=1): {fraction_no_los:.4f}")
print("\nDataFrame head:")
print(df.head())


## Question 3: Generative Models & Sequence Architectures in NLP (25 Points Total)

### **Objective**  
This question explores modern generative modeling techniques and neural architectures used for sequence data. The focus is on understanding **how GANs, VAEs, and Seq2Seq models**, how they are trained, and how they handle **discrete language data**. You will analyze their design and evaluate basic implementations using pre-trained or lightweight models via **HuggingFace** or **TensorFlow Hub**.

- **Topics Covered:** Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Sequence-to-Sequence (Seq2Seq) models  
- **Focus Areas:** Basic architectural understanding, loss function comparison, implementation insight, and training strategy analysis

---

### **Part 1: Model Comparison and Demonstration (25 Points)**

#### **Instructions:**
1. **Select two model classes** from the list below:
   - GANs (e.g., TextGAN, SeqGAN)
   - VAEs (e.g., VAE for text generation)
   - Seq2Seq (e.g., encoder-decoder with attention)

2. For each selected model, do the following:

   - **(5 pts)** **Load and demonstrate a pre-trained model** using either [🤗 HuggingFace Transformers](https://huggingface.co/models) or [TensorFlow Hub](https://tfhub.dev/).  
     - Run the model on a sample text input.  
     - Show the generated output or the model's prediction.  
     - Briefly explain how the input and output are represented.

   - **(5 pts)** Describe the model architecture in your own words. What are the key layers (e.g., encoder/decoder, attention, positional encoding)? What role does each play?

   - **(5 pts)** Identify and explain the **loss function** used for training. How does it handle discrete or sequential data?

   - **(5 pts)** Discuss one **challenge in training** this model on text (e.g., instability, exposure bias, overfitting) and a technique used to mitigate it.

3. **(5 pts)** Write a short comparison paragraph (150–250 words) addressing the following:
   - What kinds of tasks is each model better suited for?
   - Which model is more interpretable or efficient?
   - Which is easier to fine-tune or deploy?

#### **Deliverables:**
- Code cells demonstrating the two selected models (using HuggingFace or TensorFlow Hub) with at least one input/output example per model.
- Written responses covering architecture, loss function, training challenge, and model comparison.

