<h1 align="center"><b>Programming Assignment 1 (100 points total)</b></h1>
<h3 align="center"><b>Due at the end of Module 7</b></h3><br>


## **Question 1: Exploratory Data Analysis and Preprocessing**

### **Objective**  
In this question, you will conduct a structured exploratory data analysis (EDA) and preprocess the **Wine Quality Dataset** to prepare it for downstream analysis. This includes computing **summary statistics**, detecting and removing **outliers using Mahalanobis distance**, applying **feature normalization**, and performing **dimensionality reduction with PCA**. Throughout, you will be required to analyze your results and justify your methodological choices.

- **Dataset:** [Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)  
- **Focus Areas:** Algorithm analysis, data interpretation, and preprocessing strategies.

---

## **Question 1: Exploratory Data Analysis and Preprocessing (30 Points)**

### **Part 1: Summary Statistics (5 Points)**
#### **Instructions:**
1. Load the **Wine Quality Dataset** and inspect its structure (e.g., feature types, missing values, summary statistics).
2. Compute the following descriptive statistics for each feature, both overall and grouped by wine quality rating:
   - Minimum, Maximum
   - Mean, Trimmed Mean (5%)
   - Standard Deviation
   - Skewness, Kurtosis
3. Present results in a **clear table**.
4. Provide a written **interpretation** of what these statistics reveal about the dataset.

#### **Deliverables:**
- Code implementation for computing summary statistics.
- A table summarizing computed values.
- A written analysis of key insights.

---

### **Part 2: Data Visualization (5 Points)**
#### **Instructions:**
1. Create **scatter plots or pair plots** to visualize relationships between two numerical features and wine quality.
2. Identify any **patterns, trends, or clusters** in the data.
3. Discuss whether **certain features appear to separate wine quality levels** more effectively.

#### **Deliverables:**
- Code for generating visualizations.
- A written discussion of key observations.

---

### **Part 3: Outlier Detection and Removal using Mahalanobis Distance (5 Points)**
#### **Instructions:**
1. **Use the provided pairwise ellipse plot method** (which builds on the Mahalanobis distance) to assess outliers in the dataset.
2. Select **at least three distinct feature pairs** for visualization.
3. Develop a **numerical outlier metric** based on Mahalanobis distance to systematically identify extreme values.
4. Implement an **algorithm that removes observations identified as outliers** based on this metric.
5. Justify the **choice of threshold** for outlier removal and explain why Mahalanobis distance is appropriate for multivariate data.

#### **Deliverables:**
- Code implementing the outlier detection and removal algorithm.
- Pairwise ellipse plots for at least three feature pairs.
- A written explanation of the metric used for outlier detection and removal, including justification of the threshold.

---

### **Part 4: Feature Scaling and Normalization (5 Points)**
#### **Instructions:**
1. Apply **Min-Max Normalization** to scale all numerical features between 0 and 1.
2. Verify that the transformed features meet the expected range.
3. Explain why normalization is essential for analyses such as PCA.

#### **Deliverables:**
- Code for Min-Max Normalization.
- A table comparing feature values before and after normalization.
- A written explanation of why normalization is beneficial.

---

### **Part 5: Principal Component Analysis and Dimensionality Reduction (10 Points)**
#### **Instructions:**
1. Apply **PCA to the full dataset** and compute the **explained variance for each principal component**.
2. Visualize the **cumulative explained variance** to determine how many principal components should be retained.
3. Apply **PCA separately for different wine quality levels** and compare the variance explained.
4. Discuss whether PCA helps reveal patterns that were not evident in the original features.

#### **Deliverables:**
- Code for PCA computation (built-in package is allowed).
- A table showing explained variance for each principal component.
- A discussion on the differences between applying PCA to the full dataset vs. subsets by wine quality.

---

### **Key Considerations**
✅ **Logical Flow:** The problem walks through **EDA → Cleaning → Transformation → PCA**  
✅ **Focus on Analysis & Justification:** You must **explain your choices** rather than just implement code.  
✅ **Algorithmic Thinking:** Requires **metric development**, use of **Mahalanobis distance**, and **PCA interpretation**.  

---

Good luck! 🚀


In [1]:
import pandas as pd

# Load red and white wine datasets from UCI Machine Learning Repository
url_red = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
url_white = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

df_red = pd.read_csv(url_red, sep=';')
df_white = pd.read_csv(url_white, sep=';')
df_red['wine_type'] = 'red'
df_white['wine_type'] = 'white'
df_wine = pd.concat([df_red, df_white], axis=0, ignore_index=True)

df_wine.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,wine_type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


# **Question 2: Statistical Algorithms (30 Points)**

### **Objective**  
In this question, you will implement a **Naïve Bayes classifier from scratch** (without using built-in machine learning libraries). You will:
1. **Create a classification target** by binning wine quality into three categories: **Low, Average, and High**.
2. **Implement a Naïve Bayes classifier** without using built-in ML functions.
3. **Analyze the runtime complexity** of your implementation.
4. **Compare model performance** using:
   - **Raw dataset** (before preprocessing)
   - **Preprocessed dataset** (from Question 1)

Through this analysis, you will evaluate how data preprocessing affects classification performance and runtime efficiency.

---

## **Part 1: Creating a Classification Target (5 Points)**
1. Using the `quality` column from the Wine Quality Dataset, **convert wine quality into three categories**:
   - **Low Quality:** `quality ≤ 5`
   - **Average Quality:** `quality = 6`
   - **High Quality:** `quality ≥ 7`
2. Store this as a new column: `quality_category`
3. Ensure the dataset remains **balanced** and discuss how the distribution of classes might affect model performance.

**Deliverables:**
- Code to transform the target variable.
- A frequency table showing the distribution of the three categories.
- A written discussion on class distribution.

---

## **Part 2: Implementing Naïve Bayes from Scratch (15 Points)**
You will **implement a Naïve Bayes classifier without using built-in ML libraries**.

### **Steps to Implement:**
1. **Compute Prior Probabilities:**  
   - Calculate the probability of each class (`P(Class)`).
   
2. **Compute Conditional Probabilities:**  
   - For each feature, assume a **Gaussian (Normal) distribution** and compute:
     $$ P(X | Class) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(X - \mu)^2}{2\sigma^2}} $$
   - Use the **mean (μ) and standard deviation (σ)** per feature per class.

3. **Implement the Prediction Function:**  
   - Compute **posterior probabilities** for each class using Bayes’ Theorem:
     $$ P(Class | X) = \frac{P(X | Class) P(Class)}{P(X)} $$
   - Assign each observation to the class with the highest posterior probability.

4. **Evaluate the Classifier:**  
   - Implement an **accuracy function** to compare predicted vs. actual classes.

**Deliverables:**
- Python implementation of Naïve Bayes (without built-in ML functions).
- Code for prior probabilities, likelihood estimation, and classification.
- An accuracy metric for model performance.

---

## **Part 3: Algorithm Runtime Analysis (5 Points)**
1. **Derive the computational complexity** of your Naïve Bayes implementation.
2. Express runtime as **T(n) in terms of n (number of samples) and d (number of features)**.
3. Provide the **asymptotic runtime** using **Big-O notation**.

**Deliverables:**
- Derivation of **T(n) runtime complexity**.
- Asymptotic **Big-O analysis**.
- A written explanation of how the runtime is affected by dataset size.

---

## **Part 4: Comparing Performance on Raw vs. Preprocessed Data (5 Points)**
1. Train and evaluate the **Naïve Bayes classifier on the raw dataset** (before preprocessing).
2. Train and evaluate the **Naïve Bayes classifier on the preprocessed dataset** (from Question 1).
3. **Compare results**, considering:
   - **Classification accuracy**
   - **Computation time**
   - **Impact of preprocessing on model performance**
4. Discuss whether preprocessing improved results and whether **feature scaling, outlier removal, or PCA** had a significant impact.

**Deliverables:**
- Accuracy comparison table for **raw vs. preprocessed data**.
- Computation time analysis for both datasets.
- A written discussion on preprocessing impact.

---

### **Key Takeaways**
✅ **Students implement a core ML algorithm from scratch**, reinforcing mathematical intuition.  
✅ **Runtime complexity analysis** encourages computational efficiency considerations.  
✅ **Comparing raw vs. preprocessed data** teaches the importance of data preparation in model performance.

---

Good luck! 🚀


# **Question 3: Linear Programming vs. Particle Swarm Optimization (20 Points)**

### **Objective**
In this question, you will solve a **linear programming (LP) optimization problem** using **two different methods**:
1. **Linear Programming (LP) Solver (`scipy.optimize.linprog`)**
2. **Particle Swarm Optimization (PSO) (`pyswarms`)**

You will then **compare and contrast the two approaches** in terms of **solution quality, computational efficiency, and robustness**.

---

## **Problem Statement**
You are given the following **linear objective function** to minimize:

$ \min_{x} \quad f(x) = -4x_1 - 3x_2 $

### **Subject to Constraints:**
$ x_1 + 2x_2 \leq 8 $ 
$ 3x_1 + x_2 \leq 9 $
$ x_1 \geq 0, \quad x_2 \geq 0 $

where:
- $ (x_1, x_2) $ are the decision variables.
- The constraints ensure feasible values for $x_1$ and $x_2$.

---

## **Part 1: Solve Using Linear Programming (LP) (7 Points)**
1. **Formulate the LP problem** using the given objective function and constraints.
2. **Use `scipy.optimize.linprog`** to solve for the optimal $x$.
3. **Record the optimal solution $x^*$ and objective value $f(x^*)$.**

**Deliverables:**
- Python code implementing the LP solution.
- The optimal solution $x^*$ and objective function value.

---

## **Part 2: Solve Using Particle Swarm Optimization (PSO) (7 Points)**
1. Define the **same objective function** as a Python function.
2. Implement **constraint handling** so that the constraints $Ax \leq b$ and $x \geq 0$ are satisfied.
3. **Use `pyswarms`** to approximate the solution.
4. **Record the optimal solution $x^*$ and objective value $f(x^*)$.**

**Deliverables:**
- Python code implementing the PSO solution.
- The optimal solution $x^*$ and objective function value.

---

## **Part 3: Compare and Contrast LP vs. PSO (6 Points)**
Write a **comparative analysis** of the two optimization methods based on:
1. **Solution Accuracy:** How close was PSO to the exact LP solution?
2. **Computational Efficiency:** Which method was faster? Why?
3. **Robustness:** How does each method perform in more complex scenarios (e.g., non-convex problems)?
4. **Use Cases:** When would you prefer **LP over PSO**, and vice versa?

**Deliverables:**
- A **written analysis** comparing LP vs. PSO.
- A **table summarizing key differences**.

---

### **Key Takeaways**
✅ **Demonstrates the difference between exact (LP) and heuristic (PSO) methods**.  
✅ **Encourages computational analysis by comparing solution accuracy and runtime**.  
✅ **Prepares students to think critically about choosing optimization techniques in real-world problems**.  

Good luck! 🚀


# **Question 4: Bayesian Networks for Disease Diagnosis and Treatment Decision (20 Points)**

## **Objective**
In this problem, you will:
1. **Construct a Bayesian Network** for medical diagnosis.
2. **Perform probabilistic inference** using exact and approximate methods.
3. **Analyze the runtime complexity** of different inference algorithms.
4. **Evaluate the impact of graph structure on inference performance.**

---

## **Problem Statement**
A hospital is developing an **AI-driven Bayesian Network** to assist in diagnosing patients. The system includes:

- **Flu (F)** and **COVID-19 (C)** as potential diseases.
- **Cough (K)** and **Fever (V)** as symptoms.
- **COVID-19 Treatment (T)** as an intervention.
- **Recovery (R)** depends on the disease and treatment.

### **Bayesian Network Structure**
      Flu       COVID-19
       |        /
       v       v
    Fever   Cough
        \     |
         v    v
        Recovery
           ^
           |
       Treatment

### **Conditional Probability Tables (CPTs)**
The following **CPTs** define the probabilistic relationships in the network:

#### **Disease Probabilities**
| Disease | P(Flu) | P(COVID-19) |
|---------|--------|-------------|
| True    | 0.12   | 0.08        |
| False   | 0.88   | 0.92        |

#### **Symptoms Given Disease**
| Flu | COVID-19 | P(Fever) | P(Cough) |
|-----|---------|----------|----------|
| False | False | 0.01     | 0.02     |
| False | True  | 0.85     | 0.60     |
| True  | False | 0.90     | 0.70     |
| True  | True  | 0.98     | 0.85     |

#### **Treatment Decision**
Doctors **only administer treatment if COVID-19 is present**:
- $ P(Treatment | COVID-19) = 0.95 $
- $ P(Treatment | \neg COVID-19) = 0.05 $ (error rate)

#### **Recovery Probabilities**
| Flu | COVID-19 | Treatment | P(Recovery) |
|-----|---------|-----------|-------------|
| False | False | Any       | 0.99        |
| False | True  | Yes       | 0.90        |
| False | True  | No        | 0.50        |
| True  | False | Any       | 0.85        |
| True  | True  | Yes       | 0.80        |
| True  | True  | No        | 0.30        |

## **Part 1: Constructing the Bayesian Network (5 Points)**
1. **Define the Bayesian Network structure** using `pgmpy`.
2. **Assign conditional probability tables (CPTs)** to each node.
3. **Ensure the network is valid and consistent.**

**Deliverables:**
- Python code defining the Bayesian Network.
- Explanation of the model.

---

## **Part 2: Bayesian Inference (8 Points)**
Compute the following probabilities using different inference algorithms:
1. $ P(\text{COVID-19} \mid \text{Fever} = \text{True}, \text{Cough} = \text{True}) $
2. $ P(\text{Flu} \mid \text{Fever} = \text{True}, \text{Cough} = \text{False}) $
3. $ P(\text{Treatment} \mid \text{Cough} = \text{True}) $
4. $ P(\text{Recovery} \mid \text{Fever} = \text{True}, \text{Treatment} = \text{True}) $

Use:
- **Exact Inference** (Variable Elimination)
- **Approximate Inference** (Gibbs Sampling)

**Deliverables:**
- Python code implementing both inference methods.
- Interpretation of results.

---

## **Part 3: Runtime Analysis (7 Points)**
### **Step 1: Measure and Compare Runtime**
1. Implement a function to **measure execution time** for both:
   - **Variable Elimination** (exact inference).
   - **Gibbs Sampling** (approximate inference).
2. Run both algorithms on increasingly **larger networks** (e.g., by adding more symptoms or diseases).
3. Plot runtime as a function of network size.

### **Step 2: Theoretical Complexity Analysis**
1. Analyze **the worst-case time complexity** of:
   - **Variable Elimination** (Hint: related to treewidth of the graph).
   - **Gibbs Sampling** (Hint: depends on number of iterations).
2. Discuss how the **graph structure** (e.g., chain, tree, densely connected) impacts computational efficiency.

### **Step 3: Interpretation**
1. Based on your runtime measurements, which algorithm scales better?
2. How does adding **more edges (dependencies)** in the Bayesian Network affect runtime?
3. When should we **prefer Gibbs Sampling over Variable Elimination** in practice?

**Deliverables:**
- Python code measuring runtime.
- A **runtime comparison graph**.
- A **written explanation** discussing results.

---

## **Key Takeaways**
✅ **Demonstrates Bayesian inference using exact and approximate algorithms.**  
✅ **Encourages students to evaluate algorithmic efficiency in probabilistic reasoning.**  
✅ **Teaches practical trade-offs between accuracy and computational cost.**  
✅ **Connects Bayesian Networks to Graph Algorithm complexity analysis.**

Good luck! 🚀