# COMP 6630 | Assignment 4 | Will Gasser | wbg0023

## README
Data, Google‑Drive mounting, Assignment4_Data.xlsx.

Required libs: pandas, numpy, math, sklearn, matplotlib


### 1.1 Conditional probability distributions (hand‑calculated)

| Feature | μ (Class 0 Apt) | σ (0) | μ (Class 1 H/C) | σ (1) |
|---------|---------------:|------:|-----------------:|------:|
| Local Price | 7.3327 | 3.6160 | 6.5247 | 3.1241 |
| Bathrooms   | 1.2857 | 0.5669 | 1.1923 | 0.4349 |
| Land Area   | 6.1039 | 3.2585 | 6.3511 | 2.3079 |
| Living area | 1.5050 | 0.7041 | 1.4663 | 0.6205 |
| # Garages   | 1.2143 | 0.6986 | 1.1923 | 0.6934 |
| # Rooms     | 6.8571 | 1.3452 | 6.4615 | 1.1983 |
| # Bedrooms  | 3.4286 | 0.9759 | 3.1538 | 0.6887 |
| Age of home | 38.7143|14.6824 | 36.7692|13.0330 |

Prior probabilities (training split):  **P(Apartment)=0.35**, **P(House/Condo)=0.65**

**Manual work (example)**  
For *Bathrooms* (7 Apts, 13 H/C):

* counts: Class 0→1.0 ×5, 1.5 ×1, 2.5 ×1 → P(1.0∣0)=5⁄7=0.7143  
* counts: Class 1→1.0 ×10, 1.5 ×2, 2.5 ×1 → P(1.0∣1)=10⁄13=0.7692  

Gaussian likelihood for x = 1.0 in Class 0:  
$$
\frac{1}{0.5669\sqrt{2\pi}}
      e^{-\frac{(1-1.2857)^2}{2\times0.5669^2}}
      \approx 0.6198
$$

Identical steps (counting then μ/σ) were repeated for # Garages and checked for all other attributes.

In [2]:
"""
1.2 Hard‑coded Naïve Bayes classifier using constants from 1.1
"""
import math, pandas as pd

MU = {
    0:{'Local Price':7.3327,'Bathrooms':1.2857,'Land Area':6.1039,'Living area':1.5050,
       '# Garages':1.2143,'# Rooms':6.8571,'# Bedrooms':3.4286,'Age of home':38.7143},
    1:{'Local Price':6.5247,'Bathrooms':1.1923,'Land Area':6.3511,'Living area':1.4663,
       '# Garages':1.1923,'# Rooms':6.4615,'# Bedrooms':3.1538,'Age of home':36.7692}
}
STD = {
    0:{'Local Price':3.6160,'Bathrooms':0.5669,'Land Area':3.2585,'Living area':0.7041,
       '# Garages':0.6986,'# Rooms':1.3452,'# Bedrooms':0.9759,'Age of home':14.6824},
    1:{'Local Price':3.1241,'Bathrooms':0.4349,'Land Area':2.3079,'Living area':0.6205,
       '# Garages':0.6934,'# Rooms':1.1983,'# Bedrooms':0.6887,'Age of home':13.0330}
}
PRIORS={0:0.35,1:0.65}
FEATURES=list(MU[0].keys())

def gaussian(x, mu, sigma):
    if sigma==0:
        return 1.0 if x==mu else 1e-10
    return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-((x-mu)**2)/(2*sigma**2))

def predict_probs(row):
    logp={c:math.log(PRIORS[c]) for c in (0,1)}
    for f in FEATURES:
        for c in (0,1):
            logp[c]+=math.log(max(gaussian(row[f], MU[c][f], STD[c][f]),1e-15))
    p0,p1=math.exp(logp[0]), math.exp(logp[1])
    total=p0+p1
    return {0:p0/total,1:p1/total}


In [6]:
from google.colab import drive
drive.mount('/content/drive')

excel_path = '/content/drive/MyDrive/Colab Notebooks/Assignment4_Data.xlsx'

#  test run
train = pd.read_excel(excel_path, sheet_name=0)
test  = pd.read_excel(excel_path, sheet_name=1)

# binary labels
train['Class']=(train['Construction type']!='Apartment').astype(int)
test['Class'] =(test['Construction type']!='Apartment').astype(int)

correct=0
for i,r in test.iterrows():
    pr=predict_probs(r)
    pred=max(pr,key=pr.get)
    correct+=int(pred==r['Class'])
    print(f"Row {i+1}: P0={pr[0]:.4f} P1={pr[1]:.4f} -> Pred={pred}  Actual={r['Class']}")

print(f"Naïve Bayes accuracy on test split: {correct/len(test):.2f}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Row 1: P0=0.1126 P1=0.8874 -> Pred=1  Actual=0
Row 2: P0=0.4580 P1=0.5420 -> Pred=1  Actual=1
Row 3: P0=0.1889 P1=0.8111 -> Pred=1  Actual=1
Row 4: P0=0.3585 P1=0.6415 -> Pred=1  Actual=0
Row 5: P0=0.2151 P1=0.7849 -> Pred=1  Actual=0
Naïve Bayes accuracy on test split: 0.40


## 2 Decision Tree  

### 2.1
Using the off-the-shelf DecisionTreeClassifier() the algorithm keeps splitting until every training leaf is pure.  
* **Training accuracy = 1.0000** – the tree memorises all 20 training rows.  
* **Test accuracy = 0.6000** – on the five unseen houses it gets 3 / 5 correct.  
The perfect in-sample score + drop on new data is shows signs of overfitting as the model has captured noise of the training set.  

---

### 2.2   
Max depth was limited:

| depth | train acc | test acc |
|-------|-----------|----------|
| 1 | 0.75 | 0.40 |
| 2 | 0.85 | **0.60** |
| 3 | 0.95 | 0.60 |
| 4-10 | 1.00 | 0.60 |

Depth 2 gives the highest test score while still avoiding full memorization.Deeper models keep improving the fit to the tiny training set but do not lift test performance which shows the inclusion of noise.

---

### 2.3
A shallow tree acts like built-in regularization:  

* **Fewer leaves → larger sample per leaf** – each decision is supported by several houses, so the rule is more reliable.
* **Focus on the strongest predictors** – the first two or three splits usually involve the features with the highest information-gain (here Local Price and Age of home). Less informative variables are ignored.  
* **Lower variance** – with only 20 training rows, an unconstrained tree has enormous variance; depth-capping keeps that variance in check, improving generalization.

That is why accuracy peaks at a modest depth and then plateaus: we have extracted the useful structure after ~2 levels, everything deeper just memorizes nose.

---

### 2.4
Given the feature vector  

| Feature | Value |
|---------|-------|
| Local Price | 9.0384 |
| Bathrooms | 1 |
| Land Area | 7.8 |
| Living area | 1.5 |
| # Garages | 1.5 |
| # Rooms | 7 |
| # Bedrooms | 3 |
| Age of home | 23 |

the depth-2 tree follows two comparisons:

1. **Local Price ≤ 8.36 ?** – *No* → right branch  
2. **Age of home ≤ 37.5 ?** – *Yes* → reach leaf labelled **0**

Leaf 0 was created from several “Apartment” examples, so the model assigns a ≈ 1.00 to Class 0.  

**Prediction:** Apartment (Class 0) with essentially full confidence.

---


In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd, numpy as np

train = pd.read_excel(excel_path, sheet_name=0)
test  = pd.read_excel(excel_path, sheet_name=1)
train['Class']=(train['Construction type']!='Apartment').astype(int)
test['Class'] =(test['Construction type']!='Apartment').astype(int)

features=['Local Price','Bathrooms','Land Area','Living area','# Garages','# Rooms','# Bedrooms','Age of home']
Xtr,ytr=train[features],train['Class']
Xte,yte=test[features],test['Class']

print("Depth | Train | Test")
best_depth,best_score=1,0
for d in range(1,11):
    dt=DecisionTreeClassifier(max_depth=d,random_state=42).fit(Xtr,ytr)
    tr,te=accuracy_score(ytr,dt.predict(Xtr)),accuracy_score(yte,dt.predict(Xte))
    print(f"{d:>5} | {tr:.2f}  | {te:.2f}")
    if te>best_score:
        best_score, best_depth = te, d

# inference
best=DecisionTreeClassifier(max_depth=best_depth,random_state=42).fit(Xtr,ytr)
query=pd.DataFrame({'Local Price':[9.0384],'Bathrooms':[1],'Land Area':[7.8],'Living area':[1.5],
                    '# Garages':[1.5],'# Rooms':[7],'# Bedrooms':[3],'Age of home':[23]})
print("\nDepth chosen:",best_depth)
print("Prediction:",best.predict(query)[0],"Probbility:",best.predict_proba(query)[0])


Depth | Train | Test
    1 | 0.75  | 0.40
    2 | 0.85  | 0.60
    3 | 0.95  | 0.60
    4 | 0.95  | 0.60
    5 | 1.00  | 0.60
    6 | 1.00  | 0.60
    7 | 1.00  | 0.60
    8 | 1.00  | 0.60
    9 | 1.00  | 0.60
   10 | 1.00  | 0.60

Depth chosen: 2
Prediction: 0 Proba: [1. 0.]
