# Part 1: Numbers

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


In [2]:
# Load the digits dataset
digits = datasets.load_digits()
X = digits['data']
Y = digits['target']


In [3]:
 # Train-test split
X_train_dc, X_test_dc, y_train_dc, y_test_dc = train_test_split(X, Y, test_size=0.3, random_state=42)

In [4]:
# 1. Model with 64 features
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train_dc, y_train_dc)
y_pred_full_train_dc = dt_full.predict(X_train_dc)
y_pred_full_test_dc = dt_full.predict(X_test_dc)
acc_full_train_dc = accuracy_score(y_train_dc, y_pred_full_train_dc)
acc_full_test_dc = accuracy_score(y_test_dc, y_pred_full_test_dc)
print("Accuracy with all 64 features - Train:", acc_full_train_dc)
print("Accuracy with all 64 features - Test:", acc_full_test_dc)

Accuracy with all 64 features - Train: 1.0
Accuracy with all 64 features - Test: 0.8425925925925926


In [5]:
# 2. Feature Selection using Chi-Square Test (Top 25 features)
kbest_dc = SelectKBest(chi2, k=25)
X_train_kbest_dc = kbest_dc.fit_transform(X_train_dc, y_train_dc)
X_test_kbest_dc = kbest_dc.transform(X_test_dc)

dt_kbest = DecisionTreeClassifier(random_state=42)
dt_kbest.fit(X_train_kbest_dc, y_train_dc)
y_pred_kbest_train_dc = dt_kbest.predict(X_train_kbest_dc)
y_pred_kbest_test_dc = dt_kbest.predict(X_test_kbest_dc)
acc_kbest_train_dc = accuracy_score(y_train_dc, y_pred_kbest_train_dc)
acc_kbest_test_dc = accuracy_score(y_test_dc, y_pred_kbest_test_dc)
print("Accuracy with 25 selected features (Chi-Square) - Train:", acc_kbest_train_dc)
print("Accuracy with 25 selected features (Chi-Square) - Test:", acc_kbest_test_dc)

Accuracy with 25 selected features (Chi-Square) - Train: 1.0
Accuracy with 25 selected features (Chi-Square) - Test: 0.8629629629629629


In [6]:
# 3. Feature Reduction using PCA (Top 25 components)
pca = PCA(n_components=25)
X_train_pca = pca.fit_transform(X_train_dc)
X_test_pca = pca.transform(X_test_dc)

dt_pca = DecisionTreeClassifier(random_state=42)
dt_pca.fit(X_train_pca, y_train_dc)
y_pred_pca_train = dt_pca.predict(X_train_pca)
y_pred_pca_test = dt_pca.predict(X_test_pca)
acc_pca_train = accuracy_score(y_train_dc, y_pred_pca_train)
acc_pca_test = accuracy_score(y_test_dc, y_pred_pca_test)
print("Accuracy with 25 PCA components - Train:", acc_pca_train)
print("Accuracy with 25 PCA components - Test:", acc_pca_test)


Accuracy with 25 PCA components - Train: 1.0
Accuracy with 25 PCA components - Test: 0.8407407407407408


In [7]:
# Compare models
print("\nModel Comparisons:")
print(f"Full feature model accuracy - Train: {acc_full_train_dc:.4f}, Test: {acc_full_test_dc:.4f}")
print(f"Chi-Square selected feature model accuracy - Train: {acc_kbest_train_dc:.4f}, Test: {acc_kbest_test_dc:.4f}")
print(f"PCA reduced feature model accuracy - Train: {acc_pca_train:.4f}, Test: {acc_pca_test:.4f}")


Model Comparisons:
Full feature model accuracy - Train: 1.0000, Test: 0.8426
Chi-Square selected feature model accuracy - Train: 1.0000, Test: 0.8630
PCA reduced feature model accuracy - Train: 1.0000, Test: 0.8407


## Which Model is More Likely to Overfit or Underfit?
- **Full feature model (64 features):** This model is more likely to overfit because it has a high number of features, capturing noise along with patterns. The high training accuracy compared to test accuracy suggests potential overfitting.
- **Chi-Square selection (25 features):** This model performs the best, suggesting it keeps the most relevant features while reducing noise. A smaller gap between training and test accuracy indicates better generalization.
- **PCA (25 components):** PCA focuses on variance and may lose some useful features, leading to slight underfitting.

## Model Comparison
| Model | Features | Train Accuracy | Test Accuracy |
|--------|----------|----------------|--------------|
| Full Model | 64 | 1.00 | 0.8426 |
| Chi-Square | 25 | 1.00| 0.8630 |
| PCA | 25 | 1.00 | 0.8407 |

- The **full model** is likely overfitting, as the train accuracy is significantly higher than the test accuracy.
- The **Chi-Square model is the best performer**, achieving the highest test accuracy with minimal overfitting.
- The **PCA model may underfit** due to loss of useful information, as indicated by lower training accuracy.

# Part 2: House Prices Prep


In [8]:
## Import data
file_path = "data/House_Prices.csv"
houses = pd.read_csv(file_path)

print("Shape:", houses.shape)
houses.head()


Shape: (10659, 13)


Unnamed: 0.1,Unnamed: 0,Record,Sale_amount,Sale_date,Beds,Baths,Sqft_home,Sqft_lot,Type,Build_year,Town,University,Type2
0,1,1,295000.0,42521,5,3.0,2020,38332.8,3,1976,1,10,3
1,2,2,240000.0,42541,4,2.0,1498,54014.4,3,2002,1,10,3
2,3,3,385000.0,42521,5,4.0,4000,85813.2,3,2001,1,10,3
3,4,4,268000.0,42472,3,2.5,2283,118918.8,3,1972,1,10,3
4,5,5,186000.0,42465,3,1.25,1527,15681.6,3,1975,1,10,3


In [9]:
# Dropping unused columns
df = houses.drop(columns = ["Unnamed: 0", "Record", "University", "Type2"])
df

Unnamed: 0,Sale_amount,Sale_date,Beds,Baths,Sqft_home,Sqft_lot,Type,Build_year,Town
0,295000.0,42521,5,3.00,2020,38332.8,3,1976,1
1,240000.0,42541,4,2.00,1498,54014.4,3,2002,1
2,385000.0,42521,5,4.00,4000,85813.2,3,2001,1
3,268000.0,42472,3,2.50,2283,118918.8,3,1972,1
4,186000.0,42465,3,1.25,1527,15681.6,3,1975,1
...,...,...,...,...,...,...,...,...,...
10654,320000.0,42528,3,2.00,1870,13068.0,3,2012,50
10655,359100.0,42468,5,4.50,2119,11325.6,3,2013,50
10656,349646.0,42534,3,1.75,1949,14374.8,3,2015,50
10657,288000.0,42476,4,4.00,2710,10890.0,3,2012,50


## Before Preprocessing

In [10]:
## Define target variale
y = df["Town"]

In [11]:
## Set features and target
x_raw = df.drop(columns="Town")

In [12]:
# Train test split before preprocessing
x_train_raw, x_test_raw, y_train, y_test = train_test_split(x_raw, y, test_size=0.3, random_state=42)
y

0         1
1         1
2         1
3         1
4         1
         ..
10654    50
10655    50
10656    50
10657    50
10658    50
Name: Town, Length: 10659, dtype: int64

In [None]:
# Initialize and train the model before preprocessing
best_model = RandomForestClassifier(n_estimators=500, max_depth=15, random_state=42)
best_model.fit(x_train_raw, y_train)

In [None]:
# Predictions before preprocessing
y_pred_train_before = best_model.predict(x_train_raw)
y_pred_test_before = best_model.predict(x_test_raw)

In [None]:
# Get accuracy before preprocessing
train_acc_before = accuracy_score(y_train, y_pred_train_before)
test_acc_before = accuracy_score(y_test, y_pred_test_before)

print("Accuracy before preprocessing - Train:", train_acc_before)
print("Accuracy before preprocessing - Test:", test_acc_before)


Accuracy before preprocessing - Train: 0.9923602734217933
Accuracy before preprocessing - Test: 0.38461538461538464


## Apply Preprocessings

In [None]:
x_processed = pd.get_dummies(x_raw['Type'], prefix='Type').astype('int')
x_processed = pd.concat([x_raw, x_processed], axis=1)
x_processed.drop(columns='Type', inplace=True)
x_processed

Unnamed: 0,Sale_amount,Sale_date,Beds,Baths,Sqft_home,Sqft_lot,Build_year,Type_1,Type_2,Type_3
0,295000.0,42521,5,3.00,2020,38332.8,1976,0,0,1
1,240000.0,42541,4,2.00,1498,54014.4,2002,0,0,1
2,385000.0,42521,5,4.00,4000,85813.2,2001,0,0,1
3,268000.0,42472,3,2.50,2283,118918.8,1972,0,0,1
4,186000.0,42465,3,1.25,1527,15681.6,1975,0,0,1
...,...,...,...,...,...,...,...,...,...,...
10654,320000.0,42528,3,2.00,1870,13068.0,2012,0,0,1
10655,359100.0,42468,5,4.50,2119,11325.6,2013,0,0,1
10656,349646.0,42534,3,1.75,1949,14374.8,2015,0,0,1
10657,288000.0,42476,4,4.00,2710,10890.0,2012,0,0,1


In [None]:
## Scaling Numerical Features
scaler = StandardScaler()

cols_to_scale = ['Sale_amount', 'Sale_date', 'Beds', 'Baths', 'Sqft_home', 'Sqft_lot', 'Build_year']

x_processed[cols_to_scale] = scaler.fit_transform(x_processed[cols_to_scale])
x_processed

Unnamed: 0,Sale_amount,Sale_date,Beds,Baths,Sqft_home,Sqft_lot,Build_year,Type_1,Type_2,Type_3
0,-0.124695,0.354976,1.499324,0.617398,-0.031612,0.243523,0.236781,0,0,1
1,-0.285850,0.839862,0.542360,-0.326315,-0.344471,0.457297,1.032831,0,0,1
2,0.139015,0.354976,1.499324,1.561111,1.155095,0.890781,1.002214,0,0,1
3,-0.203807,-0.832995,-0.414604,0.145541,0.126017,1.342080,0.114312,0,0,1
4,-0.444076,-1.002706,-0.414604,-1.034100,-0.327090,-0.065260,0.206164,0,0,1
...,...,...,...,...,...,...,...,...,...,...
10654,-0.051442,0.524686,-0.414604,-0.326315,-0.121514,-0.100889,1.339004,0,0,1
10655,0.063125,-0.929973,1.499324,2.032967,0.027724,-0.124642,1.369622,0,0,1
10656,0.035424,0.670152,-0.414604,-0.562244,-0.074165,-0.083075,1.430856,0,0,1
10657,-0.145205,-0.736018,0.542360,1.561111,0.381938,-0.130580,1.339004,0,0,1


## After Preprocessings

In [None]:
# Train-test split after preprocessing
x_train_processed, x_test_processed, y_train, y_test = train_test_split(x_processed, y, test_size=0.3, random_state=42)

In [None]:
# Train the model after preprocessing
best_model.fit(x_train_processed, y_train)

In [None]:
# Predictions after preprocessing
y_pred_train_after = best_model.predict(x_train_processed)
y_pred_test_after = best_model.predict(x_test_processed)

In [None]:
# Get accuracy after preprocessing
train_acc_after = accuracy_score(y_train, y_pred_train_after)
test_acc_after = accuracy_score(y_test, y_pred_test_after)

print("Accuracy after preprocessing - Train:", train_acc_after)
print("Accuracy after preprocessing - Test:", test_acc_after)   

Accuracy after preprocessing - Train: 0.9912880310950275
Accuracy after preprocessing - Test: 0.38586616635397125


## Compare Two Models (Before vs. After Transformations)

In [None]:
# Print results
print(f"Original Model (Before Preprocessing) - Train Accuracy: {train_acc_before:.4f}, Test Accuracy: {test_acc_before:.4f}")
print(f"Processed Model (After Preprocessing) - Train Accuracy: {train_acc_after:.4f}, Test Accuracy: {test_acc_after:.4f}")

Original Model (Before Preprocessing) - Train Accuracy: 0.9924, Test Accuracy: 0.3846
Processed Model (After Preprocessing) - Train Accuracy: 0.9913, Test Accuracy: 0.3859


## Drop Columns Using KBest

In [None]:
# Feature selection using SelectKBest (5 best features)
k = 5
selector = SelectKBest(score_func=chi2, k=k)
x_selected = selector.fit_transform(x_raw, y)

In [None]:
# Get selected feature names
selected_features = x_raw.columns[selector.get_support()]
print("Selected Features using chi2 on raw data:", selected_features)

Selected Features using chi2 on raw data: Index(['Sale_amount', 'Baths', 'Sqft_home', 'Sqft_lot', 'Build_year'], dtype='object')


In [None]:
# Convert selected features back to a DataFrame
x_selected_df = pd.DataFrame(x_selected, columns=selected_features)

In [None]:
# Train test split after feature selection
x_train_selected, x_test_selected, y_train, y_test = train_test_split(
    x_selected_df, y, test_size=0.3, random_state=42
)

In [None]:
# Train the model after feature selection
best_model.fit(x_train_selected, y_train)

In [None]:
# Predictions after feature selection
y_pred_train_kbest = best_model.predict(x_train_selected)
y_pred_test_kbest = best_model.predict(x_test_selected)

In [None]:
# Get accuracy after feature selection
train_acc_kbest = accuracy_score(y_train, y_pred_train_kbest)
test_acc_kbest = accuracy_score(y_test, y_pred_test_kbest)

In [None]:
# Print results
print(f"Original Model (Before Preprocessing) - Train Accuracy: {train_acc_before:.4f}, Test Accuracy: {test_acc_before:.4f}")
print(f"Processed Model (After Preprocessing) - Train Accuracy: {train_acc_after:.4f}, Test Accuracy: {test_acc_after:.4f}")
print(f"KBest Model (Feature Selection with chi2 on raw data) - Train Accuracy: {train_acc_kbest:.4f}, Test Accuracy: {test_acc_kbest:.4f}")

Original Model (Before Preprocessing) - Train Accuracy: 0.9924, Test Accuracy: 0.3846
Processed Model (After Preprocessing) - Train Accuracy: 0.9913, Test Accuracy: 0.3859
KBest Model (Feature Selection with chi2 on raw data) - Train Accuracy: 0.9890, Test Accuracy: 0.3271


## **Model Comparison and Discussion**  

### **1. Original Model (Before Preprocessing)**
- **Train Accuracy:** 99.24%  
- **Test Accuracy:** 38.46%  
- **Summary:**  
  - The model performs very well on training data but poorly on test data, showing **overfitting**.  
  - The low test accuracy suggests that some features may not be useful.  

### **2. Processed Model (After Preprocessing)**
- **Train Accuracy:** 99.13%  
- **Test Accuracy:** 38.59%  
- **Summary:**  
  - Scaling and encoding improved test accuracy slightly but didn’t solve overfitting.  
  - The model still struggles to generalize to new data.  

### **3. Model with SelectKBest (Feature Selection with chi2)**
- **Train Accuracy:** 98.90%  
- **Test Accuracy:** 32.71%  
- **Summary:**  
  - Assume raw data with no scaling
  - Reducing features lowered training accuracy but **also reduced test accuracy**. 
  - Some important features may have been removed.  

### **Summary**
- The **original model overfits**, performing well on training data but poorly on test data.  
- The **processed model** helped a little but didn’t fully fix the problem.  
- The **feature selection model** reduced overfitting but lost too much useful information.  


## Citation

**ChatGPT**: https://chatgpt.com/

**Scikit-learn**: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

**Lab 4 Code**