# Grade: 100 points

# Assignment 01: Traditional Machine Learning 

## Instructions

#### Follow These Steps before submitting your assignment 

This notebook contains the questions for Assignment 1. 

You must upload this completed Jupyter Notebook file as your submission (other file types are not permitted and will result in a grade of 0).***

* If you have trouble running neural network models on your laptop, you can use online platforms, like **[Google Colab](https://colab.research.google.com/)**.
* All Figures should have a x- and y-axis label and an appropriate title.
**Ensure that your code runs correctly by choosing "Kernel -> Restart and Cell -> Run All" before submitting.**

# Datasets:

`Dataset1.csv` lists the housing market data, where the goal is to predict house prices based on various factors:

- SqFt: Square footage of the house
- Bedrooms: Number of bedrooms
- Bathrooms: Number of bathrooms
- Dist_Center: Distance to city center (in miles)
- House_Age: Age of the house (years)
- Floors: Number of floors
- Lot_Size: Lot size (square feet)
- Walk_Score: Walkability score (0-100)
- HOA_Fee: Monthly HOA fee (if applicable)
- Crime_Rate: Local crime rate (incidents per 1000 residents)
- Dist_School: Distance to the nearest school (miles)
- Grocery_Stores: Number of nearby grocery stores
- Prop_Tax: Property tax rate (%)
- Maint_Cost: Numerical	Yearly maintenance costs 
- Median_Income: Household income of the neighborhood (median)
- Region: Region (values: 'A', 'B', 'C', 'D', 'E')
- House_Cond: House condition ('Low', 'Medium', 'High')
- Urban_Rural: Urban/Rural location ('Urban', 'Rural')
- House_Price: House price (Target variable)


-----------------------------------------------------------------------------------------------------


`Dataset2.csv` is a loan status dataset to predict whether a loan application will be fully approved, conditionally approved, or rejected based on various factors:

- Credit_Score: Applicant's credit score (300-850, higher is better)
- Income: Monthly income
- Loan_Amount: Requested loan amount 
- Loan_Term: Duration of loan (in months)
- Debt_Income: Debt-to-income ratio (%)
- Open_Accounts: Number of active credit accounts
- Hist_Length: Credit history length (years)
- Delinquencies: Number of past missed payments
- Total_Loan_Balance: Total outstanding loan balance
- Credit_Inquiries: Number of recent hard credit inquiries
- Employer_Tenure: How long the applicant has been at their job (years)
- LTV_Ratio: Loan-to-value ratio (%)
- Loan_Purpose: Purpose of the loan (X, Y, Z)
- Employment_Type: Employer size (Small, Medium, Large)
- Loan_Status: Target Variable(0 is Loan Rejected, 1 is Conditionally Approved, 2 is Fully Approved)

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

import matplotlib.pyplot as plt

# Q1 - Data Loading and Exploration (5 pts)

1. Load the Dataset1.
2. Display basic statistics and inspect for missing data.
3. Encode the categorical features (one-hot encoding). 
4. Visualize the distribution of all features using histogram.
5. **Discussion Question:** Why is it important to explore and visualize the data before building any models? What types of trends or problems could you uncover at this stage?

In [None]:
# 1. Load the Dataset1
data = pd.read_csv('Dataset1.csv')
df = pd.DataFrame(data)

# 2. Display basic statistics
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# 2. inspect for missing data
df.isnull().sum()

In [None]:
# 3. Encode the categorical features (one-hot encoding)
categorical_cols = ['Region', 'House_Cond', 'Urban_Rural']
encoder = OneHotEncoder(sparse_output=False)

encoded_array = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(categorical_cols))

# Drop original categorical columns and merge
df_encoded = df.drop(columns=categorical_cols).reset_index(drop=True)
df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
df_encoded.info()

In [None]:
# 4. Visualize the distribution of all features using histogram.

# Categorical Features
for column in categorical_cols:
    plt.hist(df[column])
    plt.title(f"Distribution of {column}")
    plt.xlabel(column)
    plt.ylabel("Count")
    plt.show()

In [None]:
# Numerical Features
numerical_cols = [column for column in df.columns if column not in categorical_cols]
for column in numerical_cols:
    plt.hist(df_encoded[column])
    plt.title(f"Distribution of {column}")
    plt.xlabel(column)
    plt.ylabel("Count")
    plt.show()

**Answer to Discussion Question**: 
1. Detect missing values, outliers in the dataset, so we can conduct proper data cleaning.
2. Show feature distributions and feature types in the dataset, so we can conduct proper feature engineering.
3. Also, we can predict what models will work for the dataset.

# Q2 - Outlier Detection (10 pts)
1. Train a Gaussian Mixture Model (GMM) on Dataset1 to identify potential outliers.

2. Remove the detected outliers and save the cleaned dataset.

3. How many outliers you detected?

4. Visualize the histogram plot of the remaining data.

5. **Discussion Question**: What are outliers? and why it is important to detect them?

**Answer to Discussion Question**: 



# Q3 - Correlation and Feature Selection (10 pts)

1. Visualize correlations between features and target using a heatmap to identify highly correlated features.
2. Select numerical features with correlation above two certain thresholds (0.02 and 0.04), and print them.
3. **Discussion Question:** How do you interpret a correlation value? Does a higher correlation always mean a feature is more important?

**Answer to Discussion Question**: 



# Q4 - Multiple Linear Regression (20 pts)

1. Build three different subsets of our data using two different sets of features based on correlation thresholds in previous question, as well as the original dataset with all features. For each subset, split the data into train and test and hold out 30% of observations as the test set. Pass random_state=42 to train_test_split to ensure you get the same train and tests sets as the solution and normalize (z-normalization) the data splits. 
2. Build two different multiple linear regression models using the subsets made by two different thresholds in Q3, and train them on their normalized training sets. 
3. Now build and fit a Lasso Regression model to the training data using all features in the dataset. The penalization parameter is set to 0.5.
4. Evaluate the three models on their test sets and compare the models using R² and RMSE.
5. **Discussion Question:** How do we decide which features to include in a multiple linear regression model? What challenges might arise from using too many features?
6. **Discussion Question:** Among the three models, which model performs the best and why?
7. **Discussion Question:** If a model has a high R² value but a large RMSE, what might that indicate about the model's performance?
8. **Discussion Question:** Discuss next steps for potential improvements to the best performing model.

**Answer to the first Discussion Question**: 



**Answer to the socond Discussion Question**: 



**Answer to the third Discussion Question**: 



**Answer to the fourth Discussion Question**: 



# Q 5 - Data Loading and Classification (25 pts)

1. Load the Dataset2.
2. Display basic statistics and inspect for missing data.
3. Encode the categorical features (one-hot encoding). 
4. Split the data into train, validation and test and hold out 20% and 10% of observations as the validation and test set, respectively. Pass random_state=42.
5. Normalize the data (Z-normalization) and fit the following models to the training samples:
- Logistic Regression
- K-Nearest Neighbors (KNN) with K equals to 3
- Random Forest (RF) that consists of 5 base decision trees with the maximum depth of 5
- Single-Layer Neural Network (Perceptron) with stochastic gradient descent (SGD) optimizer and a learning rate of 0.1, run the model for 10 iterations/epochs.

6. Report the training time in milli second for all models. 

7. Use the Random Forest model you built to generate feature importance scores and a horizontal bar chart to plot the importance scores of all features in descending order. 

8. Select the important features from most to least important until the accumulated relative importance score reaches 90% or 0.9 and print out the selected features with their importance scores

# Q 6 - Model Selection (10 pts)

1. Report the prediction results of all models in Q5 on the test set of Dataset2, using these evaluation metrics: Confusion matrix, F1-score, Recall, Precision and Accuracy. 
2. Plot the ROC curve and report AUC of the predictions on the test set.
3. Report the test time (in milli second) for all models. 
4. **Discussion Question:** Why is AUC-ROC a better metric than accuracy for this datasets? Provide an example where accuracy can be misleading.
5. **Discussion Question:** Among all models, wich one you would choose? why? 

**Answer to the first Discussion Question:** 


**Answer to the second Discussion Question:** 


# Q 7 - Model Selection (20 pts)

1. Build a Multi-Layer Perceptron (MLP) and fit it to the normalized training set of Dataset2- The details of the MLP are as follows:
   * Two hidden layers (H1, H2), with 50 and 100 neurons/units in H1 and H2, respectively. 
   * Use tanh function as the activation function for hidden layers.
   * Use a proper acitivation function for the output layer.  
   * Use Stochastic gradient descent optimizer with a learning rate of 0.1.
   * Run the model for 10 iterations/epochs 
   
2.  Report the training time in milli second
3.  Record the validation and training loss for each iteration, and make the plot of learning curves (iterations/epochs vs loss).
4.  Report the prediction results of MLP on the test set of Dataset2, using these evaluation metrics: Confusion matrix, F1-score, Recall, Precision and Accuracy.
5.  Report the test time (in milli second) for MLP. 
6.  **Analytical Question:** Do you see any signes of overfitting? Why? If it overfits, how would you fix this issue?

**Answer to Discussion Question:**
