<a href="https://colab.research.google.com/github/abdullahhkhann/Misc./blob/main/chapter_appendix-tools-for-deep-learning/jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Muhammad Abdullah Khan - T546E399**

1) Load the dataset with no headers and space as the delimiter

In [1]:
import pandas as pd

# Load the dataset with no headers and space as the delimiter
df = pd.read_csv('/content/sample_data/boston_housing.data', delimiter=r'\s+', header=None)

# Show the first few rows
print(df.head())

         0     1      2   3      4      5      6       7   8      9     10  \
0   0.10574   0.0  27.74   0  0.609  5.983   98.8  1.8681   4  711.0  20.1   
1   7.75223   0.0  18.10   0  0.713  6.301   83.7  2.7831  24  666.0  20.2   
2   0.02763  75.0   2.95   0  0.428  6.595   21.8  5.4011   3  252.0  18.3   
3   0.09266  34.0   6.09   0  0.433  6.495   18.4  5.4917   7  329.0  16.1   
4  15.17720   0.0  18.10   0  0.740  6.152  100.0  1.9142  24  666.0  20.2   

       11     12    13  
0  390.11  18.07  13.6  
1  272.21  16.23  14.9  
2  395.63   4.32  30.8  
3  383.61   8.67  26.4  
4    9.32  26.45   8.7  


2) Multiply the last column by 1000

In [2]:
# Multiply the last column (column index 13) by 1000
df[13] = df[13] * 1000

# Check that it worked
print(df[13].head())

0    13600.0
1    14900.0
2    30800.0
3    26400.0
4     8700.0
Name: 13, dtype: float64


3) Split the Data into Training and Test Sets & Separating out Features and Labels

In [3]:
from sklearn.model_selection import train_test_split

# First 13 columns are features, last column is the target (housing price)
X = df.iloc[:, :-1]  # Features (columns 0 to 12)
y = df.iloc[:, -1]   # Target (column 13)

# Perform the split: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print shapes to verify
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (354, 13)
Test set shape: (152, 13)


4) Handle Missing Values (if any) and Scale Features Using StandardScaler

In [4]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler only on the training data
X_train = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test = scaler.transform(X_test)

5) Initializing the 4 Regression Models

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Define the four models
models = {
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=42),
    'RandomForestRegressor': RandomForestRegressor(random_state=42),
    'SVR': SVR(kernel='rbf', gamma=0.01)
}

6) Perform 10-Fold Cross-Validation for Each Model

In [6]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Perform 10-fold CV and print RMSE stats
for name, model in models.items():
    print(f"\nModel: {name}")

    # cross_val_score gives NEGATIVE RMSE, so we negate it back
    scores = cross_val_score(model, X_train, y_train, cv=10, scoring='neg_root_mean_squared_error')
    rmse_scores = -scores

    print("RMSE for each fold:", np.round(rmse_scores, 2))
    print("Mean RMSE:", round(rmse_scores.mean(), 2))
    print("Standard Deviation:", round(rmse_scores.std(), 2))


Model: LinearRegression
RMSE for each fold: [5950.36 5377.69 4064.73 3410.71 4316.39 3159.05 3238.71 4855.28 5184.96
 5772.  ]
Mean RMSE: 4532.99
Standard Deviation: 993.74

Model: DecisionTreeRegressor
RMSE for each fold: [4048.42 5556.63 4045.13 6892.69 4573.99 3491.5  3168.42 4926.87 3453.32
 6264.62]
Mean RMSE: 4642.16
Standard Deviation: 1193.21

Model: RandomForestRegressor
RMSE for each fold: [3114.33 4971.98 4076.82 2732.86 3434.63 2828.34 2357.64 3384.33 2824.05
 5106.55]
Mean RMSE: 3483.15
Standard Deviation: 896.97

Model: SVR
RMSE for each fold: [ 9376.59  8882.71  8469.07  7542.62  9567.51  8812.32  4699.83  8909.15
 10775.37  9672.35]
Mean RMSE: 8670.75
Standard Deviation: 1547.2


7) GridSearchCV for RandomForestRegressor Hyperparameter Tuning

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

# Define parameter grid
param_grid = {
    'n_estimators': [10, 20, 30, 40],
    'max_features': [4, 8]
}

# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=param_grid,
    cv=10,
    scoring='neg_root_mean_squared_error',
    return_train_score=True
)

# Fit GridSearch
grid_search.fit(X_train, y_train)

# Print results for each combination
print("\nGrid Search CV Results:")
for mean_score, std_score, params in zip(
    grid_search.cv_results_['mean_test_score'],
    grid_search.cv_results_['std_test_score'],
    grid_search.cv_results_['params']
):
    print(f"Params: {params}, Mean RMSE: {-mean_score:.2f}, Std Dev: {std_score:.2f}")

# Print best parameters
print("\nBest Hyperparameters:", grid_search.best_params_)

# Evaluate best model on test set
final_model = grid_search.best_estimator_
y_pred = final_model.predict(X_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"\nTest RMSE with best model: {rmse_test:.2f}")


Grid Search CV Results:
Params: {'max_features': 4, 'n_estimators': 10}, Mean RMSE: 3706.72, Std Dev: 734.45
Params: {'max_features': 4, 'n_estimators': 20}, Mean RMSE: 3455.36, Std Dev: 784.00
Params: {'max_features': 4, 'n_estimators': 30}, Mean RMSE: 3304.52, Std Dev: 758.18
Params: {'max_features': 4, 'n_estimators': 40}, Mean RMSE: 3277.26, Std Dev: 758.02
Params: {'max_features': 8, 'n_estimators': 10}, Mean RMSE: 3477.52, Std Dev: 868.64
Params: {'max_features': 8, 'n_estimators': 20}, Mean RMSE: 3307.90, Std Dev: 754.81
Params: {'max_features': 8, 'n_estimators': 30}, Mean RMSE: 3268.24, Std Dev: 744.04
Params: {'max_features': 8, 'n_estimators': 40}, Mean RMSE: 3267.56, Std Dev: 800.43

Best Hyperparameters: {'max_features': 8, 'n_estimators': 40}

Test RMSE with best model: 3317.56
