# **Machine Learning II Regression Project**

# Notebook Structure

## 1. Data Preparation and Exploratory Data Analysis (EDA)
- Prepare and clean the dataset.
- Perform exploratory data analysis to identify patterns, trends, and potential outliers.

## 2. Building Models
- Develop predictive models using machine learning algorithms.
- Train and evaluate the models to ensure optimal performance.

## 3. Comparing Results
- Compare the performance metrics of the models.
- Analyze the results to draw meaningful conclusions.

## 1. Data Preparation and Exploratory Data Analysis (EDA)

In [1]:
# Data manipulation and visualization
import pandas as pd  # Data manipulation
import seaborn as sns  # Data visualization
import matplotlib.pyplot as plt  # Plotting and visualization

# Data preprocessing
from sklearn.preprocessing import LabelEncoder, MinMaxScaler  # Encoding and scaling
from sklearn.pipeline import Pipeline  # Preprocessing pipeline

# Data splitting and model selection
from sklearn.model_selection import train_test_split, GridSearchCV  # Data splitting and hyperparameter tuning

# Machine learning models
from sklearn.linear_model import LinearRegression  # Linear regression model
from scikeras.wrappers import KerasRegressor  # Keras model wrapper for scikit-learn

# Model evaluation
from sklearn.metrics import mean_squared_error, r2_score, make_scorer  # Metrics for evaluation

# LightGBM
import lightgbm as lgb  # LightGBM implementation

# Neural networks (Keras)
from tensorflow.keras.models import Sequential  # Model building
from tensorflow.keras.layers import Dense, Input  # Layers for neural networks
from tensorflow.keras.optimizers import Adam  # Optimizer for neural networks


ImportError: initialization failed

In [2]:
# Load the data
data = pd.read_csv('/Users/elgun/Desktop/Machine_Learning_Project/data/regression.csv')
display(data.head()) # Display the first 5 rows of the data

Unnamed: 0,No,Property_Type,Land_Area,Floor_Area,Condition,View,Amenities,Furnishing_Status,Air_Conditioning (Yes/No),Heating (Yes/No),...,Location,Num_bathrooms,Num_rooms,Price,distance to nearest MRT Station,distance to nearest Hospital,distance to nearest School,distance to nearest Bus Stand,distance to nearest Airport,Crimerate in area
0,1,Condo,1784,1555,New,,Basic,Fully Furnished,No,Yes,...,Phoenix,3,2,334126,7.06,11.83,3.54,2.67,45.83,9.9
1,2,Farmhouse,1768,1530,New,City View,Basic,Semi Furnished,No,Yes,...,San Francisco,2,3,1895788,3.96,14.28,1.02,1.21,21.99,7.7
2,3,Apartment,2667,2667,Old,Park Facing,Rooftop Access,Fully Furnished,Yes,No,...,Miami,2,3,4397494,14.08,0.92,4.94,0.62,11.06,5.3
3,4,Apartment,1262,1262,Old,City View,,Semi Furnished,Yes,No,...,Chicago,3,4,1367842,14.17,14.21,7.13,0.39,43.88,8.9
4,5,Apartment,1720,1720,Renovated,City View,Parking,Semi Furnished,Yes,No,...,New York,4,6,1403758,6.09,8.50,7.16,4.56,40.70,8.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,499996,Condo,3815,2060,Renovated,City View,Basic,Semi Furnished,Yes,No,...,Boston,3,4,549973,8.48,17.41,4.18,0.85,24.97,9.3
499996,499997,Villa,6514,4525,Old,City View,"Gym, Swimming Pool",Semi Furnished,Yes,No,...,San Francisco,4,6,8464858,13.31,3.16,8.40,4.79,11.77,1.2
499997,499998,Apartment,2587,2587,Old,City View,Rooftop Access,Semi Furnished,Yes,No,...,Denver,5,3,3475517,8.51,14.24,4.09,0.37,24.30,6.2
499998,499999,Apartment,2897,2897,Old,,Rooftop Access,Semi Furnished,Yes,No,...,Boston,4,5,1616638,7.24,10.08,5.18,4.70,8.78,8.3


In [None]:
# Separate target and features
target_column = 'Price'
X = data.drop(columns=[target_column, 'No'])
y = data[[target_column]]  # Ensure y is a DataFrame for scaling

# Identify numerical and categorical columns
numerical_columns = X.select_dtypes(include=['number']).columns.tolist()
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Encode categorical columns using LabelEncoder
for col in categorical_columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))

# Define preprocessing for numerical data
X_preprocessor = Pipeline(steps=[
    ('scale', MinMaxScaler())
])

# Define scaler for the target variable
y_scaler = MinMaxScaler()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply scaling to numerical columns in X
X_train[numerical_columns] = X_preprocessor.fit_transform(X_train[numerical_columns])
X_test[numerical_columns] = X_preprocessor.transform(X_test[numerical_columns])

# Scale the target variable y
y_train = pd.DataFrame(y_scaler.fit_transform(y_train), columns=[target_column], index=y_train.index)
y_test = pd.DataFrame(y_scaler.transform(y_test), columns=[target_column], index=y_test.index)

In [None]:
# Display missing values
print("\nMissing values in the dataset:")
display(data.isnull().sum())

# Visualize the distribution of the target variable
plt.figure(figsize=(10, 6))
sns.histplot(data[target_column], kde=True, color='skyblue')
plt.title('Distribution of Target Variable (Price)', fontsize=16)
plt.xlabel('Price', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# Visualize the correlation matrix
plt.figure(figsize=(15, 10))
sns.heatmap(data[numerical_columns].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix', fontsize=16)
plt.show()

def plot_relationships(columns, plot_type):
    """Helper function to plot relationships."""
    for col in columns:
        plt.figure(figsize=(10, 6))
        if plot_type == 'scatter':
            sns.scatterplot(x=data[col], y=data[target_column], color='green')
        elif plot_type == 'box':
            sns.boxplot(x=data[col], y=data[target_column])  # Removed palette
        plt.title(f'Relationship between {col} and {target_column}', fontsize=16)
        plt.xlabel(col, fontsize=12)
        plt.ylabel(target_column, fontsize=12)
        plt.show()


# Plot numerical relationships
plot_relationships(numerical_columns, plot_type='scatter')

# Plot categorical relationships
plot_relationships(categorical_columns, plot_type='box')


## 2. Building Models

# **Linear Regression**

In [11]:
# Initialize the linear regression model
linear_regressor = LinearRegression()

# Fit the model to the training data
linear_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = linear_regressor.predict(X_test)

# **LightGBM model**

In [24]:


# Define the parameter grid
param_grid = {
    'num_leaves': [31, 50],
    'learning_rate': [0.1, 0.01],
    'n_estimators': [100, 200],
    'verbose': [-1]
}

# Ensure feature names do not have whitespaces
X_train.columns = [col.replace(' ', '_') for col in X_train.columns]
X_test.columns = [col.replace(' ', '_') for col in X_test.columns]

# Initialize the LightGBM regressor
lgb_regressor = lgb.LGBMRegressor()

# Define a custom scorer for MSE
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=lgb_regressor, param_grid=param_grid, scoring=mse_scorer, cv=5, verbose=1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best Parameters: {best_params}')
print(f'Best MSE Score: {best_score}')

# Make predictions on the test data using the best estimator
y_pred_lgb = grid_search.best_estimator_.predict(X_test)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best Parameters: {'learning_rate': 0.1, 'n_estimators': 100, 'num_leaves': 31, 'verbose': -1}
Best MSE Score: -0.0006798928610166256
Test MSE: 0.0006574862803625383


# **MLP model**

In [16]:
# Define a function to create the Keras model
def create_model(hidden_layer_sizes=(50,), learning_rate=0.01):
    model = Sequential()
    model.add(Input(shape=(X_train.shape[1],)))  # Input layer
    for units in hidden_layer_sizes:
        model.add(Dense(units=units, activation='relu'))
    model.add(Dense(1, activation='linear'))  # Output layer
    
    # Use Adam optimizer with specified learning rate
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss='mean_squared_error')
    return model

# Wrap the Keras model using KerasRegressor
model = KerasRegressor(model=create_model, verbose=0)

# Define the parameter grid
param_grid = {
    'model__hidden_layer_sizes': [(50,)],  # Single hidden layer with 50 neurons
    'model__learning_rate': [0.01],        # Learning rate for the optimizer
    'epochs': [50]                         # Number of epochs for training
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=3)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best Parameters: {best_params}')
print(f'Best MSE Score: {best_score}')

# Make predictions on the test data using the best estimator
y_pred_mlp = grid_search.best_estimator_.predict(X_test)


Best Parameters: {'epochs': 50, 'model__hidden_layer_sizes': (50,), 'model__learning_rate': 0.01}
Best MSE Score: -0.0008581459448876611


## 3. Comparing Results

# Performance Metrics for Linear Regression

When evaluating the performance of a linear regression model, two commonly used metrics are **Mean Squared Error (MSE)** and **$R^2$ (Coefficient of Determination)**. Below is an explanation of these metrics:

---

## 1. Mean Squared Error (MSE)

The Mean Squared Error measures the average squared difference between the actual and predicted values. It provides insight into the magnitude of the error.

### Formula:
$
MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$

### Components:
- $n$: Number of data points
- $y_i$: Actual value of the $i$-th observation
- $\hat{y}_i$: Predicted value of the $i$-th observation

### Key Points:
- Lower $MSE$ indicates a better fit of the model to the data.
- $MSE$ is sensitive to outliers because it squares the errors.

---

## 2. Coefficient of Determination ($R^2$)

The $R^2$ metric indicates how well the model explains the variance in the target variable. It measures the proportion of the variance in the dependent variable that is predictable from the independent variables.

### Formula:
$
R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}
$

### Components:
- $SS_{\text{res}} = \sum_{i=1}^n (y_i - \hat{y}_i)^2$: Residual sum of squares
- $SS_{\text{tot}} = \sum_{i=1}^n (y_i - \bar{y})^2$: Total sum of squares
- $\bar{y}$: Mean of the actual values

### Key Points:
- $R^2$ ranges from 0 to 1:
  - $R^2 = 1$: Perfect fit.
  - $R^2 = 0$: The model does not explain any variance.
- Negative $R^2$: Occurs when the model performs worse than simply predicting the mean of the target variable.

---

## Summary Table

| Metric  | Formula | Interpretation |
|---------|---------|----------------|
| **MSE** | $ \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 $ | Measures average squared error. Lower is better. |
| **$R^2$** | $ 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} $ | Explains variance proportion. Higher is better. |

---

Using these metrics together provides a robust evaluation of your linear regression model, assessing both the accuracy of predictions (via $MSE$) and the goodness-of-fit (via $R^2$).


In [20]:
# Evaluate the performance of the models

# Linear Regression
mse_lr = mean_squared_error(y_test, y_pred)
r2_lr = r2_score(y_test, y_pred)

# LightGBM
mse_lgb = mean_squared_error(y_test, y_pred_lgb)
r2_lgb = r2_score(y_test, y_pred_lgb)

# MLP
mse_mlp = mean_squared_error(y_test, y_pred_mlp)
r2_mlp = r2_score(y_test, y_pred_mlp)

In [22]:
# Create a DataFrame to store the performance metrics of the models
results = pd.DataFrame({
    'Model': ['Linear Regression', 'LightGBM', 'MLP'],
    'MSE': [mse_lr, mse_lgb, mse_mlp],
    'R^2': [r2_lr, r2_lgb, r2_mlp]
})

# Display the results
display(results)

# Identify the best model based on the lowest MSE
best_model = results.loc[results['MSE'].idxmin()]

print(f"Best Model: {best_model['Model']}")
print(f"MSE: {best_model['MSE']}")
print(f"R^2: {best_model['R^2']}")

Unnamed: 0,Model,MSE,R^2
0,Linear Regression,0.001569,0.920011
1,LightGBM,0.000657,0.966484
2,MLP,0.000923,0.952954


Best Model: LightGBM
MSE: 0.0006574862803625383
R^2: 0.9664839840716286


### Model Comparison and Conclusion

| **Model**               | **MSE**    | **R²**      |
|--------------------------|------------|-------------|
| Linear Regression        | 0.001569   | 0.920011    |
| LightGBM                | **0.000657** | **0.966484** |
| MLP (10 min runtime)    | 0.000923   | 0.952954    |

#### Conclusion
- Among the models tested, **LightGBM** performed the best with the lowest Mean Squared Error (MSE) of **0.000657** and the highest R² score of **0.966484**, indicating excellent predictive performance and capturing the variance in the data effectively.
- The **MLP model** achieved a competitive R² score of **0.952954**, but it required **10 minutes of runtime**, making it less efficient compared to LightGBM.
- **Linear Regression** had the simplest implementation but underperformed with an R² of **0.920011**, highlighting its limitations for this dataset.

Based on this comparison, **LightGBM is the recommended model** for its combination of accuracy and efficiency.
