# Model Building on a Synthetic Dataset

## Table of Contents

- [Objective](#objective)
- [Exploratory Data Analysis](#eda)
- [Data Processing](#data-processing)
    - [Encoding training data](#encoding-training-data)
    - [Encoding test data](#encoding-test-data)
    - [Missing values](#missing-values)
- [Modeling](#modeling)
    - [Linear Regression](#linear-regression)
    - [Random Forest Regressor](#random-forest-regressor)
    - [Multiple Regression Models](#multiple-regression-models)
    - [Compare Root Mean Square Error (RMSE) and R-Squared for Models](#compare-metrics-all)

## Objective <a id='objective'></a>

This notebook shows the process of building a predictive model using the data in the training set to predict the target values from the test set.

The two synthetic datasets were generated using the same underlying data model.

Predictive accuracy will be assessed using the mean squared error metric.

## Exploratory Data Analysis <a id='eda'></a>

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [None]:
# Set random seed for reproducibility
random.seed(42)

In [None]:
# Read data
train_data = pd.read_csv('data/raw/codetest_train.txt', delimiter='\t')
test_data = pd.read_csv('data/raw/codetest_test.txt', delimiter='\t')

In [None]:
# Display first 5 rows of training data
train_data.head()

In [None]:
# Summary statistics of training data
train_data.describe()

In [None]:
# Visualize distribution of target variable
plt.figure(figsize=(10, 6))
sns.histplot(train_data['target'], bins=50, kde=True)
plt.title('Distribution of Target Variable')
plt.xlabel('Target')
plt.ylabel('Frequency')
plt.show()

## Data Processing <a id='data-processing'></a>

### Encoding training data <a id='encoding-training-data'></a>

In [None]:
# Check for non-numeric columns
non_numeric_cols = [col for col, dtype in train_data.dtypes.items() if not pd.api.types.is_numeric_dtype(dtype)]

if non_numeric_cols:
    print(f'Columns with non-numeric data types: {non_numeric_cols}')
else:
    print(f'All columns have float or integer data types.')

In [None]:
# Show first 5 rows of non-numeric columns
train_data[['f_61', 'f_121', 'f_215', 'f_237']].head()

In [None]:
# Select non-numeric features
non_numeric_features = train_data.select_dtypes(include='object').columns

# Encode non-numeric features
encoder = LabelEncoder()
train_data[non_numeric_features] = train_data[non_numeric_features].apply(lambda x: encoder.fit_transform(x.astype(str)))

# Check all features are numeric
train_data.select_dtypes(exclude='float64').columns

In [None]:
# Show first 5 rows of columns
train_data[['f_61', 'f_121', 'f_215', 'f_237']].head()

### Encoding test data <a id='encoding-test-data'></a>

In [None]:
# Check for non-numeric columns
non_numeric_cols = [col for col, dtype in test_data.dtypes.items() if not pd.api.types.is_numeric_dtype(dtype)]

if non_numeric_cols:
    print(f'Columns with non-numeric data types: {non_numeric_cols}')
else:
    print(f'All columns have float or integer data types.')

In [None]:
# Select non-numeric features
non_numeric_features_test = test_data.select_dtypes(include='object').columns

# Encode non-numeric features
encoder = LabelEncoder()
test_data[non_numeric_features_test] = test_data[non_numeric_features_test].apply(lambda x: encoder.fit_transform(x.astype(str)))

# Check all features are numeric
test_data.select_dtypes(exclude='float64').columns

### Missing values <a id='missing-values'></a>

In [None]:
# Check missing values
missing_train = train_data.isna().sum()
missing_test = test_data.isna().sum()

print('Missing values in training data:')
print(missing_train[missing_train > 0])

print('\nMissing values in test data:')
print(missing_test[missing_test > 0])

In [None]:
# Fill missing values in training data with median
train_data.fillna(train_data.median(), inplace=True)

# Fill missing values in test data with median
test_data.fillna(test_data.median(), inplace=True)

In [None]:
# Check missing values
missing_train = train_data.isna().sum()
missing_test = test_data.isna().sum()

print('Missing values in training data:')
print(missing_train[missing_train > 0])

print('\nMissing values in test data:')
print(missing_test[missing_test > 0])

## Modeling <a id='modeling'></a>

### Linear Regression <a id='linear-regression'></a>

In [None]:
# Separate features and target variable in training data
X = train_data.drop(columns='target')
y = train_data['target']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(test_data)

# Initialize model
lr_model = LinearRegression()

# Fit model to training data
lr_model.fit(X_train_scaled, y_train)

# Make predictions on validation set
y_val_pred = lr_model.predict(X_val_scaled)

# Calculate mean squared error
mse = mean_squared_error(y_val, y_val_pred)
mse

### Random Forest Regressor <a id='random-forest-regressor'></a>

In [None]:
# Initialize model
rf_model = RandomForestRegressor(random_state=42)

# Fit model to training data
rf_model.fit(X_train_scaled, y_train)

# Make predictions on validation set
y_val_pred_rf = rf_model.predict(X_val_scaled)

# Calculate mean squared error
mse_rf = mean_squared_error(y_val, y_val_pred_rf)
mse_rf

### Multiple Regression Models <a id='multiple-regression-models'></a>

In [None]:
# Initialize models
lasso_model = Lasso(random_state=42)
ridge_model = Ridge(random_state=42)

# Fit models to training data
lasso_model.fit(X_train_scaled, y_train)
ridge_model.fit(X_train_scaled, y_train)

# Make predictions on validation set
y_val_pred_lasso = lasso_model.predict(X_val_scaled)
y_val_pred_ridge = ridge_model.predict(X_val_scaled)

# Calculate mean squared error
mse_lasso = mean_squared_error(y_val, y_val_pred_lasso)
mse_ridge = mean_squared_error(y_val, y_val_pred_ridge)

### Compare Root Mean Square Error (RMSE) and R-Squared for Models <a id='compare-metrics-all'></a>

In [None]:
# Calculate RMSE for all models
rmse_lr = np.sqrt(mse)
rmse_rf = np.sqrt(mse_rf)
rmse_lasso = np.sqrt(mse_lasso)
rmse_ridge = np.sqrt(mse_ridge)

# Calculate R-squared for all models
r2_lr = lr_model.score(X_val_scaled, y_val)
r2_rf = rf_model.score(X_val_scaled, y_val)
r2_lasso = lasso_model.score(X_val_scaled, y_val)
r2_ridge = ridge_model.score(X_val_scaled, y_val)

models_all = ['Linear Regression', 'Random Forest', 'Lasso', 'Ridge']
mse_values_all = [mse, mse_rf, mse_lasso, mse_ridge]
rmse_values_all = [rmse_lr, rmse_rf, rmse_lasso, rmse_ridge]
r2_values_all = [r2_lr, r2_rf, r2_lasso, r2_ridge]

In [None]:
# Create figure and subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

# Create RMSE plot
ax[0].barh(y=models_all, width=rmse_values_all, color='#2caffe')
ax[0].set_xlabel('Root Mean Squared Error (RMSE)')
ax[0].set_title('Comparison of RMSE for All Models')

# Create R-squared plot
ax[1].barh(y=models_all, width=r2_values_all, color='#544fc5')
ax[1].set_xlabel('R-Squared ($R^2$)')
ax[1].set_title('Comparison of $R^2$ for All Models')

# Fit figure
plt.tight_layout()

# Show figure
plt.show()

Here, the random forest regressor model has the lowest root mean squared error (RMSE).

This notebook was inspired by the [Model Building on a Synthetic Dataset](https://platform.stratascratch.com/data-projects/model-building-synthetic-dataset) data project on [StrataScratch](https://www.stratascratch.com/).