In [1]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
import seaborn as sns
#data pre processing
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
#ignore warning
import warnings
warnings.filterwarnings('ignore')

# load the data
df = sns.load_dataset('diamonds')

# separate the features X and the target/labels y
X = df.drop('price', axis=1)
y = df['price']

# numeric features
numeric_features = ['carat', 'depth', 'table', 'x', 'y', 'z']
# categorical features
categorical_features = ['cut', 'color', 'clarity']

# preprocess the data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

# pipeline
pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', LinearRegression())
    ]
)

# fit the model
pipeline.fit(X_train, y_train)

# metric to evaluate the model
y_pred = pipeline.predict(X_test)

print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"R2 Score: {r2_score(y_test, y_pred)}")
print(f"Mean Absolute Error: {mean_absolute_error(y_test, y_pred)}")
print(f"Mean Absolute Percentage Error: {mean_absolute_percentage_error(y_test, y_pred)}")
# root mean squared error
print(f"Root Mean Squared Error: {np.sqrt(mean_squared_error(y_test, y_pred))}")

Mean Squared Error: 1288705.4778516763
R2 Score: 0.9189331350419386
Mean Absolute Error: 737.1513665933285
Mean Absolute Percentage Error: 0.3952933516494362
Root Mean Squared Error: 1135.2116445190634


# Interpretation of the model Metrics¶
The Metrics indicate the performance of your regression model.

Here's a brief interpretation:

Mean Squared Error (MSE): 1288813.63 - This value represents the average of the squares of the errors. A lower MSE indicates a better fit, but the value itself is not very interpretable without context.

e.g: if the target variable is in the range of 0-100, an MSE of 1288813.63 is high, but if the target variable is in the range of 100000-1000000, then the MSE is low.

R2 Score: 0.9189 - This value indicates that approximately 91.89% of the variance in the dependent variable (price) is predictable from the independent variables. An R2 score close to 1 indicates a good fit.

e.g: if the R2 score is 0.9189, it means that 91.89% of the variance in the price can be explained by the independent variables in the model.

Mean Absolute Error (MAE): 736.91 - This value represents the average absolute difference between the predicted and actual values. Lower values indicate better performance.
e.g: if the MAE is 736.91, it means that, on average, the model's predictions are off by $736.91.

Mean Absolute Percentage Error (MAPE): 0.3951 - This value represents the average absolute percentage difference between the predicted and actual values. Lower values indicate better performance.

e.g: if the MAPE is 0.3951, it means that, on average, the model's predictions are off by 39.51%.

Root Mean Squared Error (RMSE): 1135.26 - This value is the square root of the MSE and provides a measure of the average magnitude of the error. Lower values indicate better performance.

e.g: if the RMSE is 1135.26, it means that, on average, the model's predictions are off by $1135.26.

Overall, the R2 score of 0.9189 suggests that your model explains a significant portion of the variance in the data, which is a good sign. However, the MSE, MAE, and RMSE values are relatively high, indicating that there is still room for improvement in the model's accuracy.

Conclusion
In this notebook, we created a multi-linear regression model using the sklearn library and saved it to a file. We then loaded the model from the file and used it to make predictions. We also evaluated the model's performance using various metrics such as MSE, R2 score, MAE, MAPE, and RMSE. The results indicate that the model explains a significant portion of the variance in the data but has room for improvement in terms of accuracy.

# 15 Ways to improve the ML Model's performance?¶

Feature Engineering: Create new features that capture additional information from the data.
Hyperparameter Tuning: Optimize the model's hyperparameters to improve performance.
Regularization: Apply regularization techniques to prevent overfitting.
Ensemble Methods: Use ensemble methods such as Random Forest or Gradient Boosting to improve predictive performance.
Cross-Validation: Use cross-validation to assess the model's performance more accurately.
Feature Selection: Identify and select the most relevant features for the model.
Data Preprocessing: Clean and preprocess the data to improve model performance.
Model Selection: Experiment with different regression models to find the best fit for the data.
Error Analysis: Analyze the model's errors to identify patterns and areas for improvement.
Domain Knowledge: Incorporate domain knowledge to improve the model's predictive power.
Data Augmentation: Increase the size of the training data through data augmentation techniques.
Model Stacking: Combine multiple models to improve predictive performance.
Model Interpretation: Interpret the model's predictions to gain insights into the data and improve performance.
Model Deployment: Deploy the model in a production environment and monitor its performance over time.
Feedback Loop: Incorporate feedback from users and stakeholders to continuously improve the model.