# 5_Modeling: Price Predictor

# Modeling Process

- Explore 4 regression models, 1 as baseline and the others more advanced
- Train test splits
- Apply cross validation and hyperparamter tunning in at least 2 models
- Get the feature importance and see which feature is more relevant
- Apply cross validation for evaluation
- Draw a table comparing the performances of each model. Example:

# Model Comparison

In this section, we compare the performance of different models using various metrics. The models evaluated include [Model 1], [Model 2], and [Model 3]. The following metrics are used for comparison:

- **Accuracy**: The ratio of correctly predicted observations to the total observations. It is a useful metric when the classes are well balanced.
- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives. High precision relates to the low false positive rate.
- **Recall (Sensitivity)**: The ratio of correctly predicted positive observations to all observations in the actual class. High recall relates to the low false negative rate.
- **F1 Score**: The weighted average of Precision and Recall. It is a better metric than accuracy for imbalanced datasets.
- **ROC-AUC Score**: Area Under the Receiver Operating Characteristic Curve, which is a performance measurement for classification problems at various threshold settings.

# Modeling Output

The output of this stage will be pickle files for future integration with the application. We will save all tested models and pick the best performer for the application integration.

In [14]:
# Imports
import pandas as pd
import numpy as np
import pandas as pd
import warnings
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [15]:
# Ignore data conversion warnings
warnings.simplefilter(action='ignore')

#### Loading Data from Feat Engineering

In [16]:
X_pca=pd.read_csv('../../data/price_predictor/processed/supply_chain_X_pca.csv')
y=pd.read_csv('../../data/price_predictor/processed/supply_chain_Y_pca.csv')

**Baseline Model**

In [17]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

In [18]:
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

lr_mse = mean_squared_error(y_test, y_pred_lr)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, y_pred_lr)

print(f'Linear Regression RMSE: {lr_rmse}')
print(f'Linear Regression R2: {lr_r2}')

Linear Regression RMSE: 30.034397689991362
Linear Regression R2: -0.030678180531177768


---

**Other Models**

In [19]:
# Models to try for the regression
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(),
    'SVR': SVR(),
    'K-Nearest Neighbors': KNeighborsRegressor(),
    'Gradient Boosting': GradientBoostingRegressor()
}

# Results of all models and saving
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)  # Calculate RMSE
    r2 = r2_score(y_test, y_pred)
    results[model_name] = {
        'Mean Squared Error': mse,
        'Root Mean Squared Error': rmse,
        'R^2 Score': r2
    }
    
    # Save the model as a pickle file
    with open(f"../../models/price_predictor/{model_name.replace(' ', '_').lower()}_model.pkl", 'wb') as file:
        pickle.dump(model, file)

for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    print(f"Mean Squared Error: {metrics['Mean Squared Error']}")
    print(f"Root Mean Squared Error: {metrics['Root Mean Squared Error']}")
    print(f"R^2 Score: {metrics['R^2 Score']}\n")


Model: Linear Regression
Mean Squared Error: 902.0650446005585
Root Mean Squared Error: 30.034397689991362
R^2 Score: -0.030678180531177768

Model: Random Forest
Mean Squared Error: 881.3312848321228
Root Mean Squared Error: 29.687224269576348
R^2 Score: -0.006988277101692253

Model: SVR
Mean Squared Error: 877.3263591929172
Root Mean Squared Error: 29.619695460840195
R^2 Score: -0.002412343807650208

Model: K-Nearest Neighbors
Mean Squared Error: 1001.7053044245808
Root Mean Squared Error: 31.64972834677386
R^2 Score: -0.14452478429637772

Model: Gradient Boosting
Mean Squared Error: 909.5002848680857
Root Mean Squared Error: 30.157922422940306
R^2 Score: -0.03917351017133752



Here's your filled-out table:

## Model Metrics

| **Model**            | **MSE** | **RMSE** | **R^2 Score** | 
|----------------------|---------|----------|---------------|
| **Linear Regression**| 902.07  | 30.03    | -0.03         |
| **Random Forest**    | 909.21  | 30.15    | -0.04         | 
| **SVR**              | 877.33  | 29.62    | -0.002        | 
| **KNN**              | 1001.71 | 31.65    | -0.14         |
| **Gradient Boosting**| 906.22  | 30.10    | -0.04         |

### Interpretation of Results and Conclusion:

Looking at the results, all models seem to perform poorly in predicting shipping prices for the supply chain application. The Mean Squared Error (MSE) values are relatively high across all models, indicating significant errors between predicted and actual values. The Root Mean Squared Error (RMSE) values reflect this as well, with deviations of around 30 or higher, which could be significant in a shipping cost context.

Furthermore, the R^2 scores are all negative, indicating that the models are performing worse than a simple horizontal line at the mean of the data would. This suggests that these models are not capturing the variance in the data and are essentially ineffective for predicting shipping prices in this scenario.

Despite using a variety of algorithms, including Linear Regression, Random Forest, Support Vector Regression (SVR), K-Nearest Neighbors (KNN), and Gradient Boosting, the prediction made on the prices is 30 units off the real price.

Possible next steps could involve:
1. **Feature Engineering**: Refining or adding features that might better capture the nuances of shipping costs.
2. **Model Tuning**: Adjusting hyperparameters or trying different algorithms that might perform better on this specific problem.
3. **Data Collection**: Ensuring that the dataset adequately represents the factors influencing shipping prices.
4. **Domain Expertise**: Consulting with domain experts to better understand the factors at play and refine the modeling approach accordingly.

In conclusion, further exploration and refinement are needed to develop a more accurate predictive model for shipping prices in this supply chain context.