In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ship Fuel Consumption Prediction Model\n",
    "\n",
    "This notebook trains a machine learning model to predict the fuel consumption (`fuel_per_nm`) of a ship based on its characteristics and voyage details.\n",
    "\n",
    "**Pipeline:**\n",
    "1. **Load Data**: Load the `processed_voyage_data.csv` file created by `preprocess_data.py`.\n",
    "2. **Feature Selection**: Define the features (X) and the target variable (y).\n",
    "3. **Train-Test Split**: Split the data into training and testing sets.\n",
    "4. **Model Training**: Train a `RandomForestRegressor` model.\n",
    "5. **Evaluation**: Evaluate the model's performance on the test set.\n",
    "6. **Save Model**: Serialize and save the trained model to a `.pkl` file for production use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.metrics import mean_absolute_error, r2_score\n",
    "import joblib # For saving the model\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Load Preprocessed Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processed_data_path = 'processed_voyage_data.csv'\n",
    "\n",
    "try:\n",
    "    df = pd.read_csv(processed_data_path)\n",
    "    print(f\"Successfully loaded processed data. Shape: {df.shape}\")\n",
    "    print(\"Data Head:\")\n",
    "    display(df.head())\n",
    "except FileNotFoundError:\n",
    "    print(f\"Error: Processed data not found at '{processed_data_path}'.\")\n",
    "    print(\"Please run 'preprocess_data.py' first to generate the required file.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Feature Selection\n",
    "\n",
    "We select the features that will be used to train the model. The target variable is `fuel_per_nm`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The target variable we want to predict\n",
    "target = 'fuel_per_nm'\n",
    "\n",
    "# Features are all columns except the target and identifiers\n",
    "# 'fuel_consumed_tonnes' and 'distance_nm' are dropped because they were used to create the target\n",
    "features = [col for col in df.columns if col not in [target, 'voyage_id', 'fuel_consumed_tonnes', 'distance_nm']]\n",
    "\n",
    "X = df[features]\n",
    "y = df[target]\n",
    "\n",
    "print(\"Target Variable (y):\")\n",
    "print(y.name)\n",
    "print(\"\\nFeature Set (X):\")\n",
    "print(X.columns.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Train-Test Split\n",
    "\n",
    "We split the dataset into a training set (for teaching the model) and a testing set (for evaluating its performance on unseen data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "print(f\"Training set size: {X_train.shape[0]} samples\")\n",
    "print(f\"Testing set size: {X_test.shape[0]} samples\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Model Training\n",
    "\n",
    "We will use a `RandomForestRegressor`, which is a powerful and versatile model suitable for this kind of tabular data problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)\n",
    "\n",
    "print(\"Training the RandomForestRegressor model...\")\n",
    "model.fit(X_train, y_train)\n",
    "print(\"Model training complete.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Model Evaluation\n",
    "\n",
    "We make predictions on the test set and compare them to the actual values to see how well the model performs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = model.predict(X_test)\n",
    "\n",
    "# Calculate performance metrics\n",
    "mae = mean_absolute_error(y_test, y_pred)\n",
    "r2 = r2_score(y_test, y_pred)\n",
    "\n",
    "print(f\"Model Performance on Test Set:\")\n",
    "print(f\"  - Mean Absolute Error (MAE): {mae:.4f}\")\n",
    "print(f\"  - R-squared (R²): {r2:.4f}\")\n",
    "\n",
    "# Visualize the results\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.scatterplot(x=y_test, y=y_pred, alpha=0.6)\n",
    "plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', linewidth=2)\n",
    "plt.title('Actual vs. Predicted Fuel Consumption')\n",
    "plt.xlabel('Actual Fuel per NM')\n",
    "plt.ylabel('Predicted Fuel per NM')\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Save the Model\n",
    "\n",
    "Finally, we save the trained model to a file named `fuel_model.pkl`. This file can then be loaded by the backend server to make live predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_output_path = '../backend/models/fuel_model.pkl'\n",
    "\n",
    "try:\n",
    "    joblib.dump(model, model_output_path)\n",
    "    print(f\"Model successfully saved to: {model_output_path}\")\n",
    "except Exception as e:\n",
    "    print(f\"Error saving model: {e}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
