# Soh estimation experimentation of Renault vehicles
We will try to express the soh as:
```
soh = charging.battery_energy / (charging.battery_level * model_battery_capacity) 
```

This expression is based on the assumption that the charging.battery_level variable is represents the actual energy in the battey and not some simple cross product.

## Setup

In [None]:
! mkdir -p data_cache

### Imports

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from rich import print
import pandas as pd
from pandas import Series
from pandas import DataFrame as DF
import plotly.express as px
from core.config import *
from scipy.stats import linregress as lr 

from transform.high_mobility.high_mobility_raw_tss import get_raw_tss
from transform.ayvens.ayvens_fleet_info import fleet_info

### Data extraction

Then we will use data find online to get the default battery capacity of each model.  
Note: *Here a model is a combinatin of the `Model` and `Type` fleet_info variables since cars of the same model with different type can have different battery capacity*.

Let's extract the raw time seriess of all the cars we have into a multi indexed tss. 

In [None]:
raw_tss = get_raw_tss("renault")

In [None]:
# Compter le nombre de VIN uniques
nombre_vin_uniques = raw_tss['vin'].nunique()

print(f"Le nombre de VIN différents dans tss est : {nombre_vin_uniques}")

**Note**: *There are only R135 models.*

### Time series processing
Let's implement a naive soh estimation pipeline.  

In [None]:
COLS_TO_CPY_FROM_FLEET_INFO = [
    "make",
    "model",
    "version",
    "dummy_soh_maker_offset",
    "dummy_soh_model_offset",
    "dummy_soh_model_slope",
    "dummy_soh_vehicle_offset",
    "capacity",
]

KEYS = [
    "slope",
    "intercept",
    "r_value",
    "p_value",
    "std_err",
]

RENAME_COLS_DICT:dict[str, str] = {
    "date_of_value": "date",
    "diagnostics.odometer": "odometer",
    "odometer.value": "odometer",
    "diagnostics.odometer": "odometer",
    "mileage_km": "odometer",
    "mileage": "odometer",
    "charging.battery_energy": "battery_energy",
    "charging.estimated_range": "estimated_range",
    "charging.battery_level": "soc",
    "soc_hv_header": "soc",
    "charging.battery_energy": "battery_energy",
    "charging.battery_level": "battery_level",
}

COLS_TO_KEEP = [
    "date",
    "soc",
    "odometer",
    "estimated_range",
    "battery_energy",
    "soc",
    "vin",
]

COL_DTYPES = {
    "soc": "float",
    "odometer": "float",
    "estimated_range": "float",
    "battery_energy": "float",
    "soc": "float",
    "vin": "string",
    "capacity": "float",
}

In [None]:
tss:DF = (
    raw_tss
    .merge(fleet_info[COLS_TO_CPY_FROM_FLEET_INFO], on="vin", how="left")
    .assign(capacity=52)# Hot fix, some capacity values of the fleet info are incorrectly equal R135 (the version of the car) instead of 52 (the capacity of the car)  
    .rename(columns=RENAME_COLS_DICT)
    .eval("soc = battery_level * 100")
    .astype(COL_DTYPES, errors="ignore")
    .eval("expected_battery_energy = capacity * battery_level")
    .eval("soh = 100 * battery_energy / expected_battery_energy") 
)
tss.columns

## EDA

## Assumption verification
First, we will verify that the `soc` and `battery_energy` are two "real" variables.  
That is, none of them is calculated from the other.

In [None]:
# Compter le nombre de VIN uniques
nombre_vin_uniques = tss['vin'].nunique()

print(f"Le nombre de VIN différents dans tss est : {nombre_vin_uniques}")

In [None]:
px.scatter(tss, x="soc", y="battery_energy", color="vin")

Looking at this scatter plot we can see that:
- The two variables are in fact two real variables instead of one being a synthetic variable calculated from the other.  
- The difference is much more important at high `soc` values.

Let's verify that the `soh` is not correlated with the `soc` or `odometer`.

In [None]:
energy_over_soc = (
    tss
    .groupby(["vin", "soc"])
    .agg({"soc":"median", "battery_energy":"median"})
    .reset_index(level=0)
)
energy_over_soc

In [None]:
import numpy as np
from scipy.optimize import minimize
from pandas import Series

def lr_with_positive_intercept(tss) -> Series:
    # Get X and y values from the dataframe
    X = tss["soc"].values
    y = tss["battery_energy"].values
    
    # Define the objective function to minimize sum of squared residuals
    def objective_function(params, X, y):
        slope, intercept = params
        y_pred = slope * X + intercept
        return np.sum((y - y_pred) ** 2)
    
    # Define the constraint: intercept between 0.5 and 2.5
    def constraint_intercept(params):
        intercept = params[1]
        return intercept - 0.5, 2.5 - intercept
    
    # Define the constraint: y_pred at X=100 should be between 45 and 52
    def constraint_y_at_100(params):
        slope, intercept = params
        y_pred_100 = slope * 100 + intercept
        return y_pred_100 - 45, 52 - y_pred_100
    
    # Add both constraints
    constraints = [
        {'type': 'ineq', 'fun': lambda params: constraint_intercept(params)[0]},  # intercept >= 0.5
        {'type': 'ineq', 'fun': lambda params: constraint_intercept(params)[1]},  # intercept <= 2.5
        # {'type': 'ineq', 'fun': lambda params: constraint_y_at_100(params)[0]},  # y_pred_100 >= 45
        # {'type': 'ineq', 'fun': lambda params: constraint_y_at_100(params)[1]},  # y_pred_100 <= 52
    ]
    
    # Initial guess for slope and intercept
    initial_guess = np.array([0, 1])

    # Minimize the objective function with the constraints
    result = minimize(objective_function, initial_guess, args=(X.squeeze(), y), constraints=constraints)

    # Extract the optimal slope and intercept
    slope, intercept = result.x

    return Series({"slope": slope, "intercept": intercept})


In [None]:


soc_to_energy_relation_descr_stats:DF = (
    tss
    .dropna(subset=["soc", "battery_energy"], how="any")
    .groupby("vin")
    .apply(
        lr_with_positive_intercept,
        include_groups=False,
    )
    .eval("soc_0_battery_energy = intercept")
    .eval("soc_100_battery_energy = intercept + slope * 100")
    .eval("integral = (slope * (100**2) / 2) + (intercept * 100)")
    .eval("soh = soc_100_battery_energy / 51 * 100")
    .merge(tss.groupby("vin").agg({"odometer": "max"}), on="vin", how="left")
)
soc_to_energy_relation_descr_stats

In [None]:
COLS_TO_MERGE = [
    "slope",
    "intercept",
    "r_value",
    "p_value",
    "std_err",
]

plt_soc_to_energy_relation_descr_stats = (
    soc_to_energy_relation_descr_stats
    .loc[:, ["soc_0_battery_energy", "soc_100_battery_energy"]]
    .T
    .unstack()
    .reset_index()
    .rename(columns={"level_1": "soc", 0: "battery_energy"})
    .replace({"soc_0_battery_energy": 0, "soc_100_battery_energy": 100})
)

plt_soc_to_energy_relation_descr_stats.dtypes

In [None]:
soc_to_energy_relation_descr_stats.sort_values("soh")

In [None]:
px.scatter(
    soc_to_energy_relation_descr_stats,
    x="odometer",
    y="soh",
    trendline="ols"
)

In [None]:
px.line(
    plt_soc_to_energy_relation_descr_stats,
    x="soc",
    y="battery_energy",
    color="vin"
)

In [None]:
px.line(
    energy_over_soc.query("vin == 'VF1AG000366007352'"),
    x="soc",
    y="battery_energy",
    color="vin",
    # opacity=0.4,
)

In [None]:
px.scatter(
    tss.query("vin == 'VF1AG000164535225'"),
    x="soc",
    y="battery_energy",
    # color="vin",
    # opacity=0.4,
)

In [None]:
px.box(tss, x="soc", y="soh")

This simple soh calculation highlights a non linear relationship between the `soc` and `battery_energy` variables.  
Looking at a single vin, we can see that the energy is at 1.6 at 1%soc and 48 at 100%soc (so 4,8 per soc).  
To compemsate this, we will use a linear regression model with a log engineered featrue.  
The output of this model will be use to estimate the expected energy at each soc.  
Knowing that the total energy capacity of the only model that we are studyinh (R135) we can add a constrain to the training.  
This will force the model to output the 100% soh `expected_energy` at any given soc instead of the average of the `battery_energy` points that we have.  