# Soh estimation experimentation of Renault vehicles
We will try to express the soh as:
```
soh = charging.battery_energy / (charging.battery_level * model_battery_capacity) 
```

This expression is based on the assumption that the charging.battery_level variable is represents the actual energy in the battey and not some simple cross product.

## Setup

In [None]:
! mkdir -p data_cache

### Imports

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from rich import print
import pandas as pd
from pandas import Series
from pandas import DataFrame as DF
import plotly.express as px
from core.config import *

from transform.high_mobility.high_mobility_raw_tss import get_raw_tss
from transform.ayvens.ayvens_fleet_info import fleet_info

### Data extraction

Then we will use data find online to get the default battery capacity of each model.  
Note: *Here a model is a combinatin of the `Model` and `Type` fleet_info variables since cars of the same model with different type can have different battery capacity*.

Let's extract the raw time seriess of all the cars we have into a multi indexed tss. 

In [None]:
raw_tss = get_raw_tss("renault")

In [None]:
# Compter le nombre de VIN uniques
nombre_vin_uniques = raw_tss['vin'].nunique()

print(f"Le nombre de VIN différents dans tss est : {nombre_vin_uniques}")

**Note**: *There are only R135 models.*

### Time series processing
Let's implement a naive soh estimation pipeline.  

In [None]:
COLS_TO_CPY_FROM_FLEET_INFO = [
    "make",
    "model",
    "version",
    "dummy_soh_maker_offset",
    "dummy_soh_model_offset",
    "dummy_soh_model_slope",
    "dummy_soh_vehicle_offset",
    "capacity",
]

RENAME_COLS_DICT:dict[str, str] = {
    "date_of_value": "date",
    "diagnostics.odometer": "odometer",
    "odometer.value": "odometer",
    "diagnostics.odometer": "odometer",
    "mileage_km": "odometer",
    "mileage": "odometer",
    "charging.battery_energy": "battery_energy",
    "charging.estimated_range": "estimated_range",
    "charging.battery_level": "soc",
    "soc_hv_header": "soc",
    "charging.battery_energy": "battery_energy",
    "charging.battery_level": "battery_level",
}

COLS_TO_KEEP = [
    "date",
    "soc",
    "odometer",
    "estimated_range",
    "battery_energy",
    "soc",
    "vin",
]

COL_DTYPES = {
    "soc": "float",
    "odometer": "float",
    "estimated_range": "float",
    "battery_energy": "float",
    "soc": "float",
    "vin": "string",
    "capacity": "float",
}

In [None]:
tss:DF = (
    raw_tss
    .merge(fleet_info[COLS_TO_CPY_FROM_FLEET_INFO], on="vin", how="left")
    .assign(capacity=52)# Hot fix, some capacity values of the fleet info are incorrectly equal R135 (the version of the car) instead of 52 (the capacity of the car)  
    .rename(columns=RENAME_COLS_DICT)
    .eval("soc = battery_level * 100")
    .astype(COL_DTYPES, errors="ignore")
    .eval("expected_battery_energy = capacity * battery_level")
    .eval("soh = 100 * battery_energy / expected_battery_energy") 
)
tss.columns

## EDA

## Assumption verification
First, we will verify that the `soc` and `battery_energy` are two "real" variables.  
That is, none of them is calculated from the other.

In [None]:
# Compter le nombre de VIN uniques
nombre_vin_uniques = tss['vin'].nunique()

print(f"Le nombre de VIN différents dans tss est : {nombre_vin_uniques}")

In [None]:
px.scatter(tss, x="soc", y="battery_energy", color="vin")

Looking at this scatter plot we can see that:
- The two variables are in fact two real variables instead of one being a synthetic variable calculated from the other.  
- The difference is much more important at high `soc` values.

Let's verify that the `soh` is not correlated with the `soc` or `odometer`.

In [None]:
from scipy.stats import linregress as lr 

KEYS = [
    "slope",
    "intercept",
    "r_value",
    "p_value",
    "std_err",
]

soc_points = (
    tss
    .groupby("vin")
    .apply(lambda vin_ts: Series(lr(vin_ts["soc"], vin_ts["battery_energy"]), KEYS))
    .reset_index(drop=False)
)
soc_points

In [None]:
px.line(
    soc_points,
    x="soc",
    y="battery_energy",
    color="vin",
)

In [None]:
px.box(tss, x="soc", y="soh")

This simple soh calculation highlights a non linear relationship between the `soc` and `battery_energy` variables.  
Looking at a single vin, we can see that the energy is at 1.6 at 1%soc and 48 at 100%soc (so 4,8 per soc).  
To compemsate this, we will use a linear regression model with a log engineered featrue.  
The output of this model will be use to estimate the expected energy at each soc.  
Knowing that the total energy capacity of the only model that we are studyinh (R135) we can add a constrain to the training.  
This will force the model to output the 100% soh `expected_energy` at any given soc instead of the average of the `battery_energy` points that we have.  

In [None]:
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.integrate import quad

# Define the target integral value
TARGET_INTEGRAL = 52
tss = tss.dropna(subset=["soc", "battery_energy",], how='any')
tss["log_soc"] = np.log(tss['soc'])

# Fit a linear model to the log-transformed data using np.polyfit
coeffs = np.polyfit(tss["log_soc"], tss['battery_energy'], 1)  # 1-degree polynomial (linear fit)

# Extract slope (a) and intercept (b) from polyfit
a, b = coeffs
print(f"Initial polyfit coefficients: a = {a}, b = {b}")

# Define the model function: a * log(soc) + b
def model_func(soc, a, b):
    return a * np.log(soc) + b

# Function to calculate the integral of the model between soc=1 and soc=100
def integral_of_model(a, b):
    # Integrate the model function from 1 to 100
    integral, _ = quad(lambda soc: model_func(soc, a, b), 1, 100)
    return integral

# Compute the integral of the model with initial polyfit parameters
initial_integral = integral_of_model(a, b)
print(f"Initial integral of the model: {initial_integral}")

# Adjust the intercept b to satisfy the integral constraint
adjusted_b = b + (TARGET_INTEGRAL - initial_integral) / (100 - 1)  # Spread the adjustment over the range of soc

# Print the adjusted coefficients
print(f"Adjusted coefficients: a = {a}, adjusted_b = {adjusted_b}")

# Use the adjusted model to make predictions
tss['predicted_battery_energy'] = model_func(tss['soc'], a, adjusted_b)
tss:DF = tss.eval("soh2 = 100 * battery_level / predicted_battery_energy")

In [None]:
print(X)

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Custom logarithmic transformer for soc
log_transformer = FunctionTransformer(np.log1p, validate=True)
notna_feature_mask = tss[["soc", "battery_energy"]].notna().all(axis="columns")
print(notna_feature_mask)
# Define feature and target columns as numpy arrays
X = tss.loc[notna_feature_mask, ['soc']].values  # Keep X as a DataFrame for compatibility with ColumnTransformer
y = tss.loc[notna_feature_mask, 'battery_energy'].values  # y is still a 1D numpy array

# Define a column transformer to apply log transformation to 'soc'
# Construct the pipeline
pipeline = Pipeline(steps=[
    # ('preprocessing', preprocessor),  # Preprocess (log transform soc)
    ('scaler', StandardScaler()),     # Optional: Scale the features
    ('model', LinearRegression())     # Apply linear regression to the transformed data
])

# Fit the pipeline on the data
pipeline.fit(X, y)

# Predict battery energy based on soc
predicted_battery_energy = pipeline.predict(X)

# Add predictions to the DataFrame
tss.loc[notna_feature_mask, 'predicted_battery_energy'] = predicted_battery_energy



In [None]:
px.scatter(tss, "soc", "predicted_battery_energy")