# Tesla SoH std EDA
The SoH of tesla vehicles is computed as `energy_added / soc_diff / 100 * capacity`.  
The issue with this relation is that low energy_added / soc diff values tend to be less accurtate as the noise influences more the SoH estimation.  
To mitigate this, we apply a simple `soc_diff > X` filter to get the final result.  
However this filter brings its own issue.  
If we apply a filter that is too restrictive, we will end up with a very low number of vehicles in the final result.  
If we apply a filter that is too permissive, we will end up with a higher number of vehicles in the final result, but with a higher noise.  
The goal of this notebook is to understand the SoH std distribution for Tesla vehicles over the soc diff.  
This will, in turn, help us decide what filter we want to apply to get the final result.  

## Setup

### Imports

In [None]:
import plotly.express as px

from core.pandas_utils import *
from core.ev_models_info import models_info
from transform.processed_tss.ProcessedTimeSeries import ProcessedTimeSeries
from core.ev_models_info import models_info

### Data extraction

In [None]:
tss = ProcessedTimeSeries("tesla")

In [None]:
results = (
    tss
    .query("trimmed_in_charge")
    .groupby(["vin", "trimmed_in_charge_idx"])
    .agg(
        energy_added=pd.NamedAgg("charge_energy_added", series_start_end_diff),
        soc_diff=pd.NamedAgg("soc", series_start_end_diff),
        soc_start=pd.NamedAgg("soc", "first"),
        soc_end=pd.NamedAgg("soc", "last"),
        temp=pd.NamedAgg("inside_temp", "mean"),
        capacity=pd.NamedAgg("capacity", "first"),
        odometer=pd.NamedAgg("odometer", "first"),
        fast_charger_type=pd.NamedAgg("fast_charger_type", Series.mode),
        size=pd.NamedAgg("soc", "size"),
        model=pd.NamedAgg("model", "first"),
        version=pd.NamedAgg("version", "first"),
        date=pd.NamedAgg("date", "first"),
    )
    .reset_index(drop=False)
    .eval("soh = energy_added / soc_diff / 100 * capacity")
    .eval("model_version = model + version")
)
results

### EDA

In [None]:
px.box(
    results,
    x="soc_diff",
    y="soh",
)

By zooming on the positive soc diff, we can see that there is in fact a correlation between the SoH std and the soc diff.  
It would be worth looking why there are negative soc diffs....

Let's describe the SoH distribution over the soc diff.

In [None]:
results_stats_per_soc_diff = (
    results
    .query("soc_diff > 0")
    .groupby("soc_diff")
    .agg(
        soh_mean=pd.NamedAgg("soh", "mean"),
        soh_std=pd.NamedAgg("soh", "std"),
        nb_soh_points=pd.NamedAgg("soh", "count"),
    )
    .reset_index(drop=False)
    .eval("cum_nb_soh_points = nb_soh_points.cumsum()")
    .eval("cum_ratio_soh_points = cum_nb_soh_points / cum_nb_soh_points.max()")
)
results_stats_per_soc_diff

While this is usefull, this is not quite enough to for us to deceide what filter to apply.  
We also need to know how many vehicles have charges above a given soc diff.  
THis will tell us how many vehicles we can actually compute the SoH from given a specific soc diff.    

In [None]:
cum_nb_vin_per_max_soc_diff:DF = (
    results
    .query("soc_diff > 0")
    .groupby("vin")
    .agg(max_soc_diff=pd.NamedAgg("soc_diff", "max"))
    .reset_index(drop=False)
    .groupby("max_soc_diff")
    .agg(nb_vins=pd.NamedAgg("vin", "nunique"))
    .reset_index(drop=False)
    .sort_values("max_soc_diff")
    .eval("cum_nb_vins = nb_vins.cumsum()")
    .eval("cum_ratio_vins = cum_nb_vins / cum_nb_vins.max()")
    .pipe(left_merge, results_stats_per_soc_diff, "max_soc_diff", "soc_diff")
    .assign(cum_min_soh_std_to_come=lambda df: df["soh_std"].iloc[::-1].cummin())
)
cum_nb_vin_per_max_soc_diff.to_csv("data_cache/cum_nb_vin_per_max_soc_diff.csv", index=False)
with pd.option_context("display.max_rows", None):
    display(cum_nb_vin_per_max_soc_diff)

## Conclusion
20% soc diff seems to be a good threshold to get a good number of vehicles while keeping a low noise.  
Ideally we would use some sort of optimization to find the best threshold.  
Or, even better, perform apply a threshold per vehicle.  
However this would mean that we would have a varying reliability for each vehicle.  