# SoH estimation
The goal of this notebook is to estimate show all the SoH estimation steps of the tesla vehicles.  
This notebooks should always represent the latest version of the SoH estimation that is in `transform.raw_results.tesla_results`.  
If the current method gets deprecated because of a refactor/fundamental change please copy paste this notebook into the legacy folder and explain why it got deprecated.

Note: This notebook will get completed in issue DATAEV-279

## Setup

In [None]:
! mkdir -p data_cache

### Imports

In [None]:
import warnings

import plotly.express as px
from scipy.optimize import curve_fit
from sklearn.linear_model import LinearRegression

from core.pandas_utils import *
from core.caching_utils import cache_result
from core.ev_models_info import models_info
from core.stats_utils import lr_params_as_series
from core.plt_utils import plt_3d_df
from transform.fleet_info.main import fleet_info
from transform.processed_tss.ProcessedTimeSeries import TeslaProcessedTimeSeries

## Estimation process

### Raw estimation
Tesla provides an charge_ernergy_added varaible that represents the cumulative energy added to the battery during a charge.  
We express the SoH as the ratio between the energy added and the capacity relative to the difference in SoC during a charge.   
This means that we get an SoH estimation per charge.  

In [None]:
# We specify the columns that we will be using to avoid loading unnecessary columns.
# Otherwise my computer tends to crash ;-)
USE_COLS = [
    "vin",
    "trimmed_in_charge_idx",
    "trimmed_in_charge",
    "charge_energy_added",
    "soc",
    "inside_temp",
    "capacity",
    "odometer",
    "model",
    "date",
    "tesla_code",
    "battery_heater",
    "charging_power",
    "version",
]

@cache_result("data_cache/tesla_results.parquet", "local_storage")
def get_raw_results() -> DF:
    return (
        TeslaProcessedTimeSeries("tesla", columns=USE_COLS, filters=[("trimmed_in_charge", "==", True)]) 
        .groupby(["vin", "trimmed_in_charge_idx"])                      # We group by vin and the index of the charge.
        .agg(
            # Instead of using the first charge_energy_added we use the minimum charge_energy_added.
            # Ideally we would use the first charge_energy_added but the masking of the charges has some noise to it because sometimes there are big data gaps living to charges end to end without a discharge in between to seperate them.
            # This means that the first charge_energy_added is not ALWAYS the actual charge_energy_added of the beggining of the charge.
            # Sometimes we get some of the previous charge's points so we use the minimum charge_energy_added to counting them in.
            # This is because the charge_energy_added is cumulative and the minimum charge_energy_added is the first charge_energy_added of the charge.
            energy_added_min=pd.NamedAgg("charge_energy_added", "min"), 
            energy_added_end=pd.NamedAgg("charge_energy_added", "last"),
            soc_end=pd.NamedAgg("soc", "last"),
            soc_min=pd.NamedAgg("soc", "min"),
            #soc_diff=pd.NamedAgg("soc", series_start_end_diff),
            inside_temp=pd.NamedAgg("inside_temp", "mean"),
            capacity=pd.NamedAgg("capacity", "first"),
            odometer=pd.NamedAgg("odometer", "first"),
            version=pd.NamedAgg("version", "first"),
            size=pd.NamedAgg("soc", "size"),
            model=pd.NamedAgg("model", "first"),
            date=pd.NamedAgg("date", "first"),
            charging_power=pd.NamedAgg("charging_power", "median"),
            tesla_code=pd.NamedAgg("tesla_code", "first"),
        )
        .reset_index(drop=False)
        .eval("energy_added = energy_added_end - energy_added_min")
        .eval("soc_diff = soc_end - soc_min")
        .eval("soh = energy_added / (soc_diff / 100.0 * capacity)")
        #.query("soc_diff > 40 & soh.between(0.75, 1.05)")
	    #.eval("bottom_soh = soh.between(0.75, 0.9)")
        #.eval("fixed_soh_min_end = soh.mask(tesla_code == 'MTY13', soh / 0.96)")
        #.eval("fixed_soh_min_end = fixed_soh_min_end.mask(bottom_soh & tesla_code == 'MTY13', fixed_soh_min_end + 0.08)")
        .sort_values(["tesla_code", "vin", "date"])
    )


In [None]:
results = get_raw_results(force_update=False)

In [None]:
#px.scatter(
#    results,
#    x="odometer",
#    y="soh",
#    color="tesla_code",
#    #color_discrete_sequence="Rainbow",
#    opacity=0.25,
#)

We can see that the simplest SoH estimation is very noisy...

### Reducing SoH variance by filtering results with low soc diff.
As for any dataset, the values have some noise to them when we use divisions the noise is amplified the lower the values are.  
We can reduce the noise by filtering the results with low soc diff.  

In [None]:
soh_std_over_soc_diff:DF = (
    results
    .assign(soh_mean=results.groupby("vin")["soh"].transform("mean"))
    .assign(soh_median=results.groupby("vin")["soh"].transform("median"))
    .eval("soh_to_mean = soh - soh_mean")
    .eval("soh_to_median = soh - soh_median")
    .assign(soh_to_mean_abs=lambda df: df["soh_to_mean"].abs())
    .assign(soh_to_median_abs=lambda df: df["soh_to_median"].abs())
)

In [None]:
px.box(
    soh_std_over_soc_diff,
    x="soc_diff",
    y="soh_to_mean_abs",
)

We can see that, per vehicle, the absolute between an SoH point and the overall vehicle SoH median decreases until ~40% soc_diff, after that it doesn't change much.  
So this is the minimum soc_diff that we will use as to filter SoH points.  

In [None]:
results = results.query("soc_diff > 40")

### Getting the chemistry of the battery of the vehicles.  
This extra imformatoion allows us to perform statistical description (and hopefully inference) on the SoH of the vehicles relative to the chemistry.  
Ultimately this information will be pulled from the DB, for now we will do it "manually".  

In [None]:
LFP_TESLA_CODES = [
    "MT351",
    "MT336",
    "MT322",
]
NCA_TESLA_CODES = [
    "MT353",
    "MT308", #BAttery: BT35
]
NMC_TESLA_CODES = [
    "MTY09",
    "MTY12", # BAttery: BT43
    "MT353", # BAttery: BT43
]
raw_results = (
    get_raw_results()
    .eval("LFP = tesla_code in @LFP_TESLA_CODES")
    .eval("NCA = tesla_code in @NCA_TESLA_CODES")
    .eval("NMC = tesla_code in @NMC_TESLA_CODES")
    .sort_values(["tesla_code", "vin"])
)
raw_results["chemistry"] = raw_results[["LFP","NCA","NMC",]].idxmax(axis=1)

### Visualizing SoH over odometer by vin

In [None]:
results_by_vin = (
    raw_results
    .groupby("vin")
    .agg({"odometer": "last", "soh": "median", "tesla_code": "first", "chemistry": "first"})
    .reset_index()
    .sort_values(["tesla_code", "vin"])
)
results_by_vin

In [None]:
px.scatter(
    (
        results_by_vin
        .dropna(subset=["fixed_soh_min_end", "odometer", "tesla_code"])
    ),
    x="odometer",
    y="fixed_soh_min_end",
    color="tesla_code",
    color_continuous_scale="Rainbow",
    #trendline="ols",
    opacity=0.25,
    title="SoH(State of Health) over odometer for 12589 Tesla vehicles",
    labels={
        "odometer": "Odometer (km)",
        "fixed_soh_min_end": "SoH (State of Health)",
    },
)

In [None]:
px.scatter(
    (
        results_by_vin
        .dropna(subset=["fixed_soh_min_end", "odometer", "tesla_code"])
    ),
    x="odometer",
    y="fixed_soh_min_end",
    color="chemistry",
    color_continuous_scale="Rainbow",
    #trendline="ols",
    opacity=0.4,
    title="SoH(State of Health) over odometer for 12589 Tesla vehicles",
    labels={
        "odometer": "Odometer (km)",
        "fixed_soh_min_end": "SoH (State of Health)",
    },
)

In [None]:
# Prepare the filtered dataset
filtered_data = (
    results_by_vin
    .dropna(subset=["fixed_soh_min_end", "odometer", "tesla_code", "chemistry"], how="any")
    .query("(tesla_code in @LFP_TESLA_CODES | tesla_code in @NCA_TESLA_CODES | tesla_code in @NMC_TESLA_CODES)")
    .query("fixed_soh_min_end.between(0.9, 1.05)")
    .sort_values("odometer")
)
filtered_data

### Plotting the SoH over odometer with logarithmic trendlines.

In [None]:
# Create base scatter plot
fig = px.scatter(
    filtered_data,
    x="odometer",
    y="fixed_soh_min_end",
    color="chemistry",
    color_continuous_scale="Rainbow",
    opacity=0.25,
    hover_data=["tesla_code"],
    title="SoH(State of Health) over odometer for 4975 Tesla vehicles",
    labels={
        "odometer": "Odometer (km)",
        "fixed_soh_min_end": "SoH (State of Health)",
    },
)

# Define logarithmic function that passes through (0,1)
def log_func(x, a):
    return 1 + a * np.log1p(x/1000)  # Using log1p for numerical stability

# Add trendlines for each chemistry type
for chemistry in filtered_data['chemistry'].unique():
    chemistry_data = filtered_data[filtered_data['chemistry'] == chemistry].copy()
    
    # Add the (0,1) point to force trendlines through it
    chemistry_data = pd.concat([
        pd.DataFrame({'odometer': [0], 'fixed_soh_min_end': [1.0]}),
        chemistry_data
    ]).sort_values('odometer')
    
    # Ensure data is clean for trendline calculation
    chemistry_data = chemistry_data.dropna(subset=['odometer', 'fixed_soh_min_end'])
    
    # Fixed maximum x value for all trendlines
    MAX_ODOMETER = 250000

    if len(chemistry_data) > 0:

        try:
            x_data = chemistry_data['odometer'].values
            y_data = chemistry_data['fixed_soh_min_end'].values
            
            # Fit logarithmic curve
            popt, _ = curve_fit(log_func, x_data, y_data, p0=[-0.01])
            
            # Generate smooth curve
            x_smooth = np.linspace(0, MAX_ODOMETER, 100)
            y_smooth = log_func(x_smooth, *popt)
            
            # Add to plot
            fig.add_trace({
                'x': x_smooth,
                'y': y_smooth,
                'name': f"{chemistry} (Log)",
                'mode': 'lines',
                'showlegend': True
            })
            
        except Exception as e:
            print(f"Could not add logarithmic trendline for {chemistry}: {str(e)}")

# Update layout to ensure y-axis starts at appropriate value
fig.update_layout(
    yaxis_range=[0.85, 1.05]
)

fig.show()

In [None]:
fig = px.scatter(
    filtered_data,
    x="odometer",
    y="fixed_soh_min_end",
    color="chemistry",
    color_discrete_sequence=px.colors.qualitative.Set1,
    opacity=0.25,
    title="SoH(State of Health) over odometer for 4975 Tesla vehicles",
    labels={
        "odometer": "Odometer (km)",
        "fixed_soh_min_end": "SoH (State of Health)",
    },
)

# Get color mapping from the created figure
color_map = {trace.name: trace.marker.color for trace in fig.data}

# Define logarithmic function that passes through (0,1)
def log_func(x, a):
    return 1 + a * np.log1p(x / 1000)

# Constants for the trendlines
LOG_THRESHOLD = 50000  # Transition point
MAX_ODOMETER = 250000
TRANSITION_WINDOW = 20000  # Points for a smooth transition

# Iterate for each chemistry type in the data
for chemistry in filtered_data["chemistry"].unique():
    chemistry_data = filtered_data[filtered_data["chemistry"] == chemistry].copy()
    
    # Add the (0, 1) point
    chemistry_data = pd.concat(
        [pd.DataFrame({"odometer": [0], "fixed_soh_min_end": [1.0]}), chemistry_data]
    ).sort_values("odometer")
    
    chemistry_data = chemistry_data.dropna(subset=["odometer", "fixed_soh_min_end"])
    
    if len(chemistry_data) > 0:
        try:
            # Fit the logarithmic curve
            x_data = chemistry_data["odometer"].values
            y_data = chemistry_data["fixed_soh_min_end"].values
            popt, _ = curve_fit(log_func, x_data, y_data, p0=[-0.01])
            
            # Generate logarithmic part
            x_log = np.linspace(0, LOG_THRESHOLD, 100)
            y_log = log_func(x_log, *popt)
            
            # Calculate slope at the end of the logarithmic curve
            slope_log = popt[0] / (LOG_THRESHOLD + 1000)
            
            # Fit linear regression to data beyond transition
            high_km_data = chemistry_data[
                chemistry_data["odometer"] >= LOG_THRESHOLD - TRANSITION_WINDOW / 2
            ]
            if len(high_km_data) > 0:
                lr = LinearRegression()
                lr.fit(
                    high_km_data["odometer"].values.reshape(-1, 1),
                    high_km_data["fixed_soh_min_end"].values,
                )
                linear_slope = lr.coef_[0]
            else:
                linear_slope = slope_log  # Use the log slope if no data
            
            # Adjust linear function to match the end of log curve
            x_linear = np.linspace(LOG_THRESHOLD, MAX_ODOMETER, 100)
            y_linear = y_log[-1] + slope_log * (x_linear - LOG_THRESHOLD)
            
            # Smooth transition between log and linear
            x_trans = np.linspace(LOG_THRESHOLD - TRANSITION_WINDOW / 2, LOG_THRESHOLD, 50)
            weight = np.linspace(0, 1, len(x_trans))
            y_trans = (1 - weight) * log_func(x_trans, *popt) + weight * (
                y_log[-1] + slope_log * (x_trans - LOG_THRESHOLD)
            )
            
            # Combine all parts
            x_combined = np.concatenate([x_log, x_trans, x_linear])
            y_combined = np.concatenate([y_log, y_trans, y_linear])
            
            # Add trendline to the plot
            fig.add_trace({
                "x": x_combined,
                "y": y_combined,
                "name": f"{chemistry} (Trend)",
                "mode": "lines",
                "line": {"color": color_map[chemistry], "width": 2},
                "showlegend": True
            })
        
        except Exception as e:
            print(f"Could not add trendline for {chemistry}: {str(e)}")

# Update layout
fig.update_layout(
    yaxis_range=[0.85, 1.1],
    xaxis_range=[0, MAX_ODOMETER]
)

fig.show()


## Conclusion