# DATAEV-291 new tss processing method comparaison with previous one
After identifying some issues in the Tesla results, we realized that some of them came from the tss processing step.  
This notebook aims at evaluating the impact of this new tss processing method on the final soh estimation. 

## Setup

### Imports

In [None]:
import warnings

import plotly.express as px

from core.pandas_utils import *
from core.stats_utils import *
from core.plt_utils import scatter_and_arrow_fig
from transform.fleet_info.main import fleet_info
from transform.raw_results.tesla_results import get_results
from transform.processed_tss.ProcessedTimeSeries import TeslaProcessedTimeSeries
from transform.raw_results.config import *

### Data extraction

In [None]:
legacy_results = pd.read_parquet("./data_cache/tesla_legacy_results.parquet")
# As if writing this notebook, the vehilce table in the DB is faulty so we relly on a tesla results backup to  
fake_fleet_info = legacy_results.groupby("vin", observed=True, as_index=False)[["capacity", "tesla_code"]].first()
display(sanity_check(fake_fleet_info))

def get_results() -> DF:
    logger.info("Processing raw tesla results.")
    return (
        TeslaProcessedTimeSeries("tesla", columns=TESLA_USE_COLS, filters=[("trimmed_in_charge", "==", True)])
        .groupby(["vin", "trimmed_in_charge_idx"], observed=True)
        .agg(
            energy_added_min=pd.NamedAgg("charge_energy_added", "min"),
            energy_added_end=pd.NamedAgg("charge_energy_added", "last"),
            soc_diff=pd.NamedAgg("soc", series_start_end_diff),
            inside_temp=pd.NamedAgg("inside_temp", "mean"),
            # capacity=pd.NamedAgg("capacity", "first"),
            odometer=pd.NamedAgg("odometer", "first"),
            # version=pd.NamedAgg("version", "first"),
            size=pd.NamedAgg("soc", "size"),
            # model=pd.NamedAgg("model", "first"),
            date=pd.NamedAgg("date", "first"),
            charging_power=pd.NamedAgg("charging_power", "median"),
            # tesla_code=pd.NamedAgg("tesla_code", "first"),
        )
        .merge(fake_fleet_info, "left", "vin")
        .eval("energy_added = energy_added_end - energy_added_min")
        .eval("soh = energy_added / (soc_diff / 100.0 * capacity)")
        # .query("soc_diff > 40 & soh.between(0.75, 1.05)")
        .eval("level_1 = soc_diff * (charging_power < @LEVEL_1_MAX_POWER) / 100")
        .eval("level_2 = soc_diff * (charging_power.between(@LEVEL_1_MAX_POWER, @LEVEL_2_MAX_POWER)) / 100")
        .eval("level_3 = soc_diff * (charging_power > @LEVEL_2_MAX_POWER) / 100")
	    .eval("bottom_soh = soh.between(0.75, 0.9)")
        .eval("fixed_soh_min_end = soh.mask(tesla_code == 'MTY13', soh / 0.96)")
        .eval("fixed_soh_min_end = fixed_soh_min_end.mask(bottom_soh & tesla_code == 'MTY13', fixed_soh_min_end + 0.08)")
        .eval("soh = fixed_soh_min_end")
        .sort_values(["tesla_code", "vin", "date"])
    )

new_raw_results = get_results()
new_results = new_raw_results.query("soc_diff > 40 & soh.between(0.75, 1.05)")
sanity_check(new_raw_results)

In [None]:
all_results = (
    pd.concat({"new_results":new_results, "legacy_results":legacy_results}, names=["res_type", "range_index"])
    .reset_index(level=0)
    .reset_index(drop=True)
    .eval("bottom_MT336 = tesla_code == 'MT336' & soh < 0.87")
)

## Comparaison

### Visaulize comparaison
While it will be hard to see minor differences, let's make sure there are no wild differences.

In [None]:
px.scatter(
    all_results,
    x="odometer",
    y="soh",
    color="tesla_code",
    # symbol="res_type",
    facet_row="res_type",
    # symbol_map={"legacy_results": "cross", "new_results": "square"},
    opacity=0.3,
    height=700,
)

### Quantitative comparaison

In [None]:
pd.concat({
    "new_results": evaluate_soh_estimations(new_results, ["soh"]),
    "legacy_results": evaluate_soh_estimations(legacy_results, ["soh"]),
})

### Coverage comparaison
One of the issues that the previous tss processing method had was that some charges were split up with different `trimmed_in_charge_idx`.  
This in turn would filter out charges that were marked with a lower soc_diff than the real soc_diff, because we have a `soc_diff > 40` query line.  
Let's see if the final coverage is better.  

In [None]:
# To know what is the coverage difference with the previous method we will get all the unique vins in the7 time series as the vehicle RDB table is currently faulty.
unique_vins = TeslaProcessedTimeSeries(columns=["vin"])["vin"].pipe(uniques_as_series)

In [None]:
display(unique_vins.isin(new_results["vin"].pipe(uniques_as_series)).value_counts(dropna=False, normalize=True))
display(unique_vins.isin(legacy_results["vin"].pipe(uniques_as_series)).value_counts(dropna=False, normalize=True))

The final cache is actually worse...

## Conclusion
It seems like the new tss processing is equal if not worse than the legacy one that's okay it's not like I spent a ~month on it...