# Tesla specific time series processing
The goal of this notebook is to demonstrate the implementation of time series processing steps that are specific to Tesla.  
As of writting this markdown cell, this concerns only the segmentation/masking of charging/discharging periods as well as the indexing of charging periods.  

## Setup

### Imports

In [None]:
import plotly.express as px

from pandas.api.types import CategoricalDtype

from core.pandas_utils import *
from core.caching_utils import cache_result
from transform.processed_tss.config import IN_CHARGE_CHARGING_STATUS_VALS, IN_DISCHARGE_CHARGING_STATUS_VALS
from transform.processed_tss.ProcessedTimeSeries import TeslaProcessedTimeSeries
from transform.raw_results.tesla_results import get_results

In [None]:
! mkdir -p data_cache

### Data extraction

We load in the already processed time series even though this is supposed to explain the processing sep.  
This is because we will implement only the processing step spcefic to Tesla and not the rest (like renaming, setting types, ect...)

In [None]:
# Subset of columns that we need to segment and index, the rest won't be loaded to minimize memory usage.
# Add columns names that you might need.
COLUMNS = [ 
    "soc",
    "charging_status",
    "vin",
    "charge_limit_soc",
    "charge_energy_added",
    "sec_time_diff",
    "date",
    "time_diff",
    "odometer",
]
EDGE_CASES_VINS = [
    "LRW3E7FA4MC314534", 
    "5YJ3E7EA6LF558840", 
    "5YJ3E7EB7KF474436",
    "5YJ3E7EB1KF334219", # SoC oscillates at 60%
] 

@cache_result("data_cache/tesla_sub_tss.parquet", on="local_storage")
def get_subset_tss() -> DF:
    tss = TeslaProcessedTimeSeries(use_cols=COLUMNS)
    random_vins = tss["vin"].value_counts(sort=True, ascending=False).index[:100]
    return tss.query("(random_vins in @vins) | (vin in @EDGE_CASES_VINS)")

# tss = get_subset_tss() # To get a subset... and prevent my laptop from crashing :)
tss = TeslaProcessedTimeSeries(use_cols=COLUMNS) # To get all the time series 

## Segmentation and indexing
The following cells are made after many observations and back-and-forward reasonning.  
Unfortunatly, I don't have enough time to show all of the reasonning steps from naive implementation to the ome I have come up with.    

### Charging status interpretation

We base most of our masking/segmentation from the `charging_status` variable.

In [None]:
tss["charging_status"].value_counts(normalize=True, dropna=False)

After observation, the only values that I found truly insightfull/reliable are `charging` and `disconnected`.  
Here is a small recap of what I have observed so far:
- `disconnected`: The battery is defenetly not being charged.
- `charging`: 99% of the time(not a factual stat), the battery is really is charging and the SoC increasing, yes 1% of the time the SoC decreases when `charging_status == 'charging'` .
- `complete`: The battery the desired SoC defined by `charge_limit_soc` (not always 100%).  
    It seems like the battery doesn't charge anymore in this charging_status, it will then fall back down to ~3% less than `charge_limit_soc` and start charing again.  
    This causes charging oscillation patterns (very annoying).
- `stopped`: For some reason, the charging stops but this does not mean that it will not start again afterwards so this cannot be considered as a discharging period.  
- `nopower`: Similar to `stopped` tends to be at the beginning or end of the charges.

### Telsa specific considerations to keep in mind
We monitor a lot of Teslas so we are subject to many edge cases, the issue is that the negative returns of our customers are based on edge cases.   
So we need to make sure that the handle as many as possible.  
Here are the considerations that went into the implementation:
- A lot of charges contain `stopped` or `nopower` values, if we were to naively increase the `in_charge_idx` every time charging goes from `charging` to anything else, we would have many small charges.
This would in turn, increase the noise to SoH estimations as the values would be smaller.  
- We have holes in the time series because we did not monitor the fleet some days("Do data science, it'll be fun they say...">_<).  
    Some vehicles were charging before and after these missing data periods we need to make sure that we index them with different values to identify them as different charges.
- `charge_energy_added` is a cumulative, forward filled variable.  
    It also decreases when the charge is `stopped` or there is `nopower`.  
    

In [None]:
MIN_POWER_LOSS = -0.0005
MAX_CHARGE_TD = TD(days=1)

def compute_charge_n_discharge_masks(tss:DF) -> DF:
    # We use a nullable boolean Series to represnet the rows where:
    # - We are sure that the vehicle is in charge: True.
    # - We are sure that the vehicle is not in charge: False.
    # - We are not sure of anything: NaN.
    tss["nan_charging"] = (
        Series(pd.NA, index=tss.index, dtype="boolean")
        .mask(tss["charging_status"].isin(IN_CHARGE_CHARGING_STATUS_VALS), True)
        .mask(tss["charging_status"].isin(IN_DISCHARGE_CHARGING_STATUS_VALS), False)
    )
    # If a period of uncertainty (NaN) is surrounded by equal periods of certainties (True-NaN-True or False-NaN-False),
    # We will fill them to the value of these certainties.
    # However there are edge cases that have multiple days of uncertainties periods (I can't find the VIN but I'm sure you can ;-) )
    # Interestingly enough the charge_energy_adde variable does not get forwared that far and gets reset to zero. 
    # This would create outliers in our charge SoH estimation as we estimate the energy_gained as the diff between the last(0) and first value of charge_energy_added.
    # So we set a maximal uncertainty period duration over which we don't fill it.
    tss["nan_date"] = tss["date"].mask(tss["nan_charging"].isna())
    tss[["ffill_charging", "ffill_date"]] = tss.groupby("vin", observed=True)[["nan_charging", "nan_date"]].ffill()
    tss[["bfill_charging", "bfill_date"]] = tss.groupby("vin", observed=True)[["nan_charging", "nan_date"]].bfill()
    nan_period_duration:Series = tss.eval("bfill_date - ffill_date")
    fill_unknown_period = tss.eval("ffill_charging.eq(bfill_charging) & @nan_period_duration.le(@MAX_CHARGE_TD)")
    tss["nan_charging"] = tss["nan_charging"].mask(fill_unknown_period, tss["ffill_charging"])
    # As mentioned before, the SoC oscillates at [charge_limit_soc - ~3%, charge_limit_soc] so we set these periods to NaN as well.
    tss["nan_charging"] = tss["nan_charging"].mask(tss["soc"] >= (tss["charge_limit_soc"] - 3))
    # Then we seperate the Series into two, more explicit, columns.
    tss["in_charge"] = tss.eval("nan_charging.notna() & nan_charging")
    tss["in_discharge"] = tss.eval("nan_charging.notna() & ~nan_charging")
    return tss

def compute_charge_idx(tss:DF) -> DF:
    tss_grp = tss.groupby("vin", observed=False)
    tss["charge_energy_added"] = tss_grp["charge_energy_added"].ffill()
    tss["energy_added_over_time"] = tss_grp['charge_energy_added'].diff().div(tss["sec_time_diff"].values)
    # charge_energy_added is cummulative and forward filled, 
    # We check that the charge_energy_added decreases too fast to make sure that we correctly indentify two charging periods before and after a gap as two separate charging periods.
    new_charge_mask = tss["energy_added_over_time"].lt(MIN_POWER_LOSS, fill_value=0) 
    # For the same reason, we ensure that there are no gaps bigger than MAX_CHARGE_TD in between to rows of the same charging period.
    new_charge_mask |= tss["time_diff"].gt(MAX_CHARGE_TD) 
    # And of course we also check that there is no change of status. 
    new_charge_mask |= (~tss_grp["in_charge"].shift() & tss["in_charge"]) 
    tss["in_charge_idx"] = new_charge_mask.groupby(tss["vin"], observed=True).cumsum()
    tss["in_charge_idx"] = tss_grp["in_charge_idx"].factorize()

    return tss

# The following functions are not sepcific to Tesla
def compute_status_col(tss:DF) -> DF:
    tss_grp = tss.groupby("vin", observed=True)
    tss["status"] = Series(pd.NA, index=tss.index, dtype=CategoricalDtype(["in_charge", "moving", "unknown", "idle_discharging"]))
    tss["status"] = tss["status"].mask(tss["in_charge"], "in_charge")
    tss["status"] = tss["status"].mask(
        tss["in_discharge"], 
        np.where(tss_grp["odometer"].diff() > 0, "moving", "idle_discharging")
    )
    return tss

def trim_leading_n_trailing_soc_off_masks(tss:DF, masks:list[str]) -> DF:
    for mask in masks:
        tss["naned_soc"] = tss["soc"].where(tss[mask])
        soc_grp = tss.groupby(["vin", mask + "_idx"], observed=True)["naned_soc"]
        trailing_soc = soc_grp.transform("first")
        leading_soc = soc_grp.transform("last")
        tss["trailing_soc"] = trailing_soc
        tss["leading_soc"] = leading_soc
        tss[f"trimmed_{mask}"] = tss[mask] & (tss["soc"] != trailing_soc) & (tss["soc"] != leading_soc)
    tss = tss.drop(columns="naned_soc")
    return tss

In [None]:
tss = (
    tss
    .pipe(compute_charge_n_discharge_masks)
    .pipe(compute_charge_idx)
    .pipe(compute_status_col)
    .pipe(trim_leading_n_trailing_soc_off_masks, ["in_charge"])
)

## Evaluation
Since we cannot visually check our result on all the vehicles, we look at the ones with the most extreme charging stats.  
The stats can be charge duration, charges count per vehicle, ect...
For the each stat, we check the N first and last vehicles and if:
- The statistic is not representative of the truth and it's back to the drawing board.
- The statistic is representative
- The extreme is an edge case that is not representative of the truth but it affects a negligeable part of the fleet.

> This reasonning is based on the assumption that if results on extreme cases are correct, the results of the vehicles inbetween them are correct.  
> Ideally, we would also use the evaluation of the soh estimation of the charges in the raw_result step of the pipeline to evaluate the processing of time seires...

### Single points charges
We will check that the charges with only one point/row as they are sometimes due to errors in the `in_charge_idx` computation.

In [None]:
tss["in_charge_idx_size"] = tss.groupby(["vin", "in_charge_idx"], observed=True).transform("size")
tss["one_point_charge"] = tss.eval("status == 'in_charge' & in_charge_idx_size == 1")

In [None]:
single_points_charges_counts = (
    tss
    .groupby("vin", observed=True, as_index=False)
    ["one_point_charge"]
    .sum()
    .sort_values(by="one_point_charge")
    .reset_index(drop=True)
)
single_points_charges_counts

In [None]:
top_most_single_point_charges_vins = single_points_charges_counts.iloc[-4:, 0]
px.scatter(
    (
        tss
        .query("vin in @top_most_single_point_charges_vins")
        .eval("charging_status = charging_status.astype('string').fillna('unknown')")
        .eval("status = status.astype('string').fillna('unknown')")
    ),
    x="date",
    y="soc",
    facet_row="vin",
    color="one_point_charge",
    symbol="status",
    hover_data=["in_charge_idx", "charging_status", "in_charge", "in_discharge"],
    height=750,
).update_yaxes(matches=None)

In [None]:
px.histogram(
    single_points_charges_counts,
    x="one_point_charge",
    log_y=True,
    hover_data=["vin"],
)

We can see that the some of these charges are actual single point/row charges whereas some are not single row/pooint charge but charges with SoC decreasing over time.  
This causes the `in_charge_idx` to increase at every row.  
Looking at the histogram, we can see that this affects very little major party of the fleet.  
While there certainly is a way to change the computing of `in_charge`/`in_charge_idx` to not mark soc decreasing charge periods as in charge,  
it would most likely interfere with the indexing of charges that are separated by data gaps/holes.  
Since the single point/row charges won't be used in the soh esitmation we can leave them for now.  

## Charges duration
We will check that the charges with the longest duration as they are sometimes due to errors in the `in_charge`/`in_charge_idx` computation.

In [None]:
charges_duration:DF = (
    tss
    .query("status == 'in_charge'")
    .groupby(["vin", "in_charge_idx"], observed=True)
    .agg(
        start_date=pd.NamedAgg("date", "first"),
        end_date=pd.NamedAgg("date", "last")
    )
    .eval("duration = end_date - start_date")
    .sort_values(by="duration")
)
charges_duration

In [None]:
longest_charges_vins = charges_duration.iloc[-10:].index.get_level_values(0)
px.scatter(
    tss.query("vin in @longest_charges_vins"),
    x="date",
    y="soc",
    facet_row="vin",
    color="status",
    hover_data=["charging_status", "in_charge_idx"],
    # facet_row_spacing=0.04,
    height=1000,
).update_yaxes(matches=None)

In [None]:
px.histogram(
    charges_duration.eval("hours_duration = duration.dt.total_seconds().div(3600)"),
    x="hours_duration",
    log_y=True,
)

We can see that most long duration charges are representative of the truth ath the exception of `LRW3E7FA7MC233155` 2024-12-25 to 2024-12-29.  
This edge case is due to the fact that most of the charge is mostly `charging_status == 'stopped'`.  

### Number of charges per vin
We will make sure that the number of charges is representative of the truth rather than an error in the computation of `in_charge`/`in_charge_idx`.  

In [None]:
charge_counts = (
    tss
    .query("in_charge")
    .groupby("vin", observed=True, as_index=False)
    .agg(charge_counts=pd.NamedAgg("in_charge_idx", "nunique"))
    .sort_values(by="charge_counts")
)
charge_counts

In [None]:
min_charge_counts = charge_counts.iloc[:5, 0]
px.scatter(
    (
        tss.query("vin in @min_charge_counts")
        .eval("charging_status = charging_status.astype('string').fillna('unknown')")
        .eval("status = status.astype('string').fillna('unknown')")
    ),
    x="date",
    y="soc",
    facet_row="vin",
    color="status",
    # symbol="in_charge_idx",
    hover_data=["in_charge_idx", "charging_status", "in_charge", "in_discharge"],
    height=650,
).update_yaxes(matches=None)

We see that the vehicles with the least amount of charges are in fact entire time series without any charging point/rows.

In [None]:
max_charge_counts = charge_counts.iloc[-5:, 0]
px.scatter(
    (
        tss
        .query("vin in @max_charge_counts")
        .eval("charging_status = charging_status.astype('string').fillna('unknown')")
        .eval("status = status.astype('string').fillna('unknown')")
    ),
    x="date",
    y="soc",
    facet_row="vin",
    color="status",
    # symbol="in_charge_idx",
    hover_data=["in_charge_idx", "charging_status", "in_charge", "in_discharge"],
    height=650,
).update_yaxes(matches=None)

To be honest I haven't manually counted all the charges of all these vehicles...

## Conclusion
Tesla times serie are full of edge cases, but it's not impossible to process them, ideally we would use ML/DL models to segment and index the time series to save time.