# Ituran Second Response time series EDA
Ituran has sent us a second response with more data on more vehicles.  
In this notebook we will handle the data to preprocess it into a "normalized" time series and perform some simple descriptive analysis.  
This would corresponds to the processed_tss step in our pipeline.  

## Setup

In [None]:
! mkdir -p data_cache

### Import

In [None]:
import plotly.express as px

from core.pandas_utils import *
from core.plt_utils import plt_3d_df
from transform.processed_tss.ProcessedTimeSeries import ProcessedTimeSeries

### Data Extraction

Let's open the data and see what we have...

In [None]:
raw_tss = (
    pd.read_csv(
        "./data_cache/ituran_tss_response.csv",
        parse_dates=["signal_time", "year_of_manufacture"],
        usecols = [
            "dataran_id",
            "signal_time",
            "vehicle_make",
            "vehicle_model",
            "signal_name",
            "signal_value",
            "year_of_manufacture",
        ],
        dtype={
            "dataran_id": "category",
            "vehicle_make": "category",
            "vehicle_model": "category",
            "signal_name": "category",
        }
    )
)
raw_tss

We can see that the data is in long format, with a row for each vehicle and each signal.  
We will need to pivot the data to wide format, where each row is a vehicle and each column is a signal.  

First let's see what signals we have...

In [None]:
raw_tss["signal_name"].value_counts(sort=True, ascending=False, dropna=False, normalize=True)

Infortunately, we don't have an odometer nor a temperature signal.  

Let's also check the number of vehicles we have:

In [None]:
raw_tss["dataran_id"].nunique()

## Processing

Let's define the parameters for the preprocessing steps.  
This would corersponds to the variables in the `transform.processed_tss.config` module.  

In [None]:
INDEX_COLS = [
    "year_of_manufacture",
    'vehicle_make',
    'vehicle_model',
    'dataran_id',
    'signal_time',
]

COLUMNS_NAMES_MAP = {
    "Electric Data - Battery Status Of Charge - 2334": "soc",
    "Electric Data - Charging AC Mode - 2227": "charging_ac_mode",
    "Electric Data - Charging Current - 232": "charging_current",
    "Electric Data - Charging DC Mode - 9629": "charging_dc_mode",
    "Electric Data - Charging Voltage - 7C": "charging_voltage",
    "Electric Data - Ready Switch Open - 2015": "switch_open",
    "Electric Data - Time Remaining for Charge - 2291": "time_remaining_for_charge",
    "Electric Data - Vehicle Range Of Battery - 2229": "estimated_range",
    "signal_time": "date",
    "dataran_id": "vehicle_id",
    "vehicle_make": "make",
}

DTYPES = {
    "date": "datetime64[ns]",
    "vehicle_id": "string",
    "switch_open": "bool",
    "charging_ac_mode": "bool",
    "charging_dc_mode": "bool",
    "time_remaining_for_charge": "int",
    "soc": "float32",
    "charging_current": "float32",
    "charging_voltage": "float32",
    "estimated_range": "float32",
}

The data format is quite different from the responses that we get from data providers.  
The two main differences are:
- The data is in long format, with a row for each vehicle and each signal.
- The charging current and voltage contain the current and voltage of charging AND discharging.
- The frequency of the data is much higher than the data that we get from data providers.
   => this means that the noise of the soc is more inpactfull on charge/discharge detection.

The data is 

To address these differences, we will need to modify the time series processing steps.  
Thanks to the (amazing) time series processing refactoring, this is now quite easy.  
We simply need to inherit from the `ProcessedTimeSeries` class and override the methods that we need to modify.  

In [None]:
class HighFreqProcecssedTimeSeries(ProcessedTimeSeries):

    def run(self) -> DF:
        return (
            raw_tss
            .drop_duplicates(INDEX_COLS + ["signal_name"], keep="first")
            .pivot(index=INDEX_COLS, columns="signal_name", values="signal_value")
            .reset_index()
            .rename(columns=COLUMNS_NAMES_MAP, errors="ignore")
            .astype(DTYPES, errors="ignore")
            .sort_values(by=[self.id_col, "date"], ascending=True)
            .pipe(self.compute_date_vars)
            .pipe(self.compute_charge_n_discharge_masks)
            .pipe(self.compute_current_vars)
            .pipe(self.compute_idx_from_masks, masks=["in_charge", "in_discharge"])
            .pipe(self.trim_leading_n_trailing_soc_off_masks, masks=["in_charge", "in_discharge"])
            .pipe(self.compute_idx_from_masks, masks=["trimmed_in_charge", "trimmed_in_discharge"])
            .pipe(self.ffill_vars, vars=["estimated_range", "charging_voltage", "charging_current", "time_remaining_for_charge","soc"])
        )

    def compute_charge_n_discharge_masks(self, tss:DF) -> DF:
        tss_grp = tss.groupby(self.id_col)
        tss["soc_ffilled"] = tss_grp["soc"].ffill()
        tss["soc_diff"] = tss_grp["soc_ffilled"].diff()
        tss["soc_diff"] /= tss["soc_diff"].abs()
        soc_diff_ffilled = tss_grp["soc_diff"].ffill()
        soc_diff_bfilled = tss_grp["soc_diff"].bfill()
        tss["in_charge"] = soc_diff_ffilled.gt(0, fill_value=False) & soc_diff_bfilled.gt(0, fill_value=False)
        tss["in_discharge"] = soc_diff_ffilled.lt(0, fill_value=False) & soc_diff_bfilled.lt(0, fill_value=False)
        return tss

    def compute_current_vars(self, tss:DF) -> DF:
        tss["power"] = tss.eval("charging_current * charging_voltage")
        tss["charging_power"] = tss["power"].mask(~tss["in_charge"], pd.NA)
        tss["power"] = tss["power"].mask(tss["in_charge"], pd.NA)
        tss = self.compute_cum_var(tss, var_col="charging_power", cum_var_col="cum_energy_added")
        tss = self.compute_cum_var(tss, var_col="power", cum_var_col="cum_energy_spent")
        return tss

    # This is used for the charging_points SoH analysis, it might be not needed for the other analysis.
    def ffill_vars(self, tss:DF, vars:list[str]) -> DF:
        tss_grp = tss.groupby(self.id_col)
        self.logger.info(f"ffilling vars")
        for var in vars:
            tss[f"ffilled_{var}"] = tss_grp[var].ffill()
        return tss

In [None]:
tss_without = HighFreqProcecssedTimeSeries(make="ituran", id_col="vehicle_id", force_update=True, log_level="ERROR")
tss_without.to_parquet("./data_cache/ituran_second_response_tss.parquet")
display(sanity_check(tss_without))
display(tss_without.memory_usage(deep=True).div(1024**2).sum())

## Processed Time series descriptive analysis

model distribution:

In [None]:
tss_without["vehicle_model"] = tss_without["vehicle_model"].replace("ev 6", "ev6")

In [None]:
model_distribution = (
    tss_without
    .groupby("vehicle_model", observed=True, as_index=False)
    .agg(count=pd.NamedAgg(column="vehicle_id", aggfunc="nunique"))
)
display(model_distribution)
px.pie(model_distribution, values="count", names="vehicle_model", title="Model distribution")

We can see that the data mostly contains ev 6 models.

Playing around with the data, I noticed that the charging current and voltage are not always present.  
Let's see the count of not NaN (not not a number(what a mouthful...)) values for each model.

In [None]:
display("Count of not NaN values for each model:")
display(
    tss_without
    .groupby("vehicle_model", observed=True)
    [["charging_voltage", "charging_current"]]
    .count()
)

We can see that unfortunatly, we only have current and voltage values for ev6 models.  
Which means that we have current and voltage values for 17.4% of the fleet(-_-).

Number of days of data per vehicle:

In [None]:
nb_days_of_data_per_vehicle = (
    tss_without
    .groupby("vehicle_id")["date"]
    .agg(series_start_end_diff)
    .dt.days
    .rename("nb_days_of_data")
)
display(nb_days_of_data_per_vehicle.describe())

## Processed time series visualization

Let's select a subset of the fleet to plot.  
We will plot 4 geometry c models as they are the only one to have current and voltage values.

In [None]:
def compute_first_charge_soc(tss:DF) -> DF:
    tss["first_charge_soc"] = (
        tss
        .groupby(["vehicle_id", "trimmed_in_charge_idx"])
        ["soc"]
        .transform("first")
    )
    tss["first_charge_soc"] = tss["first_charge_soc"].where(tss["trimmed_in_charge"], pd.NA)
    return tss

tss_without = compute_first_charge_soc(tss_without)

In [None]:
vehicle_ids_to_plot = tss_without.query("vehicle_model == 'geometry c'")["vehicle_id"].unique()[:4]
tss_to_plot = tss_without.query("vehicle_id in @vehicle_ids_to_plot")

In [None]:
px.scatter(
    tss_to_plot,
    x="date",
    y="soc",
    symbol="soc_diff",
    color="in_charge",
    facet_row="vehicle_id",
    title="SoC over time",
)

In [None]:
px.scatter(
    tss_to_plot,
    x="date",
    y="first_charge_soc",
    facet_row="vehicle_id",
    title="First charge SoC over time",
)

In [None]:
px.scatter(
    tss_to_plot,
    x="date",
    y="cum_energy_added",
    facet_row="vehicle_id",
    title="Cumulative energy added over time",
)

## Adding odometer from trips response

In [None]:
trips = (
    pd.read_csv(
        "data_cache/ituran_trips_response.csv",
        parse_dates=["trip_start_date"],
        dtype={"vehicle_id": "category"}
    )
    .rename(columns={"rounded_start_mileage_km": "odometer", "trip_start_date": "date"})
    .astype({"date": "datetime64[ns]"})
)
display(trips.head())
trips.dtypes

In [None]:
tss_without["vehicle_id"].pipe(uniques_as_series).isin(trips["vehicle_id"].unique()).value_counts()

In [None]:
trips.dtypes

In [None]:
tss_without["vehicle_id"].value_counts(sort=True)

In [None]:
display(tss_without.query("vehicle_id == '878819821'").size)
display(trips.query("vehicle_id == '878819821'").size)

In [None]:
sub_tss = tss_without.query("vehicle_id == '878819821'").merge(trips.query("vehicle_id == '878819821'"), on=['vehicle_id', "date"], how="outer")
sanity_check(sub_tss)

In [None]:
tss = (
    tss_without
    .merge(trips, "outer", ["vehicle_id", "date"])
)
tss["odometer"] = tss.groupby("vehicle_id", observed=True)["odometer"].ffill()
sanity_check(tss[["odometer"]])
tss.groupby("vehicle_id", observed=True)["odometer"].nunique()

In [None]:
tss.to_parquet("data_cache/ituran_tss.parquet")

## Conclusion
While the data is better than the previous response, it is still lacking some important variables.  
Most notably, we don't have the temperature data nor charging current and voltage (for 82.6% of the fleet).  
Hopefully, this is enough to get started estimating the SoH of that subset of the fleet...