# Ituran preliminary EDA
The goal of this notebook is to examine the data provided by Ituran.  
We want to know what columns we will need for the POC and their respective quality (frequence, error margin, ...).  

## Setup

### Import

In [None]:
from rich import print
import pandas as pd
from pandas import DataFrame as DF
import plotly.express as px

from core.config import *

### Data Extraction

In [None]:
raw_ts = pd.read_csv("single_vehicle_ts.csv", parse_dates=["signal_time"])
raw_ts

The data we are intrested in residdes in the columns `signal_name` and `LocTime_utc`.  
We will first perform a split to obtain the variables names and values into two corresponding columns.  
Then, we will pivot those columns to get the dat into a usefull format.  

In [None]:
COLS_TO_KEEP_WITHOUT_PIVOTIONG = [
    'vehicle_make',
    'vehicle_model', 
    'vehicle_energy_type',
    'year_of_manufacture',
]

INDEX_COLS = [
    'vehicle_id',
    'signal_time',
]

COLUMNS_NAMES_MAP = {
    "Battery Status Of Charge": "soc",
    "signal_time": "date",
    "Vehicle Range Of Battery": "estimated_range"
}
DTYPES = {
    "date": "datetime64[ns]",
    "vehicle_id": "string",
}
COLS_TO_DROP = [
    "Ready Switch Open"
]

In [None]:
# signal_name column as the following format "CarData - variable_name - value"
split_signal_name_col = raw_ts["signal_name"].str.split(" - ", expand=True)
raw_ts["value"] = split_signal_name_col[2]
raw_ts["variable"] = split_signal_name_col[1]


tss:DF = (
    pd.pivot_table(
        raw_ts,
        values="signal_value",
        index=INDEX_COLS,
        columns="variable",
        aggfunc="first",
        # dropna=False,
    )
    .sort_index()
    .reset_index()
    .merge(
        raw_ts[COLS_TO_KEEP_WITHOUT_PIVOTIONG + INDEX_COLS],
        on=INDEX_COLS,
        how="left"
    )
    .rename(columns=COLUMNS_NAMES_MAP)
    .astype(DTYPES)
    .drop(columns=COLS_TO_DROP)
)
tss

## EDA

### Data sparcity

In [None]:
DF({
    "count": tss.count(),
    "density": tss.count() / len(tss),
    "dtype": tss.dtypes,
})

In [None]:
most_common_vehicle_ids = (
    tss
    .dropna(subset=["date", "soc"], how="any")
    .loc[:, 'vehicle_id']
    .value_counts()
    .index
)
most_common_vehicle_ids

In [None]:
(
    tss
    .dropna(subset=["date", "soc"], how="any")
    .loc[:, ["soc", "vehicle_id", "date"]]
    .groupby("vehicle_id")
    .agg(["min", "max", "count"])
    .sort_values(("soc", "count"), ascending=False)
)

In [None]:
COLS_TO_PLOT = [
    "Charging AC Mode",
    "Charging Current",
    "Charging DC Mode",
    "Charging Voltage",
    "Time Remaining for Charge",
]
for col in COLS_TO_PLOT:
    px.line(
            tss
            .dropna(subset=["date", col], how="any")
            .set_index("vehicle_id", drop=False)
            .loc[most_common_vehicle_ids[:4]]
        ,
        "date",
        col,
        color="vehicle_id"
    ).show()

## Conclusion
We can see that while the date range in the time series is 6 month, ther are only 2 days wotrth of data.  
Given the variables at hand, we *could* implement an soh estimation similar to the one we used in watea.  
For that we would need more data and ideally the temperature.  
If we don't have the temperature we would need to check how the models handle heat differentials? (do they use a heater compensate low temps? Is the battery simply not affected by the temperature?)