# Soh estimation experimentation of Renault vehicles
Two methods of calculation for the SoH: 
Based on the battery level 
```
soh = charging.battery_energy / (charging.battery_level * model_battery_capacity) 
```
Based on the estimated range 
```
soh = estimated_range / soc * model_battery_range) 
```
The good result is probably a combination of the two.

## Imports

In [None]:
import logging
from datetime import datetime as DT
from datetime import timedelta as TD
from dateutil import parser
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import os
from dotenv import load_dotenv

import numpy as np
from rich import print
import pandas as pd
from pandas import Series
from pandas import DataFrame as DF
import plotly.express as px
import plotly.graph_objects as go
from scipy.optimize import curve_fit
from core.s3_utils import S3_Bucket
from core.config import *
from core.time_series_processing import preprocess_date

## Setup

We must ensure that the data points of the time series can be compared together.  
To do this, we will extract their corresponding car model from `fleet_info.csv`("List finale des vin a activer" on the drive).

In [None]:
fleet_info = pd.read_csv("../ayvens/fleet_info1.csv", usecols=["VIN","Make","Model","Type"], dtype={"Make":"string"})
# fleet_info = pd.read_csv("fleet_info.csv")
print(fleet_info.columns)
fleet_info = (
    fleet_info
    .rename(columns={"VIN": "vin"})
    .assign(Make=fleet_info["Make"].str.lower())
    .query("Make == 'renault'")
    .set_index("vin", drop=False)
)
fleet_info[["Model", "Type"]].value_counts()

Then we will use data find online to get the default battery capacity of each model.  
Note: *Here a model is a combinatin of the `Model` and `Type` fleet_info variables since cars of the same model with different type can have different battery capacity*.

In [None]:
KWH_BATTERY_CAPCITY_DICT = {
    "ZOE": {
        "R90 Life (batterijkoop) 5d": 41,
        "R135 Edition One (batterijkoop) 5d": 52,
        "R135 Intens (batterijkoop) 5d": 52,
        "R135":52
    }
}
KNOW_MODEL_TYPES = ["R90 Life (batterijkoop) 5d", "R135 Edition One (batterijkoop) 5d", "R135 Intens (batterijkoop) 5d", "R135"]

Let's remove the vins that we don't have a known default battery capacity.

In [None]:
has_known_capcity = fleet_info["Type"].isin(KNOW_MODEL_TYPES)
fleet_info = fleet_info[has_known_capcity]
fleet_info.head(10)


Let's extract the raw time seriess of all the cars we have into a multi indexed df. 

In [None]:
PROD_CREDS = {
    "bucket_name":os.getenv("PROD_S3_BUCKET"),
    "aws_access_key_id":os.getenv("PROD_S3_KEY"),
    "aws_secret_access_key":os.getenv("PROD_S3_SECRET"),
}


bucket = S3_Bucket(PROD_CREDS)

def get_renault_raw_ts(vin:str) -> DF:
    return (
        bucket.read_parquet_df(f"raw_ts/renault/time_series/{vin}.parquet")
        .set_index("date", drop=False)
        .sort_index()
    )

raw_tss = {}
count = 0
for vin, vehicle_info in fleet_info.iterrows():
    default_100_soc_energy = KWH_BATTERY_CAPCITY_DICT[vehicle_info["Model"]][vehicle_info["Type"]]
    try:
        raw_tss[vin] = (
            get_renault_raw_ts(vin)
            .assign(default_100_soc_energy=default_100_soc_energy)
            .assign(vin=vin)
            .assign(type=vehicle_info["Type"])
        )
    except Exception as e:
        display(e)
        # print(vin)
        count += 1
        continue
raw_tss = pd.concat(raw_tss, axis="index", keys=raw_tss.keys(), names=["vin"])
print("Le nombre de VIN qui ont eu un problème est de : ", count)
raw_tss["type"].unique()


In [None]:
# Compter le nombre de VIN uniques
nombre_vin_uniques = raw_tss['vin'].nunique()

print(f"Le nombre de VIN différents dans tss est : {nombre_vin_uniques}")

**Note**: *There are only R135 models.*

### Time series processing
Let's implement a naive soh estimation pipeline.  

In [None]:
tss:DF = (
    raw_tss
    .rename(columns={"charging.battery_energy": "battery_energy", "diagnostics.odometer": "odometer", "charging.battery_level": "battery_level","charging.estimated_range": "estimated_range"})
    .eval("soc = battery_level * 100")
    .eval("expected_battery_energy = default_100_soc_energy * battery_level")
    .eval("soh = 100 * expected_battery_energy / battery_energy / 115") # the division of 115 is to normalize the battery capacity 
)
tss.columns

In [None]:
# tss[tss['vin']=='VF1AG000366046670'].tail(10).head(25)
# columns_of_interest = ['vin', 'soc', 'battery_energy']  # Replace with your desired columns
# value_counts_specific = tss[columns_of_interest].agg('value_counts')
# print(value_counts_specific)

## EDA

## Assumption verification
First, we will verify that the `soc` and `battery_energy` are two "real" variables.  
That is, none of them is calculated from the other.

In [None]:
# Compter le nombre de VIN uniques
nombre_vin_uniques = tss['vin'].nunique()

print(f"Le nombre de VIN différents dans tss est : {nombre_vin_uniques}")


In [None]:
px.scatter(tss, x="soc", y="battery_energy", color="vin")


Looking at this scatter plot we can see that:
- The two variables are in fact two real variables instead of one being a synthetic variable calculated from the other.  
- The difference is much more important at high `soc` values.

Let's verify that the `soh` is not correlated with the `soc` or `odometer`.

In [None]:
px.scatter(tss, x="soc", y="soh", color="vin")

In [None]:
px.scatter(tss, x="odometer", y="soh", color="vin")
