# Soh estimation using umap & Polynomial Linear Regression

### Introduction:
We want to estimate the soh of batteries over time using the data found during their charges.  

### Vocabulary:
- charging point: Aggregated time series samples over `CHARGING_POINTS_GRP_BY_SOC_QUANTIZATION` defined in `watea_constant`
- `energy_added`: Energy received during a charging point.
- `default_100_soh energy_added`: `energy_added` of a battery with 100% soh.

### Assumptions:
Our main assumption is that: *a battery that requires less energy to gain a certain amount of soc than another battery has a lower soh*.  
Our second assumption is that: *The charges that were made at 3k odometer or less can be used to define the expected energy to gain a certain amount of soc for a 100% soh battery*.  

### Observations:
1.  The required energy to gain a certain amount of soc depends on multiple factors*.  
    **namely**:
    - voltage/soc
    - temperature
    - current
The relationship between the `energy_added` and the aforementioned factors is discontinous, forming different clusters of charging points.  
We call these clusters charging regimes as they are most likely representative of different charger types/brands and regimes (AC/DC and so on).

### Main idea:
We estimate the soh of a charging points as its `energy_added` divided by the `default_100_soh energy_added`.
The `default_100_soh energy_added` for a given charging point is estimated using Linear Regression.
Note: Ideally there would be one regressor per charging regime but here we implement only one regressor for one charging regime.

### Imports

In [None]:
import logging

import plotly.express as px
import pandas as pd
from pandas import DataFrame as DF
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np
import umap

from core.plt_utils import plt_3d_df
from watea.watea_constants import *
from watea.energy_distribution import *

logging.basicConfig(level=logging.INFO)


## Setup

In [None]:
logging.basicConfig(level=logging.INFO)

charging_points = (
    extract_raw_fleet_charging_points()
    .pipe(clean_charging_points)
    .pipe(compute_regime_seperation_feature)
)
display(charging_points["estimated_range"].notna().sum())

In [None]:
charging_points.head(10)
charging_points.to_csv("charging_points.csv")


Here we can visualize the entirety (minus some outlier points) of the fleet's charging points.

In [None]:
plt_3d_df(charging_points, "soc", "current", "energy_added", color="temperature", colorscale="Rainbow", size=2.5)

## Umap dimensionality reduction

To segment the different the diffrent charging regimes, we first use a umap dimensionality reducer.  
We train it by asking it to group the charging points based on their relation between input features and target feature.

In [None]:

def dimensionality_reduction(df:DF, n_components=N_COMPONENTS, features=FEATURE_COLS, n_neighbours=120) -> DF:
    umap_feature_cols = [f"umap_feature_{i}" for i in range(n_components)]
    umap_feature_cols_to_drop = [col for col in umap_feature_cols if col in df.columns] #Drop columns if they are already in the df
    df = df.drop(columns=umap_feature_cols_to_drop)
    return (
        Pipeline([
            ('standar_scalar', StandardScaler()),
            ('reducer', umap.UMAP(n_components=n_components, verbose=True, n_neighbors=n_neighbours, random_state=UMAP_RANDOM_STATE)),
            ('to_df', FunctionTransformer(lambda X: DF(X, columns=umap_feature_cols))),
            ('concat_with_og_df', FunctionTransformer(lambda X: pd.concat((X, df.reset_index(drop=True)), axis="columns"))),
        ])
        .fit_transform(
            X=df[features].values,
            y=df["energy_added"],
        )
    )

In [None]:
charging_points = dimensionality_reduction(charging_points, n_neighbours=150)

In [None]:
plt_3d_df(charging_points, "umap_feature_0", "umap_feature_1", "umap_feature_2", color="energy_added", colorscale="Rainbow", size=2.5)

## Charge regime clustering

Then we use DBSCAN to segment the charging points based ont their UMAP embeded features.

In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean', n_jobs=-1)
umap_feature_cols = charging_points.filter(regex='umap_feature_').columns
charging_points['cluster_idx'] = dbscan.fit_predict(charging_points[umap_feature_cols])

In [None]:
plt_3d_df(charging_points, "umap_feature_0", "umap_feature_1", "umap_feature_2", color="cluster_idx", colorscale="Rainbow", size=2.5)


## Charge regime cluster eda
Before training a regressor on this charge regime cluster, let's do some EDA on it to better understand it.  

In [None]:
cluster = (
    charging_points
    .query(f"cluster_idx == {MAIN_CHARGING_REGIME_CLUSTER_IDX}")
    .query("energy_added > 320 & energy_added < 490")
    .query("current < 27.5 & current > 5.8")
)
plt_3d_df(cluster, "voltage", "current", "energy_added", color="odometer", colorscale="Rainbow", size=2.5)

In [None]:
cluster:DF = cluster.eval("current_plt = current * 10")
plt_3d_df(cluster, "voltage", "current_plt", "energy_added", color="temperature", colorscale="Rainbow", size=2.5)

In [None]:
cluster["current"].plot.hist(bins=75)

In [None]:
cluster["temperature"].plot.hist(bins=20)

## SOH estimation
Now we train a linear regression model ot estimate the `default_100_soh energy_added`.  
We train it in two steps:  
1. Fit it to the entirety of the cluster.
1. Fit its intercept to the < 3k odometer charging points based on the residual of its prediction of those points.

In [None]:
x = cluster[["voltage", "temperature", "current"]].values
y = cluster["energy_added"].values
display(x.shape)
display(y.shape)
soh_estimator = (
    Pipeline([
        ('poly_features', PolynomialFeatures(degree=6)),
        ('regressor', LinearRegression())
    ])
    .fit(X=x, y=y)
)
cluster["general_energy_added"] = (
    soh_estimator
    .predict(X=x)
    .squeeze()
)

In [None]:
plt_3d_df(cluster, "voltage", "current", "temperature", color="general_energy_added", colorscale="Rainbow", size=2.5)

In [None]:
default_100_soh_cluster = cluster.query("is_default_100_soh")
y2_pred = soh_estimator.predict(default_100_soh_cluster[['voltage', 'temperature', 'current']])
residuals = default_100_soh_cluster['energy_added'] - y2_pred
initial_intercept = soh_estimator.named_steps['regressor'].intercept_
adjusted_intercept = initial_intercept + residuals.mean()
soh_estimator.named_steps['regressor'].intercept_ = adjusted_intercept

cluster:DF = (
    cluster
    .assign(default_100_energy_added=soh_estimator.predict(cluster[['voltage', 'temperature', 'current']]))
    .eval("soh = 100 * energy_added / default_100_energy_added")
)

cluster_charges = agg_charging_points_over_charges(cluster, {
    "odometer":"median",
    "energy_added":"median",
    "voltage":"median",
    "current":"median",
    "temperature":"median",
    "sec_duration":"median",
    "date":"median",
    "soc":"median",
    "min_voltage":"median",
    "soc_voltage_feature":"median",
    "default_100_soh_energy_added":"median",
    "soh":"median",
    "estimated_range": "mean",
    "estimated_range_diff": "mean",
    #Debugging
    "id":pd.Series.mode,
    "charge_idx":pd.Series.mode,
    "charge_id":pd.Series.mode,
})

In [None]:
# Save output
! mkdir -p data_cache/soh_estimation
charging_points.to_parquet("data_cache/soh_estimation/charging_points_after_umap.parquet")
cluster.to_parquet("data_cache/soh_estimation/main_charging_regime_lr_soh_estimation.parquet")
cluster_charges.to_parquet("data_cache/soh_estimation/main_charging_regime_lr_soh_estimation_per_charge.parquet")

In [None]:
cluster["estimated_range"].value_counts()

In [None]:
default_100_soh_cluster = cluster.query("is_default_100_soh")
y2_pred = soh_estimator.predict(default_100_soh_cluster[['voltage', 'temperature', 'current']])
residuals = default_100_soh_cluster['energy_added'] - y2_pred
initial_intercept = soh_estimator.named_steps['regressor'].intercept_
adjusted_intercept = initial_intercept + residuals.mean()
soh_estimator.named_steps['regressor'].intercept_ = adjusted_intercept

cluster = (
    cluster
    .assign(default_100_energy_added=soh_estimator.predict(cluster[['voltage', 'temperature', 'current']]))
    .eval("soh = 100 * energy_added / default_100_energy_added")
)

cluster_charges = agg_charging_points_over_charges(cluster)

In [None]:
px.scatter(cluster_charges, x='odometer', y='soh', color='id')

## Estimation evaluation

Here we visualize the estimated `default_100_soh energy_added` across the soh estimation features.

In [None]:
plt_3d_df(cluster, "energy_added", "odometer", "general_energy_added", color="temperature", colorscale="Rainbow", size=2.5)

We can see that there is one particular charge that has a very large spread of soh values.  
Here we take a look at one specific battery(`bob432`) to try to interpret noisy soh estimations.  
This is a very "minimalist" evaluation...  

In [None]:
px.scatter(cluster.query("id == 'bob432'"), x='odometer', y='soh', opacity=0.6, color='id')

It seems like these low soh charging points are in a much lower `current` region than the rest.  
It might be worth checking the rest of the batteries to see if they all have abnormally lower soh in this `current` region.  

In [None]:
plt_3d_df(cluster.query("id == 'bob432'"), "voltage", "current", "soh", color="odometer", colorscale="Rainbow", size=2.5)

Check the difference in intercept values before and after fitting it to the `default_100_soh` batteries.

In [None]:
print(initial_intercept)
print(adjusted_intercept)

Interpret the inlfuence of current on the `energy_added`.

In [None]:
# Get the coefficients from the linear regression model
coefficients = soh_estimator.named_steps['regressor'].coef_

# Get the feature names after polynomial transformation
poly_feature_names = soh_estimator.named_steps['poly_features'].get_feature_names_out(['voltage', 'temperature', 'current'])

# Combine feature names with their corresponding coefficients
coeff_dict = dict(zip(poly_feature_names, coefficients))

# Display the coefficient for the 'current' feature
for feature, coeff in coeff_dict.items():
    if 'current' in feature:
        print(f"Feature: {feature}, Coefficient: {coeff}")
