# Filtering soh results
The goal of this notebook is to show how we filter out soh resutlts that are not valid.  
As of writing this(2024-11-26) all the critirions are arbitrary.

## Setup

### Imports

In [None]:
import plotly.express as px

from core.pandas_utils import *
from core.config import valid_soh_points
from core.s3.s3_utils import S3Service, S3Settings
from core.spark_utils import create_spark_session
settings = S3Settings()

spark = create_spark_session(
    settings.S3_KEY,
    settings.S3_SECRET
)

s3 = S3Service()

### Data extraction

The results will be used to fill the vehicle_data table so we will format them to match the expected frequency.

In [None]:
df = s3.read_parquet_df_spark(spark, 'result_phases/result_phases_tesla_fleet_telemetry.parquet').toPandas()

Let's visualize the raw results.

In [None]:
px.scatter(df, x="ODOMETER_LAST", y="SOH", color="VIN")

It's pretty clear that som results are outliers.  
While this is *okay* for a statistical analysis we would prefet not to show them to our clients.  

## Filtering

To filter the results we will use the following criteria:
- SoH must be between 0.5 and 1.0
- SoH must be within a range defined by two slopes and intercepts.   
These slopes are themselves defined by two points A and B stored in `transform.raw_results.config.VALID_SOH_POINTS`.   
A and B were chosen arbitrarily.  

In [None]:
VALID_SOH_POINT = pd.DataFrame({
  "ODOMETER_LAST": [20_000, 200_000, 0, 200_000],
  "SOH": [1.0, 0.95, 0.9, 0.6],
  "point": ["A", "B", "A", "B"],
  "bound": ["max", "max", "min", "min"]
}).set_index(["bound", "point"])

In [None]:
def filter_results_by_lines_bounds(results: DF) -> DF:
    max_intercept, max_slope = intercept_and_slope_from_points(VALID_SOH_POINT.xs("max", level=0, drop_level=True))
    min_intercept, min_slope = intercept_and_slope_from_points(VALID_SOH_POINT.xs("min", level=0, drop_level=True))
    return (
        results
        .eval(f"max_valid_soh = ODOMETER_LAST * {max_slope:f} + {max_intercept:f}")
        .eval(f"min_valid_soh = ODOMETER_LAST * {min_slope:f} + {min_intercept:f}")
        .eval(f"soh_is_valid = SOH <= max_valid_soh & SOH >= min_valid_soh & SOH > 0.5 & SOH < 1.0")
        .pipe(debug_df, subset=["SOH", "max_valid_soh", "min_valid_soh", "soh_is_valid"], logger=logger)
        .query("soh_is_valid")
        .dropna(subset=["SOH", "ODOMETER_LAST"], how="any")
    )

def intercept_and_slope_from_points(points: DF) -> tuple[float, float]:
    slope = (points.at["B", "SOH"] - points.at["A", "SOH"]) / (points.at["B", "ODOMETER_LAST"] - points.at["A", "ODOMETER_LAST"])
    intercept = points.at["A", "SOH"] - slope * points.at["A", "ODOMETER_LAST"]
    return intercept, slope

filtered_results = filter_results_by_lines_bounds(df)
px.scatter(filtered_results, x="ODOMETER_LAST", y="SOH", color="VIN", hover_data=["DATETIME_BEGIN"])