# Notebook to analyse consumption 

## Conclusion

We have identified 3 main issues:

1. **Some cars have no consumption data**
- BMW: Missing data from the provider — nothing can be done, but we can’t explain it since the car has recorded trips.
- Renault: The issue comes from the computation of `ODOMETER_DIFF`, which returns None or 0.
- Tesla: The pipeline did not run to completion for older Tesla data.

2. **Some consumption values are negative**

- Negative consumption values are due to negative `ODOMETER_DIFF` computations. We need to investigate why the odometer becomes negative during the transform/processed_phases.

3. **Some consumption values are too high**

- Excessively high consumption values occur due to two main reasons:

    - Low odometer values during a discharge phase caused by missing data (e.g., Renault).
    - Incorrect SoC points taken from another potential phase with a lower odometer due to data gaps.

**Decisions for the first fix:**
- Filter out phases with less than 5 km of odometer difference when computing consumption. (processed_phases)
- Exclude consumption values above 100 from the final computation. (results_phases)


In [None]:
import pandas as pd
from core.sql_utils import get_connection
import numpy as np
import plotly.express as px
from core.s3.s3_utils import S3Service
from core.s3.settings import S3Settings
from core.spark_utils import create_spark_session
from sqlalchemy import text
from pyspark.sql.functions import col

settings = S3Settings()

spark = create_spark_session(
    settings.S3_KEY,
    settings.S3_SECRET
)

s3 = S3Service()

In [None]:
tesla_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_tesla_fleet_telemetry.parquet")
bmw_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_bmw.parquet")
stellantis_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_stellantis.parquet")
ford_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_ford.parquet")
renault_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_renault.parquet")
kia_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_kia.parquet")
mercedes_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_mercedes_benz.parquet")
volkswagen_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_volkswagen.parquet")
volvo_phase = s3.read_parquet_df_spark(spark, "result_phases/result_phases_volvo_cars.parquet")

In [None]:
with get_connection() as con:
        cursor = con.cursor()
        cursor.execute("""SELECT v.vin, make_name,  model_name, vm.type, vm.version, vd.speed, vd.consumption, vd.timestamp, b.capacity, b.battery_name FROM vehicle_data vd
            left join vehicle v
            on v.id = vd.vehicle_id
            left join vehicle_model vm
            on vm.id=vehicle_model_id
            left join battery b
            on b.id=vm.battery_id
            left join make m 
            on m.id=vm.make_id;""", con)
        dbeaver =  pd.DataFrame(cursor.fetchall(), columns=["vin", "make_name", "model_name", "type", "version", "speed", "consumption", "timestamp", "capacity", "battery_name"])
        
dbeaver = dbeaver.sort_values('timestamp')
dbeaver['timestamp'] = pd.to_datetime(dbeaver['timestamp'])
dbeaver = dbeaver.drop_duplicates()

In [None]:
dbeaver.vin.nunique()

# consumption study

## current state and missing consumption

In [None]:
dbeaver[["vin", "make_name"]].drop_duplicates()['make_name'].value_counts()

In [None]:
vin_null = (
    dbeaver.groupby('vin')['consumption']
    .apply(lambda x: x.isna().all())
)

In [None]:
vin_null = vin_null[vin_null].index.tolist()

In [None]:
len(vin_null)

In [None]:
dbeaver[dbeaver['vin'].isin(vin_null)][["vin", "make_name"]].drop_duplicates()["make_name"].value_counts()

In [None]:
(dbeaver[dbeaver['vin'].isin(vin_null)][["vin", "make_name"]].drop_duplicates()["make_name"].value_counts() / dbeaver[["vin", "make_name"]].drop_duplicates()['make_name'].value_counts()
).sort_values(ascending=False)

All Tesla before fleet-telemetry have a consumption equal to none.  
BMW have 85% of vin with consumption equal to none.

In [None]:
(dbeaver[dbeaver['vin'].isin(vin_null)][["vin", "model_name"]].drop_duplicates()["model_name"].value_counts() / dbeaver[["vin", "model_name"]].drop_duplicates()['model_name'].value_counts()
).sort_values(ascending=False).head(20)

All i3 have a consuption value of None, totaling 39 vins. Also 75% of i5 and 50% of i4.

In [None]:
last_date = dbeaver[dbeaver['vin'].isin(vin_null)].groupby('vin', as_index=False)[['timestamp']].max()


In [None]:
dbeaver.merge(last_date[last_date['timestamp'] > "2025-06-01"], on="vin", how="inner")[["vin", "model_name"]].drop_duplicates()["model_name"].value_counts()

### BMW

In [None]:
bmw = bmw_phase.toPandas()

In [None]:
bmw_null = (
    bmw.groupby('VIN')['CONSUMPTION']
    .apply(lambda x: x.isna().all())
)
bmw_null = bmw_null[bmw_null].index.tolist()
len(bmw_null)

We get 73 in the Postgres database, but only 39 in the `result_phases` step.

In [None]:
bmw[bmw['VIN'].isin(bmw_null)][['VIN', 'MODEL']].drop_duplicates()['MODEL'].value_counts()

In [None]:
bmw[['VIN', 'MODEL']].drop_duplicates()['MODEL'].value_counts()

In [None]:
bmw[bmw['MODEL']=="i3"]

In [None]:
bmw_raw = s3.read_parquet_df_spark(spark, "/raw_ts/bmw/time_series/raw_ts_spark.parquet")


In [None]:
bmw_raw_missing = bmw_raw.filter(col("VIN").isin(bmw_null)).toPandas()

In [None]:
bmw_raw_missing.isna().sum()/ bmw_raw_missing.shape[0]

For all the i3s, BMW doesn’t send the consumption data — and it’s the same for the i5s with no consumption. 

In [None]:
bmw_raw_missing["mileage"] = bmw_raw_missing["mileage"].astype("float")

In [None]:
bmw_raw_missing.groupby('vin').agg(
    odometer_first=("mileage", "min"),
    odometer_last=("mileage", "max")
)

The vehicles were driven, so we can’t explain why the consumption data wasn’t returned by BMW.
We used the BMW API directly, but on HM we can see that a column called `last_trip_energy_consumption` exists. It might be filled, and we could potentially compute the average consumption from that column.

### Renault

10 megane doesn't have consumption let's see why

In [None]:
renault_null = dbeaver[dbeaver["model_name"]=='megane'].merge(last_date[last_date['timestamp'] > "2025-06-01"], on="vin", how="inner")["vin"].to_list()

In [None]:
renault_phase_missing = renault_phase.filter(col("VIN").isin(renault_null)).toPandas()

In [None]:
renault_phase_missing[['SOC_DIFF', 'SOC_LAST', 'SOC_FIRST']]

In [None]:
renault_phase_missing['BATTERY_NET_CAPACITY'].unique()

In [None]:
renault_phase_missing["ODOMETER_DIFF"].unique()

problem in odometer diff 

In [None]:
renault_raw = s3.read_parquet_df_spark(spark, "/raw_ts/renault/time_series/raw_ts_spark.parquet")


In [None]:
renault_raw_missing = renault_raw.filter(col("VIN").isin(renault_null)).toPandas()
renault_raw_missing[['odometer', 'battery_energy']] = renault_raw_missing[['odometer', 'battery_energy']].astype(float)

In [None]:
renault_raw_missing.vin.unique()

In [None]:
renault_raw_missing.groupby('vin').odometer.max()

4 wherewe can possibily compute a consumption

In [None]:
px.scatter(renault_raw_missing[renault_raw_missing['vin']=='VF1AG000964650090'], x='date', y='battery_energy')

In [None]:
px.scatter(renault_raw_missing[renault_raw_missing['vin']=='VF1AG000964650090'], x='date', y='odometer', color='battery_energy')

In [None]:
renault_test = renault_raw_missing[renault_raw_missing['vin']=='VF1AG000964650090'].sort_values("date")

In [None]:
renault_test['odometer'] = renault_test['odometer'].ffill()

calcul problem for the odometer during the phase

## Analysis

In [None]:
px.scatter(dbeaver.dropna(subset='consumption'), x='timestamp', y='consumption', color='vin')

There is a cap at 100 on the db due to (5,2) conditionning for the columns.

In [None]:
df_concat = pd.concat([tesla_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       bmw_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       stellantis_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       ford_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       renault_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       kia_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       mercedes_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       volkswagen_phase.filter(~col("VIN").isin(vin_null)).toPandas(), 
                       volvo_phase.filter(~col("VIN").isin(vin_null)).toPandas()])

In [None]:
df_concat[df_concat['SOC_DIFF']<=0]["CONSUMPTION"].describe()

### Negative consumption

In [None]:
neg_consumption = df_concat[(df_concat['SOC_DIFF']<=0) & (df_concat['CONSUMPTION']<0)]

In [None]:
neg_consumption[['SOC_DIFF', 'ODOMETER_DIFF']]

Negative consumption values are due to negative odometer computes.  
We will explore that in another notebook.

### too high consumption

In [None]:
px.scatter(df_concat[(df_concat['SOC_DIFF']<=0) & (df_concat['CONSUMPTION']>0) & (df_concat['ODOMETER_DIFF']<200)].sample(1000), x='DATETIME_BEGIN', y='CONSUMPTION', color='ODOMETER_DIFF')

In [None]:
absurd_conso = df_concat[(df_concat['SOC_DIFF']<=0) & (df_concat['CONSUMPTION']>100) & (df_concat['ODOMETER_DIFF']>5)].copy()

In [None]:
absurd_conso.MAKE.value_counts()

In [None]:
absurd_conso.loc[0][["CONSUMPTION", "SOC_DIFF", "SOC_LAST", "SOC_FIRST", "ODOMETER_DIFF", "ODOMETER_LAST", "ODOMETER_FIRST", "DATETIME_BEGIN", "DATETIME_END", "VIN"]]


In [None]:
tesla_raw = s3.read_parquet_df_spark(spark, "/raw_ts/tesla-fleet-telemetry/time_series/raw_ts_spark.parquet")


In [None]:
tesla_vin = tesla_raw.filter(col("VIN").isin(['XP7YGCEK7PB143869'])).toPandas()

In [None]:
tesla_vin['BatteryLevel'] = tesla_vin['BatteryLevel'].astype(float)
tesla_vin['Odometer'] = tesla_vin['Odometer'].astype(float)

In [None]:
px.scatter(tesla_vin[(tesla_vin['date']>'2025-04-03') & (tesla_vin['date']<'2025-04-05')], x='date', y='BatteryLevel', color='vin', hover_data={"Odometer": True})

In [None]:
tesla_vin[(tesla_vin['date']>'2025-04-03') & (tesla_vin['date']<'2025-04-05')][["Odometer", "date", "BatteryLevel", 'vin']].dropna(subset=['Odometer'])#[170:230]

Here the consumption value is due to an error computing the discharge. We take the first point after a long period with an odometer but we should take the one before.

In [None]:
renault_vin = renault_raw.filter(col("VIN").isin(['VF1AG000964535215'])).toPandas()

In [None]:
renault_vin['battery_level'] = renault_vin['battery_level'].astype(float)
renault_vin['odometer'] = renault_vin['odometer'].astype(float)

In [None]:
px.scatter(renault_vin[renault_vin['date']>'2025-03-03'], y='battery_level', x='date', color='vin')

In [None]:
renault_vin[renault_vin['date']>'2025-03-03'][["battery_level", "odometer", "date", "vin"]].head(20)

In [None]:
df_concat[(df_concat['VIN']=='VF1AG000964535215') & (df_concat['CONSUMPTION']>100)][["CONSUMPTION", "SOC_DIFF", "ODOMETER_DIFF", "DATETIME_BEGIN", "DATETIME_END", "VIN"]]

Same kind of problem but we  have only small discharge

## Conclusion

We have identified 3 main issues:

1. **Some cars have no consumption data**
- BMW: Missing data from the provider — nothing can be done, but we can’t explain it since the car has recorded trips.
- Renault: The issue comes from the computation of `ODOMETER_DIFF`, which returns None or 0.
- Tesla: The pipeline did not run to completion for older Tesla data.

2. **Some consumption values are negative**

- Negative consumption values are due to negative `ODOMETER_DIFF` computations. We need to investigate why the odometer becomes negative during the transform/processed_phases.

3. **Some consumption values are too high**

- Excessively high consumption values occur due to two main reasons:

    - Low odometer values during a discharge phase caused by missing data (e.g., Renault).
    - Incorrect SoC points taken from another potential phase with a lower odometer due to data gaps.

**Decisions for the first fix:**
- Filter out phases with less than 5 km of odometer difference when computing consumption. (processed_phases)
- Exclude consumption values above 100 from the final computation. (results_phases)
