# Volvo processed_ts series Exploratory Data Analysis
The goal of this notebook is to validate a first SoH 

## Setup

### Imports

In [None]:
from datetime import datetime as DT
import pytz

import numpy as np
import pandas as pd
from pandas import DataFrame as DF
import plotly.express as px

from core.s3_utils import S3_Bucket
from core.config import *
from core.pandas_utils import *
from transform.processed_tss.main import get_processed_tss
from transform.fleet_info.ayvens_fleet_info import fleet_info

### Data extraction

In [None]:
fleet_info.columns

In [None]:
tss = get_processed_tss("volvo-cars")
tss.columns


## Time series EDA

In [None]:
# If you wan to plot only for one specific vin you can use 
tss_unique = tss[tss["vin"] == "YV1XZEDVEM2546269"] ## It is a car that has good data
# If you want to plot for a random sample of vins you can use 
selected_vins = np.random.choice(tss['vin'].unique(), size=5, replace=False)
tss_sample = tss[tss['vin'].isin(selected_vins)]


### Available data 


In [None]:
tss.count() / len(tss)

## Printing first graphs


Let's list the variables and the respective count ratio.

In [None]:
# Créer le scatter plot
fig = px.scatter(
    tss,
    x="date",
    y="estimated_range",
    title="Estimated range vs State of Charge",
    color="vin",
    labels={
        "soc": "State of Charge (%)",
        "estimated_range": "Estimated Range (km)",
    },
    hover_data=["date"]  # Ajouter la date dans les infos au survol
)
fig.show()



In [None]:
# Créer le scatter plot
fig = px.scatter(
    tss,
    x="date",
    y="estimated_range",
    title="Estimated range vs State of Charge",
    color="vin",
    labels={
        "soc": "State of Charge (%)",
        "estimated_range": "Estimated Range (km)",
    },
    hover_data=["date"]  # Ajouter la date dans les infos au survol
)
fig.show()

## First attempt on the SoH



### Using the avg_electric_range_consumptionesti

In [None]:
# How many cars have a non-null avg_electric_range_consumption?
cars_with_range = tss[tss["avg_electric_range_consumption"].notna()]['vin'].nunique()
total_cars = tss['vin'].nunique()
print(f"We have data for {cars_with_range} out of {total_cars} cars")
print(tss[tss["avg_electric_range_consumption"].notna()]['vin'].unique())

-> The data is only available for i4 cars. The avg_electric_range_consumption is not useful for the SoH calculation.

### Using  the kombi_remaining_electric_range


In [None]:
# How many cars have a non-null kombi_remaining_electric_range?
cars_with_range = tss[tss["kombi_remaining_electric_range"].notna()]['vin'].nunique()
total_cars = tss['vin'].nunique()
print(f"We have data for {cars_with_range} out of {total_cars} cars")
print(tss[tss["kombi_remaining_electric_range"].notna()]['vin'].unique())


-> The data is available for all cars

In [None]:
tss["SoH"] = tss["kombi_remaining_electric_range"] / tss["soc"]
tss_sample["SoH"] = tss_sample["kombi_remaining_electric_range"] / tss_sample["soc"]

#### Study for one car


In [None]:
px.scatter(
    tss_unique,
    x="soc",
    y="SoH", 
    color="charging_method",
)

-> It doesn't seems to have any difference between the charging methods.

In [None]:
px.scatter(
    tss_unique,
    x="soc",
    y="SoH", 
    color="charging_status",
)

-> No differenceis the car is charging or not 


#### Study for all the cars 

In [None]:
px.scatter(
    tss,
    x="odometer",
    y="SoH", 
    color="vin",
)

In [None]:
import plotly.express as px
# Calculate SoH for each entry
tss['SoH'] = (tss['kombi_remaining_electric_range'] / tss['soc']) * 100

# Group by VIN to calculate the mean SoH and maximum odometer
aggregated_data = tss.groupby('vin').agg({
    'SoH': 'mean',
    'odometer': 'max'
}).reset_index()


# Create a scatter plot for mean SoH vs. max odometer
fig = px.scatter(
    aggregated_data,
    x='odometer',
    y='SoH',
    color="vin",
    hover_data=['vin'],
    title='Mean SoH vs Maximum Odometer per Vehicle',
    labels={
        'odometer': 'Maximum Odometer Reading',
        'SoH': 'Mean SoH (%)'
    }
)

# Show the plot
fig.show()

#### Adding filters


##### Filtering for cars with a SoC > 40%

### Using the charging 
charging_ac_ampere / charging_ac_voltage

## Data extraction pipelines comparaisons
Assuming that the data provided by High Mobility comes from BMW API, we will compare these two pipelines:    
As
 of writing this notebook markdown cell, the two data extraction pipelines are (give or take):  
- BMW API - High Mobility - [Tom's ingestion](../../../ingestion/) - My high_mobility_raw_ts
- BMW API - Theophile's ingestion - My bmw_raw_tss - The preprocessing code cell above(unlikely to destroy affect any values)

Let's call them long and direct pipelines.

### Long pipeline EDA
We will extract the raw time series of all the vins, even the ones we didn't pull from the BMW API.

In [None]:
bucket = S3_Bucket()

def get_bmw_hm_raw_tss() -> DF:
    keys = bucket.list_keys("raw_ts/bmw/time_series/")
    keys = keys[keys.str.endswith(".parquet")]
    if len(keys) == 0:
        print("no keys found!!!!!!!!")
        return DF(None, columns=KEY_LIST_COLUMN_NAMES)
    # Only retain .json responses
    # Reponses are organized as follow response/brand_name/vin/date-of-response.json
    keys = str_split_and_retain_src(
        keys,
        "/",
        col_names=["key", "dtype_folder", "brnad", "dtype_folder2", "file"]
    )
    raw_tss_dict = {key["file"].split(".")[0]: bucket.read_parquet_df(key["key"]) for _, key in keys.iterrows()}
    raw_tss = pd.concat(
        raw_tss_dict,
        axis="index",
        keys=raw_tss_dict.keys(),
        names=["vin", "idx"]
    )
    return raw_tss

long_raw_tss = get_bmw_hm_raw_tss()

long_raw_tss

In [None]:
long_raw_tss.count() / len(long_raw_tss)

Looking at the variables in the long_raw_tss, or rather the lack there of, it is pretty obvious that the direct pipeline is more appropriate.  

## Conclusion

We have a fair bit of missing values compared to the ones that we asked for in the direct data pipeline.  
The "High Mobility pipeline" is even worse so we are already bettery off with the direct one.  