# BMW raw time series Exploratory Data Analysis
The goal of this notebook is to validate the integrity of the data provided by the BMW API.  
We will examine the data on its own and compare it to the one provided by High Mobility.  

## Setup

### Imports

In [None]:
from datetime import datetime as DT
import pytz

import pandas as pd
from pandas import DataFrame as DF
import plotly.express as px

from core.s3_utils import S3_Bucket
from core.constants import *
from core.pandas_utils import *
from transform.bmw.bmw_raw_tss import get_raw_tss_without_units

### Data extraction

In [None]:
raw_tss = get_raw_tss_without_units(force_update=True)
raw_tss.columns

In [None]:
tss = (
    raw_tss.astype({
        "charging_ac_ampere": "float",
        "charging_ac_voltage": "float",
        "charging_method": "category",
        "charging_plug_connected": "category",
        "charging_status": "category",
        "coolant_temperature": "float",
        "kombi_remaining_electric_range": "float",
        "mileage": "float",
        "soc_customer_target": "float",
        "soc_hv_header": "float",
        "soc_target_charging_time_forecast": "float",
        "teleservice_status": "category",
        "vin": "category",
    })
    .assign(date_of_value=pd.to_datetime(raw_tss["date_of_value"], format='mixed'))
    .rename(columns={
        "date_of_value": "date",
        "mileage": "odometer",
        "soc_hv_header": "soc",
    })
    .sort_values(by=["vin", "date"])
)

## Time series EDA

Let's list the variables and the respective count ratio.

In [None]:
raw_tss.count() / len(raw_tss)

In [None]:
tss.set_index("vin", drop=False)

In [None]:
! mkdir -p data_cache
var_counts = raw_tss.groupby('vin').count()
var_counts.to_csv("data_cache/var_counts_per_vin.csv")

In [None]:
px.scatter(
    tss,
    x="date",
    y="odometer",
    facet_col="vin",
    facet_col_wrap=1,
    facet_row_spacing=0.01   # Ensure the spacing is smaller than 0.025641
).update_layout(
    height=5000,            # Adjust the height to fit the rows
)

In [None]:
px.scatter(
    tss,
    x="date",
    y="soc",
    facet_col="vin",
    facet_col_wrap=1
)

We can see that the plots seem skewed.  
let's see why.  

In [None]:
mask = tss["date"] < DT(year=2024, month=8, day=1, tzinfo=pytz.UTC)
tss[mask].count()

In [None]:
px.box(tss, x="date")

We can see that there are a few points before auggust, pretty suprising given the fact the BMW POC started way later than this (late September).

In [None]:
requested_vars = (
    DF.from_dict(data=VARIABLES_THAT_WE_ASKED_FOR)
    .drop(columns=["key_type"])
)

display(requested_vars)

In [None]:
received_vars = (
    tss
    .dtypes
    .to_frame("unit")
    .reset_index(drop=False)
    .rename(columns={"key": "key_name"})
)
display(received_vars)

In [None]:
raw_tss[raw_tss["date_of_value"].isna()]
# raw_tss.query("date_of_value == 'None'")

## data extraction pipelines comparaisons
Assuming that the data provided by High Mobility comes from BMW API, we will compare these two pipelines:    
As of writing this notebook markdown cell, the two data extraction pipelines are (give or take):  
- BMW API - High Mobility - [Tom's ingestion](../../../ingestion/) - My high_mobility_raw_ts
- BMW API - Theophile's ingestion - My bmw_raw_tss - The preprocessing code cell above(unlikely to destroy affect any values)

Let's call them long and direct pipelines.

### Long pipeline EDA
We will extract the raw time series of all the vins, even the ones we didn't pull from the BMW API.

In [None]:
bucket = S3_Bucket()

def get_bmw_hm_raw_tss() -> DF:
    keys = bucket.list_keys("raw_ts/bmw/time_series/")
    keys = keys[keys.str.endswith(".parquet")]
    if len(keys) == 0:
        print("no keys found!!!!!!!!")
        return DF(None, columns=KEY_LIST_COLUMN_NAMES)
    # Only retain .json responses
    # Reponses are organized as follow response/brand_name/vin/date-of-response.json
    keys = str_split_and_retain_src(
        keys,
        "/",
        col_names=["key", "dtype_folder", "brnad", "dtype_folder2", "file"]
    )
    raw_tss_dict = {key["file"].split(".")[0]: bucket.read_parquet_df(key["key"]) for _, key in keys.iterrows()}
    raw_tss = pd.concat(
        raw_tss_dict,
        axis="index",
        keys=raw_tss_dict.keys(),
        names=["vin", "idx"]
    )
    return raw_tss

long_raw_tss = get_bmw_hm_raw_tss()

long_raw_tss

In [None]:
long_raw_tss.count() / len(long_raw_tss)

Looking at the variables in the long_raw_tss, or rather the lack there of, it is pretty obvious that the direct pipeline is more appropriate.  

## Conclusion

We have a fair bit of missing values compared to the ones that we asked for in the direct data pipeline.  
The "High Mobility pipeline" is even worse so we are already bettery off with the direct one.  