# Raw tss from direct bmw responses
The goal of this notebook is to demonstrate how to parse the raw tss from direct bmw responses.

## Setup




### Imports

In [None]:
from functools import reduce

from core.pandas_utils import *
from core.singleton_s3_bucket import bucket

## Experimantation


Let's first take a look at an example response.

In [None]:
EXAMPLE_KEY = "response/BMW/WBY71AW000FM68170/2024-12-02.json"
response = bucket.read_json_file(EXAMPLE_KEY)
response

Let's list all the responses that we will have to parse.

In [None]:
responses = bucket.list_responses_keys_of_brand("BMW")
responses

The responses are lists of dicts where the dicts are not a line but a single element.    
Example:  
```json
  'date_of_value': '2024-10-13T18:16:01Z'},
 {'key': 'charging_status',
  'value': 'NOCHARGING',
  'unit': None,
  'info': None,
  'date_of_value': '2024-10-13T17:19:33Z'},
 {'key': 'mileage',
  'value': '149510.0',
  'unit': 'km',
  'info': None,
  'date_of_value': '2024-10-13T17:19:33Z'},
 {'key': 'charging_ac_ampere',
  'value': '0',
  'unit': 'A',
  'info': None,
  'date_of_value': '2024-10-13T17:19:33Z'},
 {'key': 'charging_ac_voltage',
  'value': '0.0',
  'unit': 'V',
  'info': None,
  'date_of_value': '2024-10-13T17:19:33Z'},
```
When parsing it into a dataframe we will have to pivot the dataframe.    
The structure is convinient because it allows us to first concatenate the lists and then pivot once.

Let's take the responses of a single vin and parse it.

In [None]:
responses_dicts = responses.query("vin == 'WBY1Z610407A12415'")["key"].apply(bucket.read_json_file)
display(responses_dicts)
cat_responses_dicts = reduce(lambda cat_rep, rep_2: cat_rep + rep_2["data"], responses_dicts, [])
display(cat_responses_dicts)

Let's see what we get when we parse it.

In [None]:
unpivoted_df = DF.from_dict(cat_responses_dicts).drop(columns=["unit", "info"])
unpivoted_df

We can see that there are duplicate dates and keys.

In [None]:
unpivoted_df.drop_duplicates(subset=["date_of_value", "value"])["date_of_value"].value_counts(sort=True, ascending=False)

If we were to pivot it we would raise an error because there are multiple dates for the same key.  
This is most likely due to the fact that responses close to each other in time sometime contain the same element.  
Therefore, we need to drop duplicates before pivoting.  

Here we take the most common date and view the duplicates

In [None]:
unpivoted_df.query("date_of_value == '2024-11-06T15:48:10Z'")

We can sse that there are duplicates, if you look at the output of the above cell you can see that the values equals so we can drop them without loosing any data.

In [None]:
unpivoted_df.drop_duplicates(subset=["date_of_value", "key", "value"]).query("date_of_value == '2024-11-06T15:48:10Z'")[["date_of_value", "key"]].value_counts()

In [None]:
unpivoted_df.drop_duplicates(subset=["date_of_value", "key", "value"]).query("date_of_value == '2024-11-06T15:48:10Z'")

Let's check the count of dates

In [None]:
unpivoted_df["date_of_value"].count() / len(unpivoted_df)

We can see that there are a lot of NaT date values which would cause an equivalent loss of data.  
To remedy this we will ffill and bfill the dates before pivoting.  
We can allow our self to do this since the elements that are close in the list should be close in time.


In [None]:
unpivoted_df["date_of_value"].ffill().bfill().count() / len(unpivoted_df)

In [None]:
df = (
    unpivoted_df
    .eval("date_of_value = date_of_value.ffill().bfill()")
    .drop_duplicates(subset=["date_of_value", "key"])
    .pivot(index="date_of_value", columns="key", values="value")
)

df

## Sanity check

Let's check that the frequency of the data is correct.


In [None]:
min_date = df.reset_index()["date_of_value"].pipe(pd.to_datetime, format="mixed").min()
max_date = df.reset_index()["date_of_value"].pipe(pd.to_datetime, format="mixed").max()

duration = (max_date - min_date).total_seconds()
freq = len(df) / duration
freq * 3600

The frequency is fairly low....

Let's check the number of notna values per row.

In [None]:
df.count(axis=1).describe()

## Final implementation

Let's implement this for all the vins.

In [None]:
def parse_responses(responses:DF) -> DF:
    print("reading responses of", responses.name, end="")
    responses_dicts = responses["key"].apply(bucket.read_json_file)
    print(", concatenating...", end="")
    cat_responses_dicts = reduce(lambda cat_rep, rep_2: cat_rep + rep_2["data"], responses_dicts, [])
    print("Parsing reps.")
    return (
        DF.from_dict(cat_responses_dicts)
        .drop(columns=["unit", "info"])
        .eval("date_of_value = date_of_value.ffill().bfill()")
        .drop_duplicates(subset=["date_of_value", "key"])
        .pivot(index="date_of_value", columns="key", values="value")
        .assign(vin=responses.name)
    )

raw_tss = (
    responses
    .groupby("vin")
    .apply(parse_responses, include_groups=False)
)

raw_tss

In [None]:
sanity_check(raw_tss)

In [None]:
raw_tss.drop(columns=["vin"]).reset_index(drop=False)

## Conclusion

The frequency of the data is still a bit too low.  
We need to fix the NaT dates.  
And we should be good üëç.