##### Research on BWM and Audi offerings and tiers

Based on the findings on their respective websites ([BMW](https://www.bmw.lt/lt/all-models.html) and [Audi models](https://www.audi.lt/lt/web/lt/models.html)) and for the sake of simplicity, I'll reduce all Audi and BMW models to the following hierarchy seen below.

![](images/full-hierarchy.png)

__Considerations__

- This is likely not the most accurate depiction of the reality and thus requires more time and research before going to prod. The goal is to only capture the main principles behind this workflow.
- Not sure where _BWM M series_ should fall exactly since [BMW themselves consider M series as both, as a model as well as a tier](https://www.bmw.lt/lt/all-models.html).
- Some models like _X series_, _Z series_, etc can be broken down to X2, X5, X6, thus this requires additional layer in BWM hierarchy.
- Not exactly sure how to build a hierarchy for Audi body sets and engine types, need more time and business context to improve over it.
- In general, I would look for ways to split this VIN decoding task into multiple separate classification tasks. In my experience modularity always leads to better overall system prediction accuracy as well as is much easier to maintain and improve upon from the engineering point of view.
- For instance, there could be a service that detects the make of the vehicle (regex / ML model) first, then calls a respective service (most likely just ML) that would be optimized to classify either BWM / Audi models (those would be 2 different models).
- Then vehicle model prediction service would call a service to predicts engine type and the same idea of drilling down into the hierarchy would extend to body sets, year, etc until we build the whole report.
- If possible, I would aim for completely modular and isolated services / ML models that serve only 1 function (and does this 1 function well) to ensure separation of concerns. Anything else is generaly considered as an anti-pattern, though more business context is needed before making such claims.
- With the proposal above, we will end up with a plethora of modular services that we'll need to maintain and manage. This will also introduce additional network latencies, though to my understanding, SLA requirements are not very strict, so this might not be a big problem. Once again - need more business context and domain expertise to check the validity of this.
- Some car makers are subsidiaries of the same car company so maybe they follow the same standards and maybe can be combined into 1 ML model / service (e.g. Audi, Porsche, Škoda, Seat are subsidiaries of Volkswagen)?


__N.B.__ In this assignment, I'll just focus on building just 1 machine learning model that detects the model of the vehicle from VIN (1 model for both, BMW and Audi). The same approach can be reused to building the rest of the system. The make of the vehicle can be inferred from cached key value pairs in memory key-value stores like [Redis](https://redis.io/), [DynamoDB](https://aws.amazon.com/dynamodb/) or even [Aerospike](https://aerospike.com/).

The same, but simplified hierarchy can be seen below.

![](images/simple-hierarchy.png)

####  Label preprocessing

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/data_validation/vin_model_pairs_good.csv")
df = df[["vin", "model"]].drop_duplicates()
df

Unnamed: 0,vin,model
0,3MW5R1J0XM8CXXXXX,3 Series
1,4USBT33443LRXXXXX,Z4
2,5UXCR4C06M9FXXXXX,X5
3,5UXCY6C04P9PXXXXX,
4,5UXCY6C04P9PXXXXX,X6
...,...,...
2030,WUAZZZF5XJA9XXXXX,A5
2031,WUAZZZF5XJA9XXXXX,
2032,WUAZZZFX9J79XXXXX,R8
2035,WUAZZZGY4NA9XXXXX,


In [3]:
df.groupby("model")["vin"].agg("count").reset_index().rename(columns={"vin": "count"})

Unnamed: 0,model,count
0,1 Series,74
1,2 Series,25
2,3 Series,257
3,335,1
4,4,17
5,4 Series,5
6,5 Series,213
7,530D (EUR),1
8,6 Series,6
9,7 Series,36


In [4]:
# TODO: Move this to a config file
bmw_3_series = ["335"]
bmw_4_series = ["4", "428I (USA)"]
bmw_5_series = ["530D", "530D (EUR)", "M5"]
bmw_6_series = ["630dx (630dx)", "640dx (640dx)"]
audi_a5 = ["S5"]
audi_a7 = ["rs 7"]
audi_q5 = ["SQ5"]

In [5]:
# TODO: Refactor this into something less embarrassing
bmw_3_series_dict = dict.fromkeys(bmw_3_series, "3 Series")
bmw_4_series_dict = dict.fromkeys(bmw_4_series, "4 Series")
bmw_5_series_dict = dict.fromkeys(bmw_5_series, "5 Series")
bmw_6_series_dict = dict.fromkeys(bmw_6_series, "6 Series")
audi_a5_dict = dict.fromkeys(audi_a5, "A5")
audi_a7_dict = dict.fromkeys(audi_a7, "A7")
audi_q5_dict = dict.fromkeys(audi_q5, "Q5")

d1 = {**bmw_3_series_dict,
     **bmw_4_series_dict,
     **bmw_5_series_dict,
     **bmw_6_series_dict,
     **audi_a5_dict,
     **audi_a7_dict,
     **audi_q5_dict,
     **audi_q5_dict,
}

display(d1)

{'335': '3 Series',
 '4': '4 Series',
 '428I (USA)': '4 Series',
 '530D': '5 Series',
 '530D (EUR)': '5 Series',
 'M5': '5 Series',
 '630dx (630dx)': '6 Series',
 '640dx (640dx)': '6 Series',
 'S5': 'A5',
 'rs 7': 'A7',
 'SQ5': 'Q5'}

In [6]:
df["updated_model"] = df["model"].map(d1).fillna(df["model"])
display(df)

Unnamed: 0,vin,model,updated_model
0,3MW5R1J0XM8CXXXXX,3 Series,3 Series
1,4USBT33443LRXXXXX,Z4,Z4
2,5UXCR4C06M9FXXXXX,X5,X5
3,5UXCY6C04P9PXXXXX,,
4,5UXCY6C04P9PXXXXX,X6,X6
...,...,...,...
2030,WUAZZZF5XJA9XXXXX,A5,A5
2031,WUAZZZF5XJA9XXXXX,,
2032,WUAZZZFX9J79XXXXX,R8,R8
2035,WUAZZZGY4NA9XXXXX,,


In [7]:
df.groupby("updated_model")["vin"].agg("count").reset_index().rename(columns={"vin": "count"})

Unnamed: 0,updated_model,count
0,1 Series,74
1,2 Series,25
2,3 Series,258
3,4 Series,22
4,5 Series,214
5,6 Series,6
6,7 Series,36
7,8 Series,3
8,A1,12
9,A3,72


Grouping models even more

In [8]:
bmw_x_series = ["X1", "X2", "X3", "X4", "X5", "X6",]
bmw_i = ["i3", "i8",]
bmw_z_series = ["Z3", "Z4",]

In [9]:
bmw_x_series_dict = dict.fromkeys(bmw_x_series, "X Series")
bmw_i_dict = dict.fromkeys(bmw_i, "i")
bmw_z_series_dict = dict.fromkeys(bmw_z_series, "Z Series")

d2 = {**bmw_x_series_dict,
     **bmw_i_dict,
     **bmw_z_series_dict,
}

display(d2)

{'X1': 'X Series',
 'X2': 'X Series',
 'X3': 'X Series',
 'X4': 'X Series',
 'X5': 'X Series',
 'X6': 'X Series',
 'i3': 'i',
 'i8': 'i',
 'Z3': 'Z Series',
 'Z4': 'Z Series'}

In [10]:
df["agg_model"] = df["updated_model"].map(d2).fillna(df["updated_model"])
display(df)

Unnamed: 0,vin,model,updated_model,agg_model
0,3MW5R1J0XM8CXXXXX,3 Series,3 Series,3 Series
1,4USBT33443LRXXXXX,Z4,Z4,Z Series
2,5UXCR4C06M9FXXXXX,X5,X5,X Series
3,5UXCY6C04P9PXXXXX,,,
4,5UXCY6C04P9PXXXXX,X6,X6,X Series
...,...,...,...,...
2030,WUAZZZF5XJA9XXXXX,A5,A5,A5
2031,WUAZZZF5XJA9XXXXX,,,
2032,WUAZZZFX9J79XXXXX,R8,R8,R8
2035,WUAZZZGY4NA9XXXXX,,,


In [11]:
df.groupby("agg_model")["vin"].agg("count").reset_index().rename(columns={"vin": "count"})

Unnamed: 0,agg_model,count
0,1 Series,74
1,2 Series,25
2,3 Series,258
3,4 Series,22
4,5 Series,214
5,6 Series,6
6,7 Series,36
7,8 Series,3
8,A1,12
9,A3,72


##### Impute missing values

In [12]:
df["imputed_model"] = df.groupby("vin")["agg_model"].fillna(method="ffill")
df["imputed_model"] = df.groupby("vin")["imputed_model"].fillna(method="bfill")
df

Unnamed: 0,vin,model,updated_model,agg_model,imputed_model
0,3MW5R1J0XM8CXXXXX,3 Series,3 Series,3 Series,3 Series
1,4USBT33443LRXXXXX,Z4,Z4,Z Series,Z Series
2,5UXCR4C06M9FXXXXX,X5,X5,X Series,X Series
3,5UXCY6C04P9PXXXXX,,,,X Series
4,5UXCY6C04P9PXXXXX,X6,X6,X Series,X Series
...,...,...,...,...,...
2030,WUAZZZF5XJA9XXXXX,A5,A5,A5,A5
2031,WUAZZZF5XJA9XXXXX,,,,A5
2032,WUAZZZFX9J79XXXXX,R8,R8,R8,R8
2035,WUAZZZGY4NA9XXXXX,,,,


Some VINs had all associated model values as nulls so previous imputation logic didn't change change this, thus we can remove all VINs where model NaN.

In [13]:
final_df = df[df["imputed_model"].notnull()]
final_df = final_df[["vin", "imputed_model"]].rename(columns={"imputed_model": "model"})
final_df

Unnamed: 0,vin,model
0,3MW5R1J0XM8CXXXXX,3 Series
1,4USBT33443LRXXXXX,Z Series
2,5UXCR4C06M9FXXXXX,X Series
3,5UXCY6C04P9PXXXXX,X Series
4,5UXCY6C04P9PXXXXX,X Series
...,...,...
2028,WUAZZZF38N19XXXXX,Q3
2030,WUAZZZF5XJA9XXXXX,A5
2031,WUAZZZF5XJA9XXXXX,A5
2032,WUAZZZFX9J79XXXXX,R8


In [14]:
final_df.to_csv("../data/preprocessed_labels/vin_model_pairs_w_labels.csv", index=False)