# Electric Vehicle Recommendation System

## 🔍 Introduction

This project develops a recommendation system for electric vehicles (EVs) that personalizes suggestions based on user-defined inputs. Users specify numeric and categorical vehicle attributes via an interactive interface, which are then processed through data transformation and similarity calculations to find the most comparable EVs from the dataset.

---

## 📦 Dataset Description

The dataset contains detailed specifications for various EV models, including:

- **Core Attributes:** Brand, Model, Car Body Type, Segment  
- **Battery & Range:** Battery Capacity (kWh), Number of Cells, Battery Type, Efficiency (Wh/km), Range (km)  
- **Charging:** Fast Charging Power (kW), Fast Charge Port Type  
- **Performance:** Top Speed (km/h), Acceleration 0-100 km/h (s), Torque (Nm)  
- **Practical Specs:** Towing Capacity (kg), Cargo Volume (L), Seats  
- **Dimensions:** Length, Width, Height (mm)  
- **Technical Info:** Drivetrain, Source URL  

The dataset has been cleaned for consistency, with numeric fields standardized and categorical variables encoded for model compatibility.

---

## 🎯 Project Objective

Build an interactive recommendation tool that:

- Collects user input with dropdowns for categorical variables and sliders (with automatic min/max) for numeric variables.
- Transforms inputs via one-hot encoding aligned with the dataset’s structure.
- Normalizes numeric values using MinMax scaling based on the original dataset ranges.
- Computes cosine similarity between user input and all dataset entries in a normalized, encoded feature space.
- Returns the top 5 most similar EVs with associated brand, model, and source URL information.


In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("urvishahir/electric-vehicle-specifications-dataset-2025")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/electric-vehicle-specifications-dataset-2025


In [None]:
import ipywidgets as widgets

import pandas as pd
import numpy as np

import re

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

## Data Cleaning

In [None]:
df=pd.read_csv(path+"/electric_vehicles_spec_2025.csv.csv")
df

Unnamed: 0,brand,model,top_speed_kmh,battery_capacity_kWh,battery_type,number_of_cells,torque_nm,efficiency_wh_per_km,range_km,acceleration_0_100_s,...,towing_capacity_kg,cargo_volume_l,seats,drivetrain,segment,length_mm,width_mm,height_mm,car_body_type,source_url
0,Abarth,500e Convertible,155,37.8,Lithium-ion,192.0,235.0,156,225,7.0,...,0.0,185,4,FWD,B - Compact,3673,1683,1518,Hatchback,https://ev-database.org/car/1904/Abarth-500e-C...
1,Abarth,500e Hatchback,155,37.8,Lithium-ion,192.0,235.0,149,225,7.0,...,0.0,185,4,FWD,B - Compact,3673,1683,1518,Hatchback,https://ev-database.org/car/1903/Abarth-500e-H...
2,Abarth,600e Scorpionissima,200,50.8,Lithium-ion,102.0,345.0,158,280,5.9,...,0.0,360,5,FWD,JB - Compact,4187,1779,1557,SUV,https://ev-database.org/car/3057/Abarth-600e-S...
3,Abarth,600e Turismo,200,50.8,Lithium-ion,102.0,345.0,158,280,6.2,...,0.0,360,5,FWD,JB - Compact,4187,1779,1557,SUV,https://ev-database.org/car/3056/Abarth-600e-T...
4,Aiways,U5,150,60.0,Lithium-ion,,310.0,156,315,7.5,...,,496,5,FWD,JC - Medium,4680,1865,1700,SUV,https://ev-database.org/car/1678/Aiways-U5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
473,Zeekr,7X Premium RWD,210,71.0,Lithium-ion,,440.0,148,365,6.0,...,2000.0,539,5,RWD,JD - Large,4787,1930,1650,SUV,https://ev-database.org/car/3081/Zeekr-7X-Prem...
474,Zeekr,X Core RWD (MY25),190,49.0,Lithium-ion,,343.0,148,265,5.9,...,1600.0,362,5,RWD,JB - Compact,4432,1836,1566,SUV,https://ev-database.org/car/3197/Zeekr-X-Core-RWD
475,Zeekr,X Long Range RWD (MY25),190,65.0,Lithium-ion,,343.0,146,360,5.6,...,1600.0,362,5,RWD,JB - Compact,4432,1836,1566,SUV,https://ev-database.org/car/3198/Zeekr-X-Long-...
476,Zeekr,X Privilege AWD (MY25),190,65.0,Lithium-ion,,543.0,153,350,3.8,...,1600.0,362,5,AWD,JB - Compact,4432,1836,1566,SUV,https://ev-database.org/car/3199/Zeekr-X-Privi...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   brand                      478 non-null    object 
 1   model                      477 non-null    object 
 2   top_speed_kmh              478 non-null    int64  
 3   battery_capacity_kWh       478 non-null    float64
 4   battery_type               478 non-null    object 
 5   number_of_cells            276 non-null    float64
 6   torque_nm                  471 non-null    float64
 7   efficiency_wh_per_km       478 non-null    int64  
 8   range_km                   478 non-null    int64  
 9   acceleration_0_100_s       478 non-null    float64
 10  fast_charging_power_kw_dc  477 non-null    float64
 11  fast_charge_port           477 non-null    object 
 12  towing_capacity_kg         452 non-null    float64
 13  cargo_volume_l             477 non-null    object 

The following code performs data cleaning and missing value imputation on the DataFrame:

- Rows with missing values in the `model` or `fast_charge_port` columns are removed.
- Missing values in the `number_of_cells` and `torque_nm` numeric columns are replaced with their respective median values.
- Missing values in the `towing_capacity_kg` numeric column are replaced with zero.

This process ensures that the dataset has no missing values in critical categorical columns and that numeric missing values are imputed with appropriate defaults, maintaining consistency for subsequent analysis.

In [None]:
df = df.copy()
df = df.dropna(subset=['model', 'fast_charge_port'])  # Just 1 missing value
df['number_of_cells'] = df['number_of_cells'].fillna(df['number_of_cells'].median())
df['torque_nm'] = df['torque_nm'].fillna(df['torque_nm'].median())
df['towing_capacity_kg'] = df['towing_capacity_kg'].fillna(0)


Now, we have to recheck cargo_volume_l, because it is set as an object feature:

In [None]:
df["cargo_volume_l"].unique()

array(['185', '360', '496', '472', '400', '326', '502', '520', '535',
       '526', '511', '10 Banana Boxes', '514', '350', '405', '470', '490',
       '570', '500', '525', '440', '345', '308', '552', '235', '385',
       '540', '793', '775', '1050', '310', '460', '380', '510', '603',
       '989', '390', '620', '467', '361', '572', '536', '519', '523', nan,
       '672', '228', '333', '354', '432', '503', '438', '280', '480',
       '401', '338', '466', '505', '355', '550', '839', '435', '475',
       '309', '210', '522', '316', '611', '509', '456', '410', '249',
       '363', '479', '453', '448', '151', '270', '31 Banana Boxes', '340',
       '495', '430', '645', '13 Banana Boxes', '828', '551', '1410',
       '1030', '555', '1390', '990', '300', '200', '579', '265', '386',
       '450', '468', '415', '819', '352', '516', '267', '434', '588',
       '412', '608', '471', '348', '407', '484', '446', '366', '367',
       '420', '950', '545', '585', '313', '323', '370', '630', '441',
   

It is necessary to fill the NaN value and fix the "Banana Box unit" which is standarized as 52 l.

- The function `parse_cargo_volume` processes each value in the column:
  - Returns `NaN` if the value is missing.
  - Converts purely numeric strings to floats.
  - Converts values formatted as "`<number> Banana Boxes`" to a numeric volume by multiplying the number by 52
  - Returns `NaN` for any other formats.

In [None]:
def parse_cargo_volume(val):
    if pd.isna(val):
        return np.nan
    if re.match(r'^\d+$', str(val)):
        return float(val)
    match = re.match(r'(\d+)\s+Banana Boxes', str(val))
    if match:
        return int(match.group(1)) * 52
    return np.nan

df['cargo_volume_l'] = df['cargo_volume_l'].apply(parse_cargo_volume)

# Fill Nans with the median
df['cargo_volume_l'] = df['cargo_volume_l'].fillna(df['cargo_volume_l'].median())

## Data Processing

One Hot encoding is performed for the EV categorical features.

In [None]:
df_encoded = df.drop(columns=['model', 'source_url', 'brand'], errors='ignore')

# Select categorical columns in df_features
categorical_cols = df_encoded.select_dtypes(include=['object', 'category']).columns

# One-hot encode those categorical columns in df_features
df_encoded = pd.get_dummies(df_encoded, columns=categorical_cols, drop_first=False)

Apply Min Max Normalization for vectorization

In [None]:
# Copy the df_encoded to keep original intact
df_normalized = df_encoded.copy()

# Identify numeric columns
numeric_cols = df_normalized.select_dtypes(include=['int64', 'float64']).columns

# Initialize scaler
scaler = MinMaxScaler(clip=True)

# Scale numeric columns and assign back to df_normalized
df_normalized[numeric_cols] = scaler.fit_transform(df_normalized[numeric_cols])

# USER INTERFACE WIDGET

The user inputs the desired characteristics for their car and cosine similarity is performed to find the best match form the 476 EV catalogue.

In [None]:
# Use original DataFrame (df) to get min/max
numerical_cols = [
    'top_speed_kmh', 'battery_capacity_kWh', 'number_of_cells', 'torque_nm',
    'efficiency_wh_per_km', 'range_km', 'acceleration_0_100_s',
    'fast_charging_power_kw_dc', 'towing_capacity_kg', 'cargo_volume_l',
    'seats', 'length_mm', 'width_mm', 'height_mm'
]

# Create sliders dynamically
sliders = {
    col: widgets.FloatSlider(
        value=df[col].mean(),
        min=float(df[col].min()),
        max=float(df[col].max()),
        step=1.0,
        description=col,
        continuous_update=False,
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='600px')
    ) for col in numerical_cols
}

# Dropdowns for categoricals
dropdowns = {
    'battery_type': widgets.Dropdown(
        options=sorted(df['battery_type'].dropna().unique()),
        description='Battery Type'
    ),
    'fast_charge_port': widgets.Dropdown(
        options=sorted(df['fast_charge_port'].dropna().unique()),
        description='Fast Charge Port'
    ),
    'drivetrain': widgets.Dropdown(
        options=sorted(df['drivetrain'].dropna().unique()),
        description='Drivetrain'
    ),
    'segment': widgets.Dropdown(
        options=sorted(df['segment'].dropna().unique()),
        description='Segment'
    ),
    'car_body_type': widgets.Dropdown(
        options=sorted(df['car_body_type'].dropna().unique()),
        description='Car Body Type'
    )
}

# Combine and display all widgets
all_widgets = list(sliders.values()) + list(dropdowns.values())
ui = widgets.VBox(all_widgets)
display(ui)

VBox(children=(FloatSlider(value=185.67857142857142, continuous_update=False, description='top_speed_kmh', lay…

## Result

Top 5 Matches are printed with model, brand and url

In [None]:
#  Capture user input from widgets
user_input = {
    **{k: s.value for k, s in sliders.items()},
    **{k: d.value for k, d in dropdowns.items()}
}
user_df = pd.DataFrame([user_input])

# One-hot encode user input
user_encoded = pd.get_dummies(user_df)
# Align with df_normalized structure
missing_cols = set(df_normalized.columns) - set(user_encoded.columns)
for col in missing_cols:
    user_encoded[col] = 0

# Ensure same column order
user_encoded = user_encoded[df_normalized.columns]

# Normalize numeric features using original scaler
scaler = MinMaxScaler()
scaler.fit(df[numerical_cols])  # Fit only once, ideally you store this scaler
user_encoded[numerical_cols] = scaler.transform(user_df[numerical_cols])

# Compute cosine similarity
similarities = cosine_similarity(user_encoded, df_normalized)[0]

# Get top 5 most similar entries
top_indices = similarities.argsort()[::-1][:5]
df['similarity'] = similarities  # Temporarily attach similarity

# Return brand, model, source_url
result = df.loc[top_indices, ['brand', 'model', 'source_url', 'similarity']].copy()
result.reset_index(drop=True, inplace=True)

for i, row in result.iterrows():
    print(f"Match #{i+1}:")
    print(f"  Brand      : {row['brand']}")
    print(f"  Model      : {row['model']}")
    print(f"  Source URL : {row['source_url']}")
    print(f"  Similarity : {row['similarity']:.4f}")
    print("-" * 30)

Match #1:
  Brand      : MG
  Model      : Cyberster GT
  Source URL : https://ev-database.org/car/2203/MG-Cyberster-GT
  Similarity : 0.8044
------------------------------
Match #2:
  Brand      : Maserati
  Model      : GranCabrio Folgore
  Source URL : https://ev-database.org/car/2187/Maserati-GranCabrio-Folgore
  Similarity : 0.7968
------------------------------
Match #3:
  Brand      : Mercedes-Benz
  Model      : EQS SUV 450 4MATIC
  Source URL : https://ev-database.org/car/2088/Mercedes-Benz-EQS-SUV-450-4MATIC
  Similarity : 0.7376
------------------------------
Match #4:
  Brand      : Mercedes-Benz
  Model      : EQS SUV 500 4MATIC
  Source URL : https://ev-database.org/car/2089/Mercedes-Benz-EQS-SUV-500-4MATIC
  Similarity : 0.7366
------------------------------
Match #5:
  Brand      : Mercedes-Benz
  Model      : EQS SUV 580 4MATIC
  Source URL : https://ev-database.org/car/2090/Mercedes-Benz-EQS-SUV-580-4MATIC
  Similarity : 0.7358
------------------------------


##  Conclusions

- The dataset was successfully cleaned and standardized, with missing or inconsistent entries handled through targeted imputation and parsing logic.
- Categorical and numeric features were preprocessed through one-hot encoding and MinMax normalization, preparing the data for vector-based similarity comparison.
- An interactive interface was created using `ipywidgets`, allowing users to input both numeric and categorical preferences with automatic min/max range extraction and valid option filtering.
- Cosine similarity was used to identify the top 5 most similar electric vehicles based on the user's specified criteria, returning clear and structured results that include brand, model, and source URL.
- The pipeline maintains alignment between transformed input data and the dataset’s encoded structure, ensuring accurate and meaningful comparisons.

This system provides a foundation for delivering user-personalized EV recommendations using clean data, interpretable logic, and scalable similarity-based methods.