## **Electric Vehicle Population EDA**

**Objectives** :  Analyze the Electric Vehicle (EV) dataset to uncover patterns, trends, and insights into EV sales and market. \
**Dataset** : Data collected from [data.gov](https://catalog.data.gov/dataset/electric-vehicle-population-data)
For more information about the dataset, referse to the [README file](../data/README.md).

#### **Import Neccessory modules**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from camel_converter import to_snake

from dotenv import load_dotenv
load_dotenv()

import os
PROJECT_DIR = os.getenv("PROJECT_DIR")

# data cleaning helper scirpt 
import sys
sys.path.append(f"{PROJECT_DIR}/eda-and-visualization/EV-vechicle/scripts")
import visualizing as vis_helper

SyntaxError: '(' was never closed (visualizing.py, line 23)

#### **Getting Start**

##### read dataset

In [None]:
df = pd.read_csv(f"{PROJECT_DIR}/eda-and-visualization/EV-vechicle/data/ev_population.csv")

##### dataset lookup

In [None]:
df.info()

In [None]:
print(f"Dataframe contain {df.shape[0]} rows & {df.shape[1]} columns")

In [None]:
df.head()

#### **Data cleaning**

##### Drop unwanted columns

In [None]:
df.sample()

In [None]:
unwanted_cols = ["VIN (1-10)", "DOL Vehicle ID", "2020 Census Tract", "Legislative District", "Postal Code", \
                 "Vehicle Location"]
# new df without unwanted columns
df = df.drop(columns=unwanted_cols, axis=1)

In [None]:
df.sample()

##### Drop duplicated rows

In [None]:
duplicate_df = df.duplicated()
print(f"The dataframe contain {df[duplicate_df].shape[0]} duplicated rows")
# even though when we remove this much no.of rows it will strongly reduce our dataset but...

df = df[~duplicate_df]

In [None]:
df.shape

##### Rename column names

In [None]:
map_cols = {
    "Electric Vehicle Type" : "type",
    "Clean Alternative Fuel Vehicle (CAFV) Eligibility" : "cafv eligibility"}

# replace large column names with meaningfull names
df = df.rename(columns=map_cols)

In [None]:
# change all column name into snake_case format
df.columns = [to_snake("_".join(each.lower().split(" "))) for each in df.columns]

In [None]:
df.sample()

##### Handle invalid data points

In [None]:
for each in df.columns:
    print(each)
    print(df[each].unique())
    print("*" * 100)

In [None]:
df['base_msrp'] = df['base_msrp'].replace({0 : np.nan})
df['electric_range'] = df['electric_range'].replace({0 : np.nan})

In [None]:
cafv_map = {
    "Eligibility unknown as battery range has not been researched" : "Unknown",
    "Clean Alternative Fuel Vehicle Eligible" : "Eligible",
    "Not eligible due to low battery range" : "Not Eligible"
}

df['cafv_eligibility'] = df['cafv_eligibility'].replace(cafv_map)
df['cafv_eligibility'].value_counts()

In [None]:
type_map = {
    "Battery Electric Vehicle (BEV)" : "BEV",
    "Plug-in Hybrid Electric Vehicle (PHEV)" : "PHEV"
}
df['type'] = df['type'].replace(type_map)
df['type'].value_counts()

In [None]:
df.head()

##### Handle Null values

In [None]:
df.isna().mean() * 100     # columns percentage of missing values

In [None]:
df = df.drop(columns=["base_msrp", "electric_range"])    
# drop columns which have large null values,
# we lost 2 important varibales, we dont have another option here
# when we try to impute the null values it change the data

In [None]:
plt.title("NULL VALUES HEATMAP")
sns.heatmap(data=df.isna())
plt.show()

- there is very small size of missing values, so we can drop it

In [None]:
df = df.dropna(how="any", axis=0)    # it remove entire rows when it contain any of NaN values

In [None]:
df.isna().sum()

In [None]:
df.shape

##### Confirm Data-Type

In [None]:
df.dtypes

#### **Alias**

**cafv** &nbsp; &nbsp; &nbsp;=&nbsp; Clean Alternative Fuel Vehicle \
**PHEV** &nbsp; &nbsp;=&nbsp; Plug-In Hybrid Elevtric Vehicle \
**BEV** &nbsp; &nbsp; &nbsp; =&nbsp; Battery Elecrtic Vehicle

#### **Exploratory Data Analysis**

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.describe(include="O").T

In [None]:
numerical_columns = df.select_dtypes(include=np.number).columns
catagorical_columns = df.select_dtypes(include="object").columns

##### **Catagorical Features**

In [None]:
plt.figure(figsize=(16,24))
vis_helper.plot_catagorical(df, columns=catagorical_columns)
plt.tight_layout()
plt.show()

##### **Numerical Feature**

In [None]:
plt.figure(figsize=(16,6))

plt.suptitle("Distribution of Model Year Feature")
plt.subplot(1,2, 1)
plt.boxplot(df[numerical_columns])

plt.subplot(1,2,2)
plt.hist(df[numerical_columns])

plt.show()

### **Exploration**

**Key Questions to Explore:**
1. What percentage of EVs are CAFV-eligible?\
2. Is there an increase in EV adoption over recent years?\
3. Ev type and locality as any hidden relationship

### **1. What percentage of EV are CAFV-eligible?** 

In [None]:
cafv_eligible_data = df['cafv_eligibility'].value_counts()

plt.title("Clean Alternative Fuel Vehicle Eligibility")
plt.bar(x=cafv_eligible_data.index, height=cafv_eligible_data.values)

percentage = [(each/df.shape[0])*100 for each in cafv_eligible_data.values]

for i, v in enumerate(percentage):
    plt.text(i, v + 0.5, f"{v: .1f}%", ha='center') 
plt.show()

**Clean Alternative Fuel Vehicle Eligibility have *41%* with in the Ev's**

### **2. Is there an increase in EV adoption over recent years?**

In [None]:
year_by_make_count = df.groupby('model_year')['model'].count()[:-1]   # ignore 2025

plt.figure(figsize=(16,6))
plt.title("EV adoption increased by years")
sns.lineplot(x=year_by_make_count.index, y=year_by_make_count.values)
plt.ylabel("EV Model Count in US Market")
plt.xlabel("Years")
plt.show()

**There is an increase from 2010**