**EV Adoption Forecasting**
As electric vehicle (EV) adoption surges, urban planners need to anticipate infrastructure needs—especially charging stations. Inadequate planning can lead to bottlenecks, impacting user satisfaction and hindering sustainability goals.

**Problem Statement**: Using the electric vehicle dataset (which includes information on EV populations, vehicle types, and possibly historical charging usage), create a model to forecast future EV adoption. For example, predict the number of electric vehicles in upcoming years based on the trends in the data.

**Goal**: Build a regression model that forecasts future EV adoption demand based on historical trends in EV growth, types of vehicles, and regional data.

**Dataset**: This dataset shows the number of vehicles that were registered by Washington State Department of Licensing (DOL) each month. The data is separated by county for passenger vehicles and trucks.

*Date*: Counts of registered vehicles are taken on this day (the end of this month). - 2017-01-31 2024-02-29

*County*: This is the geographic region of a state that a vehicle's owner is listed to reside within. Vehicles registered in Washington

*State*: This is the geographic region of the country associated with the record. These addresses may be located in other

*Vehicle Primary Use*: This describes the primary intended use of the vehicle.(Passenger-83%, Truck-17%)

*Battery Electric Vehicles (BEVs)*: The count of vehicles that are known to be propelled solely by an energy derived from an onboard electric battery.

*Plug-In Hybrid Electric Vehicles (PHEVs)*: The count of vehicles that are known to be propelled from energy partially sourced from an onboard electric battery

*Electric Vehicle (EV) Total*: The sum of Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs).
*Non-Electric Vehicle Total*: The count of vehicles that are not electric vehicles.

*Total Vehicles*: All powered vehicles registered in the county. This includes electric vehicles.

*Percent Electric Vehicles*: Comparison of electric vehicles versus their non-electric counterparts.

**Dataset Link**: https://www.kaggle.com/datasets/sahirmaharajj/electric-vehicle-population-size-2024/data

## Importing necessary Python libraries

In [16]:
import pandas as pd       # For data manipulation and analysis
import numpy as np        # For numerical operations and arrays
import matplotlib.pyplot as plt   # For basic plotting
import seaborn as sns     # For advanced visualizations and styling
from sklearn.linear_model import LinearRegression  # For linear regression modeling
from sklearn.metrics import mean_squared_error, r2_score  # For model evaluation metrics
from sklearn.model_selection import train_test_split  # For splitting data into train/test sets

## Load Dataset

In [17]:
# Load data
df = pd.read_csv("Electric_Vehicle_Population_By_County.csv")

## Exploratory Data Analysis (EDA)

In [18]:
df.head() # top 5 rows

Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
0,September 30 2022,Riverside,CA,Passenger,7,0,7,460,467,1.5
1,December 31 2022,Prince William,VA,Passenger,1,2,3,188,191,1.57
2,January 31 2020,Dakota,MN,Passenger,0,1,1,32,33,3.03
3,June 30 2022,Ferry,WA,Truck,0,0,0,3575,3575,0.0
4,July 31 2021,Douglas,CO,Passenger,0,1,1,83,84,1.19


In [19]:
# no of rows and cols
df.shape

(20819, 10)

In [20]:
# Data Types, class and memory alloc
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20819 entries, 0 to 20818
Data columns (total 10 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Date                                      20819 non-null  object 
 1   County                                    20733 non-null  object 
 2   State                                     20733 non-null  object 
 3   Vehicle Primary Use                       20819 non-null  object 
 4   Battery Electric Vehicles (BEVs)          20819 non-null  object 
 5   Plug-In Hybrid Electric Vehicles (PHEVs)  20819 non-null  object 
 6   Electric Vehicle (EV) Total               20819 non-null  object 
 7   Non-Electric Vehicle Total                20819 non-null  object 
 8   Total Vehicles                            20819 non-null  object 
 9   Percent Electric Vehicles                 20819 non-null  float64
dtypes: float64(1), object(9)
memory us

In [21]:
# Count of missing values
df.isnull().sum()

Unnamed: 0,0
Date,0
County,86
State,86
Vehicle Primary Use,0
Battery Electric Vehicles (BEVs),0
Plug-In Hybrid Electric Vehicles (PHEVs),0
Electric Vehicle (EV) Total,0
Non-Electric Vehicle Total,0
Total Vehicles,0
Percent Electric Vehicles,0


Check if any column contain outliers.

In [22]:
# Calculate the first quartile (Q1 - 25th percentile) of the 'Percent Electric Vehicles' column
Q1 = df['Percent Electric Vehicles'].quantile(0.25)

# Calculate the third quartile (Q3 - 75th percentile) of the same column
Q3 = df['Percent Electric Vehicles'].quantile(0.75)

# Compute the Interquartile Range (IQR), which measures the spread of the middle 50% of the data
IQR = Q3 - Q1

# Define the lower boundary for outliers (1.5 times IQR below Q1)
lower_bound = Q1 - 1.5 * IQR

# Define the upper boundary for outliers (1.5 times IQR above Q3)
upper_bound = Q3 + 1.5 * IQR

# Print the calculated lower and upper bounds
print('lower_bound:', lower_bound)
print('upper_bound:', upper_bound)

# Identify outliers as values below the lower bound or above the upper bound
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]

# Print the total number of outliers detected
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])


lower_bound: -3.5174999999999996
upper_bound: 6.9025
Number of outliers in 'Percent Electric Vehicles': 2476


# Data Preprocessing


Basic Data Cleaning

In [23]:
# Converts the "Date" column to actual datetime objects
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Removes rows where "Date" conversion failed
df = df[df['Date'].notnull()]

# Removes rows where the target (EV Total) is missing
df = df[df['Electric Vehicle (EV) Total'].notnull()]

# Fill missing values
df['County'] = df['County'].fillna('Unknown')
df['State'] = df['State'].fillna('Unknown')

# Confirm remaining nulls
print("Missing after fill:")
print(df[['County', 'State']].isnull().sum())

df.head()

Missing after fill:
County    0
State     0
dtype: int64


Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
0,2022-09-30,Riverside,CA,Passenger,7,0,7,460,467,1.5
1,2022-12-31,Prince William,VA,Passenger,1,2,3,188,191,1.57
2,2020-01-31,Dakota,MN,Passenger,0,1,1,32,33,3.03
3,2022-06-30,Ferry,WA,Truck,0,0,0,3575,3575,0.0
4,2021-07-31,Douglas,CO,Passenger,0,1,1,83,84,1.19



Remove Outliers: Cap the values to the IQR bounds

In [24]:
# Cap the outliers - it keeps all the data while reducing the skew from extreme values.

df['Percent Electric Vehicles'] = np.where(df['Percent Electric Vehicles'] > upper_bound, upper_bound,
                                 np.where(df['Percent Electric Vehicles'] < lower_bound, lower_bound, df['Percent Electric Vehicles']))

# Identify outliers
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])

Number of outliers in 'Percent Electric Vehicles': 0
