# Incentives for Zero-Emission Vehicles (iZEV) Incentive Program

## 1. Introduction

The purpose of this project is to take files of data about iZEV program, and to store them into a relational database by applying an ETL process.

* **Programming language**: python
* **Relational database**: MSSQL Server 2019

## 2. Data information

This open data provides information about Zero-Emission Vehicles Incentive Program between 2019 and 2024, which was carried out by government of Canada. 

This data is stored in CSV files:

* **iZEV_Data_2019**: data from May 2019 to March 2023.
* **iZEV_Data_2023**: data from April 2023 to March 2024.
* **iZEV_Data_2024**: data from April 2024 to August 2024.

**Link**: https://open.canada.ca/data/en/dataset/42986a95-be23-436e-af15-7c6bf292a2e1/resource/bba4c959-53ca-4d23-9cde-da3ce771bba2

## 3. ETL process

In [1]:
import os
import pandas as pd

In [2]:
PATH_DATA = os.path.join(os.getenv("PATH_DATA_PROJECTS"), "Tabular", "izev")

### 3.1. Extract

In [3]:
def get_data_preview(file_name, encoding = "utf-8"):
    '''
        Read files
    '''
    with open(file_name, "r", encoding = encoding) as file:
        for i in range(5):
            print(f"Row {i}:",file.readline())

#### 3.1.1. Get a preview of the data

**Data 2019**

In [4]:
get_data_preview(os.path.join(PATH_DATA, "iZEV_Data_2019.csv"))

Row 0: Incentive Request Date,Month and Year ,Government of Canada Fiscal Year (FY),Calendar Year,Dealership Province / Territory ,Dealership Postal Code,Purchase or Lease,Vehicle Year,Vehicle Make,Vehicle Model,Vehicle Make & Model,"Battery-Electric Vehicle (BEV) Plug-in Hybrid Electric Vehicle (PHEV) or Fuel Cell Electric Vehicle (FCEV)","BEV/PHEV/FCEV - Battery equal to or greater than 15 kWh or Electric range equal to or greater than 50 km","BEV PHEV  ? 15 kWh or PHEV < 15 kWh (until April 24 2022) and PHEV ?  50 km or PHEV < 50 km and  FCEVs ? 50 km or FCEVs < 50 km (April 25 2022 onward)",Eligible Incentive, Amount,"Individual or Organization (Recipient)",Recipient Province / Territory ,Country,

Row 1: 2019-05-15,May 2019,2019-20,2019,British Columbia,V5C 0J4,Purchase,2019,Volkswagen,e-Golf,Volkswagen e-Golf,BEV ,YES,BEV,"5,000",Individual,British Columbia,Canada,

Row 2: 2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV ,YES,BEV,"5,000",I

**Data 2023**

In [5]:
get_data_preview(os.path.join(PATH_DATA, "iZEV_Data_2023.csv"))

Row 0: Incentive Request Date,Month and Year ,Government of Canada Fiscal Year (FY),Calendar Year,Dealership Province / Territory ,Dealership Postal Code,Purchase or Lease,Vehicle Year,Vehicle Make,Vehicle Model,Vehicle Make & Model,"Battery-Electric Vehicle (BEV) Plug-in Hybrid Electric Vehicle (PHEV) or Fuel Cell Electric Vehicle (FCEV)","BEV/PHEV/FCEV - Battery equal to or greater than 15 kWh or Electric range equal to or greater than 50 km","BEV PHEV  ? 15 kWh or PHEV < 15 kWh (until April 24 2022) and PHEV ?  50 km or PHEV < 50 km and  FCEVs ? 50 km or FCEVs < 50 km (April 25 2022 onward)",Eligible Incentive, Amount,"Individual or Organization (Recipient)",Recipient Province / Territory ,Country

Row 1: 2023-04-01,April 2023,2023-24,2023,British Columbia,V6J 1H6,Purchase,2023,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,"5,000",Organization,British Columbia,Canada

Row 2: 2023-04-01,April 2023,2023-24,2023,British Columbia,V6J 1H6,Purchase,2023,Audi,Q4 50 e-tron

**Data 2024**

In [6]:
get_data_preview(os.path.join(PATH_DATA, "iZEV_Data_2024.csv"), "windows-1252")

Row 0: Incentive Request Date,Month and Year ,Government of Canada Fiscal Year (FY),Calendar Year,Dealership Province / Territory ,Dealership Postal Code,Purchase or Lease,Vehicle Year,Vehicle Make,Vehicle Model,Vehicle Make & Model,"Battery-Electric Vehicle (BEV) Plug-in Hybrid Electric Vehicle (PHEV) or Fuel Cell Electric Vehicle (FCEV)","BEV/PHEV/FCEV - Battery equal to or greater than 15 kWh or Electric range equal to or greater than 50 km","BEV PHEV  ? 15 kWh or PHEV < 15 kWh (until April 24 2022) and PHEV ?  50 km or PHEV < 50 km and  FCEVs ? 50 km or FCEVs < 50 km (April 25 2022 onward)",Eligible Incentive, Amount,"Individual or Organization (Recipient)",Recipient Province / Territory ,Country

Row 1: 2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV ,YES ,BEV,"5,000",Individual,British Columbia,Canada

Row 2: 2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV ,YES ,BEV

**Summary:**

* The name of the columns are too complex to deal with, so it is necessary to change them.

#### 3.1.2. Read data

In [7]:
columns = ["Request_Date", "Month_Year", "Fiscal_Year", "Calendar_Year", "Dealership_Province", "Dealership_Code", "Purchase_or_Lease", "Vehicle_Year", "Vehicle_Make", 
            "Vehicle_Model", "Vehicle_Make_Model", "BEV_PHEV_FCEV", "BEV_PHEV_FCEV_15", "BEV_PHEV_FCEV_15_2022", "Eligible_Incentive_Amount",
            "Recipient", "Recipient_Province", "Country"]

**Data 2019**

In [8]:
df_2019 = pd.read_csv(os.path.join(PATH_DATA, "iZEV_Data_2019.csv"), sep = ",", header = None, skiprows = 1)
print("Shape:", df_2019.shape)
df_2019.head()

Shape: (202206, 19)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,2019-05-15,May 2019,2019-20,2019,British Columbia,V5C 0J4,Purchase,2019,Volkswagen,e-Golf,Volkswagen e-Golf,BEV,YES,BEV,5000,Individual,British Columbia,Canada,
1,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada,
2,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada,
3,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada,
4,2019-05-16,May 2019,2019-20,2019,Quebec,J9P 7B2,Purchase,2019,Hyundai,Ioniq electric,Hyundai Ioniq electric,BEV,YES,BEV,5000,Individual,Quebec,Canada,


In [9]:
df_2019.drop(18, axis = 1, inplace = True)
df_2019.columns = columns
df_2019.head()

Unnamed: 0,Request_Date,Month_Year,Fiscal_Year,Calendar_Year,Dealership_Province,Dealership_Code,Purchase_or_Lease,Vehicle_Year,Vehicle_Make,Vehicle_Model,Vehicle_Make_Model,BEV_PHEV_FCEV,BEV_PHEV_FCEV_15,BEV_PHEV_FCEV_15_2022,Eligible_Incentive_Amount,Recipient,Recipient_Province,Country
0,2019-05-15,May 2019,2019-20,2019,British Columbia,V5C 0J4,Purchase,2019,Volkswagen,e-Golf,Volkswagen e-Golf,BEV,YES,BEV,5000,Individual,British Columbia,Canada
1,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada
2,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada
3,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada
4,2019-05-16,May 2019,2019-20,2019,Quebec,J9P 7B2,Purchase,2019,Hyundai,Ioniq electric,Hyundai Ioniq electric,BEV,YES,BEV,5000,Individual,Quebec,Canada


**Data 2023**

In [10]:
df_2023 = pd.read_csv(os.path.join(PATH_DATA, "iZEV_Data_2023.csv"), sep = ",", header = None, skiprows = 1)
print("Shape:", df_2023.shape)
df_2023.head()

Shape: (166758, 18)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6J 1H6,Purchase,2023.0,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,5000,Organization,British Columbia,Canada
1,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6J 1H6,Purchase,2023.0,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,5000,Organization,British Columbia,Canada
2,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6V 2M6,Purchase,2023.0,Lexus,NX 450h+,Lexus NX 450h+,PHEV,YES,PHEVs ? 50km,5000,Individual,British Columbia,Canada
3,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6J 1H6,Lease,2023.0,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,5000,Individual,British Columbia,Canada
4,2023-04-01,April 2023,2023-24,2023.0,Ontario,L6L 2X6,Purchase,2023.0,Mitsubishi,Outlander PHEV,Mitsubishi Outlander PHEV,PHEV,YES,PHEVs ? 50km,5000,Individual,Ontario,Canada


In [11]:
df_2023.columns = columns
df_2023.head()

Unnamed: 0,Request_Date,Month_Year,Fiscal_Year,Calendar_Year,Dealership_Province,Dealership_Code,Purchase_or_Lease,Vehicle_Year,Vehicle_Make,Vehicle_Model,Vehicle_Make_Model,BEV_PHEV_FCEV,BEV_PHEV_FCEV_15,BEV_PHEV_FCEV_15_2022,Eligible_Incentive_Amount,Recipient,Recipient_Province,Country
0,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6J 1H6,Purchase,2023.0,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,5000,Organization,British Columbia,Canada
1,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6J 1H6,Purchase,2023.0,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,5000,Organization,British Columbia,Canada
2,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6V 2M6,Purchase,2023.0,Lexus,NX 450h+,Lexus NX 450h+,PHEV,YES,PHEVs ? 50km,5000,Individual,British Columbia,Canada
3,2023-04-01,April 2023,2023-24,2023.0,British Columbia,V6J 1H6,Lease,2023.0,Audi,Q4 50 e-tron Quattro,Audi Q4 50 e-tron Quattro,BEV,YES,BEV,5000,Individual,British Columbia,Canada
4,2023-04-01,April 2023,2023-24,2023.0,Ontario,L6L 2X6,Purchase,2023.0,Mitsubishi,Outlander PHEV,Mitsubishi Outlander PHEV,PHEV,YES,PHEVs ? 50km,5000,Individual,Ontario,Canada


**Data 2024**

In [12]:
df_2024 = pd.read_csv(os.path.join(PATH_DATA, "iZEV_Data_2024.csv"), sep = ",", header = None, skiprows = 1, encoding = "windows-1252")
print("Shape:", df_2024.shape)
df_2024.head()

Shape: (90325, 18)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
1,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
2,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
3,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
4,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada


In [13]:
df_2024.columns = columns
df_2024.head()

Unnamed: 0,Request_Date,Month_Year,Fiscal_Year,Calendar_Year,Dealership_Province,Dealership_Code,Purchase_or_Lease,Vehicle_Year,Vehicle_Make,Vehicle_Model,Vehicle_Make_Model,BEV_PHEV_FCEV,BEV_PHEV_FCEV_15,BEV_PHEV_FCEV_15_2022,Eligible_Incentive_Amount,Recipient,Recipient_Province,Country
0,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
1,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
2,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
3,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada
4,2024-04-01,April 2024,2024-25,2024,British Columbia,V5L 1H7,Purchase,2023,Tesla,Model Y,Tesla Model Y,BEV,YES,BEV,5000,Individual,British Columbia,Canada


### 3.2. Transform

#### 3.2.1. Data profiling

##### 3.2.1.1. Identify missing values

**Data 2019**

In [14]:
print("Total missing values by columns:")
df_2019.isnull().sum()

Total missing values by columns:


Request_Date                 0
Month_Year                   0
Fiscal_Year                  0
Calendar_Year                0
Dealership_Province          0
Dealership_Code              0
Purchase_or_Lease            0
Vehicle_Year                 0
Vehicle_Make                 0
Vehicle_Model                0
Vehicle_Make_Model           0
BEV_PHEV_FCEV                0
BEV_PHEV_FCEV_15             0
BEV_PHEV_FCEV_15_2022        0
Eligible_Incentive_Amount    0
Recipient                    0
Recipient_Province           0
Country                      0
dtype: int64

Summary:

* There are no missing values.

**Data 2023**

In [15]:
print("Total missing values by columns:")
df_2023.isnull().sum()

Total missing values by columns:


Request_Date                 1
Month_Year                   1
Fiscal_Year                  1
Calendar_Year                1
Dealership_Province          1
Dealership_Code              1
Purchase_or_Lease            1
Vehicle_Year                 1
Vehicle_Make                 1
Vehicle_Model                1
Vehicle_Make_Model           1
BEV_PHEV_FCEV                1
BEV_PHEV_FCEV_15             1
BEV_PHEV_FCEV_15_2022        1
Eligible_Incentive_Amount    1
Recipient                    1
Recipient_Province           1
Country                      1
dtype: int64

Summary:

* There is an entire row of missing values.

**Data 2024**

In [16]:
print("Total missing values by columns:")
df_2024.isnull().sum()

Total missing values by columns:


Request_Date                 0
Month_Year                   0
Fiscal_Year                  0
Calendar_Year                0
Dealership_Province          0
Dealership_Code              0
Purchase_or_Lease            0
Vehicle_Year                 0
Vehicle_Make                 0
Vehicle_Model                0
Vehicle_Make_Model           0
BEV_PHEV_FCEV                0
BEV_PHEV_FCEV_15             0
BEV_PHEV_FCEV_15_2022        0
Eligible_Incentive_Amount    0
Recipient                    0
Recipient_Province           0
Country                      0
dtype: int64

Summary:

* There are no missing values.

##### 3.2.1.2. Identify duplicate values

**Data 2019**

In [17]:
print(f"Total duplicated values {df_2019.duplicated().sum()} from {df_2019.duplicated().count()}")

Total duplicated values 75232 from 202206


**Data 2023**

In [18]:
print(f"Total duplicated values {df_2023.duplicated().sum()} from {df_2023.duplicated().count()}")

Total duplicated values 69434 from 166758


**Data 2024**

In [19]:
print(f"Total duplicated values {df_2024.duplicated().sum()} from {df_2024.duplicated().count()}")

Total duplicated values 30231 from 90325


##### 3.2.1.3. Validate data consistency 

In [20]:
from datetime import datetime

In [21]:
def validate_date(date_string):
    try:
        datetime.strptime(date_string, "%Y-%m-%d")
        return True
    except ValueError:
        return False

In [22]:
def validate_number(number_string):
    try:
        float(number_string)
        return True
    except ValueError:
        return False

**Data 2019**

In [23]:
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202206 entries, 0 to 202205
Data columns (total 18 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   Request_Date               202206 non-null  object
 1   Month_Year                 202206 non-null  object
 2   Fiscal_Year                202206 non-null  object
 3   Calendar_Year              202206 non-null  int64 
 4   Dealership_Province        202206 non-null  object
 5   Dealership_Code            202206 non-null  object
 6   Purchase_or_Lease          202206 non-null  object
 7   Vehicle_Year               202206 non-null  int64 
 8   Vehicle_Make               202206 non-null  object
 9   Vehicle_Model              202206 non-null  object
 10  Vehicle_Make_Model         202206 non-null  object
 11  BEV_PHEV_FCEV              202206 non-null  object
 12  BEV_PHEV_FCEV_15           202206 non-null  object
 13  BEV_PHEV_FCEV_15_2022      202206 non-null  

In [24]:
df_2019.head()

Unnamed: 0,Request_Date,Month_Year,Fiscal_Year,Calendar_Year,Dealership_Province,Dealership_Code,Purchase_or_Lease,Vehicle_Year,Vehicle_Make,Vehicle_Model,Vehicle_Make_Model,BEV_PHEV_FCEV,BEV_PHEV_FCEV_15,BEV_PHEV_FCEV_15_2022,Eligible_Incentive_Amount,Recipient,Recipient_Province,Country
0,2019-05-15,May 2019,2019-20,2019,British Columbia,V5C 0J4,Purchase,2019,Volkswagen,e-Golf,Volkswagen e-Golf,BEV,YES,BEV,5000,Individual,British Columbia,Canada
1,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada
2,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada
3,2019-05-16,May 2019,2019-20,2019,Quebec,G3M 1W1,Purchase,2019,Nissan,Leaf,Nissan Leaf,BEV,YES,BEV,5000,Individual,Quebec,Canada
4,2019-05-16,May 2019,2019-20,2019,Quebec,J9P 7B2,Purchase,2019,Hyundai,Ioniq electric,Hyundai Ioniq electric,BEV,YES,BEV,5000,Individual,Quebec,Canada


In [25]:
df_2019["Request_Date_OK"] = df_2019["Request_Date"].apply(validate_date)
df_2019["Request_Date_OK"].value_counts()

Request_Date_OK
True    202206
Name: count, dtype: int64

In [26]:
df_2019["Eligible_Incentive_Amount_OK"] = df_2019["Eligible_Incentive_Amount"].apply(validate_number)
df_2019["Eligible_Incentive_Amount_OK"].value_counts()

Eligible_Incentive_Amount_OK
False    202188
True         18
Name: count, dtype: int64

In [27]:
df_2019[df_2019["Eligible_Incentive_Amount_OK"] == False].head(1)["Eligible_Incentive_Amount"]

0    5,000
Name: Eligible_Incentive_Amount, dtype: object

In [28]:
df_2019[df_2019["Eligible_Incentive_Amount_OK"] == True].head(1)["Eligible_Incentive_Amount"]

7026    625
Name: Eligible_Incentive_Amount, dtype: object

**Data 2023**

In [29]:
df_2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 166758 entries, 0 to 166757
Data columns (total 18 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Request_Date               166757 non-null  object 
 1   Month_Year                 166757 non-null  object 
 2   Fiscal_Year                166757 non-null  object 
 3   Calendar_Year              166757 non-null  float64
 4   Dealership_Province        166757 non-null  object 
 5   Dealership_Code            166757 non-null  object 
 6   Purchase_or_Lease          166757 non-null  object 
 7   Vehicle_Year               166757 non-null  float64
 8   Vehicle_Make               166757 non-null  object 
 9   Vehicle_Model              166757 non-null  object 
 10  Vehicle_Make_Model         166757 non-null  object 
 11  BEV_PHEV_FCEV              166757 non-null  object 
 12  BEV_PHEV_FCEV_15           166757 non-null  object 
 13  BEV_PHEV_FCEV_15_2022      16

In [30]:
df_2023["Request_Date_OK"] = df_2023[df_2023["Request_Date"].notna()]["Request_Date"].apply(validate_date)
df_2023["Request_Date_OK"].value_counts()

Request_Date_OK
True    166757
Name: count, dtype: int64

In [31]:
df_2023["Eligible_Incentive_Amount_OK"] = df_2023[df_2023["Eligible_Incentive_Amount"].notna()]["Eligible_Incentive_Amount"].apply(validate_number)
df_2023["Eligible_Incentive_Amount_OK"].value_counts()

Eligible_Incentive_Amount_OK
False    166746
True         11
Name: count, dtype: int64

In [32]:
df_2023[df_2023["Eligible_Incentive_Amount_OK"] == False].head(1)["Eligible_Incentive_Amount"]

0    5,000
Name: Eligible_Incentive_Amount, dtype: object

In [33]:
df_2023[df_2023["Eligible_Incentive_Amount_OK"] == True].head(1)["Eligible_Incentive_Amount"]

1027    625
Name: Eligible_Incentive_Amount, dtype: object

**Data 2024**

In [34]:
df_2024.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90325 entries, 0 to 90324
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Request_Date               90325 non-null  object
 1   Month_Year                 90325 non-null  object
 2   Fiscal_Year                90325 non-null  object
 3   Calendar_Year              90325 non-null  int64 
 4   Dealership_Province        90325 non-null  object
 5   Dealership_Code            90325 non-null  object
 6   Purchase_or_Lease          90325 non-null  object
 7   Vehicle_Year               90325 non-null  int64 
 8   Vehicle_Make               90325 non-null  object
 9   Vehicle_Model              90325 non-null  object
 10  Vehicle_Make_Model         90325 non-null  object
 11  BEV_PHEV_FCEV              90325 non-null  object
 12  BEV_PHEV_FCEV_15           90325 non-null  object
 13  BEV_PHEV_FCEV_15_2022      90325 non-null  object
 14  Eligib

In [35]:
df_2024["Request_Date_OK"] = df_2024["Request_Date"].apply(validate_date)
df_2024["Request_Date_OK"].value_counts()

Request_Date_OK
True    90325
Name: count, dtype: int64

In [36]:
df_2024["Eligible_Incentive_Amount_OK"] = df_2024["Eligible_Incentive_Amount"].apply(validate_number)
df_2024["Eligible_Incentive_Amount_OK"].value_counts()

Eligible_Incentive_Amount_OK
False    90322
True         3
Name: count, dtype: int64

In [37]:
df_2024[df_2024["Eligible_Incentive_Amount_OK"] == False].head(1)["Eligible_Incentive_Amount"]

0    5,000
Name: Eligible_Incentive_Amount, dtype: object

In [38]:
df_2024[df_2024["Eligible_Incentive_Amount_OK"] == True].head(1)["Eligible_Incentive_Amount"]

21811    625
Name: Eligible_Incentive_Amount, dtype: object

#### 3.2.2. Cleaning

##### 3.2.2.1. Remove missing values

**Data 2023**

In [39]:
# Remove rows where all values are NaN
df_2023 = df_2023.dropna(how = "all")

print("Total missing values by columns:")
df_2023.isnull().sum()

Total missing values by columns:


Request_Date                    0
Month_Year                      0
Fiscal_Year                     0
Calendar_Year                   0
Dealership_Province             0
Dealership_Code                 0
Purchase_or_Lease               0
Vehicle_Year                    0
Vehicle_Make                    0
Vehicle_Model                   0
Vehicle_Make_Model              0
BEV_PHEV_FCEV                   0
BEV_PHEV_FCEV_15                0
BEV_PHEV_FCEV_15_2022           0
Eligible_Incentive_Amount       0
Recipient                       0
Recipient_Province              0
Country                         0
Request_Date_OK                 0
Eligible_Incentive_Amount_OK    0
dtype: int64

##### 3.2.2.2. Remove duplcated data

Depending on the type of problem, it would be necessary treat this data as duplicated or not. For the purpose of this project, they are going to be treated as duplicated.

**Data 2019**

In [40]:
# Remove duplicated values
df_2019 = df_2019.drop_duplicates()
print(f"Total duplicated values {df_2019.duplicated().sum()} from {df_2019.duplicated().count()}")

Total duplicated values 0 from 126974


**Data 2023**

In [41]:
# Remove duplicated values
df_2023 = df_2023.drop_duplicates()
print(f"Total duplicated values {df_2023.duplicated().sum()} from {df_2023.duplicated().count()}")

Total duplicated values 0 from 97323


**Data 2024**

In [42]:
# Remove duplicated values
df_2024 = df_2024.drop_duplicates()
print(f"Total duplicated values {df_2024.duplicated().sum()} from {df_2024.duplicated().count()}")

Total duplicated values 0 from 60094


##### 3.2.2.3. Remove white spaces

The idea of this transformation is to reduce cardinality, for example: "V5L 1H7" is different than " V5L 1H7 ".

In [43]:
string_columns = ["Month_Year", "Fiscal_Year", "Dealership_Province", "Dealership_Code", "Purchase_or_Lease", "Vehicle_Make", "Vehicle_Model", "Vehicle_Make_Model",
                  "BEV_PHEV_FCEV", "BEV_PHEV_FCEV_15", "BEV_PHEV_FCEV_15_2022", "Recipient", "Recipient_Province", "Country"]

**Data 2019**

In [44]:
for c in string_columns:
    df_2019[c] = df_2019[c].str.strip()

**Data 2023**

In [45]:
for c in string_columns:
    df_2023[c] = df_2023[c].str.strip()

**Data 2024**

In [46]:
for c in string_columns:
    df_2024[c] = df_2024[c].str.strip()

##### 3.2.2.4. Convert a string to uppercase

The idea of this transformation is to reduce cardinality, for example: "V5L 1H7" is different than "vsl 1H7".

**Data 2019**

In [47]:
for c in string_columns:
    df_2019[c] = df_2019[c].str.upper()

**Data 2023**

In [48]:
for c in string_columns:
    df_2023[c] = df_2023[c].str.upper()

**Data 2024**

In [49]:
for c in string_columns:
    df_2024[c] = df_2024[c].str.upper()

##### 3.2.2.5. Convert a string to number (if applicable)

Some number were identify as invalid by validation function, the reason was they had a comma, so it is necessary to remove it, in order to apply numeric transformation.

**Data 2019**

In [50]:
df_2019["Eligible_Incentive_Amount"] = df_2019["Eligible_Incentive_Amount"].str.replace(",", "")

In [51]:
df_2019["Eligible_Incentive_Amount"] = pd.to_numeric(df_2019["Eligible_Incentive_Amount"])

In [52]:
df_2019.info()

<class 'pandas.core.frame.DataFrame'>
Index: 126974 entries, 0 to 202205
Data columns (total 20 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Request_Date                  126974 non-null  object
 1   Month_Year                    126974 non-null  object
 2   Fiscal_Year                   126974 non-null  object
 3   Calendar_Year                 126974 non-null  int64 
 4   Dealership_Province           126974 non-null  object
 5   Dealership_Code               126974 non-null  object
 6   Purchase_or_Lease             126974 non-null  object
 7   Vehicle_Year                  126974 non-null  int64 
 8   Vehicle_Make                  126974 non-null  object
 9   Vehicle_Model                 126974 non-null  object
 10  Vehicle_Make_Model            126974 non-null  object
 11  BEV_PHEV_FCEV                 126974 non-null  object
 12  BEV_PHEV_FCEV_15              126974 non-null  object
 13  BEV_

**Data 2023**

In [53]:
df_2023["Eligible_Incentive_Amount"] = df_2023["Eligible_Incentive_Amount"].str.replace(",", "")

In [54]:
df_2023["Eligible_Incentive_Amount"] = pd.to_numeric(df_2023["Eligible_Incentive_Amount"])

In [55]:
df_2023.info()

<class 'pandas.core.frame.DataFrame'>
Index: 97323 entries, 0 to 166735
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Request_Date                  97323 non-null  object 
 1   Month_Year                    97323 non-null  object 
 2   Fiscal_Year                   97323 non-null  object 
 3   Calendar_Year                 97323 non-null  float64
 4   Dealership_Province           97323 non-null  object 
 5   Dealership_Code               97323 non-null  object 
 6   Purchase_or_Lease             97323 non-null  object 
 7   Vehicle_Year                  97323 non-null  float64
 8   Vehicle_Make                  97323 non-null  object 
 9   Vehicle_Model                 97323 non-null  object 
 10  Vehicle_Make_Model            97323 non-null  object 
 11  BEV_PHEV_FCEV                 97323 non-null  object 
 12  BEV_PHEV_FCEV_15              97323 non-null  object 
 13  BEV_P

**Data 2024**

In [56]:
df_2024["Eligible_Incentive_Amount"] = df_2024["Eligible_Incentive_Amount"].str.replace(",", "")

In [57]:
df_2024["Eligible_Incentive_Amount"] = pd.to_numeric(df_2024["Eligible_Incentive_Amount"])

In [58]:
df_2024.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60094 entries, 0 to 90282
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Request_Date                  60094 non-null  object
 1   Month_Year                    60094 non-null  object
 2   Fiscal_Year                   60094 non-null  object
 3   Calendar_Year                 60094 non-null  int64 
 4   Dealership_Province           60094 non-null  object
 5   Dealership_Code               60094 non-null  object
 6   Purchase_or_Lease             60094 non-null  object
 7   Vehicle_Year                  60094 non-null  int64 
 8   Vehicle_Make                  60094 non-null  object
 9   Vehicle_Model                 60094 non-null  object
 10  Vehicle_Make_Model            60094 non-null  object
 11  BEV_PHEV_FCEV                 60094 non-null  object
 12  BEV_PHEV_FCEV_15              60094 non-null  object
 13  BEV_PHEV_FCEV_15_2022

### 3.3. Load

#### 3.3.1. Script to create database

```sql
CREATE DATABASE IZEV_EDA
GO

USE IZEV_EDA
GO

CREATE TABLE dbo.Incentives (
	Id INT NOT NULL IDENTITY,
	Request_Date DATE NOT NULL,
	Month_Year VARCHAR(100) NOT NULL,
	Fiscal_Year VARCHAR(100) NOT NULL,
	Calendar_Year INT NOT NULL,
	Dealership_Province VARCHAR(50) NOT NULL,
	Dealership_Code VARCHAR(50) NOT NULL,
	Purchase_or_Lease VARCHAR(50) NOT NULL,
	Vehicle_Year INT NOT NULL,
	Vehicle_Make VARCHAR(100) NOT NULL,
	Vehicle_Model VARCHAR(100) NOT NULL,
	Vehicle_Make_Model VARCHAR(100) NOT NULL,
	BEV_PHEV_FCEV VARCHAR(20) NOT NULL,
	BEV_PHEV_FCEV_15 VARCHAR(20) NOT NULL,
	BEV_PHEV_FCEV_15_2022 VARCHAR(20) NOT NULL,
	Eligible_Incentive_Amount INT NOT NULL,
	Recipient VARCHAR(50) NOT NULL,
	Recipient_Province VARCHAR(50) NOT NULL,
	Country VARCHAR(50) NOT NULL,
	File_Origin VARCHAR(100) NOT NULL,

    CONSTRAINT PK_dbo_Incentives PRIMARY KEY (Id)
)  ON [PRIMARY]
GO

SELECT * FROM dbo.Incentives
GO
```

#### 3.3.2. Prepare connection information

In [59]:
import pyodbc
from dotenv import load_dotenv

In [60]:
load_dotenv()

True

Get database connection information

In [61]:
MSSQL_SERVER = os.getenv("MSSQL_SERVER")
MSSQL_DATABASE = os.getenv("MSSQL_DATABASE")
MSSQL_USERNAME = os.getenv("MSSQL_USERNAME")
MSSQL_PASSWORD = os.getenv("MSSQL_PASSWORD")

In [62]:
connection_string = f"DRIVER={{ODBC Driver 18 for SQL Server}};SERVER={MSSQL_SERVER};DATABASE={MSSQL_DATABASE};UID={MSSQL_USERNAME};PWD={MSSQL_PASSWORD};TrustServerCertificate=yes"

Build insertion script

In [63]:
table_destination_sql = "dbo.Incentives"

In [64]:
columns = ["Request_Date", "Month_Year", "Fiscal_Year", "Calendar_Year", "Dealership_Province", "Dealership_Code", "Purchase_or_Lease", "Vehicle_Year", "Vehicle_Make", 
            "Vehicle_Model", "Vehicle_Make_Model", "BEV_PHEV_FCEV", "BEV_PHEV_FCEV_15", "BEV_PHEV_FCEV_15_2022", "Eligible_Incentive_Amount",
            "Recipient", "Recipient_Province", "Country", "File_Origin"]

In [65]:
sql_query_columns = ",".join(columns)
sql_query_values = ",".join(["?" for c in range(len(columns))])
sql_query = f"INSERT INTO {table_destination_sql} ({sql_query_columns}) VALUES ({sql_query_values})"
print(sql_query)

INSERT INTO dbo.Incentives (Request_Date,Month_Year,Fiscal_Year,Calendar_Year,Dealership_Province,Dealership_Code,Purchase_or_Lease,Vehicle_Year,Vehicle_Make,Vehicle_Model,Vehicle_Make_Model,BEV_PHEV_FCEV,BEV_PHEV_FCEV_15,BEV_PHEV_FCEV_15_2022,Eligible_Incentive_Amount,Recipient,Recipient_Province,Country,File_Origin) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)


#### 3.3.3. Initial load

The first file is the initial load.

Clean table

In [66]:
connection = pyodbc.connect(connection_string)
cursor = connection.cursor()
cursor.execute(f"TRUNCATE TABLE {table_destination_sql}")
cursor.close()
connection.commit()
connection.close()

Load data

In [67]:
connection = pyodbc.connect(connection_string)
cursor = connection.cursor()
for index, row in df_2019.iterrows():
    cursor.execute(sql_query, 
                    row["Request_Date"], 
                    row["Month_Year"], 
                    row["Fiscal_Year"],
                    row["Calendar_Year"],
                    row["Dealership_Province"],
                    row["Dealership_Code"],
                    row["Purchase_or_Lease"],
                    row["Vehicle_Year"],
                    row["Vehicle_Make"],
                    row["Vehicle_Model"],
                    row["Vehicle_Make_Model"],
                    row["BEV_PHEV_FCEV"],
                    row["BEV_PHEV_FCEV_15"],
                    row["BEV_PHEV_FCEV_15_2022"],
                    row["Eligible_Incentive_Amount"],
                    row["Recipient"],
                    row["Recipient_Province"],
                    row["Country"],
                    "iZEV_Data_2019.csv"
                    )
connection.commit()
cursor.close()
connection.close()

#### 3.3.4. Incremental load

The rest of the files are going to added to current data in the database.

**Data 2023**

In [68]:
connection = pyodbc.connect(connection_string)
cursor = connection.cursor()
for index, row in df_2023.iterrows():
    cursor.execute(sql_query, 
                    row["Request_Date"], 
                    row["Month_Year"], 
                    row["Fiscal_Year"],
                    row["Calendar_Year"],
                    row["Dealership_Province"],
                    row["Dealership_Code"],
                    row["Purchase_or_Lease"],
                    row["Vehicle_Year"],
                    row["Vehicle_Make"],
                    row["Vehicle_Model"],
                    row["Vehicle_Make_Model"],
                    row["BEV_PHEV_FCEV"],
                    row["BEV_PHEV_FCEV_15"],
                    row["BEV_PHEV_FCEV_15_2022"],
                    row["Eligible_Incentive_Amount"],
                    row["Recipient"],
                    row["Recipient_Province"],
                    row["Country"],
                    "iZEV_Data_2023.csv"
                    )
connection.commit()
cursor.close()
connection.close()

**Data 2024**

In [69]:
connection = pyodbc.connect(connection_string)
cursor = connection.cursor()
for index, row in df_2024.iterrows():
    cursor.execute(sql_query, 
                    row["Request_Date"], 
                    row["Month_Year"], 
                    row["Fiscal_Year"],
                    row["Calendar_Year"],
                    row["Dealership_Province"],
                    row["Dealership_Code"],
                    row["Purchase_or_Lease"],
                    row["Vehicle_Year"],
                    row["Vehicle_Make"],
                    row["Vehicle_Model"],
                    row["Vehicle_Make_Model"],
                    row["BEV_PHEV_FCEV"],
                    row["BEV_PHEV_FCEV_15"],
                    row["BEV_PHEV_FCEV_15_2022"],
                    row["Eligible_Incentive_Amount"],
                    row["Recipient"],
                    row["Recipient_Province"],
                    row["Country"],
                    "iZEV_Data_2024.csv"
                    )
connection.commit()
cursor.close()
connection.close()

#### 3.3.5. Show data

In [74]:
connection = pyodbc.connect(connection_string)
cursor = connection.cursor()
cursor.execute(f"SELECT File_Origin, COUNT(1) Total FROM {table_destination_sql} (NOLOCK) GROUP BY File_Origin")
for row in cursor.fetchall():
    print(f"{row[0]}: {row[1]}")
cursor.close()
connection.close()

iZEV_Data_2019.csv: 126974
iZEV_Data_2024.csv: 60094
iZEV_Data_2023.csv: 97323
