# Washington State Electric Vehicle Data
The original premise or the reason why for working with this data is to confirm or deny if Electric Vehicles are becoming more popular or not. Also, I decided, in an effort to keep my skills up, want to perform some exploratory data analysis.

Imports all functions from the pyspark.sql.functions module in PySpark. These functions are used for various operations on DataFrames, such as data manipulation, aggregation, and transformation. By importing everything with the * wildcard, you can use any function from this module directly in your code.

In [0]:
from pyspark.sql.functions import *
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

This Python code uses PySpark to read a CSV file into a DataFrame.
Here's a concise explanation:
1. bronze_path is defined as the file path to the CSV file containing electric vehicle title and registration activity data.
2. ev_df is created by reading the CSV file located at bronze_path using Spark's read method.
3. The format("csv") specifies that the file format is CSV.
4. The option("header", "true") indicates that the first row of the CSV file contains header information (column names).
5. The load(bronze_path) method loads the CSV file into a Spark DataFrame.

In [0]:
bronze_path = "/Volumes/main/djsprojects/evdata/bronze/Electric_Vehicle_Title_and_Registration_Activity.csv"
ev_df=spark.read.format("csv")\
           .option("header", "true")\
           .option("inferSchema", "true")\
           .option("preferDate", "true")\
           .load(bronze_path)


Now we are going to do some data cleansing to ensure the date is formatted in a manner which we use to analyze the data:

1. Converts the "Sale Date" column from a string format to a date format using the to_date function, with the date format specified as "MMMM dd yyyy" (e.g., "January 01 2020").
2. Converts the "Transaction Date" column from a string format to a date format using the to_date function, with the same date format "MMMM dd yyyy".

The withColumn method is used to create or replace the columns with the newly formatted date values.

In [0]:
ev_df = ev_df.withColumn("Sale Date", to_date(col("Sale Date"), "MMMM dd yyyy"))
ev_df = ev_df.withColumn("Transaction Date", to_date(col("Transaction Date"), "MMMM dd yyyy"))

The printSchema() method is called on a DataFrame object named ev_df. This method prints out the schema of the DataFrame, which includes the column names, data types, and nullable information. This is useful for understanding the structure of the data contained in the DataFrame.

In [0]:
ev_df.printSchema()

This Python code uses PySpark to transform the date columns in a DataFrame ev_df. Specifically, it converts the "Sale Date" and "Transaction Date" columns from their string format ("MM/dd/yyyy") to date format.

1. withColumn("Sale Date", to_date(col("Sale Date"), "MM/dd/yyyy")):

  - Converts the values in the "Sale Date" column from a string format (e.g., "12/31/2021") to a date format.

2. withColumn("Transaction Date", to_date(col("Transaction Date"), "MM/dd/yyyy")):

-   Converts the values in the "Transaction Date" column from a string format to a date format.
The to_date function is used here for the conversion, and it follows the specified date pattern "MM/dd/yyyy".

The following code utilizes the built Databricks functionality provide a data profile of the dataframe ev_data

In [0]:
dbutils.data.summarize(ev_df,precise=True)

This Python code processes a DataFrame containing electric vehicle (EV) transaction data and visualizes the number of transactions over time using a line plot. Here's a step-by-step explanation:

1. Group and Count Transactions:

    - The code groups the ev_df DataFrame by the "Transaction Date" column and counts the number of transactions for each date.
   - The result is converted to a Pandas DataFrame named ev_count_df.
2. Convert Date Column:

    - The "Transaction Date" column in ev_count_df is converted to a datetime format.
3. Extract Year and Month:

    - Two new columns, "Transaction_Year" and "Transaction_Month", are created by extracting the year and month from the "Transaction Date" column, respectively.
4. Plot the Data:

    - A line plot is created using Seaborn (sns.lineplot), with the x-axis representing the transaction year and the y-axis representing the count of transactions.
    - The plot is labeled with "Year" on the x-axis, "Number of Transactions" on the y-axis, and titled "Electric Vehicle Transactions in Washington State".
5. Display the Plot:

    - The plot is displayed using plt.show().

In [0]:
ev_count_df = ev_df\
                  .groupby("Transaction Date")\
                  .count()\
                  .toPandas()
ev_count_df["Transaction Date"] = ev_count_df["Transaction Date"].astype('datetime64[ns]')
ev_count_df["Transaction_Year"]  = ev_count_df["Transaction Date"].dt.year
ev_count_df["Transaction_Month"] = ev_count_df["Transaction Date"].dt.month
             

ev_plot = sns.lineplot(
    x="Transaction_Year",
    y="count",
    data=ev_count_df
)
ev_plot.set(xlabel ="Year", ylabel = "Number of Transactions", title ='Electric Vehicle Transactions in Washington State')
plt.show()

## Analysis
The above visualization displays the sharp increase in ownership.
