Hey, hope you're doing well. This project will analyze EV charging utilization metrics from a public Kaggle dataset which measures EV charging station usage from Palo Alto.

To begin, let's first upload the necessary libraries:

In [None]:
import zipfile

from pathlib import Path

import pandas as pd

Let's now unzip this dataset locally here (it is a zip file as it is an extremely large dataset).

In [105]:
zip_path = Path("data/EVChargingStationUsage.csv.zip")

extract_dir = Path("data/extracted") # Placing the unzipped CSV in a temporary folder

# Let's now unzip this file:

with zipfile.ZipFile(zip_path, "r") as z: # To open the zip to place it in extract_dir
    z.extractall(extract_dir)

list(extract_dir.glob("*"))

# Let's check if that extraction worked.    

[PosixPath('data/extracted/EVChargingStationUsage.csv'),
 PosixPath('data/extracted/.ipynb_checkpoints')]

Because this dataset tracks usages, it is an event-based and relational system. In practice, each usage belongs to exactly one station, and each station relates to multiple usages. We can create relational tables for this, and we will use SQL to do so.

In [106]:
import sqlite3

db_path = Path("data/ev_charging.db") # to create a file path object
conn = sqlite3.connect(db_path) # now let's connect to the database
print(db_path.resolve()) # let's see where our db lives

conn.execute("SELECT 1").fetchone() # as a test

/home/jupyter/ev-charging-station-analysis/ev-charging-stations-analysis/data/ev_charging.db


(1,)

So now our databse exists, and the connection seems to work- but our database is empty, so we need to put the raw CSV data as a "raw table".

In [107]:
# Let's first just see how big our data is:

df = pd.read_csv("data/extracted/EVChargingStationUsage.csv",
                low_memory = False)

df.shape

(259415, 33)

In [110]:
# Now, let's write this into SQLite

df.to_sql(
    "raw_sessions", # we'll choose this as the table name
    conn, if_exists = "replace",
    index = False
)

# Let's confirm this table exists

pd.read_sql_query(
    "SELECT * FROM raw_sessions LIMIT 5;", # to test a small query
    conn
)

Unnamed: 0,Station Name,MAC Address,Org Name,Start Date,Start Time Zone,End Date,End Time Zone,Transaction Date (Pacific Time),Total Duration (hh:mm:ss),Charging Time (hh:mm:ss),...,Longitude,Currency,Fee,Ended By,Plug In Event Id,Driver Postal Code,User ID,County,System S/N,Model Number
0,PALO ALTO CA / HAMILTON #1,000D:6F00:015A:9D76,City of Palo Alto,7/29/2011 20:17,PDT,7/29/2011 23:20,PDT,7/29/2011 23:20,3:03:32,1:54:03,...,-122.160309,USD,0.0,Plug Out at Vehicle,3,95124.0,3284,,,
1,PALO ALTO CA / HAMILTON #1,000D:6F00:015A:9D76,City of Palo Alto,7/30/2011 0:00,PDT,7/30/2011 0:02,PDT,7/30/2011 0:02,0:02:06,0:01:54,...,-122.160309,USD,0.0,Customer,4,94301.0,4169,,,
2,PALO ALTO CA / HAMILTON #1,000D:6F00:015A:9D76,City of Palo Alto,7/30/2011 8:16,PDT,7/30/2011 12:34,PDT,7/30/2011 12:34,4:17:32,4:17:28,...,-122.160309,USD,0.0,Plug Out at Vehicle,5,94301.0,4169,,,
3,PALO ALTO CA / HAMILTON #1,000D:6F00:015A:9D76,City of Palo Alto,7/30/2011 14:51,PDT,7/30/2011 16:55,PDT,7/30/2011 16:55,2:03:24,2:02:58,...,-122.160309,USD,0.0,Customer,6,94302.0,2545,,,
4,PALO ALTO CA / HAMILTON #1,000D:6F00:015A:9D76,City of Palo Alto,7/30/2011 18:51,PDT,7/30/2011 20:03,PDT,7/30/2011 20:03,1:11:24,0:43:54,...,-122.160309,USD,0.0,Plug Out at Vehicle,7,94043.0,3765,,,


In [111]:
# We can quickly confirm our table and row count:

pd.read_sql_query(
    "SELECT COUNT(*) AS n_rows FROM raw_sessions;", conn)

Unnamed: 0,n_rows
0,259415


In [112]:
# It matches. Let's now just confirm all our column names:

cols = pd.read_sql_query("PRAGMA table_info(raw_sessions);", conn)

cols

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Station Name,TEXT,0,,0
1,1,MAC Address,TEXT,0,,0
2,2,Org Name,TEXT,0,,0
3,3,Start Date,TEXT,0,,0
4,4,Start Time Zone,TEXT,0,,0
5,5,End Date,TEXT,0,,0
6,6,End Time Zone,TEXT,0,,0
7,7,Transaction Date (Pacific Time),TEXT,0,,0
8,8,Total Duration (hh:mm:ss),TEXT,0,,0
9,9,Charging Time (hh:mm:ss),TEXT,0,,0


Looking at this, we can get a better sense of what we need to do next. Let's focus on creating tables for stations (one row per station), users (one row per user), and sessions (one row per charging session, which references the station and user). Before we do that though, let's make sure we these variables have enough non-null values:

In [113]:
pd.read_sql_query("""
SELECT
    COUNT(*) AS rows,
    COUNT(DISTINCT "EVSE ID") AS distinct_stations,
    SUM(CASE WHEN "EVSE ID" IS NULL THEN 1 ELSE 0 END) AS null_station_id
FROM raw_sessions;
""", conn)



Unnamed: 0,rows,distinct_stations,null_station_id
0,259415,51,78948


In [114]:
pd.read_sql_query("""
SELECT
  COUNT(*) AS rows,
  COUNT(DISTINCT "Plug In Event Id") AS distinct_events,
  SUM(CASE WHEN "Plug In Event Id" IS NULL THEN 1 ELSE 0 END) AS null_event_id
FROM raw_sessions;
""", conn)


Unnamed: 0,rows,distinct_events,null_event_id
0,259415,36838,0


In [115]:
pd.read_sql_query("""
SELECT
  COUNT(*) AS rows,
  COUNT(DISTINCT "User ID") AS distinct_users,
  SUM(CASE WHEN "User ID" IS NULL OR TRIM("User ID")='' THEN 1 ELSE 0 END) AS null_user_id
FROM raw_sessions;
""", conn)

Unnamed: 0,rows,distinct_users,null_user_id
0,259415,21441,7677


Something interesting to uncover here is that there are just 36838 distinct events, which mean that one row in the data does not necessarily mean one event. Let's confirm that events repeat:

In [116]:
pd.read_sql_query("""
SELECT
    "Plug in Event Id" AS plug_in_event_id,
    COUNT(*) AS rows_for_event
    FROM raw_sessions
    GROUP BY "Plug in Event Id"
    ORDER BY rows_for_event DESC
    LIMIT 10;
    """, conn)

Unnamed: 0,plug_in_event_id,rows_for_event
0,657,257
1,100,52
2,406,51
3,345,51
4,152,51
5,401,50
6,154,50
7,140,50
8,131,50
9,130,50


We see that the data can go up to 257 rows in the dataset for just one event ID. Let's see what these rows look like for said event ID:

In [117]:
pd.read_sql_query("""
SELECT *
    FROM raw_sessions
    WHERE "Plug in Event ID" = 657;
""", conn)
            

Unnamed: 0,Station Name,MAC Address,Org Name,Start Date,Start Time Zone,End Date,End Time Zone,Transaction Date (Pacific Time),Total Duration (hh:mm:ss),Charging Time (hh:mm:ss),...,Longitude,Currency,Fee,Ended By,Plug In Event Id,Driver Postal Code,User ID,County,System S/N,Model Number
0,PALO ALTO CA / HAMILTON #1,000D:6F00:015A:9D76,City of Palo Alto,2/23/2012 23:14,PST,2/24/2012 4:19,PST,2/24/2012 4:19,5:05:55,2:40:13,...,-122.160309,,0.00,Plug Out at Vehicle,657,,0,,,
1,PALO ALTO CA / HAMILTON #2,000D:6F00:009E:D39E,City of Palo Alto,2/25/2012 23:16,PST,2/26/2012 10:20,PST,2/26/2012 10:21,11:03:57,4:06:37,...,-122.160263,USD,0.00,Plug Out at Vehicle,657,94301.0,2670,,,
2,PALO ALTO CA / BRYANT #2,000D6F0000A2108E,City of Palo Alto,5/30/2012 12:10,PDT,5/30/2012 14:44,PDT,5/30/2012 14:44,2:34:43,2:34:36,...,-122.162140,USD,0.00,Plug Out at Vehicle,657,95148.0,5717,,,
3,PALO ALTO CA / HIGH #4,000D6F0000A20F47,City of Palo Alto,6/3/2012 10:03,PDT,6/3/2012 10:55,PDT,6/3/2012 10:55,0:51:30,0:51:19,...,-122.162880,USD,0.00,Plug Out at Vehicle,657,,0,,,
4,PALO ALTO CA / BRYANT #1,000D6F0000A20D9E,City of Palo Alto,6/15/2012 8:37,PDT,6/15/2012 11:44,PDT,6/15/2012 11:44,3:06:56,3:06:46,...,-122.162308,USD,0.00,Plug Out at Vehicle,657,94611.0,48863,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252,PALO ALTO CA / TED THOMPSON #4,0024:B100:0002:9BE4,City of Palo Alto,12/21/2018 12:03,PST,12/21/2018 14:25,PST,12/21/2018 14:26,2:21:34,2:05:30,...,-122.143990,USD,1.41,Plug Out at Vehicle,657,94022.0,3402341,Santa Clara County,1.742410e+11,CT4020-HD
253,PALO ALTO CA / TED THOMPSON #2,0024:B100:0002:BF6D,City of Palo Alto,4/17/2019 7:04,PDT,4/17/2019 9:27,PDT,4/17/2019 9:29,2:23:28,2:08:37,...,-122.143929,USD,1.43,Plug Out at Vehicle,657,95119.0,2492681,,1.810410e+11,CT4020-HD
254,PALO ALTO CA / MPL #6,0024:B100:0003:4209,City of Palo Alto,7/12/2019 15:29,PDT,7/12/2019 16:04,PDT,7/12/2019 16:06,0:34:26,0:34:14,...,-122.113441,USD,0.79,Customer,657,94306.0,764631,Santa Clara County,1.903410e+11,CT4010-HD-GW
255,PALO ALTO CA / CAMBRIDGE #4,0024:B100:0003:3A0A,City of Palo Alto,7/14/2019 10:52,PDT,7/14/2019 11:12,PDT,7/14/2019 11:13,0:20:01,0:19:34,...,-122.146034,USD,0.44,Plug Out at Vehicle,657,94301.0,2390051,Santa Clara County,1.852410e+11,CT4020-HD-GW


A few observations can be made from this- mainly, we can see that these rows differ in what they're conveying. In other words, they all come from different events (we can see this from the fact that their dates and and user IDs are completely different). This suggests that each event ID does not necessarily indicate a unique event (event ID is likely something that can be recycled), however, each row does. Instead, what we can create a new session_id that treats each row as a unique "log row". Let's do that now by creating a new table:

In [118]:
conn.executescript("""

DROP TABLE IF EXISTS raw_sessions_2;

CREATE TABLE raw_sessions_2 AS 
SELECT
    ROW_NUMBER() OVER () AS row_id,
    *
    FROM raw_sessions;
    """)

pd.read_sql_query("""
SELECT row_id, "Plug In Event Id", "Start Date", "Station Name"
FROM raw_sessions_2
LIMIT 5;
""", conn)

Unnamed: 0,row_id,Plug In Event Id,Start Date,Station Name
0,1,3,7/29/2011 20:17,PALO ALTO CA / HAMILTON #1
1,2,4,7/30/2011 0:00,PALO ALTO CA / HAMILTON #1
2,3,5,7/30/2011 8:16,PALO ALTO CA / HAMILTON #1
3,4,6,7/30/2011 14:51,PALO ALTO CA / HAMILTON #1
4,5,7,7/30/2011 18:51,PALO ALTO CA / HAMILTON #1


We want to make sure that each row is a unique session, and not just a log of an event. We can treat each sesssion as rows with the same EVSE ID, same user ID, and same start date. Let's create a new table that shows sessions, where one row is one session:

In [119]:
conn.executescript("""
DROP TABLE IF EXISTS unique_sessions;

CREATE TABLE unique_sessions AS
SELECT
    MIN(row_id) as session_id,
    "EVSE ID" AS evse_id,
    "User ID" as user_id,
    "Start Date" AS start_datetime,
    MAX("End Date") AS end_datetime,
    SUM("Energy (kWh)") AS total_energy_kwh,
    SUM(COALESCE(Fee, 0)) AS total_fee, 
    COUNT(*) AS log_rows
    FROM raw_sessions_2
    GROUP BY 
        "EVSE ID",
        "User ID",
        "Start Date";
        """)

<sqlite3.Cursor at 0x7f32f8b385c0>

Let's do some sanity checks:

In [120]:
pd.read_sql_query("SELECT COUNT(*) as n_sessions FROM unique_sessions;", conn)

Unnamed: 0,n_sessions
0,259362


There are 259362 distinct sessions now, which is just a bit less than the number of rows (259415). What this possibly tells us is that there are not enough rows where EVSE ID, User ID, and Start Date are identical. This leads to the hypothesis that the start date is too rigid, which is evident in the fact that in the examples we saw earlier, start date is precise to the exact minute. If we want our sessions to be accurate, we can instead have the same criteria, but this time, set a "bucket" or time frame, where rows inside that time frame with the same EVSE ID and User ID are considered as one session. This solves the issue of time being too precise. Let's create this table now:

In [121]:
# To do this, we first must create a new variable, which we will call start_day, that is a 
# substring of the start date, that isolates for just the day. This is what part of what 
# we will group by. We also need to make sure where we only hold values where EVSE ID,
# Start Date, and User ID is not null, as otherwise, we can't comfortably group rows together
# into a single session.

conn.executescript("""
DROP TABLE IF EXISTS sessions;

CREATE TABLE sessions AS
SELECT
    MIN(row_id) AS session_id,
    CAST("EVSE ID" AS TEXT) AS evse_id,
    NULLIF(TRIM("User ID"), '') AS user_id,

    SUBSTR("Start Date", 1, INSTR("Start Date", ' ') - 1) AS start_day,

    MIN("Start Date") AS first_start_date_time_raw,
    MAX("End Date") AS last_end_datetime_raw,

    SUM(COALESCE("Energy (kWh)", 0)) AS total_energy_kwh,
    SUM(COALESCE(Fee, 0)) AS total_fee,
    COUNT(*) as log_rows
FROM raw_sessions_2
WHERE "EVSE ID" IS NOT NULL
    AND "Start Date" IS NOT NULL
    AND "User ID" IS NOT NULL
GROUP BY 
    CAST("EVSE ID" AS TEXT),
    NULLIF(TRIM("User ID"), ''),
    SUBSTR("Start Date", 1, INSTR("Start Date", ' ') -1);
    """)


# Let's now make sure that this didn't merge too many sessions together:

print(pd.read_sql_query("""
SELECT COUNT(*) AS n_sessions
FROM sessions;
""", conn))

# Let's also confirm that the amount of logs per session isn't unreasonable at the highest end:

print(pd.read_sql_query("""
SELECT evse_id, user_id, start_day, log_rows, total_energy_kwh, total_fee 
FROM sessions
ORDER BY log_rows DESC
LIMIT 10;
""", conn))




   n_sessions
0      170995
    evse_id   user_id  start_day  log_rows  total_energy_kwh  total_fee
0  109973.0    482179  7/27/2018         7            30.946       7.07
1  174559.0  21485181  5/18/2020         5             7.939       1.83
2  174559.0  21485181  6/11/2020         5            14.370       3.30
3  104427.0    804075  5/25/2018         4             9.630       2.21
4  104427.0    804075  6/10/2018         4            24.285       5.59
5  106099.0    483947  4/13/2017         4            13.357       0.00
6  109785.0    332155  5/27/2020         4             5.003       2.11
7  109785.0    858994  8/19/2017         4            11.480       2.64
8  109785.0    858994  8/20/2017         4             6.369       1.47
9  109973.0   2320571  6/15/2018         4            12.342       2.85


Great, we can see that bucketing time into per day doesn't create too many merges, and it can be reasoned that each session can be described with the criteria we used above. It should also be mentioned that while putting rows of the same day in the same session may seem too general, the nature of how charging stations are used, as well as how we will analyze a session, justifies it. What is meant by this is that electric vehicles will almost definitely not have to be charged twice within a day, and if so, likely for some type of road trip, the charging station would likely be in a different location, meaning that it would classified as another session regardless. If by the small chance it is in the same location, from the same user, in the same day, but in a different time, than we can still treat these observations as a single session, as the events are similar enough to be thought of as a single session, for our analysis.

Now we've created our sessions table, which tracks per session. Let's now create our stations, that shows information on each station, and is labeled from a station ID (likely evse_id):

In [122]:
conn.executescript("""
DROP TABLE IF EXISTS stations;

CREATE TABLE stations AS
SELECT
    CAST("EVSE ID" AS TEXT) AS evse_id,
    MAX("Station Name") AS station_name,
    MAX("Org Name") AS org_name,
    MAX("Address 1") AS address_1,
    MAX("City") AS city,
    MAX("State/Province") AS state_province,
    MAX("Postal Code") AS postal_code,
    MAX("Country") AS country,
    MAX("County") AS county,
    MAX(Latitude) AS latitude,
    MAX(Longitude) AS longitude

FROM raw_sessions_2
WHERE "EVSE ID" IS NOT NULL
GROUP BY CAST("EVSE ID" AS TEXT);
""")

# Let's see how many stations there are:

print(pd.read_sql_query("SELECT COUNT(*) AS n_stations FROM stations;", conn))

# Let's see if any are missing coordinates:

print(pd.read_sql_query("""
SELECT 
    SUM(CASE WHEN latitude IS NULL OR longitude IS NULL THEN 1 ELSE 0 END) as stations_missing_coords
FROM stations;
""", conn))

# Let's finally just take a peak at our table:

print(pd.read_sql_query("""
SELECT *
FROM stations 
LIMIT 10;
""", conn))




      

   n_stations
0          51
   stations_missing_coords
0                        0
    evse_id                    station_name            org_name  \
0  104339.0          PALO ALTO CA / HIGH #1  City of Palo Alto    
1  104427.0       PALO ALTO CA / WEBSTER #1  City of Palo Alto    
2  106099.0       PALO ALTO CA / WEBSTER #2  City of Palo Alto    
3  107367.0       PALO ALTO CA / WEBSTER #3  City of Palo Alto    
4  107427.0     PALO ALTO CA / CAMBRIDGE #2  City of Palo Alto    
5  109701.0     PALO ALTO CA / CAMBRIDGE #1  City of Palo Alto    
6  109783.0  PALO ALTO CA / TED THOMPSON #1  City of Palo Alto    
7  109785.0      PALO ALTO CA / HAMILTON #2  City of Palo Alto    
8  109973.0          PALO ALTO CA / HIGH #4  City of Palo Alto    
9  174559.0          PALO ALTO CA / HIGH #3  City of Palo Alto    

           address_1       city state_province  postal_code        country  \
0        528 High St  Palo Alto     California        94301  United States   
1     532 Webster St  Pa

We now need to make a table for users, which is differentiated by User ID. Now, for most of our analysis, the stations and sessions tables will likely be more prevelant- however, creating a users table can help us determine usage concentration. station dependency, or temporal behavior. However, without much information per user, it cannot be treated as an entire profile. Let's create this table, which will be based on the sessions table we created earlier: 

In [123]:
conn.executescript("""
DROP TABLE IF EXISTS users; 

CREATE TABLE users AS 
SELECT 
    user_id, 
    
    MIN(start_day) AS first_session_day,
    MAX(start_day) AS last_session_day,
    
    COUNT(*) AS total_sessions,
    SUM(total_energy_kwh) AS total_user_energy,
    COUNT(DISTINCT evse_id) AS distinct_stations_used

FROM sessions
WHERE user_id IS NOT NULL
GROUP BY user_id
""")

# Let's see how many users we have:

print(pd.read_sql_query("""
SELECT COUNT(*) AS n_users
FROM users;
""", conn))

# Let's also see how many sessions the most dedicated users have:

print(pd.read_sql_query("""
SELECT * 
FROM users
ORDER BY total_sessions DESC
LIMIT 10;
""", conn))

   n_users
0    17950
  user_id first_session_day last_session_day  total_sessions  \
0  804075          1/1/2018         9/9/2018             780   
1  523487         1/10/2017         9/9/2019             761   
2  524259         1/10/2017         9/9/2016             759   
3  779957          1/1/2018         9/9/2019             745   
4  453469         1/10/2017         9/9/2016             733   
5  784343          1/1/2019         9/9/2020             690   
6  485553         1/10/2017         9/9/2020             624   
7  546163         1/10/2018         9/9/2018             610   
8  643155         1/10/2017         9/8/2018             586   
9  283441         1/11/2017         9/9/2017             558   

   total_user_energy  distinct_stations_used  
0        7129.669000                      13  
1        4254.569895                       7  
2        4944.780032                       5  
3        2356.873000                      23  
4        7519.080852                  

Let's create one more table before we move on to visualizations. One row in this table will equal one station on one day, and it will answer questions like:

- How busy was each station per day?

- How does usage change over time?

- Which stations deliver the most energy?

We'll derive this table from our sessions and stations table. If a station had, for example, 12 sessions in one day, our new table will have 1 row for that table, where sessions would have 12.

Let's create this:

In [127]:
conn.executescript("""

DROP TABLE IF EXISTS station_daily_utilization;

CREATE TABLE station_daily_utilization AS 

SELECT
    st.evse_id,
    st.station_name,
    st.org_name,
    st.address_1,
    st.city,
    st.state_province,
    st.postal_code,
    st.country,
    st.county,
    st.latitude,
    st.longitude,

    se.start_day,

    COUNT(*) AS sessions,
    AVG(se.total_energy_kwh) AS avg_energy_per_session,
    SUM(se.total_fee) AS total_fee

FROM sessions se
JOIN stations st
ON se.evse_id = st.evse_id

GROUP BY
    st.evse_id,
    st.station_name,
    st.org_name,
    st.address_1,
    st.city,
    st.state_province,
    st.latitude,
    st.longitude,
    se.start_day;
    """)

# Let's check how many rows there are in this table, and how it looks like:

pd.read_sql_query(
    """
    SELECT COUNT(*) AS rows
    FROM station_daily_utilization;
    """, conn)

pd.read_sql_query("""
SELECT station_name, start_day, sessions, avg_energy_per_session
FROM station_daily_utilization
ORDER BY sessions DESC 
LIMIT 10;
""", conn)
    

Unnamed: 0,station_name,start_day,sessions,avg_energy_per_session
0,PALO ALTO CA / HAMILTON #2,11/21/2019,16,7.41125
1,PALO ALTO CA / HAMILTON #2,11/22/2019,16,5.728187
2,PALO ALTO CA / HAMILTON #2,6/28/2017,16,6.434125
3,PALO ALTO CA / HAMILTON #2,6/28/2019,16,5.5465
4,PALO ALTO CA / HAMILTON #2,6/29/2019,16,4.395625
5,PALO ALTO CA / HIGH #1,2/2/2017,15,7.173333
6,PALO ALTO CA / WEBSTER #1,2/8/2017,15,8.0134
7,PALO ALTO CA / WEBSTER #1,7/10/2017,15,9.187
8,PALO ALTO CA / WEBSTER #2,9/15/2016,15,6.749909
9,PALO ALTO CA / WEBSTER #3,8/12/2016,15,6.476083


Great, now we have a table that uses both sessions and tables to track how a station may be used per day. Let's start prepping our tables for visualization, which we will use Tableau for.

The first thing we can do is take the tables we created with SQL and convert them into CSVs for Tableau to use. Our visualizations will only use users and station_daily_utilization (sessions and users is embedded in this table), so we will just convert those:

In [129]:
station_daily_df = pd.read_sql_query(
    "SELECT * FROM station_daily_utilization;",
    conn
)

users_df = pd.read_sql_query(
    "SELECT * FROM users;", 
    conn
)

The second thing to note is that if we want to do any temporal analysis/inference, we need to configure our start_day variable to be time-orderable, as it is currently a string. We can do this part with pandas:

In [132]:
station_daily_df["session_date"] = pd.to_datetime(
    station_daily_df["start_day"],
    format="%m/%d/%Y",
    errors="coerce"
)

# To check if anything failed to parse, result would hopefully be 0
print(station_daily_df["session_date"].isna().sum())


# Let's make it a pure date:
station_daily_df["session_date"] = station_daily_df["session_date"].dt.date

# Let's add time features like year, month, and weekday that we can potentially use later:

station_daily_df["year"] = pd.to_datetime(station_daily_df["session_date"]).dt.year
station_daily_df["month"] = pd.to_datetime(station_daily_df["session_date"]).dt.month
station_daily_df["weekday"] = pd.to_datetime(station_daily_df["session_date"]).dt.day_name()

# Let's do the same thing for last session date, but with first_session_day and last_session_day

users_df["first_session_date"] = pd.to_datetime(
    users_df["first_session_day"],
    format="%m/%d/%Y",
    errors="coerce"
)

users_df["last_session_date"] = pd.to_datetime(
    users_df["last_session_day"],
    format="%m/%d/%Y",
    errors="coerce"
)

# And a quick failure check:
print(users_df[["first_session_date", "last_session_date"]].isna().sum())

# Let's also create a metric that computes "active days":

users_df["active_span_days"] = (
    users_df["last_session_date"] - users_df["first_session_date"]
).dt.days

0
first_session_date    0
last_session_date     0
dtype: int64


Great, now to get our tables ready for visualization, let's make them CSVs and in our data folder:

In [138]:
station_daily_df.to_csv("data/station_daily_utilization_tableau.csv", index=False)
users_df.to_csv("data/users_tableau.csv", index=False)