# Identifying Missing Data in the Caltrans/PeMS Dataset

The California Department of Transportation (Caltrans) collects data that describes the flow of traffic on California freeways. Caltrans stores these data in a database called PeMS. The data describe the number of counts per unit time meaured by roughly 45,000 sensors on a 30-second cadence. The type of sensor varies considerably, e.g. radar and magnetometers (see Chapter 1 of the [Introduction to PeMS User Guide](https://pems.dot.ca.gov/Papers/PeMS_Intro_User_Guide_v6.pdf)). 

In some cases, these data are missing. Faulty or broken sensors do not collect data. Or sensor data is not wirelessly transmitted back to PeMS. In addition, Caltrans performs some calculations to convert these raw sensor data into physical observables such as speed. These calculations include some assumptions such as the length of the vehicle, or $g$. Based on the quality of the assumption, these data can include errors.

In this notebook, we will take a look at the nature of the missing data. Some questions to ask:
1. Are all the data available for the most recent year, 2023?
2. Are all the data available for the most recent decade, 2013-2023?
3. If data are missing, do they occur in any spatial or temporal clusters?

### Setup

In [1]:
import ibis
import os
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

from dotenv import load_dotenv
from functools import reduce

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 3000)

In [7]:
load_dotenv(override=True)
USERNAME = os.getenv('USERNAME')
PASSWORD = os.getenv('PASSWORD')

In [11]:
con = ibis.snowflake.connect(
    user=USERNAME,
    password=PASSWORD,
    role='TRANSFORMER_DEV',
    warehouse='TRANSFORMING_XS_DEV',
    account="VSB79059-DSE_CALTRANS_PEMS",
    database="RAW_DEV/CLEARINGHOUSE",
)

In [12]:
station_metadata = con.table("STATION_META");

Insufficient privileges to operate on account 'NGB13288'


In [13]:
station_metadata

In [14]:
station_metadata_df = station_metadata.execute(limit=10)

In [15]:
station_metadata_df

Unnamed: 0,FILENAME,ID,FWY,DIR,DISTRICT,COUNTY,CITY,STATE_PM,ABS_PM,LATITUDE,LONGITUDE,LENGTH,TYPE,LANES,NAME,USER_ID_1,USER_ID_2,USER_ID_3,USER_ID_4
0,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,308511,50,E,3,17,,31.627,60.162,38.761062,-120.569835,3.134,ML,2,Sly Park Rd,1,,,
1,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,308512,50,W,3,17,,31.627,60.166,38.761182,-120.569866,3.995,ML,2,Sly Park Rd,1,,,
2,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311831,5,S,3,67,,10.896,506.189,38.409782,-121.48412,,OR,1,Elk Grove Blvd,1,,,
3,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311832,5,S,3,67,,10.896,506.189,38.409782,-121.48412,,FR,1,Elk Grove Blvd,1,,,
4,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311844,5,N,3,67,,11.08,506.373,38.412421,-121.484289,,OR,2,Elk Grove Blvd 5NB Slip,1,,,
5,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311847,5,N,3,67,,12.185,507.478,38.428258,-121.487578,,OR,3,Laguna Blvd to 5NB Slip,1,,,
6,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311864,5,N,3,67,,11.933,507.226,38.424648,-121.486808,,FR,1,5NB to Laguna Blvd,1,,,
7,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311903,50,E,3,67,64000.0,L0.633,3.789,38.566906,-121.505888,0.883,ML,3,50EB at 6TH Street,1,,,
8,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311930,50,E,3,67,64000.0,L0.632,3.788,38.566911,-121.505906,,FF,3,5NB and 5SB to 50EB,1,,,
9,clhouse/meta/d03/2023/11/d03_text_meta_2023_11...,311973,50,E,3,67,64000.0,L1.22,4.376,38.564153,-121.495585,,OR,1,13th St,1,,,
