# Identifying Missing Data in the Caltrans/PeMS Dataset

The California Department of Transportation (Caltrans) collects data that describes the flow of traffic on California freeways. Caltrans stores these data in a database called PeMS. The data describe the number of counts per unit time meaured by roughly 45,000 sensors on a 30-second cadence. The type of sensor varies considerably, e.g. radar and magnetometers (see Chapter 1 of the [Introduction to PeMS User Guide](https://pems.dot.ca.gov/Papers/PeMS_Intro_User_Guide_v6.pdf)). 

In some cases, these data are missing. Faulty or broken sensors do not collect data. Or sensor data is not wirelessly transmitted back to PeMS. In addition, Caltrans performs some calculations to convert these raw sensor data into physical observables such as speed. These calculations include some assumptions such as the length of the vehicle, or $g$. Based on the quality of the assumption, these data can include errors.

In this notebook, we will take a look at the nature of the missing data. Some questions to ask:
1. Are all the data available for District 5 from the District Map and County Chart during 2023?
2. If data are missing, do they occur in any spatial or temporal clusters?
3. Are there any outliers or unexpected values in the data?
4. Are all the data available for all the available districts in recent decade, 2013-2023? And again, is there any pattern to the missing data? Are there any odd values?

### Setup

In [1]:
import ibis
import os
import itertools

import numpy as np
import pandas as pd
import seaborn as sns
import ibis.selectors as s
import matplotlib.pyplot as plt

from dotenv import load_dotenv
from functools import reduce
from datetime import datetime as dt_obj

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 3000)

In [2]:
load_dotenv(override=True)
USERNAME = os.getenv('USERNAME')
PASSWORD = os.getenv('PASSWORD')

In [3]:
con = ibis.snowflake.connect(
    user=USERNAME,
    password=PASSWORD,
    role='TRANSFORMER_DEV',
    warehouse='TRANSFORMING_XS_DEV',
    account="VSB79059-DSE_CALTRANS_PEMS",
    database="RAW_DEV/CLEARINGHOUSE",
)

In [4]:
station_metadata = con.table("STATION_META");
station_raw = con.table("STATION_RAW");
station_status = con.table("STATION_STATUS");

Insufficient privileges to operate on account 'NGB13288'


In [5]:
station_metadata_df = station_metadata.execute(limit=10)
station_raw_df = station_raw.execute(limit=10)
station_status_df = station_status.execute(limit=10)

### Question 1. Are all the data available for District 5 from the [District Map and County Chart](https://cwwp2.dot.ca.gov/documentation/district-map-county-chart.htm) during 2023?

1. Filter the station metadata file to look at `DISTRICT` values of 5.
2. Strip away all old versions of the data by selecting the most recent unique ID.
3. Filter raw data file to look at the selected IDs from Step 2.
4. Identify the time period associated with the selected ID.
   - If the time period is not within 2023, drop the ID.
   - If the time period is within 2023, merge the raw and metadata files on the keyword `ID`.
5. Look at the keyword `SAMPLE_TIMESTAMP` in the `STATION_RAW` table.

##### 1. Filter the station metadata file to look at `DISTRICT` values of 5.

In [6]:
district_5_filter = station_metadata.filter(station_metadata["DISTRICT"] == "5")

In [7]:
district_5_df = district_5_filter.execute()

##### 2. Strip away all old versions of the data by selecting the most recent unique ID.

Identify the number of unique values of ID in District 5.

In [8]:
print('There are {} unique values of ID in District 5.'.format(district_5_df['ID'].nunique()))

There are 706 unique values of ID in District 5.


Construct a new column called `DATA_VERSION`.

In [9]:
district_5_df['DATA_VERSION'] = np.NaN

Extract the date from the `FILENAME` keyword. Populate these dates in the `DATA_VERSION` keyword.

In [10]:
data_version = [dt_obj.strptime(filename[39:49], '%Y_%m_%d') for filename in district_5_df['FILENAME'].values]

In [11]:
district_5_df['DATA_VERSION'] = data_version

Select the most recent `DATA_VERSION` for each unique ID. Drop the rest.

In [12]:
unique_IDs = district_5_df['ID'].value_counts().index.to_list()

In [13]:
drop_these_rows = []
for i in range(len(unique_IDs)):
    ID_subset = district_5_df[district_5_df['ID'] == unique_IDs[i]]
    index_for_max_value = ID_subset['DATA_VERSION'].idxmax()
    indices_for_rows_to_drop = ID_subset.drop(index_for_max_value).index.to_list()
    drop_these_rows.append(indices_for_rows_to_drop)

In [14]:
drop_these_rows_flattened = list(itertools.chain.from_iterable(drop_these_rows))

In [15]:
district_5_recent_version_df = district_5_df.drop(drop_these_rows_flattened).reset_index(drop=True)

Have the data been updated in 2023? Yes.

In [16]:
max(district_5_recent_version_df['DATA_VERSION'])

Timestamp('2023-11-17 00:00:00')

##### 3. Filter raw data file to look at the selected IDs from Step 2.

In [24]:
# Select dates within the year 2023
date_selection_start = dt_obj(2023, 1, 1)
date_selection_end = dt_obj(2023, 12, 31)

In [20]:
ID_selection = district_5_recent_version_df["ID"][115]

In [21]:
district_5_recent_version_df["DATA_VERSION"][115]

Timestamp('2011-07-15 00:00:00')

In [22]:
ID_selection

'500001'

In [26]:
# date_and_ID_filter = station_raw.filter((station_raw["SAMPLE_DATE"] >= date_selection_start) & (station_raw["SAMPLE_DATE"] < date_selection_end)).filter(station_raw["ID"] == ID_selection)
# print(ID_selection)
# sample_dates_df = date_and_ID_filter.execute()
# print(len(sample_dates_df))

In [None]:
sample_dates_df

In [31]:
def select_data_per_ID(station_raw, district_5_recent_version_df, date_selection_start, date_selection_end):
    """
    This function selects data per station ID.
    
    Parameters
    ----------
    station_raw : Ibis table
        The raw station data
    district_5_recent_version_df : dataframe
        The district 5 recent version dataframe
    date_selection_start : datetime object
        The start date of the data selection
    date_selection_end : datetime object
        The end date of the data selection
    
    Returns
    -------
    list_of_stations : list
        A list of station IDs that contain data during the selected period.
    """

    list_of_stations = []

    for i in range(len(district_5_recent_version_df)):
        ID_selection = district_5_recent_version_df['ID'][i]
        date_and_ID_filter = station_raw.filter((station_raw["SAMPLE_DATE"] >= date_selection_start) & (station_raw["SAMPLE_DATE"] < date_selection_end)).filter(station_raw["ID"] == ID_selection)
        sample_dates_df = date_and_ID_filter.execute()
        print(i,ID_selection, len(sample_dates_df))
        if not sample_dates_df.empty:
            list_of_stations.append(ID_selection)

    return list_of_stations

In [32]:
list_of_dfs = select_data_per_ID(
    station_raw = station_raw,
    district_5_recent_version_df = district_5_recent_version_df,
    date_selection_start = date_selection_start,
    date_selection_end = date_selection_end
)

Insufficient privileges to operate on account 'NGB13288'


0 5000010092 0
1 5000010093 0
2 5000010101 0
3 5000010102 0
4 5000010121 0
5 5000010122 0
6 5000010132 0
7 5000010133 0
8 5000010142 0
9 5000010143 0
10 5000010152 0
11 5000010153 0
12 5000011021 0
13 5000011022 0
14 5000011042 0
15 5000011043 0
16 5000011052 0
17 5000011053 0
18 5000011062 0
19 5000011063 0
20 5000011072 0
21 5000011073 0
22 5000011092 0
23 5000011102 0
24 5000011112 0
25 5000011113 0
26 5000011121 0
27 5000011123 0
28 5000011141 0
29 5000011143 0
30 5001010022 0
31 5001010023 0
32 5001010031 0
33 5001010032 0
34 5001010052 0
35 5001010053 0
36 5001010061 0
37 5001010062 0
38 5001010082 0
39 5001010083 0
40 5001010102 0
41 5001010103 0
42 5001010112 0
43 5001010113 0
44 5001010122 0
45 5001010124 0
46 5001010131 0
47 5001010132 0
48 5001010142 0
49 5001010152 0
50 5001010153 0
51 5001011021 0
52 5001011032 0
53 5001011041 0
54 5001011081 0
55 5001011082 0
56 5001011091 0
57 5001011092 0
58 5001011111 0
59 5001011112 0
60 5001011121 0
61 5001011122 0
62 5001011131 0
63

KeyboardInterrupt: 