# A Statistical Analysis of Smart Scale Projects in Virginia

#### Christopher Garcia, University of Mary Washington

## I. Introduction

The Virginia Smart Scale methodology was finalized in November 2017 in order to provide an objective scoring system for the state's transportation projects competing for funding. The complete technical specification of the Smart Scale methodology can be found in the November 2017 technical guide, found here: [http://vasmartscale.org/documents/20171115/ss_technical_guide_nov13_2017.pdf](http://vasmartscale.org/documents/20171115/ss_technical_guide_nov13_2017.pdf).

The purpose of this analysis is to 1) provide some descriptive statistics of recently funded projects, and 2) to independently assess whether the Smart Scale scores assigned to these projects were consistent with the Smart Scale methodology specified in the technical guide. This analysis uses a dataset provided by the Fredericksburg Area Metropolitan Planning Organization (FAMPO), containing data on 404 projects. This analysis uses the R statistical computing environment within a Jupyter notebook. This allows the analysis to be both explained as well as executed and allows complete transparancy of the analytical methodology. This removes any ambiguity and also guarantees reproducibility of the results.

## II. Data Inspection and Preparation

The data was provided in a Microsoft Excel spreadsheet. This was first converted to a CSV file (ss-data.csv) and then a number of transformation processes were applied to clean and scale the data as well as to impute missing values. To begin, we first define several functions which will be used in the data transformation:

In [43]:
# Turn off verbose messages.
options(warn=0)

# Given a vector, a vector of original values, and a corresponding vector of 
# replacement values, apply the replacements and return the transformed vector.
recode <- function(vec, from.vals, to.vals) {
    f = function(v) {
        for(i in 1:length(from.vals)) {
            if(v == from.vals[i]) { return(to.vals[i]) }
        }
        return(v)
    }
    return(sapply(as.vector(vec), f))
}

# Function for imputing mean of a vector to its missing values.
impute.mean <- function(vec) {
    vec[is.na(vec)] = mean(vec, na.rm=TRUE)
    vec
}

# Function for imputing value of 0 to each missing value in the vector.
impute.0 <- function(vec) {
    vec[is.na(vec)] = 0
    vec
}

Next we read in the dataset and inspect the first few rows of data:

In [44]:
data <- read.csv('ss-data.csv')
head(data)

App.Id,Area.Type,District,Organization.Name,Project.Title,Statewide.High.Priority,District.Grant,Throughput.Score,Delay.Score,Crash.Frequency.Score,...,Travel.Time.Reliability.Score,Land.Use.Score,Project.Benefit.Score,Total.Project,Score.Divided.by.Total.Cost,SMART.SCALE.Request,SMART.SCALE.Score,Benefit.Rank,State.Rank,District.Rank
1414,A,NOVA,Northern Virginia Transportation Commission,VRE Fredericksburg Line Capacity Expansion,x,,17.69,87.66,100,...,-,64.016,64.236,216034920,2.973,92636120,6.934,1,109,16
1057,A,Hampton Roads,Hampton Roads Transportation Planning Organization,I-64 Southside Widening and High Rise Bridge - Phase 1,x,,100.0,88.96,45.16,...,29.044,10.464,62.042,600000000,1.034,100000000,6.204,2,120,22
1090,A,Hampton Roads,Hampton Roads Transportation Planning Organization,I-64/I-264 Interchange Improvements,x,,60.61,53.86,-,...,24.245,31.671,48.747,350091800,1.392,50000000,9.749,3,79,18
1293,A,NOVA,Prince William County,Route 234 At Balls Ford Intrchng and Rel/Widen Balls Ford Rd,x,x,65.48,100.0,15.61,...,1.05,-,41.289,126027000,3.276,124027000,3.329,4,170,30
1249,A,NOVA,Fairfax County,VA 286 - Popes Head Road Interchange,x,x,61.09,54.17,8.22,...,0.118,-,37.194,64303070,5.784,50558370,7.357,5,101,14
1240,A,NOVA,Loudoun County,Loudoun County Parkway (Shellhorn Road to US Route 50),x,x,45.27,69.78,5.43,...,1.6,16.364,33.732,112053000,3.01,112053000,3.01,6,179,32


In the technical guide (Table 4.2, p. 36) there are six major factors used to compute the Smart Scale score, which are themselves calculated from more basic measurements:

1. Congestion Mitigation
2. Economic Development
3. Accessibility
4. Safety
5. Environmental Quality
6. Land Use

Of these six factors only Land Use is directly given in the dataset. All other major metrics are calculated from their basic constituent measurements as specified on pages 27-31. Within the data there are four area types: A, B, C, and D. Each of these area types has a unique factor weighting that results in its Smart Scale Score. These weightings are specified for each area in Table 4.2 on page 36 within the technical guide. It is noted that for areas C and D, the Land Use factor is not used (or equivalently, carries 0% weight). Upon inspection of the whole dataset it was apparent that this was represented by empty cells for projects falling within these areas. Accordingly, all such Land Use values are set to 0. By this precedent it also is taken that all empty cells in numeric columns correspond to zero. Accordingly, a value of zero is imputed to all empty cells.

Based upon this, the following basic transformations are made to the data prior to analysis:

* All columns with present/absent markers (x and no x) are recoded to 1 and 0, respectively. This converts them into a numeric equivalent (called binarization).

* All empty cells in basic constituent measurement columns (which are used in calculating the factors) are changed to zero.

* All empty cells in the Land Use column are changed to zero.

This is done as follows:

In [45]:
# Clean and recode numeric columns.
numeric.columns <- colnames(data)[8:ncol(data)]
for(col in numeric.columns) {
    data[[col]] <- as.numeric(as.character(data[[col]]))
}

# Impute 0 to all missing Land.Use.Score values, since area types C and D don't use it.
data[[20]] <- imputer.f(data[[20]])

imputer.f <- impute.0 # Missing value imputation function - can change if needed

# Properly binarize binary columns.
data$Statewide.High.Priority <- as.numeric(sapply(data$Statewide.High.Priority, function(x){if(x == 'x') return(1); return(0);}))
data$District.Grant <- as.numeric(sapply(data$District.Grant, function(x){if(x == 'x') return(1); return(0);}))


"NAs introduced by coercion"

Before imputing zeros to empty cells, it may be instructive to inspect the percentages of empty cells in the numeric colums. We do this below.

In [46]:
# Impute missing values to component scores and print out the percent missing in each column.
for(i in 8:19) {
    message(paste('Percent Missing Values for Column ', colnames(data)[i], ': ', round(100*(1 - (length(sort(data[[i]]))/nrow(data))), 2)))
    data[[i]] <- imputer.f(data[[i]])
}

Percent Missing Values for Column  Throughput.Score :  30.94
Percent Missing Values for Column  Delay.Score :  25.99
Percent Missing Values for Column  Crash.Frequency.Score :  17.08
Percent Missing Values for Column  Crash.Rate.Score :  19.8
Percent Missing Values for Column  Access.to.Jobs :  33.91
Percent Missing Values for Column  Disadvantaged.Access.to.Jobs :  33.91
Percent Missing Values for Column  Multimodal.Access.Score :  41.34
Percent Missing Values for Column  Air.Quality.Score :  34.16
Percent Missing Values for Column  Enviro.Impact.Score :  0.99
Percent Missing Values for Column  Econ.Dev.Support.Score :  37.87
Percent Missing Values for Column  Intermodal.Access.Score :  60.15
Percent Missing Values for Column  Travel.Time.Reliability.Score :  31.19


In the output above, the percentage of missing values in the basic constituent measurement columns ranges from 1% (Enviro.Impact.Score) to 60% (Intermodal.Access.Score).

## III. Data Exploration

In this section we explore the data further using descriptive statistics to provide several characterizations. 

First, we look at the number of projects within each of the four areas:

In [48]:
dt1 <- data.frame(table(data$Area.Type))
colnames(dt1) <- c('Area', 'Projects')
dt1

Area,Number.of.Projects
A,114
B,88
C,86
D,116


We also do the same for each district:

In [50]:
dt2 <- data.frame(table(data$District))
colnames(dt2) <- c('District', 'Projects')
dt2

District,Projects
Bristol,42
Culpeper,35
Fredericksburg,25
Hampton Roads,52
Lynchburg,28
NOVA,58
Richmond,72
Salem,50
Staunton,42
