# Commuting in Seattle, WA

## Problem Statement

How can businesses play a part in reducing the number of employees that commute by driving alone each day? Taking cars off the road during commuting hours not only reduces traffic but also drastically reduces the environmental impact of that city. With climate change a frequent topic in the news and global city populations on the rise, Seattle, and other cities across the country, are creating policies that address these concerns. 

One specific program is the 'Commute Trip Reduction (CTR)' Program run by the *City of Seattle* in partnership with *Commute Seattle*. The program first began in 1991. The law states that any employer with more than 100 employees, who report to work at a single site between 6-9am must participate in the program by developing programs to help employees reduce their drive alone commute trips and conduct commuter surveys biennially.  
Read more here: http://www.seattle.gov/waytogo/ctr_req.htm

For my project, I use data collected through the CTR program to evaluate the benefits offered to employees relating to commuting and to identify which benefits impact the Drive-Alone Rate most (either positively or negatively) in Seattle, WA. 

I formulated this question as a regression problem where the feature, `Alone_Share`, is the target. `Alone_Share` includes single occupancy vehicle drivers but does not include solo motorcycle drivers.

$$Alone Share = \frac{Weekly Drive Alone Trips}{Total Weekly Trips}$$



### 1. Data Descriptions 
#### `program_report_data`:
This table contains information regarding the benefits offered to employees by individual employers. Additional data includes how much money a company spends on those benefits, how and how often information about commuting program is distributed, and number of employees taking advantage of the individual subsidies offered to them.

There are 51 features that represent different types of benefits offered and facilities available to employees.

#### `worksites_in_goal`:
This table contains survey responses about how employees commute to work throughout the course of a week. It also contains the drive-alone-rate goal for each company, percentage of surveys returned, and the year in which the survey was collected. This table contains information beyond the bounds of the City of Seattle, but I will only use Seattle data for the purposes of this project. 

There are 1182 rows containing data from companies in Seattle, WA but there are only 328 unique CTR_ID codes. 

### 2. Data Cleaning
**Column Headers** - Column headers were originally survey questions so I renamed them to be shorter and more succinct.

**Dropped Features** - I dropped a few features that were obviously not informative (either containing no information or the same information across all rows).

**Merged DataFrames** - I merged the two DataFrames together on the CTR_ID Column. This step automatically removed multiple rows from the `worksites_in_goal` table as there was no corresponding data in `program_report_data`. This eliminated rows of employers outside the City of Seattle.

**Anonymize** - I added in a UUID key for each CTR_ID in the DataFrame and then dropped any column containing proprietary information, including the CTR_ID as per my agreement with *Commute Seattle* in allowing me the use of their data. 


### 3. Exploratory Data Anlaysis
I began by looking at the distribution in the `Response Rate` and the `Alone Share` features. 

I found it interesting that the `Response Rate` is skewed to the left because it indicates that more people responded to the survey than didn't. This is because most companies are required by the CTR Law to conduct biennial surveys and to have some sort of program to help reduce the Drive Alone Rate.

![](../figures/response_rate.png)

`Alone Share` is skewed to the right, which I also think is interesting because it shows that there are more companies with a lower alone share than with a higher alone share. 

![](../figures/alone_share.png)

I then created distplots for the numerical features. Many of the numerical features are skewed to the right by a few high values, something that I will consider when removing outliers. 

![](../figures/numerical_subplots_for_pres.png)

Below are boxplots for the categorical features. It is hard to ascertain how each feature influences the target, some results are not what I would expect. 
![](../figures/categorical_subplots_for_pres.png)

I looked at the correlation between the various features and based on this and on considering the target feature, chose to drop some. The features that I drop are: 
- `NDAT_Rate`
- `Daily_Roundtrip_GHG_Per_Employee_(Pounds)`
- `VMT/\nEmployee`
- `Weekly_Drive_Alone_Trips`

![](../figures/commute_features_corr.png)

### 4. Benchmark Score
Before obtaining a benchmark score, I deskewed and scaled the data using `BoxCoxTransformer` and `StandardScaler`. I built a pipeline that I fit on X_train and transformed on X_test.

The four regression models I choose to fit on my data are:
* `LassoCV` - implements regularization to prevent overfitting and automatically eliminates redundant variables
* `BayesianRidge`: a form of ridge regression that also implements regularization.
* `DecisionTree`: automatically feature selects, does not require extensive data preparation, and it does not require any assumptions about linearity.
* `KNN`: there is low bias when using this type of model and works on non-linear data.

I encoded the categorical features, dropping the UUID feature, and replaced the raw numeric features with the deskewed and scaled values in the respective `X_train` and `X_test` DataFrames.

X_test_benchmark Scores:
* LassoCV: 0.959261
* BayesianRidge: 0.967083		
* DecisionTree: 0.914122
* KNN: 0.799720


### 5. Removing the Outliers
The next step is to remove the outliers. The steps to do so are as follows:
* Deskew and scale entire DataFrame together using `BoxCoxTranformer` and `Standard_Scaler`.
* Apply Tukey's Method to determine the bounds greater than the interquartile range to define as outliers. 
* For each row, count the number of features in which there are outliers.
* I decide that a Tukey Value of 2.25 and rows with outliers in more than 5 features is appropriate for this dataset. This yields 27 outliers, which is 2.59% of the data.

Below are visualisations of the outlier-removed data

The numeric features below show the deskewed and scaled post outlier removed data:

![](../figures/numerical_subplots_for_pres_no_outliers.png)

I used raw data to show the categorical data:

![](../figures/categorical_subplots_for_pres_no_outliers.png)


### 6. PCA
I use the raw DataFrame (with removed outliers) and `train_test_split` to separate `X_train` and `X_test` and prevent data leakage. 
I use the same pipeline as I have previously, using the returned scaled array for PCA. This method allows me to extract new components that describe the data in fewer features than in the original dataset. 
* I find that `n_components=15` allows me to explain 76% of my data. 

![](../figures/Scaled PCA on 15 Features - Commute Dataset.png)

I concatenate the respective PCA values to `X_train` and `X_test`.

I perform cluster analysis on the PCA components as an unsupervised way to see if the data can be represented in individual clusters. Below I show `n_clusters=3`. This is something I would like to explore further in the future. 

![](../figures/cluster_analysis_PCA.png)

### 7. Model Build and Analysis

X_test Scores:
* LassoCV: 0.971045
* BayesianRidgeRegressor: 0.964379
* DecisionTree: 0.903892	
* KNN: 0.664433

I performed some analysis on two of my models to see which features most influenced the target feature. 

![](../figures/lasso-model-analysis.png)

![](../figures/brr-model-analysis.png)

- Features that are more likely to **decrease** `Alone_Share`
    * Bus Share
    * Total Employees
    * Carpool Share
    * Telecommute Share
    * Last time distributed commute program info to employees - 08/30/2016


- Features that are more likely to **increase** `Alone_Share`
    * Principal Component 2
    * Total Annual Greenhouse Gas per Employee in Metric Tons
    * Aggregate Pounds of Greenhouse Gas 
    * Additional Benefits - None
    * Vanshare/Carpool Subsidy - 50-59%

### 8. What Next? - Stretch Goals
1. Use Django to allow companies the ability to interact with the model.
2. Informing business value

I began looking at how the cost of the program compares to the `Alone_Share` rate and I think it would be interesting to examine this further and see how businesses can optimise the dollar they spend to reduce their alone share rate most effectively.

![](../figures/cost_of_program_against_alone_share.png)

3. Find trends over the years.
4. Use GIS to see if there are certain areas where businesses in that area have a lower alone share.
4. Finding ways to group open-ended category responses together.
5. Plot labels should not be header labels.
6. How can I make my code better?