To see all cleaned data [click here](https://github.com/anly501/dsan-5000-project-jsweren1/tree/main/dsan-website/5000-website/data/cleaned_data)

### Quarterly and Annual Ridership Totals by Mode​ of Transportation

The purpose of this data is to gain a baseline perspective of the current state of public transit usage in the United States. Therefore, this data set should be cleaned in a way that trends can be visualized, without including superfluous information that does not relate to any current phenomena. The steps used in cleaning this data are below.

- Trim the data set:
  - Columns 1 to 11 to trim blank items in the .csv file, as well as notes put in by the source.
  - Rows 81 to 133 to remove records from prior to 2010, as those are superfluous when comparing to current trends.
- Create one column to account for year and quarter to improve readability
- Convert all numeric rows to numeric data type
- Remove extra year and quarter columns as they are now unnecessary

Regarding the numeric fields, I have chosen to keep them all for now as each one can provide insight into which modes of transportation are most affected by certain factors. Below is the code to apply the steps laid out, as well as a comparison between the raw and cleaned data sets.

In [None]:
library(tidyverse)
library(tidyr)

ridership <- read.csv("../data/APTA-Ridership-by-Mode-and-Quarter-1990-Present.csv")
ridership <- ridership[81:133,1:11]
colnames(ridership)[2] <- 'Year - Quarter'
colnames(ridership)[4:11] <- c("total_ridership", "heavy_rail", "light_rail", "commuter_rail", "trolleybus", "bus", "demand_response", "other")
ridership$total_ridership <- as.numeric(gsub(",","", ridership$total_ridership))
ridership$heavy_rail <- as.numeric(gsub(",","", ridership$heavy_rail))
ridership$light_rail <- as.numeric(gsub(",","", ridership$light_rail))
ridership$commuter_rail <- as.numeric(gsub(",","", ridership$commuter_rail))
ridership$trolleybus <- as.numeric(gsub(",","", ridership$trolleybus))
ridership$bus <- as.numeric(gsub(",","", ridership$bus))
ridership$demand_response <- as.numeric(gsub(",","", ridership$demand_response))
ridership$other <- as.numeric(gsub(",","", ridership$other))
ggplot(data=ridership, aes(x=factor(`Year - Quarter`), y=total_ridership, group=1, xmin = "2015 - Q1", xmax="2023-Q1")) +
  geom_line()+
  geom_point()+
  labs(x = "Year - Quarter", y = "Total Ridership (000s)", title = "Total Public Transit Ridership in the U.S.")+
  theme(axis.text.x = element_text(angle = 45))
ridership <- ridership[c(2, 4:11)]
head(ridership)
write.csv(ridership, "../data/cleaned_data/ridership_by_quarter_cleaned.csv")

![Raw Quarterly Ridership Data](../images/apta_raw_data.png)

![Cleaned Quarterly Ridership Data](../images/quarterly_ridership_cleaned.png)

### News API Data

This data in its raw form comes as a JSON file with each record corresponding to a particular article. The purpose of cleaning this will be to analyze word prevalence, which can be done by creating a corpus. The steps for this are recycled from DSAN-5000 Lab 2.1, and are described as follows:

- Retrieve the raw data JSON file for WMATA news.
- Create a string cleaning function to deal with punctuation, special characters, and differently cased letters
- Iterate through each article
  - Iterate through each data point in an article to clean strings and append cleaned data to output list
- Convert cleaned data to data frame
- Create corpus from cleaned data
- Use `CountVectorizer` to retrieve vocabulary for the data set
- Repeat for BART

Below is the code, along with images of the cleaned data.

In [None]:
import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

#WMATA

x = open('../data/WMATA-newapi-raw-data.json')
response = json.load(x)

def string_cleaner(input_string):
    try: 
        out=re.sub(r"""
                    [,.;@#?!&$-]+
                    \ *
                    """,
                    " ",
                    input_string, flags=re.VERBOSE)

        out = re.sub('[’.]+', '', input_string)
        out = re.sub(r'\s+', ' ', out)
        out=out.lower()
    except:
        print("ERROR")
        out=''
    return out

article_list=response['articles']
article_keys=article_list[0].keys()
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    for key in article_keys:
        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            if(src in author): 
                author='NA'
            tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

df = pd.DataFrame(cleaned_data)
print(df.head())
df.to_csv('../data/cleaned_data/wmata_news_cleaned.csv', index=False)

corpus = df[2]
vectorizer=CountVectorizer()
word_counts  =  vectorizer.fit_transform(corpus)
print("vocabulary = ",vectorizer.vocabulary_)


## BART
x = open('../data/BART-newapi-raw-data.json')
response = json.load(x)

article_list=response['articles']
article_keys=article_list[0].keys()
index=0
cleaned_data=[];  
for article in article_list:
    tmp=[]
    for key in article_keys:
        if(key=='source'):
            src=string_cleaner(article[key]['name'])
            tmp.append(src) 

        if(key=='author'):
            author=string_cleaner(article[key])
            if(src in author): 
                author='NA'
            tmp.append(author)

        if(key=='title'):
            tmp.append(string_cleaner(article[key]))

        if(key=='publishedAt'):
            ref = re.compile('.*-.*-.*T.*:.*:.*Z')
            date=article[key]
            if(not ref.match(date)):
                date="NA"
            tmp.append(date)

    cleaned_data.append(tmp)
    index+=1

df = pd.DataFrame(cleaned_data)
print(df.head())
df.to_csv('../data/cleaned_data/bart_news_cleaned.csv', index=False)

corpus = df[2]
vectorizer=CountVectorizer()
word_counts  =  vectorizer.fit_transform(corpus)
print("vocabulary = ",vectorizer.vocabulary_)

![Cleaned WMATA News Data](../images/wmata_news_cleaned.png)

![Cleaned BART News Data](../images/bart_news_cleaned.png)

### Remote Work Trends - Desires of Employers vs. Workers

The insight to be gathered from this data would be the discrepancies between what employers want from their workers' remote work schedule, and those of the workers themselves. Therefore, while these come from two separate .csv files, it will be necessary to merge these data sets into one data frame. Additionally, each data set has two variables: 
1. The amount of working from home (days per week) employers or workers want for all workers
2. The amount of working from home (days per week) employers or workers want for workers able to work from home
Since both of these have ample data, we will keep both. The methodology for this is as follows:

- Read both data sets and trim excess space where the owner of the data had included a citation note
- Merge by date
  - These data sets come from the same series of surveys, so the date column is exactly the same, eliminating any need for removal of rows.
- Convert the date field to a date data type and order by date
- Rename columns based on glossary provided by the data source
- Ensure numeric columns have numeric data type
- Remove rows in which there are too many `NA` values.
  - Rows in which the values for all workers **OR** workers able to work from home have `NA` values can be kept, as there is a comparison to be made with the ones that don't have `NA` values. Only rows in which no comparison can be made will be removed.

The code for this is below, along with a screenshot of the cleaned data.

In [None]:
library(tidyverse)
library(tidyr)

employer <- read.csv("../data/WFH_monthly/WFH_monthly_employer.csv")
worker <- read.csv("../data/WFH_monthly/WFH_monthly_worker.csv")
employer <- employer[c(1:3)]
worker <- worker[c(1:3)]
plans <- merge(employer, worker, by = "date")
plans$date <- as.Date(plans$date, format = "%m/%d/%y")
plans <- plans[order(plans$date),]
colnames(plans)[c(2:5)] <- c("employer_desires_all", "employer_desires_able", "worker_desires_all", "worker_desires_able")
typeof(plans$employer_desires_all)
plans <- plans[!(is.na(plans$employer_desires_all) & is.na(plans$employer_desires_able)),]
head(plans)
write.csv(plans, "../data/cleaned_data/WFH_surveys_cleaned.csv")

![Cleaned Data for Remote Work Plans of Employers and Workers](../images/wfh_plans_cleaned.png)

### Remote Work Trends by City

The cleaning methodology for this data set is simple: reduce from the top ten largest cities in the United States to the cities which we are focusing on. The steps for this, along with the code and before/after screenshots are below:

- Read .csv file and remove all columns except for the date of each survey, results from Washington, D.C., and results from the San Francisco Bay Area
- Convert the date field to a date data type and order by date
- Ensure no `NA` values and that numeric columns have a numeric data type

In [None]:
library(tidyverse)
library(tidyr)

city <- read.csv("../data/WFH_monthly/WFH_monthly_city.csv")
city <- city[c(1,6,8)]
city$date <- as.Date(city$date, format = "%m/%d/%y")
city <- city[order(city$date),]
colnames(city)[c(2,3)] <- c("wfh_BayArea", "wfh_WashingtonDC")
city <- na.omit(city)
typeof(city$wfh_BayArea)
head(city)
write.csv(city, "../data/cleaned_data/WFH_city_cleaned.csv")

![Remote Work Percentages by City - Raw](../images/wfh_city.png)

![Remote Work Percentages by City - Cleaned](../images/wfh_city_cleaned.png)

### Ridership Trends by City

These data sets will allow us to compare total and average monthly riderships between Washington, D.C. and the San Francisco Bay Area. To do this, we will need the data to match up to avoid any unintended discrepancies. Thus, the date range that has been selected is January, 2018 to September, 2023. First, the WMATA data comes with all average monthly entries listed in a single row, as shown below:

![WMATA Average Daily Entries by Month](../images/wmata_monthly_boarding.png)

Since the data unit we are after is each month, this should ultimately be transposed when cleaning. Additionally, we will need to combine the years, which act as column names in the raw data, with the months. The steps for this are as follows:

- Read .csv file and remove final row, which is duplicative. It is simply a truncated version of the data directly above it.
- Retrieve column names to create a list of years
- Transpose rows containing months and values and add them to a data frame with the `years` column
- Remove blank row created by this transposition
- Create date column by concatenating year and month and converting it to date type
- Re-arrange columns, remove duplicative columns containing year and month, and rename `avg_daily_entries` column

The code and screenshot are shown below:

**Note: BART data has not yet been cleaned.**

In [None]:
library(tidyverse)
library(tidyr)

wmata_monthly <- read.csv("../data/WMATA_boardings_by_month.csv")
wmata_monthly <- wmata_monthly[c(1,2),]
wmata_years <- c(colnames(wmata_monthly))
wmata <- data.frame(wmata_years, t(wmata_monthly[1,]), t(wmata_monthly[2,]))
wmata <- wmata[-1,]
wmata$date <- paste(1, wmata$X1, substr(wmata$wmata_years, 2, 5))
wmata$date <- as.Date(wmata$date, "%d %B %Y")
wmata <- wmata[c(4, 3)]
colnames(wmata)[2] <- "avg_daily_entries"
head(wmata)
write.csv(wmata, "../data/cleaned_data/wmata_monthly_ridership.csv")

![WMATA Average Daily Entries by Month - Cleaned](../images/wmata_monthly_cleaned.png)

### Ridership by Hour

These data sets can be used to compare ridership trends before and after the pandemic and all ramifications that came from it. These both show average daily entries and exits in Washington, D.C. by hour of the day, allowing us to see when people use public transit, and ultimately infer why they may be using it. The difference in the data is that the `before` data set contains data from January 1, 2018 to March 17, 2020, while the `after` data set contains data from March 18, 2020 to October 5, 2023.

The steps for cleaning these data sets are as follows:

- Read .csv files and remove rounded fields, as they are duplicative
- Rename columns for readability
- Convert numeric columns to numeric data type
- Introduce `hour_numeric` column for future time series analysis
- Rearrange columns

The code for carrying this out and screenshots of the cleaned `before` data set are below. Additionally, plots of the data sets have been charted to visualize the data that is being obtained.

In [None]:
library(tidyverse)
library(tidyr)

before <- read.csv("../data/WMATA_boardings_by_hour/boardings_pre-covid.csv")
after <- read.csv("../data/WMATA_boardings_by_hour/boardings_post-covid.csv")
before <- before[c(1,2,4)]
after <- after[c(1,2,4)]
colnames(before) <- c("hour", "avg_daily_entries", "avg_daily_exits")
colnames(after) <- c("hour", "avg_daily_entries", "avg_daily_exits")
before$avg_daily_entries <- as.numeric(gsub(",","", before$avg_daily_entries))
before$avg_daily_exits <- as.numeric(gsub(",","", before$avg_daily_exits))
after$avg_daily_entries <- as.numeric(gsub(",","", after$avg_daily_entries))
after$avg_daily_exits <- as.numeric(gsub(",","", after$avg_daily_exits))
before$hour_numeric <- c(4:23, 0:3)
after$hour_numeric <- c(4:23, 0:3)
before <- before[c(1,4,2,3)]
after <- after[c(1,4,2,3)]
ggplot(data=before, aes(x=factor(hour_numeric, ordered = FALSE), y=avg_daily_entries, group=1)) +
  geom_line()+
  geom_point()+
  labs(x = "Numeric Hour of Day", y = "Average Daily Entries", title = "Average Daily Entries by Hour (Pre-Pandemic)")
ggplot(data=after, aes(x=factor(hour_numeric, ordered = FALSE), y=avg_daily_entries, group=1)) +
  geom_line()+
  geom_point()+
  labs(x = "Numeric Hour of Day", y = "Average Daily Entries", title = "Average Daily Entries by Hour (Post-Pandemic)")
head(before)
head(after)
write.csv(before, "../data/cleaned_data/hourly_average_cleaned_pre-covid.csv")
write.csv(after, "../data/cleaned_data/hourly_average_cleaned_post-covid.csv")

![Hourly Ridership from 1/1/2018 to 3/17/2020 - Cleaned](../images/hourly_cleaned.png)

![](../images/hourly_entries_before.png)

![](../images/hourly_entries_after.png)

### Ridership by Demographic

Lastly, the purpose of this data is to see the rates at which demographic groups use different modes of transportation for commuting to their occupation. The raw data set contains all demographic differentiators in the same table, which would be classified as untidy data. Thus, it will be necessary to split these into several tables; one for each demographic type. Additionally, the columns denoting percent error are useful for understanding the data, but could be cumbersome for conducting EDA, so we will only be focusing on the proportions given in the data columns. 

Cleaning this data set will allow us to use `R` to clean qualitative, as well as quantitative variables. The following are steps for carrying this out:

- Read full .csv file and rename columns for readability based on glossary given by the data source
- Select only rows that split records by `age`, and only columns that contain data points
- Trim leading spaces from `age` column
- Remove percentage symbol from numeric fields and convert them to numeric data type
- Repeat process for `sex`, `race`, `citizenship status`, and `earnings`

Below is the code, and a sample screenshot from the `earnings` cleaned data set.

In [None]:
library(tidyverse)
library(tidyr)
library(stringr)

demographics <- read.csv("../data/ridership_by_demographic_2021.csv")
colnames(demographics)[c(2,4,6,8)] <- c("total", "drive_alone", "carpool", "public_transit")
age <- demographics[c(3:8), c(1,2,4,6,8)]
colnames(age)[1] <- "age_group"
age$age_group <- str_trim(age$age_group, "left")
age$total <- as.numeric(substr(age$total, 1, nchar(age$total)-1))
age$drive_alone <- as.numeric(substr(age$drive_alone, 1, nchar(age$drive_alone)-1))
age$carpool <- as.numeric(substr(age$carpool, 1, nchar(age$carpool)-1))
age$public_transit <- as.numeric(substr(age$public_transit, 1, nchar(age$public_transit)-1))
head(age)
write.csv(age, "../data/cleaned_data/ridership_age.csv")

sex <- demographics[c(11:12), c(1,2,4,6,8)]
colnames(sex)[1] <- "sex"
sex$sex <- str_trim(sex$sex, "left")
sex$total <- as.numeric(substr(sex$total, 1, nchar(sex$total)-1))
sex$drive_alone <- as.numeric(substr(sex$drive_alone, 1, nchar(sex$drive_alone)-1))
sex$carpool <- as.numeric(substr(sex$carpool, 1, nchar(sex$carpool)-1))
sex$public_transit <- as.numeric(substr(sex$public_transit, 1, nchar(sex$public_transit)-1))
head(sex)
write.csv(sex, "../data/cleaned_data/ridership_sex.csv")

citizenship <- demographics[c(25:28), c(1,2,4,6,8)]
colnames(citizenship)[1] <- "status"
citizenship$status <- str_trim(citizenship$status, "left")
citizenship$total <- as.numeric(substr(citizenship$total, 1, nchar(citizenship$total)-1))
citizenship$drive_alone <- as.numeric(substr(citizenship$drive_alone, 1, nchar(citizenship$drive_alone)-1))
citizenship$carpool <- as.numeric(substr(citizenship$carpool, 1, nchar(citizenship$carpool)-1))
citizenship$public_transit <- as.numeric(substr(citizenship$public_transit, 1, nchar(citizenship$public_transit)-1))
head(citizenship)
write.csv(citizenship, "../data/cleaned_data/ridership_citizenship.csv")

earnings <- demographics[c(35:42), c(1,2,4,6,8)]
colnames(earnings)[1] <- "range"
earnings$range <- str_trim(earnings$range, "left")
earnings$total <- as.numeric(substr(earnings$total, 1, nchar(earnings$total)-1))
earnings$drive_alone <- as.numeric(substr(earnings$drive_alone, 1, nchar(earnings$drive_alone)-1))
earnings$carpool <- as.numeric(substr(earnings$carpool, 1, nchar(earnings$carpool)-1))
earnings$public_transit <- as.numeric(substr(earnings$public_transit, 1, nchar(earnings$public_transit)-1))
head(earnings)
write.csv(earnings, "../data/cleaned_data/ridership_earnings.csv")

![Transportation Methods by Earnings - Cleaned](../images/ridership_earnings_cleaned.png)