The business task is to determine how casual riders and Divvy members use Divvy company bikes differently, in order to determine how to convert casual riders into members. In order to do this, we need to find out how casual riders use the bikes, and how members use the bikes. Key stakeholders include: 

Divvy: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Note: Divvy was purchased by Lyft in 2019.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns
and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Divvy marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and
reporting data that helps guide Divvy marketing strategy.

Divvy executive team: The notoriously detail-oriented executive team will decide whether to approve the
recommended marketing program 


The data is located here: https://divvy-tripdata.s3.amazonaws.com/index.html

In [2]:
install.packages('readr')
install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")
install.packages('kimisc')
install.packages('dplyr')
install.packages('stringr')

Installing package into 'C:/Users/dakot/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)



package 'readr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\dakot\AppData\Local\Temp\Rtmpk1whro\downloaded_packages


In [None]:
#To import, we do the following:


Looking at the data, the data is organized in various chunks of time, but there is a continuous stream of data from 2013 through the end of 2021. When looking at data and questioning it's quality, we use ROCCC: The data is reliable because it is first party data collected by the company requesting the analysis. It does not have selection bias, as our question is only looking at current customers, and the dataset includes only customers, and all customer trips during the selected period. The data we will choose is over the course of 2021, as that is a large enough data set post-pandemic. 


First, let's import and take a look at the most recent data set, from dec 2021:

In [1]:
library(readr)

data1 <- read_csv("Divvy Trip Data 2021/202101-divvy-tripdata.csv")
head(data1)

[1mRows: [22m[34m96834[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
[32mdbl[39m  (4): start_lat, start_lng, end_lat, end_lng
[34mdttm[39m (2): started_at, ended_at

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
<chr>,<chr>,<dttm>,<dttm>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.90034,-87.69674,41.89,-87.72,member
DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.90033,-87.69671,41.9,-87.69,member
EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.90031,-87.69664,41.9,-87.7,member
4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.9004,-87.69666,41.92,-87.69,member
BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.90033,-87.6967,41.9,-87.7,casual
5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.90041,-87.69676,41.94,-87.71,casual


Looking at the data, we will definitly need to clean the data. The most common error in the data appears to be incomplete trip data, where the start or end station is missing. However, we are also given the start and stop latatude and longitude, and so it may be possible to figure out the start and end stations using the long and lat, depending on the accuracy of the lat and long. However, looking at divvy's website, we see that users can lock their bikes at bike racks that are not divvy stations, and this would make sense if a user decides to stop into a shop, or commutes to work, and parks their bike close to their place of employment. 

Let's go ahead and import the rest of the data for our analysis: 

In [2]:
data2 <- read_csv("Divvy Trip Data 2021/202102-divvy-tripdata.csv")
data3 <- read_csv("Divvy Trip Data 2021/202103-divvy-tripdata.csv")
data4 <- read_csv("Divvy Trip Data 2021/202104-divvy-tripdata.csv")
data5 <- read_csv("Divvy Trip Data 2021/202105-divvy-tripdata.csv")
data6 <- read_csv("Divvy Trip Data 2021/202106-divvy-tripdata.csv")
data7 <- read_csv("Divvy Trip Data 2021/202107-divvy-tripdata.csv")
data8 <- read_csv("Divvy Trip Data 2021/202108-divvy-tripdata.csv")
data9 <- read_csv("Divvy Trip Data 2021/202109-divvy-tripdata.csv")
data10 <- read_csv("Divvy Trip Data 2021/202110-divvy-tripdata.csv")
data11 <- read_csv("Divvy Trip Data 2021/202111-divvy-tripdata.csv")
data12 <- read_csv("Divvy Trip Data 2021/202112-divvy-tripdata.csv")

ERROR: Error in read_csv("Divvy Trip Data 2021/202102-divvy-tripdata.csv"): could not find function "read_csv"


In [3]:
hms(0,0,1234)

ERROR: Error in hms(0, 0, 1234): could not find function "hms"


Looking at all 12 datasets, they have the exact same columns, so we can merge them into one dataset and begin cleaning:

In [3]:
raw_data <- rbind(data1,data2,data3,data4,data5,data6,data7,data8,data9,data10,data11,data12)

In [33]:
str(data6)

spec_tbl_df [729,595 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
 $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
 $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
 $ start_station_name: chr [1:729595] NA NA NA NA ...
 $ start_station_id  : chr [1:729595] NA NA NA NA ...
 $ end_station_name  : chr [1:729595] NA NA NA NA ...
 $ end_station_id    : chr [1:729595] NA NA NA NA ...
 $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
 $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
 $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ member_casual   

In [41]:
data6a <- data6[1:floor(nrow(data6)/2),]
data6b <- data6[floor((nrow(data6)/2)+1):nrow(data6),]
nrow(data6a)+nrow(data6b)
nrow(data6)

In [42]:
write.csv(data6[1:floor(nrow(data6)/2),],"202106A-divvy-tripdata.csv", row.names = FALSE)
write.csv(data6[floor((nrow(data6)/2)+1):nrow(data6),],"202106B-divvy-tripdata.csv", row.names = FALSE)

write.csv(data7[1:floor(nrow(data7)/2),],"202107A-divvy-tripdata.csv", row.names = FALSE)
write.csv(data7[floor((nrow(data7)/2)+1):nrow(data7),],"202107B-divvy-tripdata.csv", row.names = FALSE)

write.csv(data8[1:floor(nrow(data8)/2),],"202108A-divvy-tripdata.csv", row.names = FALSE)
write.csv(data8[floor((nrow(data8)/2)+1):nrow(data8),],"202108B-divvy-tripdata.csv", row.names = FALSE)

write.csv(data9[1:floor(nrow(data9)/2),],"202109A-divvy-tripdata.csv", row.names = FALSE)
write.csv(data9[floor((nrow(data9)/2)+1):nrow(data9),],"202109B-divvy-tripdata.csv", row.names = FALSE)

write.csv(data10[1:floor(nrow(data10)/2),],"202110A-divvy-tripdata.csv", row.names = FALSE)
write.csv(data10[floor((nrow(data10)/2)+1):nrow(data10),],"202110B-divvy-tripdata.csv", row.names = FALSE)

In [9]:
library(kimisc)
library(dplyr)
library(stringr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




In [4]:
write.csv(raw_data,"Divvy_raw_data_2021.csv", row.names = FALSE)

In [25]:
v = seconds.to.hms(123)

In [28]:
strptime(v, format = "%H:%M:%S")

[1] "2022-06-16 00:02:03 CDT"

In [5]:
test_data <- read_csv("test_data.csv")
head(test_data)

[1m[22mNew names:
[36m•[39m `end_station_name` -> `end_station_name...9`
[36m•[39m `end_station_name` -> `end_station_name...10`
"One or more parsing issues, see `problems()` for details"
[1mRows: [22m[34m49622[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
[32mdbl[39m  (5): day_of_week, start_lat, start_lng, end_lat, end_lng
[34mtime[39m (1): ride_length

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ride_id,rideable_type,started_at,ended_at,ride_length,day_of_week,start_station_name,start_station_id,end_station_name...9,end_station_name...10,start_lat,start_lng,end_lat,end_lng,member_casual
<chr>,<chr>,<chr>,<chr>,<time>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
89E7AA6C29227EFF,classic_bike,2021-02-12 16:14:56,2021-02-12 16:21:43,00:06:47,6,Glenwood Ave & Touhy Ave,525,Sheridan Rd & Columbia Ave,660,42.0127,-87.66606,42.00458,-87.66141,member
0FEFDE2603568365,classic_bike,2021-02-14 17:52:38,2021-02-14 18:12:09,00:19:31,1,Glenwood Ave & Touhy Ave,525,Bosworth Ave & Howard St,16806,42.0127,-87.66606,42.01954,-87.66956,casual
E6159D746B2DBB91,electric_bike,2021-02-09 19:10:18,2021-02-09 19:19:10,00:08:52,3,Clark St & Lake St,KA1503000012,State St & Randolph St,TA1305000029,41.88579,-87.6311,41.88487,-87.6275,member
B32D3199F1C2E75B,classic_bike,2021-02-02 17:49:41,2021-02-02 17:54:06,00:04:25,3,Wood St & Chicago Ave,637,Honore St & Division St,TA1305000034,41.89563,-87.67207,41.90312,-87.67394,member
83E463F23575F4BF,electric_bike,2021-02-23 15:07:23,2021-02-23 15:22:37,00:15:14,3,State St & 33rd St,13216,Emerald Ave & 31st St,TA1309000055,41.83473,-87.62583,41.83816,-87.64512,member
BDAA7E3494E8D545,electric_bike,2021-02-24 15:43:33,2021-02-24 15:49:05,00:05:32,4,Fairbanks St & Superior St,18003,LaSalle Dr & Huron St,KP1705001026,41.89581,-87.62025,41.89489,-87.63198,casual


In [None]:
raw_data$ride_length <- 

In [5]:
head(raw_

ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
<chr>,<chr>,<dttm>,<dttm>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.90034,-87.69674,41.89,-87.72,member
DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.90033,-87.69671,41.9,-87.69,member
EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.90031,-87.69664,41.9,-87.7,member
4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.9004,-87.69666,41.92,-87.69,member
BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.90033,-87.6967,41.9,-87.7,casual
5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.90041,-87.69676,41.94,-87.71,casual


In [6]:
str(raw_data)

spec_tbl_df [5,595,063 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:5595063] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr [1:5595063] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:5595063], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
 $ ended_at          : POSIXct[1:5595063], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
 $ start_station_name: chr [1:5595063] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr [1:5595063] "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr [1:5595063] NA NA NA NA ...
 $ end_station_id    : chr [1:5595063] NA NA NA NA ...
 $ start_lat         : num [1:5595063] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:5595063] -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           :

In [31]:
library(tidyverse)
library(skimr)
library(janitor)

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mforcats[39m 0.5.1
[32m✔[39m [34mtidyr  [39m 1.2.0     

── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: 'janitor'


The following objects are masked from 'package:stats':

    chisq.test, fisher.test




In [8]:
glimpse(raw_data)

Rows: 5,595,063
Columns: 13
$ ride_id            [3m[90m<chr>[39m[23m "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
$ rideable_type      [3m[90m<chr>[39m[23m "electric_bike", "electric_bike", "electric_bike", …
$ started_at         [3m[90m<dttm>[39m[23m 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
$ ended_at           [3m[90m<dttm>[39m[23m 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
$ start_station_name [3m[90m<chr>[39m[23m "California Ave & Cortez St", "California Ave & Cor…
$ start_station_id   [3m[90m<chr>[39m[23m "17660", "17660", "17660", "17660", "17660", "17660…
$ end_station_name   [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, "Wood St & Augu…
$ end_station_id     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, "657", "13258",…
$ start_lat          [3m[90m<dbl>[39m[23m 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
$ start_lng          [3m[90m<dbl>[39m[23m -87.69674, -87.69671, -87.69664, -8

In [9]:
glimpse(raw_data)

Rows: 5,595,063
Columns: 13
$ ride_id            [3m[90m<chr>[39m[23m "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
$ rideable_type      [3m[90m<chr>[39m[23m "electric_bike", "electric_bike", "electric_bike", …
$ started_at         [3m[90m<dttm>[39m[23m 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
$ ended_at           [3m[90m<dttm>[39m[23m 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
$ start_station_name [3m[90m<chr>[39m[23m "California Ave & Cortez St", "California Ave & Cor…
$ start_station_id   [3m[90m<chr>[39m[23m "17660", "17660", "17660", "17660", "17660", "17660…
$ end_station_name   [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, "Wood St & Augu…
$ end_station_id     [3m[90m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, "657", "13258",…
$ start_lat          [3m[90m<dbl>[39m[23m 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
$ start_lng          [3m[90m<dbl>[39m[23m -87.69674, -87.69671, -87.69664, -8

In [20]:
v = "member_casual"

In [21]:
# allows us to look at a variable and see how many unique values there are for the value
count(unique(raw_data[v]))

n
<int>
2


In [23]:
# shows us the unique values for a variable
unique(raw_data[v])

member_casual
<chr>
member
casual


In [28]:
#shows all rows where a specific column returns NA
raw_data[is.na(raw_data$member_casual), ] 

ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
<chr>,<chr>,<dttm>,<dttm>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>


In [None]:
for(i in 1:ncol(raw_data)) {       # for-loop over columns
  v <- raw_data[ , i] 
}

In [1]:
cdata1 <- read_csv("Cleaned Data//202101-divvy-tripdata.csv")

ERROR: Error in read_csv("Cleaned Data//202101-divvy-tripdata.csv"): could not find function "read_csv"
